Re: [Qemu-block] [Qemu-devel] [RFC v1 2/4] util/oslib-win32: Remove invalid check

2017-06-27 Thread Philippe Mathieu-Daudé
On Tue, Jun 27, 2017 at 8:57 PM, Alistair Francis
 wrote:
> There is no way nhandles can be zero in this section so that part of the
> if statement will always be false. Let's just remove it to make the code
> easier to read.
>
> Signed-off-by: Alistair Francis 
> Acked-by: Edgar E. Iglesias 

Reviewed-by: Philippe Mathieu-Daudé 

> ---
>
>  util/oslib-win32.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index 80e4668935..7ec0f8e083 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -414,7 +414,7 @@ static int poll_rest(gboolean poll_msgs, HANDLE *handles, 
> gint nhandles,
>  /* If we have a timeout, or no handles to poll, be satisfied
>   * with just noticing we have messages waiting.
>   */
> -if (timeout != 0 || nhandles == 0) {
> +if (timeout != 0) {
>  return 1;
>  }
>
> --
> 2.11.0
>
>



Re: [Qemu-block] [PATCH v3 20/20] block: Make bdrv_is_allocated_above() byte-based

2017-06-27 Thread Xie Changlong

在 6/28/2017 3:24 AM, Eric Blake 写道:

We are gradually moving away from sector-based interfaces, towards
byte-based.  In the common case, allocation is unlikely to ever use
values that are not naturally sector-aligned, but it is possible
that byte-based values will let us be more precise about allocation
at the end of an unaligned file that can do byte-based access.

Changing the signature of the function to use int64_t *pnum ensures
that the compiler enforces that all callers are updated.  For now,
the io.c layer still assert()s that all callers are sector-aligned,
but that can be relaxed when a later patch implements byte-based
block status.  Therefore, for the most part this patch is just the
addition of scaling at the callers followed by inverse scaling at
bdrv_is_allocated().  But some code, particularly stream_run(),
gets a lot simpler because it no longer has to mess with sectors.

For ease of review, bdrv_is_allocated() was tackled separately.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: tweak function comments, favor bdrv_getlength() over ->total_sectors
---
  include/block/block.h |  2 +-
  block/commit.c| 20 
  block/io.c| 42 --
  block/mirror.c|  5 -
  block/replication.c   | 17 -


For replication part:

Reviewed-by: Xie Changlong 


  block/stream.c| 21 +
  qemu-img.c| 10 +++---
  7 files changed, 61 insertions(+), 56 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 9b9d87b..13022d5 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -428,7 +428,7 @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
  int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
int64_t *pnum);
  int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
-int64_t sector_num, int nb_sectors, int *pnum);
+int64_t offset, int64_t bytes, int64_t *pnum);

  bool bdrv_is_read_only(BlockDriverState *bs);
  bool bdrv_is_writable(BlockDriverState *bs);
diff --git a/block/commit.c b/block/commit.c
index 241aa95..774a8a5 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -146,7 +146,7 @@ static void coroutine_fn commit_run(void *opaque)
  int64_t offset;
  uint64_t delay_ns = 0;
  int ret = 0;
-int n = 0; /* sectors */
+int64_t n = 0; /* bytes */
  void *buf = NULL;
  int bytes_written = 0;
  int64_t base_len;
@@ -171,7 +171,7 @@ static void coroutine_fn commit_run(void *opaque)

  buf = blk_blockalign(s->top, COMMIT_BUFFER_SIZE);

-for (offset = 0; offset < s->common.len; offset += n * BDRV_SECTOR_SIZE) {
+for (offset = 0; offset < s->common.len; offset += n) {
  bool copy;

  /* Note that even when no rate limit is applied we need to yield
@@ -183,15 +183,12 @@ static void coroutine_fn commit_run(void *opaque)
  }
  /* Copy if allocated above the base */
  ret = bdrv_is_allocated_above(blk_bs(s->top), blk_bs(s->base),
-  offset / BDRV_SECTOR_SIZE,
-  COMMIT_BUFFER_SIZE / BDRV_SECTOR_SIZE,
-  );
+  offset, COMMIT_BUFFER_SIZE, );
  copy = (ret == 1);
-trace_commit_one_iteration(s, offset, n * BDRV_SECTOR_SIZE, ret);
+trace_commit_one_iteration(s, offset, n, ret);
  if (copy) {
-ret = commit_populate(s->top, s->base, offset,
-  n * BDRV_SECTOR_SIZE, buf);
-bytes_written += n * BDRV_SECTOR_SIZE;
+ret = commit_populate(s->top, s->base, offset, n, buf);
+bytes_written += n;
  }
  if (ret < 0) {
  BlockErrorAction action =
@@ -204,11 +201,10 @@ static void coroutine_fn commit_run(void *opaque)
  }
  }
  /* Publish progress */
-s->common.offset += n * BDRV_SECTOR_SIZE;
+s->common.offset += n;

  if (copy && s->common.speed) {
-delay_ns = ratelimit_calculate_delay(>limit,
- n * BDRV_SECTOR_SIZE);
+delay_ns = ratelimit_calculate_delay(>limit, n);
  }
  }

diff --git a/block/io.c b/block/io.c
index 5bbf153..061a162 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1903,54 +1903,52 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState 
*bs, int64_t offset,
  /*
   * Given an image chain: ... -> [BASE] -> [INTER1] -> [INTER2] -> [TOP]
   *
- * Return true if the given sector is allocated in any image between
- * BASE and TOP (inclusive).  BASE can be NULL to check if the given
- * sector is allocated in any image of the chain.  Return false 

Re: [Qemu-block] [PATCH v3 15/20] backup: Switch block_backup.h to byte-based

2017-06-27 Thread Xie Changlong

在 6/28/2017 3:24 AM, Eric Blake 写道:

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Continue by converting
the public interface to backup jobs (no semantic change), including
a change to CowRequest to track by bytes instead of cluster indices.

Note that this does not change the difference between the public
interface (starting point, and size of the subsequent range) and
the internal interface (starting and end points).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: change a couple more parameter names
---
  include/block/block_backup.h | 11 +--
  block/backup.c   | 33 -
  block/replication.c  | 12 


Reviewed-by: Xie Changlong 


  3 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/include/block/block_backup.h b/include/block/block_backup.h
index 8a75947..994a3bd 100644
--- a/include/block/block_backup.h
+++ b/include/block/block_backup.h
@@ -21,17 +21,16 @@
  #include "block/block_int.h"

  typedef struct CowRequest {
-int64_t start;
-int64_t end;
+int64_t start_byte;
+int64_t end_byte;
  QLIST_ENTRY(CowRequest) list;
  CoQueue wait_queue; /* coroutines blocked on this request */
  } CowRequest;

-void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
-  int nb_sectors);
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t offset,
+  uint64_t bytes);
  void backup_cow_request_begin(CowRequest *req, BlockJob *job,
-  int64_t sector_num,
-  int nb_sectors);
+  int64_t offset, uint64_t bytes);
  void backup_cow_request_end(CowRequest *req);

  void backup_do_checkpoint(BlockJob *job, Error **errp);
diff --git a/block/backup.c b/block/backup.c
index 4e64710..cfbd921 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -55,7 +55,7 @@ static inline int64_t cluster_size_sectors(BackupBlockJob 
*job)

  /* See if in-flight requests overlap and wait for them to complete */
  static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
-   int64_t start,
+   int64_t offset,
 int64_t end)
  {
  CowRequest *req;
@@ -64,7 +64,7 @@ static void coroutine_fn 
wait_for_overlapping_requests(BackupBlockJob *job,
  do {
  retry = false;
  QLIST_FOREACH(req, >inflight_reqs, list) {
-if (end > req->start && start < req->end) {
+if (end > req->start_byte && offset < req->end_byte) {
  qemu_co_queue_wait(>wait_queue, NULL);
  retry = true;
  break;
@@ -75,10 +75,10 @@ static void coroutine_fn 
wait_for_overlapping_requests(BackupBlockJob *job,

  /* Keep track of an in-flight request */
  static void cow_request_begin(CowRequest *req, BackupBlockJob *job,
- int64_t start, int64_t end)
+  int64_t offset, int64_t end)
  {
-req->start = start;
-req->end = end;
+req->start_byte = offset;
+req->end_byte = end;
  qemu_co_queue_init(>wait_queue);
  QLIST_INSERT_HEAD(>inflight_reqs, req, list);
  }
@@ -114,8 +114,10 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
sector_num * BDRV_SECTOR_SIZE,
nb_sectors * BDRV_SECTOR_SIZE);

-wait_for_overlapping_requests(job, start, end);
-cow_request_begin(_request, job, start, end);
+wait_for_overlapping_requests(job, start * job->cluster_size,
+  end * job->cluster_size);
+cow_request_begin(_request, job, start * job->cluster_size,
+  end * job->cluster_size);

  for (; start < end; start++) {
  if (test_bit(start, job->done_bitmap)) {
@@ -277,32 +279,29 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
  bitmap_zero(backup_job->done_bitmap, len);
  }

-void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
-  int nb_sectors)
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t offset,
+  uint64_t bytes)
  {
  BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
-int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
  int64_t start, end;

  assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);

-start = sector_num / sectors_per_cluster;
-end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+start = QEMU_ALIGN_DOWN(offset, 

[Qemu-block] [RFC v1 0/4] Windows runtime improvements

2017-06-27 Thread Alistair Francis
At Xilinx we have started to run our fork of QEMU on Windows and are
starting to distrubute this to customers. As part of this QEMU is
launched from an Eclipse based GUI on Windows hosts. This uncovered a
few performance issues in QEMU when running on CPU restricted machines.

What we see is that when QEMU is run on a machine where there aren't
enough spare CPUs on the host, QEMU is extreamly slow. We especially see
this when running our multi-architecture QEMU setup (MicroBlaze and ARM)
because we are running two QEMU instances as well as what else is
running on the machine.

Generally two instances of QEMU will never reach a Linux login prompt
when run on four host CPUs, but will reach the login prompt when run on
eight CPUs.

We investigated the issue and realised that it is mostly because QEMU
did not block and instead the IO thread just busy looped. This basically
locked up two CPUs (two QEMU instances) with IO threads not leaving
enough resources on the machine.

This patch series fixed the issue for us on our fork of QEMU. Our fork
is based on QEMU 2.8.1 and does not include MTTCG support. I have not
tested this on the mainline QEMU or with MTTCG. It is on my todo list to
see if I can repdouce the same issue on Windows with mainline QEMU. In
the meantime I wanted to send this series to see if anyone else has seen
Windows performance issues and if this helps with the problems. In order
to see the issue we had to do a full Linux boot, smaller baremetal
applications generally don't reproduce the same issue.

Also, most of these patches are editing the QEMU implementation of Glib,
obviously these fixes (if applicable) will need to be ported to Glib and
applied there as well. I just wanted to start here.

Alistair Francis (4):
  util/aio-win32: Only select on what we are actually waiting for
  util/oslib-win32: Remove invalid check
  util/oslib-win32: Fix up if conditional
  util/oslib-win32: Recursivly pass the timeout

 util/aio-win32.c   | 13 ++---
 util/oslib-win32.c | 25 +++--
 2 files changed, 29 insertions(+), 9 deletions(-)

-- 
2.11.0




[Qemu-block] [RFC v1 1/4] util/aio-win32: Only select on what we are actually waiting for

2017-06-27 Thread Alistair Francis
Signed-off-by: Alistair Francis 
Acked-by: Edgar E. Iglesias 
---

 util/aio-win32.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/util/aio-win32.c b/util/aio-win32.c
index bca496a47a..949979c2f5 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -71,6 +71,7 @@ void aio_set_fd_handler(AioContext *ctx,
 }
 } else {
 HANDLE event;
+long bitmask = 0;
 
 if (node == NULL) {
 /* Alloc and insert if it's not already there */
@@ -95,10 +96,16 @@ void aio_set_fd_handler(AioContext *ctx,
 node->io_write = io_write;
 node->is_external = is_external;
 
+if (io_read) {
+bitmask |= FD_READ;
+}
+
+if (io_write) {
+bitmask |= FD_WRITE;
+}
+
 event = event_notifier_get_handle(>notifier);
-WSAEventSelect(node->pfd.fd, event,
-   FD_READ | FD_ACCEPT | FD_CLOSE |
-   FD_CONNECT | FD_WRITE | FD_OOB);
+WSAEventSelect(node->pfd.fd, event, bitmask);
 }
 
 qemu_lockcnt_unlock(>list_lock);
-- 
2.11.0




[Qemu-block] [RFC v1 4/4] util/oslib-win32: Recursivly pass the timeout

2017-06-27 Thread Alistair Francis
Signed-off-by: Alistair Francis 
Acked-by: Edgar E. Iglesias 
---

 util/oslib-win32.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index a015e1ac96..3630e46499 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -432,10 +432,10 @@ static int poll_rest(gboolean poll_msgs, HANDLE *handles, 
gint nhandles,
 }
 }
 
-/* If no timeout and polling several handles, recurse to poll
- * the rest of them.
+/* We only found one and we are waiting on more then one. Let's try
+ * again.
  */
-if (timeout == 0 && nhandles > 1) {
+if (nhandles > 1) {
 /* Remove the handle that fired */
 int i;
 if ((ready - WAIT_OBJECT_0) < nhandles - 1) {
@@ -444,7 +444,20 @@ static int poll_rest(gboolean poll_msgs, HANDLE *handles, 
gint nhandles,
 }
 }
 nhandles--;
-recursed_result = poll_rest(FALSE, handles, nhandles, fds, nfds, 
0);
+
+/* If we just had a very small timeout let's increase it when we
+ * recurse to ensure we don't just busy wait. This ensures we let
+ * the Windows threads block at least a little. If we previously
+ * had some wait let's set it to zero to avoid blocking for too
+ * long.
+ */
+if (timeout < 10) {
+timeout = timeout + 1;
+} else {
+timeout = 0;
+}
+recursed_result = poll_rest(FALSE, handles, nhandles, fds,
+nfds, timeout);
 return (recursed_result == -1) ? -1 : 1 + recursed_result;
 }
 return 1;
-- 
2.11.0




[Qemu-block] [RFC v1 3/4] util/oslib-win32: Fix up if conditional

2017-06-27 Thread Alistair Francis
Signed-off-by: Alistair Francis 
Acked-by: Edgar E. Iglesias 
---

 util/oslib-win32.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 7ec0f8e083..a015e1ac96 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -438,7 +438,7 @@ static int poll_rest(gboolean poll_msgs, HANDLE *handles, 
gint nhandles,
 if (timeout == 0 && nhandles > 1) {
 /* Remove the handle that fired */
 int i;
-if (ready < nhandles - 1) {
+if ((ready - WAIT_OBJECT_0) < nhandles - 1) {
 for (i = ready - WAIT_OBJECT_0 + 1; i < nhandles; i++) {
 handles[i-1] = handles[i];
 }
-- 
2.11.0




[Qemu-block] [PATCH 2/2] block: add default implementations for bdrv_co_get_block_status()

2017-06-27 Thread Manos Pitsidianakis
bdrv_co_get_block_status_from_file() and
bdrv_co_get_block_status_from_backing() set *file to bs->file and
bs->backing respectively, so that bdrv_co_get_block_status() can recurse
to them. Future block drivers won't have to duplicate code to implement
this.

Signed-off-by: Manos Pitsidianakis 
---
 block/blkdebug.c  | 12 +---
 block/commit.c| 12 +---
 block/io.c| 24 
 block/mirror.c| 12 +---
 include/block/block_int.h | 16 
 5 files changed, 43 insertions(+), 33 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index b25856c4..f1539db6 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -641,16 +641,6 @@ static int coroutine_fn 
blkdebug_co_pdiscard(BlockDriverState *bs,
 return bdrv_co_pdiscard(bs->file->bs, offset, bytes);
 }
 
-static int64_t coroutine_fn blkdebug_co_get_block_status(
-BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
-BlockDriverState **file)
-{
-*pnum = nb_sectors;
-*file = bs->file->bs;
-return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
-(sector_num << BDRV_SECTOR_BITS);
-}
-
 static void blkdebug_close(BlockDriverState *bs)
 {
 BDRVBlkdebugState *s = bs->opaque;
@@ -925,7 +915,7 @@ static BlockDriver bdrv_blkdebug = {
 .bdrv_co_flush_to_disk  = blkdebug_co_flush,
 .bdrv_co_pwrite_zeroes  = blkdebug_co_pwrite_zeroes,
 .bdrv_co_pdiscard   = blkdebug_co_pdiscard,
-.bdrv_co_get_block_status = blkdebug_co_get_block_status,
+.bdrv_co_get_block_status = bdrv_co_get_block_status_from_file,
 
 .bdrv_debug_event   = blkdebug_debug_event,
 .bdrv_debug_breakpoint  = blkdebug_debug_breakpoint,
diff --git a/block/commit.c b/block/commit.c
index 524bd549..5b04f832 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -247,16 +247,6 @@ static int coroutine_fn 
bdrv_commit_top_preadv(BlockDriverState *bs,
 return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
 }
 
-static int64_t coroutine_fn bdrv_commit_top_get_block_status(
-BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
-BlockDriverState **file)
-{
-*pnum = nb_sectors;
-*file = bs->backing->bs;
-return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
-   (sector_num << BDRV_SECTOR_BITS);
-}
-
 static void bdrv_commit_top_refresh_filename(BlockDriverState *bs, QDict *opts)
 {
 bdrv_refresh_filename(bs->backing->bs);
@@ -282,7 +272,7 @@ static void bdrv_commit_top_child_perm(BlockDriverState 
*bs, BdrvChild *c,
 static BlockDriver bdrv_commit_top = {
 .format_name= "commit_top",
 .bdrv_co_preadv = bdrv_commit_top_preadv,
-.bdrv_co_get_block_status   = bdrv_commit_top_get_block_status,
+.bdrv_co_get_block_status   = bdrv_co_get_block_status_from_backing,
 .bdrv_refresh_filename  = bdrv_commit_top_refresh_filename,
 .bdrv_close = bdrv_commit_top_close,
 .bdrv_child_perm= bdrv_commit_top_child_perm,
diff --git a/block/io.c b/block/io.c
index c1b73226..aec520b4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1706,6 +1706,30 @@ typedef struct BdrvCoGetBlockStatusData {
 bool done;
 } BdrvCoGetBlockStatusData;
 
+int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
+ int64_t sector_num,
+ int nb_sectors, int *pnum,
+ BlockDriverState **file)
+{
+assert(bs->file && bs->file->bs);
+*pnum = nb_sectors;
+*file = bs->file->bs;
+return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
+   (sector_num << BDRV_SECTOR_BITS);
+}
+
+int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState 
*bs,
+ int64_t sector_num,
+ int nb_sectors, int *pnum,
+ BlockDriverState **file)
+{
+assert(bs->backing && bs->backing->bs);
+*pnum = nb_sectors;
+*file = bs->backing->bs;
+return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
+   (sector_num << BDRV_SECTOR_BITS);
+}
+
 /*
  * Returns the allocation status of the specified sectors.
  * Drivers not implementing the functionality are assumed to not support
diff --git a/block/mirror.c b/block/mirror.c
index 61a862dc..e8bf5f40 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1052,16 +1052,6 @@ static int coroutine_fn 
bdrv_mirror_top_flush(BlockDriverState *bs)
 return bdrv_co_flush(bs->backing->bs);
 }
 
-static int64_t coroutine_fn bdrv_mirror_top_get_block_status(
-BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
-BlockDriverState **file)
-{
-*pnum = nb_sectors;
-*file = bs->backing->bs;
-return 

[Qemu-block] [PATCH 0/2] block: block driver callbacks fixes

2017-06-27 Thread Manos Pitsidianakis
This series makes implementing some of the bdrv_* callbacks easier for block
drivers by passing requests to bs->file if bs->drv doesn't implement it instead
of failing, and adding default bdrv_co_get_block_status() implementations.

This based against Kevin Wolf's block branch, commit
0dd2ec6bc46bd93ea0b9df602962883417f31400

Manos Pitsidianakis (2):
  block: pass bdrv_* methods to bs->file by default
  block: add default implementations for bdrv_co_get_block_status()

 block.c   | 45 ++---
 block/blkdebug.c  | 12 +---
 block/commit.c| 12 +---
 block/io.c| 28 
 block/mirror.c| 12 +---
 include/block/block_int.h | 16 
 6 files changed, 81 insertions(+), 44 deletions(-)

-- 
2.11.0




[Qemu-block] [PATCH 1/2] block: pass bdrv_* methods to bs->file by default

2017-06-27 Thread Manos Pitsidianakis
The following functions fail if bs->drv does not implement them:

bdrv_probe_blocksizes
bdrv_probe_geometry
bdrv_truncate
bdrv_has_zero_init
bdrv_get_info
bdrv_media_changed
bdrv_eject
bdrv_lock_medium
bdrv_co_ioctl

Instead, the call should be passed to bs->file if it exists, to allow
filter drivers to support those methods without implementing them.

Signed-off-by: Manos Pitsidianakis 
---
 block.c| 45 ++---
 block/io.c |  4 
 2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/block.c b/block.c
index 69439628..ad11230a 100644
--- a/block.c
+++ b/block.c
@@ -494,6 +494,8 @@ int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes 
*bsz)
 
 if (drv && drv->bdrv_probe_blocksizes) {
 return drv->bdrv_probe_blocksizes(bs, bsz);
+} else if (bs->file && bs->file->bs) {
+return bdrv_probe_blocksizes(bs->file->bs, bsz);
 }
 
 return -ENOTSUP;
@@ -511,6 +513,8 @@ int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry 
*geo)
 
 if (drv && drv->bdrv_probe_geometry) {
 return drv->bdrv_probe_geometry(bs, geo);
+} else if (bs->file && bs->file->bs) {
+return bdrv_probe_geometry(bs->file->bs, geo);
 }
 
 return -ENOTSUP;
@@ -3406,13 +3410,18 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, 
Error **errp)
 
 assert(child->perm & BLK_PERM_RESIZE);
 
-if (!drv) {
-error_setg(errp, "No medium inserted");
-return -ENOMEDIUM;
-}
-if (!drv->bdrv_truncate) {
-error_setg(errp, "Image format driver does not support resize");
-return -ENOTSUP;
+if (!drv || !drv->bdrv_truncate) {
+if (bs->file && bs->file->bs) {
+return bdrv_truncate(bs->file, offset, errp);
+}
+if (!drv) {
+error_setg(errp, "No medium inserted");
+return -ENOMEDIUM;
+}
+if (!drv->bdrv_truncate) {
+error_setg(errp, "Image format driver does not support resize");
+return -ENOTSUP;
+}
 }
 if (bs->read_only) {
 error_setg(errp, "Image is read-only");
@@ -3778,6 +3787,9 @@ int bdrv_has_zero_init(BlockDriverState *bs)
 if (bs->drv->bdrv_has_zero_init) {
 return bs->drv->bdrv_has_zero_init(bs);
 }
+if (bs->file && bs->file->bs) {
+return bdrv_has_zero_init(bs->file->bs);
+}
 
 /* safe default */
 return 0;
@@ -3832,10 +3844,15 @@ void bdrv_get_backing_filename(BlockDriverState *bs,
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BlockDriver *drv = bs->drv;
-if (!drv)
-return -ENOMEDIUM;
-if (!drv->bdrv_get_info)
-return -ENOTSUP;
+if (!drv || !drv->bdrv_get_info) {
+if (bs->file && bs->file->bs) {
+return bdrv_get_info(bs->file->bs, bdi);
+}
+if (!drv)
+return -ENOMEDIUM;
+if (!drv->bdrv_get_info)
+return -ENOTSUP;
+}
 memset(bdi, 0, sizeof(*bdi));
 return drv->bdrv_get_info(bs, bdi);
 }
@@ -4205,6 +4222,8 @@ int bdrv_media_changed(BlockDriverState *bs)
 
 if (drv && drv->bdrv_media_changed) {
 return drv->bdrv_media_changed(bs);
+} else if (bs->file && bs->file->bs) {
+bdrv_media_changed(bs->file->bs);
 }
 return -ENOTSUP;
 }
@@ -4218,6 +4237,8 @@ void bdrv_eject(BlockDriverState *bs, bool eject_flag)
 
 if (drv && drv->bdrv_eject) {
 drv->bdrv_eject(bs, eject_flag);
+} else if (bs->file && bs->file->bs) {
+bdrv_eject(bs->file->bs, eject_flag);
 }
 }
 
@@ -4233,6 +4254,8 @@ void bdrv_lock_medium(BlockDriverState *bs, bool locked)
 
 if (drv && drv->bdrv_lock_medium) {
 drv->bdrv_lock_medium(bs, locked);
+} else if (bs->file && bs->file->bs) {
+bdrv_lock_medium(bs->file->bs, locked);
 }
 }
 
diff --git a/block/io.c b/block/io.c
index c72d7015..c1b73226 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2403,6 +2403,10 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void 
*buf)
 
 bdrv_inc_in_flight(bs);
 if (!drv || (!drv->bdrv_aio_ioctl && !drv->bdrv_co_ioctl)) {
+if (bs->file && bs->file->bs) {
+bdrv_dec_in_flight(bs);
+return bdrv_co_ioctl(bs->file->bs, req, buf);
+}
 co.ret = -ENOTSUP;
 goto out;
 }
-- 
2.11.0




Re: [Qemu-block] [Xen-devel] [PATCH v2 0/3] xen-disk: performance improvements

2017-06-27 Thread Stefano Stabellini
On Wed, 21 Jun 2017, Paul Durrant wrote:
> Paul Durrant (3):
>   xen-disk: only advertize feature-persistent if grant copy is not
> available
>   xen-disk: add support for multi-page shared rings
>   xen-disk: use an IOThread per instance
> 
>  hw/block/trace-events |   7 ++
>  hw/block/xen_disk.c   | 228 
> +++---
>  2 files changed, 188 insertions(+), 47 deletions(-)

While waiting for an answer on patch #3, I sent a pull request for the
first 2 patches



Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: Add test for failing qemu-img commit

2017-06-27 Thread Eric Blake
On 06/16/2017 08:58 AM, Max Reitz wrote:
> Signed-off-by: Max Reitz 
> ---
> In order to pass, this depends on "fix: avoid an infinite loop or a
> dangling pointer problem in img_commit"
> (http://lists.nongnu.org/archive/html/qemu-block/2017-06/msg00443.html)

Looks like that is now 4172a003

> and on the "block: Don't compare strings in bdrv_reopen_prepare()"
> series
> (http://lists.nongnu.org/archive/html/qemu-block/2017-06/msg00424.html).

Still pending.

> ---
>  tests/qemu-iotests/020 | 27 +++
>  tests/qemu-iotests/020.out | 17 +
>  2 files changed, 44 insertions(+)
> 
> diff --git a/tests/qemu-iotests/020 b/tests/qemu-iotests/020
> index 7a0..83d5ef0 100755
> --- a/tests/qemu-iotests/020
> +++ b/tests/qemu-iotests/020
> @@ -110,6 +110,33 @@ for offset in $TEST_OFFSETS; do
>  io_zero readv $(( offset + 64 * 1024 + 65536 * 4 )) 65536 65536 1
>  done
>  _check_test_img
> +_cleanup
> +TEST_IMG=$TEST_IMG_SAVE
> +
> +echo
> +echo 'Testing failing commit'
> +echo
> +
> +# Create an image with a null backing file to which committing will fail 
> (with
> +# ENOSPC so we can distinguish the result from some generic EIO which may be
> +# generated anywhere in the block layer)
> +_make_test_img -b "json:{'driver': 'raw',
> + 'file': {
> + 'driver': 'blkdebug',
> + 'inject-error': [{
> + 'event': 'write_aio',
> + 'errno': 28,

Still nasty that we encoded an errno value rather than using a
stringified enum, but that's not your problem.

> + 'once': true
> + }],
> + 'image': {
> + 'driver': 'null-co'
> + }}}"

The comment does wonders for explaining how it all works!

> +
> +# Just write anything so comitting will not be a no-op

s/comitting/committing/

With the typo fix,
Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH] nbd: fix NBD over TLS

2017-06-27 Thread Eric Blake
On 06/15/2017 02:51 PM, Paolo Bonzini wrote:
> When attaching the NBD QIOChannel to an AioContext, the TLS channel should
> be used, not the underlying socket channel.  This is because, trivially,
> the TLS channel will be the one that we read/write to and thus the one
> that will get the qio_channel_yield() call.
> 
> Fixes: ff82911cd3f69f028f2537825c9720ff78bc3f19
> Cc: qemu-sta...@nongnu.org
> Cc: Daniel P. Berrange 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/nbd-client.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH v3 15/20] backup: Switch block_backup.h to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Continue by converting
the public interface to backup jobs (no semantic change), including
a change to CowRequest to track by bytes instead of cluster indices.

Note that this does not change the difference between the public
interface (starting point, and size of the subsequent range) and
the internal interface (starting and end points).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: change a couple more parameter names
---
 include/block/block_backup.h | 11 +--
 block/backup.c   | 33 -
 block/replication.c  | 12 
 3 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/include/block/block_backup.h b/include/block/block_backup.h
index 8a75947..994a3bd 100644
--- a/include/block/block_backup.h
+++ b/include/block/block_backup.h
@@ -21,17 +21,16 @@
 #include "block/block_int.h"

 typedef struct CowRequest {
-int64_t start;
-int64_t end;
+int64_t start_byte;
+int64_t end_byte;
 QLIST_ENTRY(CowRequest) list;
 CoQueue wait_queue; /* coroutines blocked on this request */
 } CowRequest;

-void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
-  int nb_sectors);
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t offset,
+  uint64_t bytes);
 void backup_cow_request_begin(CowRequest *req, BlockJob *job,
-  int64_t sector_num,
-  int nb_sectors);
+  int64_t offset, uint64_t bytes);
 void backup_cow_request_end(CowRequest *req);

 void backup_do_checkpoint(BlockJob *job, Error **errp);
diff --git a/block/backup.c b/block/backup.c
index 4e64710..cfbd921 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -55,7 +55,7 @@ static inline int64_t cluster_size_sectors(BackupBlockJob 
*job)

 /* See if in-flight requests overlap and wait for them to complete */
 static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
-   int64_t start,
+   int64_t offset,
int64_t end)
 {
 CowRequest *req;
@@ -64,7 +64,7 @@ static void coroutine_fn 
wait_for_overlapping_requests(BackupBlockJob *job,
 do {
 retry = false;
 QLIST_FOREACH(req, >inflight_reqs, list) {
-if (end > req->start && start < req->end) {
+if (end > req->start_byte && offset < req->end_byte) {
 qemu_co_queue_wait(>wait_queue, NULL);
 retry = true;
 break;
@@ -75,10 +75,10 @@ static void coroutine_fn 
wait_for_overlapping_requests(BackupBlockJob *job,

 /* Keep track of an in-flight request */
 static void cow_request_begin(CowRequest *req, BackupBlockJob *job,
- int64_t start, int64_t end)
+  int64_t offset, int64_t end)
 {
-req->start = start;
-req->end = end;
+req->start_byte = offset;
+req->end_byte = end;
 qemu_co_queue_init(>wait_queue);
 QLIST_INSERT_HEAD(>inflight_reqs, req, list);
 }
@@ -114,8 +114,10 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
   sector_num * BDRV_SECTOR_SIZE,
   nb_sectors * BDRV_SECTOR_SIZE);

-wait_for_overlapping_requests(job, start, end);
-cow_request_begin(_request, job, start, end);
+wait_for_overlapping_requests(job, start * job->cluster_size,
+  end * job->cluster_size);
+cow_request_begin(_request, job, start * job->cluster_size,
+  end * job->cluster_size);

 for (; start < end; start++) {
 if (test_bit(start, job->done_bitmap)) {
@@ -277,32 +279,29 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 bitmap_zero(backup_job->done_bitmap, len);
 }

-void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
-  int nb_sectors)
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t offset,
+  uint64_t bytes)
 {
 BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
-int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
 int64_t start, end;

 assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);

-start = sector_num / sectors_per_cluster;
-end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+start = QEMU_ALIGN_DOWN(offset, backup_job->cluster_size);
+end = QEMU_ALIGN_UP(offset + bytes, backup_job->cluster_size);
 wait_for_overlapping_requests(backup_job, start, end);
 }

 void 

[Qemu-block] [PATCH v3 19/20] block: Minimize raw use of bds->total_sectors

2017-06-27 Thread Eric Blake
bdrv_is_allocated_above() was relying on intermediate->total_sectors,
which is a field that can have stale contents depending on the value
of intermediate->has_variable_length.  An audit shows that we are safe
(we were first calling through bdrv_co_get_block_status() which in
turn calls bdrv_nb_sectors() and therefore just refreshed the current
length), but it's nicer to favor our accessor functions to avoid having
to repeat such an audit, even if it means refresh_total_sectors() is
called more frequently.

Suggested-by: John Snow 
Signed-off-by: Eric Blake 

---
v2: new patch
---
 block/io.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index 0545180..5bbf153 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1924,6 +1924,7 @@ int bdrv_is_allocated_above(BlockDriverState *top,
 intermediate = top;
 while (intermediate && intermediate != base) {
 int64_t pnum_inter;
+int64_t size_inter;
 int psectors_inter;

 ret = bdrv_is_allocated(intermediate, sector_num * BDRV_SECTOR_SIZE,
@@ -1941,13 +1942,14 @@ int bdrv_is_allocated_above(BlockDriverState *top,

 /*
  * [sector_num, nb_sectors] is unallocated on top but intermediate
- * might have
- *
- * [sector_num+x, nr_sectors] allocated.
+ * might have [sector_num+x, nb_sectors-x] allocated.
  */
+size_inter = bdrv_nb_sectors(intermediate);
+if (size_inter < 0) {
+return size_inter;
+}
 if (n > psectors_inter &&
-(intermediate == top ||
- sector_num + psectors_inter < intermediate->total_sectors)) {
+(intermediate == top || sector_num + psectors_inter < size_inter)) 
{
 n = psectors_inter;
 }

-- 
2.9.4




[Qemu-block] [PATCH v3 18/20] block: Make bdrv_is_allocated() byte-based

2017-06-27 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  In the common case, allocation is unlikely to ever use
values that are not naturally sector-aligned, but it is possible
that byte-based values will let us be more precise about allocation
at the end of an unaligned file that can do byte-based access.

Changing the signature of the function to use int64_t *pnum ensures
that the compiler enforces that all callers are updated.  For now,
the io.c layer still assert()s that all callers are sector-aligned
on input and that *pnum is sector-aligned on return to the caller,
but that can be relaxed when a later patch implements byte-based
block status.  Therefore, this code adds usages like
DIV_ROUND_UP(,BDRV_SECTOR_SIZE) to callers that still want aligned
values, where the call might reasonbly give non-aligned results
in the future; on the other hand, no rounding is needed for callers
that should just continue to work with byte alignment.

For the most part this patch is just the addition of scaling at the
callers followed by inverse scaling at bdrv_is_allocated().  But
some code, particularly bdrv_commit(), gets a lot simpler because it
no longer has to mess with sectors; also, it is now possible to pass
NULL if the caller does not care how much of the image is allocated
beyond the initial offset.

For ease of review, bdrv_is_allocated_above() will be tackled
separately.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: rebase to earlier changes, tweak commit message
---
 include/block/block.h |  4 +--
 block/backup.c| 17 -
 block/commit.c| 21 +++-
 block/io.c| 49 +---
 block/stream.c|  5 ++--
 block/vvfat.c | 34 ++---
 migration/block.c |  9 ---
 qemu-img.c|  5 +++-
 qemu-io-cmds.c| 70 +++
 9 files changed, 114 insertions(+), 100 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 5cdd690..9b9d87b 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -425,8 +425,8 @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 int64_t sector_num,
 int nb_sectors, int *pnum,
 BlockDriverState **file);
-int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors,
-  int *pnum);
+int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
+  int64_t *pnum);
 int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
 int64_t sector_num, int nb_sectors, int *pnum);

diff --git a/block/backup.c b/block/backup.c
index 04def91..b2048bf 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -47,12 +47,6 @@ typedef struct BackupBlockJob {
 QLIST_HEAD(, CowRequest) inflight_reqs;
 } BackupBlockJob;

-/* Size of a cluster in sectors, instead of bytes. */
-static inline int64_t cluster_size_sectors(BackupBlockJob *job)
-{
-  return job->cluster_size / BDRV_SECTOR_SIZE;
-}
-
 /* See if in-flight requests overlap and wait for them to complete */
 static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
int64_t offset,
@@ -433,7 +427,6 @@ static void coroutine_fn backup_run(void *opaque)
 BackupCompleteData *data;
 BlockDriverState *bs = blk_bs(job->common.blk);
 int64_t offset;
-int64_t sectors_per_cluster = cluster_size_sectors(job);
 int ret = 0;

 QLIST_INIT(>inflight_reqs);
@@ -465,12 +458,13 @@ static void coroutine_fn backup_run(void *opaque)
 }

 if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
-int i, n;
+int i;
+int64_t n;

 /* Check to see if these blocks are already in the
  * backing file. */

-for (i = 0; i < sectors_per_cluster;) {
+for (i = 0; i < job->cluster_size;) {
 /* bdrv_is_allocated() only returns true/false based
  * on the first set of sectors it comes across that
  * are are all in the same state.
@@ -478,9 +472,8 @@ static void coroutine_fn backup_run(void *opaque)
  * backup cluster length.  We end up copying more than
  * needed but at some point that is always the case. */
 alloced =
-bdrv_is_allocated(bs,
-  (offset >> BDRV_SECTOR_BITS) + i,
-  sectors_per_cluster - i, );
+bdrv_is_allocated(bs, offset + i,
+  job->cluster_size - i, );
 i 

[Qemu-block] [PATCH v3 17/20] backup: Switch backup_run() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Change the internal
loop iteration of backups to track by bytes instead of sectors
(although we are still guaranteed that we iterate by steps that
are cluster-aligned).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/backup.c | 32 +++-
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index c029d44..04def91 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -370,11 +370,10 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)
 int ret = 0;
 int clusters_per_iter;
 uint32_t granularity;
-int64_t sector;
+int64_t offset;
 int64_t cluster;
 int64_t end;
 int64_t last_cluster = -1;
-int64_t sectors_per_cluster = cluster_size_sectors(job);
 BdrvDirtyBitmapIter *dbi;

 granularity = bdrv_dirty_bitmap_granularity(job->sync_bitmap);
@@ -382,8 +381,8 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)
 dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);

 /* Find the next dirty sector(s) */
-while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
-cluster = sector / sectors_per_cluster;
+while ((offset = bdrv_dirty_iter_next(dbi) * BDRV_SECTOR_SIZE) >= 0) {
+cluster = offset / job->cluster_size;

 /* Fake progress updates for any clusters we skipped */
 if (cluster != last_cluster + 1) {
@@ -410,7 +409,8 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)
 /* If the bitmap granularity is smaller than the backup granularity,
  * we need to advance the iterator pointer to the next cluster. */
 if (granularity < job->cluster_size) {
-bdrv_set_dirty_iter(dbi, cluster * sectors_per_cluster);
+bdrv_set_dirty_iter(dbi,
+cluster * job->cluster_size / 
BDRV_SECTOR_SIZE);
 }

 last_cluster = cluster - 1;
@@ -432,17 +432,15 @@ static void coroutine_fn backup_run(void *opaque)
 BackupBlockJob *job = opaque;
 BackupCompleteData *data;
 BlockDriverState *bs = blk_bs(job->common.blk);
-int64_t start, end;
+int64_t offset;
 int64_t sectors_per_cluster = cluster_size_sectors(job);
 int ret = 0;

 QLIST_INIT(>inflight_reqs);
 qemu_co_rwlock_init(>flush_rwlock);

-start = 0;
-end = DIV_ROUND_UP(job->common.len, job->cluster_size);
-
-job->done_bitmap = bitmap_new(end);
+job->done_bitmap = bitmap_new(DIV_ROUND_UP(job->common.len,
+   job->cluster_size));

 job->before_write.notify = backup_before_write_notify;
 bdrv_add_before_write_notifier(bs, >before_write);
@@ -457,7 +455,8 @@ static void coroutine_fn backup_run(void *opaque)
 ret = backup_run_incremental(job);
 } else {
 /* Both FULL and TOP SYNC_MODE's require copying.. */
-for (; start < end; start++) {
+for (offset = 0; offset < job->common.len;
+ offset += job->cluster_size) {
 bool error_is_read;
 int alloced = 0;

@@ -480,8 +479,8 @@ static void coroutine_fn backup_run(void *opaque)
  * needed but at some point that is always the case. */
 alloced =
 bdrv_is_allocated(bs,
-start * sectors_per_cluster + i,
-sectors_per_cluster - i, );
+  (offset >> BDRV_SECTOR_BITS) + i,
+  sectors_per_cluster - i, );
 i += n;

 if (alloced || n == 0) {
@@ -499,9 +498,8 @@ static void coroutine_fn backup_run(void *opaque)
 if (alloced < 0) {
 ret = alloced;
 } else {
-ret = backup_do_cow(job, start * job->cluster_size,
-job->cluster_size, _is_read,
-false);
+ret = backup_do_cow(job, offset, job->cluster_size,
+_is_read, false);
 }
 if (ret < 0) {
 /* Depending on error action, fail now or retry cluster */
@@ -510,7 +508,7 @@ static void coroutine_fn backup_run(void *opaque)
 if (action == BLOCK_ERROR_ACTION_REPORT) {
 break;
 } else {
-start--;
+offset -= job->cluster_size;
 continue;
 }
 }
-- 
2.9.4




[Qemu-block] [PATCH v3 20/20] block: Make bdrv_is_allocated_above() byte-based

2017-06-27 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  In the common case, allocation is unlikely to ever use
values that are not naturally sector-aligned, but it is possible
that byte-based values will let us be more precise about allocation
at the end of an unaligned file that can do byte-based access.

Changing the signature of the function to use int64_t *pnum ensures
that the compiler enforces that all callers are updated.  For now,
the io.c layer still assert()s that all callers are sector-aligned,
but that can be relaxed when a later patch implements byte-based
block status.  Therefore, for the most part this patch is just the
addition of scaling at the callers followed by inverse scaling at
bdrv_is_allocated().  But some code, particularly stream_run(),
gets a lot simpler because it no longer has to mess with sectors.

For ease of review, bdrv_is_allocated() was tackled separately.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: tweak function comments, favor bdrv_getlength() over ->total_sectors
---
 include/block/block.h |  2 +-
 block/commit.c| 20 
 block/io.c| 42 --
 block/mirror.c|  5 -
 block/replication.c   | 17 -
 block/stream.c| 21 +
 qemu-img.c| 10 +++---
 7 files changed, 61 insertions(+), 56 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 9b9d87b..13022d5 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -428,7 +428,7 @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
   int64_t *pnum);
 int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
-int64_t sector_num, int nb_sectors, int *pnum);
+int64_t offset, int64_t bytes, int64_t *pnum);

 bool bdrv_is_read_only(BlockDriverState *bs);
 bool bdrv_is_writable(BlockDriverState *bs);
diff --git a/block/commit.c b/block/commit.c
index 241aa95..774a8a5 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -146,7 +146,7 @@ static void coroutine_fn commit_run(void *opaque)
 int64_t offset;
 uint64_t delay_ns = 0;
 int ret = 0;
-int n = 0; /* sectors */
+int64_t n = 0; /* bytes */
 void *buf = NULL;
 int bytes_written = 0;
 int64_t base_len;
@@ -171,7 +171,7 @@ static void coroutine_fn commit_run(void *opaque)

 buf = blk_blockalign(s->top, COMMIT_BUFFER_SIZE);

-for (offset = 0; offset < s->common.len; offset += n * BDRV_SECTOR_SIZE) {
+for (offset = 0; offset < s->common.len; offset += n) {
 bool copy;

 /* Note that even when no rate limit is applied we need to yield
@@ -183,15 +183,12 @@ static void coroutine_fn commit_run(void *opaque)
 }
 /* Copy if allocated above the base */
 ret = bdrv_is_allocated_above(blk_bs(s->top), blk_bs(s->base),
-  offset / BDRV_SECTOR_SIZE,
-  COMMIT_BUFFER_SIZE / BDRV_SECTOR_SIZE,
-  );
+  offset, COMMIT_BUFFER_SIZE, );
 copy = (ret == 1);
-trace_commit_one_iteration(s, offset, n * BDRV_SECTOR_SIZE, ret);
+trace_commit_one_iteration(s, offset, n, ret);
 if (copy) {
-ret = commit_populate(s->top, s->base, offset,
-  n * BDRV_SECTOR_SIZE, buf);
-bytes_written += n * BDRV_SECTOR_SIZE;
+ret = commit_populate(s->top, s->base, offset, n, buf);
+bytes_written += n;
 }
 if (ret < 0) {
 BlockErrorAction action =
@@ -204,11 +201,10 @@ static void coroutine_fn commit_run(void *opaque)
 }
 }
 /* Publish progress */
-s->common.offset += n * BDRV_SECTOR_SIZE;
+s->common.offset += n;

 if (copy && s->common.speed) {
-delay_ns = ratelimit_calculate_delay(>limit,
- n * BDRV_SECTOR_SIZE);
+delay_ns = ratelimit_calculate_delay(>limit, n);
 }
 }

diff --git a/block/io.c b/block/io.c
index 5bbf153..061a162 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1903,54 +1903,52 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState 
*bs, int64_t offset,
 /*
  * Given an image chain: ... -> [BASE] -> [INTER1] -> [INTER2] -> [TOP]
  *
- * Return true if the given sector is allocated in any image between
- * BASE and TOP (inclusive).  BASE can be NULL to check if the given
- * sector is allocated in any image of the chain.  Return false otherwise,
+ * Return true if the (prefix of the) given range is allocated in any image
+ * between BASE and TOP (inclusive).  BASE can be NULL to check if the given
+ * 

[Qemu-block] [PATCH v3 14/20] backup: Switch BackupBlockJob to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Continue by converting an
internal structure (no semantic change), and all references to
tracking progress.  Drop a redundant local variable bytes_per_cluster.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/backup.c | 33 +++--
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 06431ac..4e64710 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -39,7 +39,7 @@ typedef struct BackupBlockJob {
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
 CoRwlock flush_rwlock;
-uint64_t sectors_read;
+uint64_t bytes_read;
 unsigned long *done_bitmap;
 int64_t cluster_size;
 bool compress;
@@ -102,16 +102,15 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 void *bounce_buffer = NULL;
 int ret = 0;
 int64_t sectors_per_cluster = cluster_size_sectors(job);
-int64_t bytes_per_cluster = sectors_per_cluster * BDRV_SECTOR_SIZE;
-int64_t start, end;
-int n;
+int64_t start, end; /* clusters */
+int n; /* bytes */

 qemu_co_rwlock_rdlock(>flush_rwlock);

 start = sector_num / sectors_per_cluster;
 end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);

-trace_backup_do_cow_enter(job, start * bytes_per_cluster,
+trace_backup_do_cow_enter(job, start * job->cluster_size,
   sector_num * BDRV_SECTOR_SIZE,
   nb_sectors * BDRV_SECTOR_SIZE);

@@ -120,28 +119,27 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,

 for (; start < end; start++) {
 if (test_bit(start, job->done_bitmap)) {
-trace_backup_do_cow_skip(job, start * bytes_per_cluster);
+trace_backup_do_cow_skip(job, start * job->cluster_size);
 continue; /* already copied */
 }

-trace_backup_do_cow_process(job, start * bytes_per_cluster);
+trace_backup_do_cow_process(job, start * job->cluster_size);

-n = MIN(sectors_per_cluster,
-job->common.len / BDRV_SECTOR_SIZE -
-start * sectors_per_cluster);
+n = MIN(job->cluster_size,
+job->common.len - start * job->cluster_size);

 if (!bounce_buffer) {
 bounce_buffer = blk_blockalign(blk, job->cluster_size);
 }
 iov.iov_base = bounce_buffer;
-iov.iov_len = n * BDRV_SECTOR_SIZE;
+iov.iov_len = n;
 qemu_iovec_init_external(_qiov, , 1);

 ret = blk_co_preadv(blk, start * job->cluster_size,
 bounce_qiov.size, _qiov,
 is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
 if (ret < 0) {
-trace_backup_do_cow_read_fail(job, start * bytes_per_cluster, ret);
+trace_backup_do_cow_read_fail(job, start * job->cluster_size, ret);
 if (error_is_read) {
 *error_is_read = true;
 }
@@ -157,7 +155,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
  job->compress ? BDRV_REQ_WRITE_COMPRESSED : 
0);
 }
 if (ret < 0) {
-trace_backup_do_cow_write_fail(job, start * bytes_per_cluster, 
ret);
+trace_backup_do_cow_write_fail(job, start * job->cluster_size, 
ret);
 if (error_is_read) {
 *error_is_read = false;
 }
@@ -169,8 +167,8 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 /* Publish progress, guest I/O counts as progress too.  Note that the
  * offset field is an opaque progress value, it is not a disk offset.
  */
-job->sectors_read += n;
-job->common.offset += n * BDRV_SECTOR_SIZE;
+job->bytes_read += n;
+job->common.offset += n;
 }

 out:
@@ -363,9 +361,8 @@ static bool coroutine_fn yield_and_check(BackupBlockJob 
*job)
  */
 if (job->common.speed) {
 uint64_t delay_ns = ratelimit_calculate_delay(>limit,
-  job->sectors_read *
-  BDRV_SECTOR_SIZE);
-job->sectors_read = 0;
+  job->bytes_read);
+job->bytes_read = 0;
 block_job_sleep_ns(>common, QEMU_CLOCK_REALTIME, delay_ns);
 } else {
 block_job_sleep_ns(>common, QEMU_CLOCK_REALTIME, 0);
-- 
2.9.4




[Qemu-block] [PATCH v3 16/20] backup: Switch backup_do_cow() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/backup.c | 62 --
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index cfbd921..c029d44 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -91,7 +91,7 @@ static void cow_request_end(CowRequest *req)
 }

 static int coroutine_fn backup_do_cow(BackupBlockJob *job,
-  int64_t sector_num, int nb_sectors,
+  int64_t offset, uint64_t bytes,
   bool *error_is_read,
   bool is_write_notifier)
 {
@@ -101,34 +101,28 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 QEMUIOVector bounce_qiov;
 void *bounce_buffer = NULL;
 int ret = 0;
-int64_t sectors_per_cluster = cluster_size_sectors(job);
-int64_t start, end; /* clusters */
+int64_t start, end; /* bytes */
 int n; /* bytes */

 qemu_co_rwlock_rdlock(>flush_rwlock);

-start = sector_num / sectors_per_cluster;
-end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+start = QEMU_ALIGN_DOWN(offset, job->cluster_size);
+end = QEMU_ALIGN_UP(bytes + offset, job->cluster_size);

-trace_backup_do_cow_enter(job, start * job->cluster_size,
-  sector_num * BDRV_SECTOR_SIZE,
-  nb_sectors * BDRV_SECTOR_SIZE);
+trace_backup_do_cow_enter(job, start, offset, bytes);

-wait_for_overlapping_requests(job, start * job->cluster_size,
-  end * job->cluster_size);
-cow_request_begin(_request, job, start * job->cluster_size,
-  end * job->cluster_size);
+wait_for_overlapping_requests(job, start, end);
+cow_request_begin(_request, job, start, end);

-for (; start < end; start++) {
-if (test_bit(start, job->done_bitmap)) {
-trace_backup_do_cow_skip(job, start * job->cluster_size);
+for (; start < end; start += job->cluster_size) {
+if (test_bit(start / job->cluster_size, job->done_bitmap)) {
+trace_backup_do_cow_skip(job, start);
 continue; /* already copied */
 }

-trace_backup_do_cow_process(job, start * job->cluster_size);
+trace_backup_do_cow_process(job, start);

-n = MIN(job->cluster_size,
-job->common.len - start * job->cluster_size);
+n = MIN(job->cluster_size, job->common.len - start);

 if (!bounce_buffer) {
 bounce_buffer = blk_blockalign(blk, job->cluster_size);
@@ -137,11 +131,10 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 iov.iov_len = n;
 qemu_iovec_init_external(_qiov, , 1);

-ret = blk_co_preadv(blk, start * job->cluster_size,
-bounce_qiov.size, _qiov,
+ret = blk_co_preadv(blk, start, bounce_qiov.size, _qiov,
 is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
 if (ret < 0) {
-trace_backup_do_cow_read_fail(job, start * job->cluster_size, ret);
+trace_backup_do_cow_read_fail(job, start, ret);
 if (error_is_read) {
 *error_is_read = true;
 }
@@ -149,22 +142,22 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 }

 if (buffer_is_zero(iov.iov_base, iov.iov_len)) {
-ret = blk_co_pwrite_zeroes(job->target, start * job->cluster_size,
+ret = blk_co_pwrite_zeroes(job->target, start,
bounce_qiov.size, BDRV_REQ_MAY_UNMAP);
 } else {
-ret = blk_co_pwritev(job->target, start * job->cluster_size,
+ret = blk_co_pwritev(job->target, start,
  bounce_qiov.size, _qiov,
  job->compress ? BDRV_REQ_WRITE_COMPRESSED : 
0);
 }
 if (ret < 0) {
-trace_backup_do_cow_write_fail(job, start * job->cluster_size, 
ret);
+trace_backup_do_cow_write_fail(job, start, ret);
 if (error_is_read) {
 *error_is_read = false;
 }
 goto out;
 }

-set_bit(start, job->done_bitmap);
+set_bit(start / job->cluster_size, job->done_bitmap);

 /* Publish progress, guest I/O counts as progress too.  Note that the
  * offset field is an opaque progress value, it is not a disk offset.
@@ -180,8 +173,7 @@ out:

 cow_request_end(_request);

-trace_backup_do_cow_return(job, sector_num * BDRV_SECTOR_SIZE,
-   nb_sectors * BDRV_SECTOR_SIZE, ret);
+

[Qemu-block] [PATCH v3 12/20] mirror: Switch mirror_iteration() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Change the internal
loop iteration of mirroring to track by bytes instead of sectors
(although we are still guaranteed that we iterate by steps that
are both sector-aligned and multiples of the granularity).  Drop
the now-unused mirror_clip_sectors().

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v3: rebase to Paolo's thread-safety changes, R-b kept
v2: straightforward rebase to earlier mirror_clip_bytes() change, R-b kept
---
 block/mirror.c | 105 +
 1 file changed, 46 insertions(+), 59 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 81ff784..0eb2af4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -184,15 +184,6 @@ static inline int64_t mirror_clip_bytes(MirrorBlockJob *s,
 return MIN(bytes, s->bdev_length - offset);
 }

-/* Clip nb_sectors relative to sector_num to not exceed end-of-file */
-static inline int mirror_clip_sectors(MirrorBlockJob *s,
-  int64_t sector_num,
-  int nb_sectors)
-{
-return MIN(nb_sectors,
-   s->bdev_length / BDRV_SECTOR_SIZE - sector_num);
-}
-
 /* Round offset and/or bytes to target cluster if COW is needed, and
  * return the offset of the adjusted tail against original. */
 static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
@@ -336,30 +327,28 @@ static void mirror_do_zero_or_discard(MirrorBlockJob *s,
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
 BlockDriverState *source = s->source;
-int64_t sector_num, first_chunk;
+int64_t offset, first_chunk;
 uint64_t delay_ns = 0;
 /* At least the first dirty chunk is mirrored in one iteration. */
 int nb_chunks = 1;
-int64_t end = s->bdev_length / BDRV_SECTOR_SIZE;
 int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
 bool write_zeroes_ok = bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
 int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);

 bdrv_dirty_bitmap_lock(s->dirty_bitmap);
-sector_num = bdrv_dirty_iter_next(s->dbi);
-if (sector_num < 0) {
+offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+if (offset < 0) {
 bdrv_set_dirty_iter(s->dbi, 0);
-sector_num = bdrv_dirty_iter_next(s->dbi);
+offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
 trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap) *
   BDRV_SECTOR_SIZE);
-assert(sector_num >= 0);
+assert(offset >= 0);
 }
 bdrv_dirty_bitmap_unlock(s->dirty_bitmap);

-first_chunk = sector_num / sectors_per_chunk;
+first_chunk = offset / s->granularity;
 while (test_bit(first_chunk, s->in_flight_bitmap)) {
-trace_mirror_yield_in_flight(s, sector_num * BDRV_SECTOR_SIZE,
- s->in_flight);
+trace_mirror_yield_in_flight(s, offset, s->in_flight);
 mirror_wait_for_io(s);
 }

@@ -368,25 +357,26 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 /* Find the number of consective dirty chunks following the first dirty
  * one, and wait for in flight requests in them. */
 bdrv_dirty_bitmap_lock(s->dirty_bitmap);
-while (nb_chunks * sectors_per_chunk < (s->buf_size >> BDRV_SECTOR_BITS)) {
+while (nb_chunks * s->granularity < s->buf_size) {
 int64_t next_dirty;
-int64_t next_sector = sector_num + nb_chunks * sectors_per_chunk;
-int64_t next_chunk = next_sector / sectors_per_chunk;
-if (next_sector >= end ||
-!bdrv_get_dirty_locked(source, s->dirty_bitmap, next_sector)) {
+int64_t next_offset = offset + nb_chunks * s->granularity;
+int64_t next_chunk = next_offset / s->granularity;
+if (next_offset >= s->bdev_length ||
+!bdrv_get_dirty_locked(source, s->dirty_bitmap,
+   next_offset >> BDRV_SECTOR_BITS)) {
 break;
 }
 if (test_bit(next_chunk, s->in_flight_bitmap)) {
 break;
 }

-next_dirty = bdrv_dirty_iter_next(s->dbi);
-if (next_dirty > next_sector || next_dirty < 0) {
+next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+if (next_dirty > next_offset || next_dirty < 0) {
 /* The bitmap iterator's cache is stale, refresh it */
-bdrv_set_dirty_iter(s->dbi, next_sector);
-next_dirty = bdrv_dirty_iter_next(s->dbi);
+bdrv_set_dirty_iter(s->dbi, next_offset >> BDRV_SECTOR_BITS);
+next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
 }
-assert(next_dirty == next_sector);
+assert(next_dirty == next_offset);
 nb_chunks++;
 }

@@ -394,14 

[Qemu-block] [PATCH v3 13/20] block: Drop unused bdrv_round_sectors_to_clusters()

2017-06-27 Thread Eric Blake
Now that the last user [mirror_iteration()] has converted to using
bytes, we no longer need a function to round sectors to clusters.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: hoist to earlier series, no change
---
 include/block/block.h |  4 
 block/io.c| 21 -
 2 files changed, 25 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 3e91cac..5cdd690 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -473,10 +473,6 @@ const char *bdrv_get_device_or_node_name(const 
BlockDriverState *bs);
 int bdrv_get_flags(BlockDriverState *bs);
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs);
-void bdrv_round_sectors_to_clusters(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors,
-int64_t *cluster_sector_num,
-int *cluster_nb_sectors);
 void bdrv_round_to_clusters(BlockDriverState *bs,
 int64_t offset, unsigned int bytes,
 int64_t *cluster_offset,
diff --git a/block/io.c b/block/io.c
index c72d701..d9fec1f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -419,27 +419,6 @@ static void mark_request_serialising(BdrvTrackedRequest 
*req, uint64_t align)
 }

 /**
- * Round a region to cluster boundaries (sector-based)
- */
-void bdrv_round_sectors_to_clusters(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors,
-int64_t *cluster_sector_num,
-int *cluster_nb_sectors)
-{
-BlockDriverInfo bdi;
-
-if (bdrv_get_info(bs, ) < 0 || bdi.cluster_size == 0) {
-*cluster_sector_num = sector_num;
-*cluster_nb_sectors = nb_sectors;
-} else {
-int64_t c = bdi.cluster_size / BDRV_SECTOR_SIZE;
-*cluster_sector_num = QEMU_ALIGN_DOWN(sector_num, c);
-*cluster_nb_sectors = QEMU_ALIGN_UP(sector_num - *cluster_sector_num +
-nb_sectors, c);
-}
-}
-
-/**
  * Round a region to cluster boundaries
  */
 void bdrv_round_to_clusters(BlockDriverState *bs,
-- 
2.9.4




[Qemu-block] [PATCH v3 07/20] mirror: Switch MirrorBlockJob to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Continue by converting an
internal structure (no semantic change), and all references to the
buffer size.

[checkpatch has a false positive on use of MIN() in this patch]

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/mirror.c | 79 --
 1 file changed, 38 insertions(+), 41 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index b4dfe95..9e28d59 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -24,9 +24,8 @@

 #define SLICE_TIME1ULL /* ns */
 #define MAX_IN_FLIGHT 16
-#define MAX_IO_SECTORS ((1 << 20) >> BDRV_SECTOR_BITS) /* 1 Mb */
-#define DEFAULT_MIRROR_BUF_SIZE \
-(MAX_IN_FLIGHT * MAX_IO_SECTORS * BDRV_SECTOR_SIZE)
+#define MAX_IO_BYTES (1 << 20) /* 1 Mb */
+#define DEFAULT_MIRROR_BUF_SIZE (MAX_IN_FLIGHT * MAX_IO_BYTES)

 /* The mirroring buffer is a list of granularity-sized chunks.
  * Free chunks are organized in a list.
@@ -67,11 +66,11 @@ typedef struct MirrorBlockJob {
 uint64_t last_pause_ns;
 unsigned long *in_flight_bitmap;
 int in_flight;
-int64_t sectors_in_flight;
+int64_t bytes_in_flight;
 int ret;
 bool unmap;
 bool waiting_for_io;
-int target_cluster_sectors;
+int target_cluster_size;
 int max_iov;
 bool initial_zeroing_ongoing;
 } MirrorBlockJob;
@@ -79,8 +78,8 @@ typedef struct MirrorBlockJob {
 typedef struct MirrorOp {
 MirrorBlockJob *s;
 QEMUIOVector qiov;
-int64_t sector_num;
-int nb_sectors;
+int64_t offset;
+uint64_t bytes;
 } MirrorOp;

 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
@@ -101,13 +100,12 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
 MirrorBlockJob *s = op->s;
 struct iovec *iov;
 int64_t chunk_num;
-int i, nb_chunks, sectors_per_chunk;
+int i, nb_chunks;

-trace_mirror_iteration_done(s, op->sector_num * BDRV_SECTOR_SIZE,
-op->nb_sectors * BDRV_SECTOR_SIZE, ret);
+trace_mirror_iteration_done(s, op->offset, op->bytes, ret);

 s->in_flight--;
-s->sectors_in_flight -= op->nb_sectors;
+s->bytes_in_flight -= op->bytes;
 iov = op->qiov.iov;
 for (i = 0; i < op->qiov.niov; i++) {
 MirrorBuffer *buf = (MirrorBuffer *) iov[i].iov_base;
@@ -115,16 +113,15 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
 s->buf_free_count++;
 }

-sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
-chunk_num = op->sector_num / sectors_per_chunk;
-nb_chunks = DIV_ROUND_UP(op->nb_sectors, sectors_per_chunk);
+chunk_num = op->offset / s->granularity;
+nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
 bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
 if (ret >= 0) {
 if (s->cow_bitmap) {
 bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
 }
 if (!s->initial_zeroing_ongoing) {
-s->common.offset += (uint64_t)op->nb_sectors * BDRV_SECTOR_SIZE;
+s->common.offset += op->bytes;
 }
 }
 qemu_iovec_destroy(>qiov);
@@ -144,7 +141,8 @@ static void mirror_write_complete(void *opaque, int ret)
 if (ret < 0) {
 BlockErrorAction action;

-bdrv_set_dirty_bitmap(s->dirty_bitmap, op->sector_num, op->nb_sectors);
+bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset >> BDRV_SECTOR_BITS,
+  op->bytes >> BDRV_SECTOR_BITS);
 action = mirror_error_action(s, false, -ret);
 if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) {
 s->ret = ret;
@@ -163,7 +161,8 @@ static void mirror_read_complete(void *opaque, int ret)
 if (ret < 0) {
 BlockErrorAction action;

-bdrv_set_dirty_bitmap(s->dirty_bitmap, op->sector_num, op->nb_sectors);
+bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset >> BDRV_SECTOR_BITS,
+  op->bytes >> BDRV_SECTOR_BITS);
 action = mirror_error_action(s, true, -ret);
 if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) {
 s->ret = ret;
@@ -171,7 +170,7 @@ static void mirror_read_complete(void *opaque, int ret)

 mirror_iteration_done(op, ret);
 } else {
-blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, 
>qiov,
+blk_aio_pwritev(s->target, op->offset, >qiov,
 0, mirror_write_complete, op);
 }
 aio_context_release(blk_get_aio_context(s->common.blk));
@@ -211,7 +210,8 @@ static int mirror_cow_align(MirrorBlockJob *s,
 align_nb_sectors = max_sectors;
 if (need_cow) {
 align_nb_sectors = QEMU_ALIGN_DOWN(align_nb_sectors,
-   s->target_cluster_sectors);
+   

[Qemu-block] [PATCH v3 06/20] commit: Switch commit_run() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Change the internal
loop iteration of committing to track by bytes instead of sectors
(although we are still guaranteed that we iterate by steps that
are sector-aligned).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/commit.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/block/commit.c b/block/commit.c
index 6f67d78..c3a7bca 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -143,17 +143,16 @@ static void coroutine_fn commit_run(void *opaque)
 {
 CommitBlockJob *s = opaque;
 CommitCompleteData *data;
-int64_t sector_num, end;
+int64_t offset;
 uint64_t delay_ns = 0;
 int ret = 0;
-int n = 0;
+int n = 0; /* sectors */
 void *buf = NULL;
 int bytes_written = 0;
 int64_t base_len;

 ret = s->common.len = blk_getlength(s->top);

-
 if (s->common.len < 0) {
 goto out;
 }
@@ -170,10 +169,9 @@ static void coroutine_fn commit_run(void *opaque)
 }
 }

-end = s->common.len >> BDRV_SECTOR_BITS;
 buf = blk_blockalign(s->top, COMMIT_BUFFER_SIZE);

-for (sector_num = 0; sector_num < end; sector_num += n) {
+for (offset = 0; offset < s->common.len; offset += n * BDRV_SECTOR_SIZE) {
 bool copy;

 /* Note that even when no rate limit is applied we need to yield
@@ -185,15 +183,13 @@ static void coroutine_fn commit_run(void *opaque)
 }
 /* Copy if allocated above the base */
 ret = bdrv_is_allocated_above(blk_bs(s->top), blk_bs(s->base),
-  sector_num,
+  offset / BDRV_SECTOR_SIZE,
   COMMIT_BUFFER_SIZE / BDRV_SECTOR_SIZE,
   );
 copy = (ret == 1);
-trace_commit_one_iteration(s, sector_num * BDRV_SECTOR_SIZE,
-   n * BDRV_SECTOR_SIZE, ret);
+trace_commit_one_iteration(s, offset, n * BDRV_SECTOR_SIZE, ret);
 if (copy) {
-ret = commit_populate(s->top, s->base,
-  sector_num * BDRV_SECTOR_SIZE,
+ret = commit_populate(s->top, s->base, offset,
   n * BDRV_SECTOR_SIZE, buf);
 bytes_written += n * BDRV_SECTOR_SIZE;
 }
-- 
2.9.4




[Qemu-block] [PATCH v3 09/20] mirror: Update signature of mirror_clip_sectors()

2017-06-27 Thread Eric Blake
Rather than having a void function that modifies its input
in-place as the output, change the signature to reduce a layer
of indirection and return the result.

Suggested-by: John Snow 
Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: new patch
---
 block/mirror.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index af27bcc..1a43304 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -176,12 +176,12 @@ static void mirror_read_complete(void *opaque, int ret)
 aio_context_release(blk_get_aio_context(s->common.blk));
 }

-static inline void mirror_clip_sectors(MirrorBlockJob *s,
-   int64_t sector_num,
-   int *nb_sectors)
+static inline int mirror_clip_sectors(MirrorBlockJob *s,
+  int64_t sector_num,
+  int nb_sectors)
 {
-*nb_sectors = MIN(*nb_sectors,
-  s->bdev_length / BDRV_SECTOR_SIZE - sector_num);
+return MIN(nb_sectors,
+   s->bdev_length / BDRV_SECTOR_SIZE - sector_num);
 }

 /* Round sector_num and/or nb_sectors to target cluster if COW is needed, and
@@ -216,7 +216,8 @@ static int mirror_cow_align(MirrorBlockJob *s,
 }
 /* Clipping may result in align_nb_sectors unaligned to chunk boundary, but
  * that doesn't matter because it's already the end of source image. */
-mirror_clip_sectors(s, align_sector_num, _nb_sectors);
+align_nb_sectors = mirror_clip_sectors(s, align_sector_num,
+   align_nb_sectors);

 ret = align_sector_num + align_nb_sectors - (*sector_num + *nb_sectors);
 *sector_num = align_sector_num;
@@ -445,7 +446,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 return 0;
 }

-mirror_clip_sectors(s, sector_num, _sectors);
+io_sectors = mirror_clip_sectors(s, sector_num, io_sectors);
 switch (mirror_method) {
 case MIRROR_METHOD_COPY:
 io_sectors = mirror_do_read(s, sector_num, io_sectors);
-- 
2.9.4




[Qemu-block] [PATCH v3 11/20] mirror: Switch mirror_do_read() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: rebase to earlier changes
---
 block/mirror.c | 75 ++
 1 file changed, 33 insertions(+), 42 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 1a4602a..81ff784 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -196,7 +196,7 @@ static inline int mirror_clip_sectors(MirrorBlockJob *s,
 /* Round offset and/or bytes to target cluster if COW is needed, and
  * return the offset of the adjusted tail against original. */
 static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
-unsigned int *bytes)
+uint64_t *bytes)
 {
 bool need_cow;
 int ret = 0;
@@ -204,6 +204,7 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 unsigned int align_bytes = *bytes;
 int max_bytes = s->granularity * s->max_iov;

+assert(*bytes < INT_MAX);
 need_cow = !test_bit(*offset / s->granularity, s->cow_bitmap);
 need_cow |= !test_bit((*offset + *bytes - 1) / s->granularity,
   s->cow_bitmap);
@@ -239,59 +240,50 @@ static inline void mirror_wait_for_io(MirrorBlockJob *s)
 }

 /* Submit async read while handling COW.
- * Returns: The number of sectors copied after and including sector_num,
- *  excluding any sectors copied prior to sector_num due to alignment.
- *  This will be nb_sectors if no alignment is necessary, or
- *  (new_end - sector_num) if tail is rounded up or down due to
+ * Returns: The number of bytes copied after and including offset,
+ *  excluding any bytes copied prior to offset due to alignment.
+ *  This will be @bytes if no alignment is necessary, or
+ *  (new_end - offset) if tail is rounded up or down due to
  *  alignment or buffer limit.
  */
-static int mirror_do_read(MirrorBlockJob *s, int64_t sector_num,
-  int nb_sectors)
+static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
+   uint64_t bytes)
 {
 BlockBackend *source = s->common.blk;
-int sectors_per_chunk, nb_chunks;
-int ret;
+int nb_chunks;
+uint64_t ret;
 MirrorOp *op;
-int max_sectors;
+uint64_t max_bytes;

-sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
-max_sectors = sectors_per_chunk * s->max_iov;
+max_bytes = s->granularity * s->max_iov;

 /* We can only handle as much as buf_size at a time. */
-nb_sectors = MIN(s->buf_size >> BDRV_SECTOR_BITS, nb_sectors);
-nb_sectors = MIN(max_sectors, nb_sectors);
-assert(nb_sectors);
-assert(nb_sectors < BDRV_REQUEST_MAX_SECTORS);
-ret = nb_sectors;
+bytes = MIN(s->buf_size, MIN(max_bytes, bytes));
+assert(bytes);
+assert(bytes < BDRV_REQUEST_MAX_BYTES);
+ret = bytes;

 if (s->cow_bitmap) {
-int64_t offset = sector_num * BDRV_SECTOR_SIZE;
-unsigned int bytes = nb_sectors * BDRV_SECTOR_SIZE;
-int gap;
-
-gap = mirror_cow_align(s, , );
-sector_num = offset / BDRV_SECTOR_SIZE;
-nb_sectors = bytes / BDRV_SECTOR_SIZE;
-ret += gap / BDRV_SECTOR_SIZE;
+ret += mirror_cow_align(s, , );
 }
-assert(nb_sectors << BDRV_SECTOR_BITS <= s->buf_size);
-/* The sector range must meet granularity because:
+assert(bytes <= s->buf_size);
+/* The range will be sector-aligned because:
  * 1) Caller passes in aligned values;
- * 2) mirror_cow_align is used only when target cluster is larger. */
-assert(!(sector_num % sectors_per_chunk));
-nb_chunks = DIV_ROUND_UP(nb_sectors, sectors_per_chunk);
+ * 2) mirror_cow_align is used only when target cluster is larger.
+ * But it might not be cluster-aligned at end-of-file. */
+assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE));
+nb_chunks = DIV_ROUND_UP(bytes, s->granularity);

 while (s->buf_free_count < nb_chunks) {
-trace_mirror_yield_in_flight(s, sector_num * BDRV_SECTOR_SIZE,
- s->in_flight);
+trace_mirror_yield_in_flight(s, offset, s->in_flight);
 mirror_wait_for_io(s);
 }

 /* Allocate a MirrorOp that is used as an AIO callback.  */
 op = g_new(MirrorOp, 1);
 op->s = s;
-op->offset = sector_num * BDRV_SECTOR_SIZE;
-op->bytes = nb_sectors * BDRV_SECTOR_SIZE;
+op->offset = offset;
+op->bytes = bytes;

 /* Now make a QEMUIOVector taking enough granularity-sized chunks
  * from s->buf_free.
@@ -299,7 +291,7 @@ static int mirror_do_read(MirrorBlockJob *s, int64_t 
sector_num,
 qemu_iovec_init(>qiov, nb_chunks);
 while (nb_chunks-- > 0) {
 MirrorBuffer *buf = QSIMPLEQ_FIRST(>buf_free);
-   

[Qemu-block] [PATCH v3 08/20] mirror: Switch mirror_do_zero_or_discard() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/mirror.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 9e28d59..af27bcc 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -305,8 +305,8 @@ static int mirror_do_read(MirrorBlockJob *s, int64_t 
sector_num,
 }

 static void mirror_do_zero_or_discard(MirrorBlockJob *s,
-  int64_t sector_num,
-  int nb_sectors,
+  int64_t offset,
+  uint64_t bytes,
   bool is_discard)
 {
 MirrorOp *op;
@@ -315,16 +315,16 @@ static void mirror_do_zero_or_discard(MirrorBlockJob *s,
  * so the freeing in mirror_iteration_done is nop. */
 op = g_new0(MirrorOp, 1);
 op->s = s;
-op->offset = sector_num * BDRV_SECTOR_SIZE;
-op->bytes = nb_sectors * BDRV_SECTOR_SIZE;
+op->offset = offset;
+op->bytes = bytes;

 s->in_flight++;
-s->bytes_in_flight += nb_sectors * BDRV_SECTOR_SIZE;
+s->bytes_in_flight += bytes;
 if (is_discard) {
-blk_aio_pdiscard(s->target, sector_num << BDRV_SECTOR_BITS,
+blk_aio_pdiscard(s->target, offset,
  op->bytes, mirror_write_complete, op);
 } else {
-blk_aio_pwrite_zeroes(s->target, sector_num * BDRV_SECTOR_SIZE,
+blk_aio_pwrite_zeroes(s->target, offset,
   op->bytes, s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
   mirror_write_complete, op);
 }
@@ -453,7 +453,8 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 break;
 case MIRROR_METHOD_ZERO:
 case MIRROR_METHOD_DISCARD:
-mirror_do_zero_or_discard(s, sector_num, io_sectors,
+mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
+  io_sectors * BDRV_SECTOR_SIZE,
   mirror_method == MIRROR_METHOD_DISCARD);
 if (write_zeroes_ok) {
 io_bytes_acct = 0;
@@ -657,7 +658,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 continue;
 }

-mirror_do_zero_or_discard(s, sector_num, nb_sectors, false);
+mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
+  nb_sectors * BDRV_SECTOR_SIZE, false);
 sector_num += nb_sectors;
 }

-- 
2.9.4




[Qemu-block] [PATCH v3 10/20] mirror: Switch mirror_cow_align() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change), and add mirror_clip_bytes() as a
counterpart to mirror_clip_sectors().  Some of the conversion is
a bit tricky, requiring temporaries to convert between units; it
will be cleared up in a following patch.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: tweak mirror_clip_bytes() signature to match previous patch
---
 block/mirror.c | 64 ++
 1 file changed, 38 insertions(+), 26 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 1a43304..1a4602a 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -176,6 +176,15 @@ static void mirror_read_complete(void *opaque, int ret)
 aio_context_release(blk_get_aio_context(s->common.blk));
 }

+/* Clip bytes relative to offset to not exceed end-of-file */
+static inline int64_t mirror_clip_bytes(MirrorBlockJob *s,
+int64_t offset,
+int64_t bytes)
+{
+return MIN(bytes, s->bdev_length - offset);
+}
+
+/* Clip nb_sectors relative to sector_num to not exceed end-of-file */
 static inline int mirror_clip_sectors(MirrorBlockJob *s,
   int64_t sector_num,
   int nb_sectors)
@@ -184,44 +193,39 @@ static inline int mirror_clip_sectors(MirrorBlockJob *s,
s->bdev_length / BDRV_SECTOR_SIZE - sector_num);
 }

-/* Round sector_num and/or nb_sectors to target cluster if COW is needed, and
- * return the offset of the adjusted tail sector against original. */
-static int mirror_cow_align(MirrorBlockJob *s,
-int64_t *sector_num,
-int *nb_sectors)
+/* Round offset and/or bytes to target cluster if COW is needed, and
+ * return the offset of the adjusted tail against original. */
+static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
+unsigned int *bytes)
 {
 bool need_cow;
 int ret = 0;
-int chunk_sectors = s->granularity >> BDRV_SECTOR_BITS;
-int64_t align_sector_num = *sector_num;
-int align_nb_sectors = *nb_sectors;
-int max_sectors = chunk_sectors * s->max_iov;
+int64_t align_offset = *offset;
+unsigned int align_bytes = *bytes;
+int max_bytes = s->granularity * s->max_iov;

-need_cow = !test_bit(*sector_num / chunk_sectors, s->cow_bitmap);
-need_cow |= !test_bit((*sector_num + *nb_sectors - 1) / chunk_sectors,
+need_cow = !test_bit(*offset / s->granularity, s->cow_bitmap);
+need_cow |= !test_bit((*offset + *bytes - 1) / s->granularity,
   s->cow_bitmap);
 if (need_cow) {
-bdrv_round_sectors_to_clusters(blk_bs(s->target), *sector_num,
-   *nb_sectors, _sector_num,
-   _nb_sectors);
+bdrv_round_to_clusters(blk_bs(s->target), *offset, *bytes,
+   _offset, _bytes);
 }

-if (align_nb_sectors > max_sectors) {
-align_nb_sectors = max_sectors;
+if (align_bytes > max_bytes) {
+align_bytes = max_bytes;
 if (need_cow) {
-align_nb_sectors = QEMU_ALIGN_DOWN(align_nb_sectors,
-   s->target_cluster_size >>
-   BDRV_SECTOR_BITS);
+align_bytes = QEMU_ALIGN_DOWN(align_bytes,
+  s->target_cluster_size);
 }
 }
-/* Clipping may result in align_nb_sectors unaligned to chunk boundary, but
+/* Clipping may result in align_bytes unaligned to chunk boundary, but
  * that doesn't matter because it's already the end of source image. */
-align_nb_sectors = mirror_clip_sectors(s, align_sector_num,
-   align_nb_sectors);
+align_bytes = mirror_clip_bytes(s, align_offset, align_bytes);

-ret = align_sector_num + align_nb_sectors - (*sector_num + *nb_sectors);
-*sector_num = align_sector_num;
-*nb_sectors = align_nb_sectors;
+ret = align_offset + align_bytes - (*offset + *bytes);
+*offset = align_offset;
+*bytes = align_bytes;
 assert(ret >= 0);
 return ret;
 }
@@ -257,10 +261,18 @@ static int mirror_do_read(MirrorBlockJob *s, int64_t 
sector_num,
 nb_sectors = MIN(s->buf_size >> BDRV_SECTOR_BITS, nb_sectors);
 nb_sectors = MIN(max_sectors, nb_sectors);
 assert(nb_sectors);
+assert(nb_sectors < BDRV_REQUEST_MAX_SECTORS);
 ret = nb_sectors;

 if (s->cow_bitmap) {
-ret += mirror_cow_align(s, _num, _sectors);
+int64_t offset = sector_num * BDRV_SECTOR_SIZE;
+unsigned int bytes = nb_sectors * BDRV_SECTOR_SIZE;
+int gap;
+
+gap = 

[Qemu-block] [PATCH v3 05/20] commit: Switch commit_populate() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Start by converting an
internal function (no semantic change).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/commit.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/commit.c b/block/commit.c
index 4cda7f2..6f67d78 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -47,26 +47,25 @@ typedef struct CommitBlockJob {
 } CommitBlockJob;

 static int coroutine_fn commit_populate(BlockBackend *bs, BlockBackend *base,
-int64_t sector_num, int nb_sectors,
+int64_t offset, uint64_t bytes,
 void *buf)
 {
 int ret = 0;
 QEMUIOVector qiov;
 struct iovec iov = {
 .iov_base = buf,
-.iov_len = nb_sectors * BDRV_SECTOR_SIZE,
+.iov_len = bytes,
 };

+assert(bytes < SIZE_MAX);
 qemu_iovec_init_external(, , 1);

-ret = blk_co_preadv(bs, sector_num * BDRV_SECTOR_SIZE,
-qiov.size, , 0);
+ret = blk_co_preadv(bs, offset, qiov.size, , 0);
 if (ret < 0) {
 return ret;
 }

-ret = blk_co_pwritev(base, sector_num * BDRV_SECTOR_SIZE,
- qiov.size, , 0);
+ret = blk_co_pwritev(base, offset, qiov.size, , 0);
 if (ret < 0) {
 return ret;
 }
@@ -193,7 +192,9 @@ static void coroutine_fn commit_run(void *opaque)
 trace_commit_one_iteration(s, sector_num * BDRV_SECTOR_SIZE,
n * BDRV_SECTOR_SIZE, ret);
 if (copy) {
-ret = commit_populate(s->top, s->base, sector_num, n, buf);
+ret = commit_populate(s->top, s->base,
+  sector_num * BDRV_SECTOR_SIZE,
+  n * BDRV_SECTOR_SIZE, buf);
 bytes_written += n * BDRV_SECTOR_SIZE;
 }
 if (ret < 0) {
-- 
2.9.4




[Qemu-block] [PATCH v3 04/20] stream: Switch stream_run() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Change the internal
loop iteration of streaming to track by bytes instead of sectors
(although we are still guaranteed that we iterate by steps that
are sector-aligned).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/stream.c | 24 ++--
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index 746d525..2f9618b 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -108,12 +108,11 @@ static void coroutine_fn stream_run(void *opaque)
 BlockBackend *blk = s->common.blk;
 BlockDriverState *bs = blk_bs(blk);
 BlockDriverState *base = s->base;
-int64_t sector_num = 0;
-int64_t end = -1;
+int64_t offset = 0;
 uint64_t delay_ns = 0;
 int error = 0;
 int ret = 0;
-int n = 0;
+int n = 0; /* sectors */
 void *buf;

 if (!bs->backing) {
@@ -126,7 +125,6 @@ static void coroutine_fn stream_run(void *opaque)
 goto out;
 }

-end = s->common.len >> BDRV_SECTOR_BITS;
 buf = qemu_blockalign(bs, STREAM_BUFFER_SIZE);

 /* Turn on copy-on-read for the whole block device so that guest read
@@ -138,7 +136,7 @@ static void coroutine_fn stream_run(void *opaque)
 bdrv_enable_copy_on_read(bs);
 }

-for (sector_num = 0; sector_num < end; sector_num += n) {
+for ( ; offset < s->common.len; offset += n * BDRV_SECTOR_SIZE) {
 bool copy;

 /* Note that even when no rate limit is applied we need to yield
@@ -151,28 +149,26 @@ static void coroutine_fn stream_run(void *opaque)

 copy = false;

-ret = bdrv_is_allocated(bs, sector_num,
+ret = bdrv_is_allocated(bs, offset / BDRV_SECTOR_SIZE,
 STREAM_BUFFER_SIZE / BDRV_SECTOR_SIZE, );
 if (ret == 1) {
 /* Allocated in the top, no need to copy.  */
 } else if (ret >= 0) {
 /* Copy if allocated in the intermediate images.  Limit to the
- * known-unallocated area [sector_num, sector_num+n).  */
+ * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE).  */
 ret = bdrv_is_allocated_above(backing_bs(bs), base,
-  sector_num, n, );
+  offset / BDRV_SECTOR_SIZE, n, );

 /* Finish early if end of backing file has been reached */
 if (ret == 0 && n == 0) {
-n = end - sector_num;
+n = (s->common.len - offset) / BDRV_SECTOR_SIZE;
 }

 copy = (ret == 1);
 }
-trace_stream_one_iteration(s, sector_num * BDRV_SECTOR_SIZE,
-   n * BDRV_SECTOR_SIZE, ret);
+trace_stream_one_iteration(s, offset, n * BDRV_SECTOR_SIZE, ret);
 if (copy) {
-ret = stream_populate(blk, sector_num * BDRV_SECTOR_SIZE,
-  n * BDRV_SECTOR_SIZE, buf);
+ret = stream_populate(blk, offset, n * BDRV_SECTOR_SIZE, buf);
 }
 if (ret < 0) {
 BlockErrorAction action =
@@ -211,7 +207,7 @@ out:
 /* Modify backing chain and close BDSes in main loop */
 data = g_malloc(sizeof(*data));
 data->ret = ret;
-data->reached_end = sector_num == end;
+data->reached_end = offset == s->common.len;
 block_job_defer_to_main_loop(>common, stream_complete, data);
 }

-- 
2.9.4




[Qemu-block] [PATCH v3 02/20] trace: Show blockjob actions via bytes, not sectors

2017-06-27 Thread Eric Blake
Upcoming patches are going to switch to byte-based interfaces
instead of sector-based.  Even worse, trace_backup_do_cow_enter()
had a weird mix of cluster and sector indices.

The trace interface is low enough that there are no stability
guarantees, and therefore nothing wrong with changing our units,
even in cases like trace_backup_do_cow_skip() where we are not
changing the trace output.  So make the tracing uniformly use
bytes.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: improve commit message, no code change
---
 block/backup.c | 16 ++--
 block/commit.c |  3 ++-
 block/mirror.c | 26 +-
 block/stream.c |  3 ++-
 block/trace-events | 14 +++---
 5 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 9ca1d8e..06431ac 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -102,6 +102,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 void *bounce_buffer = NULL;
 int ret = 0;
 int64_t sectors_per_cluster = cluster_size_sectors(job);
+int64_t bytes_per_cluster = sectors_per_cluster * BDRV_SECTOR_SIZE;
 int64_t start, end;
 int n;

@@ -110,18 +111,20 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 start = sector_num / sectors_per_cluster;
 end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);

-trace_backup_do_cow_enter(job, start, sector_num, nb_sectors);
+trace_backup_do_cow_enter(job, start * bytes_per_cluster,
+  sector_num * BDRV_SECTOR_SIZE,
+  nb_sectors * BDRV_SECTOR_SIZE);

 wait_for_overlapping_requests(job, start, end);
 cow_request_begin(_request, job, start, end);

 for (; start < end; start++) {
 if (test_bit(start, job->done_bitmap)) {
-trace_backup_do_cow_skip(job, start);
+trace_backup_do_cow_skip(job, start * bytes_per_cluster);
 continue; /* already copied */
 }

-trace_backup_do_cow_process(job, start);
+trace_backup_do_cow_process(job, start * bytes_per_cluster);

 n = MIN(sectors_per_cluster,
 job->common.len / BDRV_SECTOR_SIZE -
@@ -138,7 +141,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 bounce_qiov.size, _qiov,
 is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
 if (ret < 0) {
-trace_backup_do_cow_read_fail(job, start, ret);
+trace_backup_do_cow_read_fail(job, start * bytes_per_cluster, ret);
 if (error_is_read) {
 *error_is_read = true;
 }
@@ -154,7 +157,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
  job->compress ? BDRV_REQ_WRITE_COMPRESSED : 
0);
 }
 if (ret < 0) {
-trace_backup_do_cow_write_fail(job, start, ret);
+trace_backup_do_cow_write_fail(job, start * bytes_per_cluster, 
ret);
 if (error_is_read) {
 *error_is_read = false;
 }
@@ -177,7 +180,8 @@ out:

 cow_request_end(_request);

-trace_backup_do_cow_return(job, sector_num, nb_sectors, ret);
+trace_backup_do_cow_return(job, sector_num * BDRV_SECTOR_SIZE,
+   nb_sectors * BDRV_SECTOR_SIZE, ret);

 qemu_co_rwlock_unlock(>flush_rwlock);

diff --git a/block/commit.c b/block/commit.c
index 6993994..4cda7f2 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -190,7 +190,8 @@ static void coroutine_fn commit_run(void *opaque)
   COMMIT_BUFFER_SIZE / BDRV_SECTOR_SIZE,
   );
 copy = (ret == 1);
-trace_commit_one_iteration(s, sector_num, n, ret);
+trace_commit_one_iteration(s, sector_num * BDRV_SECTOR_SIZE,
+   n * BDRV_SECTOR_SIZE, ret);
 if (copy) {
 ret = commit_populate(s->top, s->base, sector_num, n, buf);
 bytes_written += n * BDRV_SECTOR_SIZE;
diff --git a/block/mirror.c b/block/mirror.c
index eb27efc..b4dfe95 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -103,7 +103,8 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
 int64_t chunk_num;
 int i, nb_chunks, sectors_per_chunk;

-trace_mirror_iteration_done(s, op->sector_num, op->nb_sectors, ret);
+trace_mirror_iteration_done(s, op->sector_num * BDRV_SECTOR_SIZE,
+op->nb_sectors * BDRV_SECTOR_SIZE, ret);

 s->in_flight--;
 s->sectors_in_flight -= op->nb_sectors;
@@ -268,7 +269,8 @@ static int mirror_do_read(MirrorBlockJob *s, int64_t 
sector_num,
 nb_chunks = DIV_ROUND_UP(nb_sectors, sectors_per_chunk);

 while (s->buf_free_count < nb_chunks) {
-trace_mirror_yield_in_flight(s, sector_num, s->in_flight);
+

[Qemu-block] [PATCH v3 03/20] stream: Switch stream_populate() to byte-based

2017-06-27 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Start by converting an
internal function (no semantic change).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: no change
---
 block/stream.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index 6cb3939..746d525 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -41,20 +41,20 @@ typedef struct StreamBlockJob {
 } StreamBlockJob;

 static int coroutine_fn stream_populate(BlockBackend *blk,
-int64_t sector_num, int nb_sectors,
+int64_t offset, uint64_t bytes,
 void *buf)
 {
 struct iovec iov = {
 .iov_base = buf,
-.iov_len  = nb_sectors * BDRV_SECTOR_SIZE,
+.iov_len  = bytes,
 };
 QEMUIOVector qiov;

+assert(bytes < SIZE_MAX);
 qemu_iovec_init_external(, , 1);

 /* Copy-on-read the unallocated clusters */
-return blk_co_preadv(blk, sector_num * BDRV_SECTOR_SIZE, qiov.size, ,
- BDRV_REQ_COPY_ON_READ);
+return blk_co_preadv(blk, offset, qiov.size, , BDRV_REQ_COPY_ON_READ);
 }

 typedef struct {
@@ -171,7 +171,8 @@ static void coroutine_fn stream_run(void *opaque)
 trace_stream_one_iteration(s, sector_num * BDRV_SECTOR_SIZE,
n * BDRV_SECTOR_SIZE, ret);
 if (copy) {
-ret = stream_populate(blk, sector_num, n, buf);
+ret = stream_populate(blk, sector_num * BDRV_SECTOR_SIZE,
+  n * BDRV_SECTOR_SIZE, buf);
 }
 if (ret < 0) {
 BlockErrorAction action =
-- 
2.9.4




[Qemu-block] [PATCH v3 01/20] blockjob: Track job ratelimits via bytes, not sectors

2017-06-27 Thread Eric Blake
The user interface specifies job rate limits in bytes/second.
It's pointless to have our internal representation track things
in sectors/second, particularly since we want to move away from
sector-based interfaces.

Fix up a doc typo found while verifying that the ratelimit
code handles the scaling difference.

Repetition of expressions like 'n * BDRV_SECTOR_SIZE' will be
cleaned up later when functions are converted to iterate over
images by bytes rather than by sectors.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v2: adjust commit message based on review; no code change
---
 include/qemu/ratelimit.h |  3 ++-
 block/backup.c   |  5 +++--
 block/commit.c   |  5 +++--
 block/mirror.c   | 13 +++--
 block/stream.c   |  5 +++--
 5 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/include/qemu/ratelimit.h b/include/qemu/ratelimit.h
index 8da1232..8dece48 100644
--- a/include/qemu/ratelimit.h
+++ b/include/qemu/ratelimit.h
@@ -24,7 +24,8 @@ typedef struct {

 /** Calculate and return delay for next request in ns
  *
- * Record that we sent @p n data units. If we may send more data units
+ * Record that we sent @n data units (where @n matches the scale chosen
+ * during ratelimit_set_speed). If we may send more data units
  * in the current time slice, return 0 (i.e. no delay). Otherwise
  * return the amount of time (in ns) until the start of the next time
  * slice that will permit sending the next chunk of data.
diff --git a/block/backup.c b/block/backup.c
index 5387fbd..9ca1d8e 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -208,7 +208,7 @@ static void backup_set_speed(BlockJob *job, int64_t speed, 
Error **errp)
 error_setg(errp, QERR_INVALID_PARAMETER, "speed");
 return;
 }
-ratelimit_set_speed(>limit, speed / BDRV_SECTOR_SIZE, SLICE_TIME);
+ratelimit_set_speed(>limit, speed, SLICE_TIME);
 }

 static void backup_cleanup_sync_bitmap(BackupBlockJob *job, int ret)
@@ -359,7 +359,8 @@ static bool coroutine_fn yield_and_check(BackupBlockJob 
*job)
  */
 if (job->common.speed) {
 uint64_t delay_ns = ratelimit_calculate_delay(>limit,
-  job->sectors_read);
+  job->sectors_read *
+  BDRV_SECTOR_SIZE);
 job->sectors_read = 0;
 block_job_sleep_ns(>common, QEMU_CLOCK_REALTIME, delay_ns);
 } else {
diff --git a/block/commit.c b/block/commit.c
index 524bd54..6993994 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -209,7 +209,8 @@ static void coroutine_fn commit_run(void *opaque)
 s->common.offset += n * BDRV_SECTOR_SIZE;

 if (copy && s->common.speed) {
-delay_ns = ratelimit_calculate_delay(>limit, n);
+delay_ns = ratelimit_calculate_delay(>limit,
+ n * BDRV_SECTOR_SIZE);
 }
 }

@@ -231,7 +232,7 @@ static void commit_set_speed(BlockJob *job, int64_t speed, 
Error **errp)
 error_setg(errp, QERR_INVALID_PARAMETER, "speed");
 return;
 }
-ratelimit_set_speed(>limit, speed / BDRV_SECTOR_SIZE, SLICE_TIME);
+ratelimit_set_speed(>limit, speed, SLICE_TIME);
 }

 static const BlockJobDriver commit_job_driver = {
diff --git a/block/mirror.c b/block/mirror.c
index 61a862d..eb27efc 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -396,7 +396,8 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 bitmap_set(s->in_flight_bitmap, sector_num / sectors_per_chunk, nb_chunks);
 while (nb_chunks > 0 && sector_num < end) {
 int64_t ret;
-int io_sectors, io_sectors_acct;
+int io_sectors;
+int64_t io_bytes_acct;
 BlockDriverState *file;
 enum MirrorMethod {
 MIRROR_METHOD_COPY,
@@ -444,16 +445,16 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 switch (mirror_method) {
 case MIRROR_METHOD_COPY:
 io_sectors = mirror_do_read(s, sector_num, io_sectors);
-io_sectors_acct = io_sectors;
+io_bytes_acct = io_sectors * BDRV_SECTOR_SIZE;
 break;
 case MIRROR_METHOD_ZERO:
 case MIRROR_METHOD_DISCARD:
 mirror_do_zero_or_discard(s, sector_num, io_sectors,
   mirror_method == MIRROR_METHOD_DISCARD);
 if (write_zeroes_ok) {
-io_sectors_acct = 0;
+io_bytes_acct = 0;
 } else {
-io_sectors_acct = io_sectors;
+io_bytes_acct = io_sectors * BDRV_SECTOR_SIZE;
 }
 break;
 default:
@@ -463,7 +464,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 sector_num += io_sectors;
 nb_chunks -= DIV_ROUND_UP(io_sectors, 

[Qemu-block] [PATCH v3 00/20] make bdrv_is_allocated[_above] byte-based

2017-06-27 Thread Eric Blake
There are patches floating around to add NBD_CMD_BLOCK_STATUS,
but NBD wants to report status on byte granularity (even if the
reporting will probably be naturally aligned to sectors or even
much higher levels).  I've therefore started the task of
converting our block status code to report at a byte granularity
rather than sectors.

This is part one of that conversion: bdrv_is_allocated().
Other parts still need a v3, but here's the link to their most
recent posting:
tracking dirty bitmaps by bytes:
https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg03859.html
replacing bdrv_get_block_status() with a byte based callback
in all the drivers:
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg02642.html

Available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-allocated-v3

Depends on Kevin's block branch

Changes since v2 are limited to rebase artifacts (Paolo's conversion
to thread-safety being the biggest cause of context conflicts, and
also affecting patch 12 - but where I think the resolution is sane
enough to keep R-b).  Added R-b on patches where it has been given,
leaving 19/20 as the only unreviewed patch.

001/20:[] [--] 'blockjob: Track job ratelimits via bytes, not sectors'
002/20:[] [-C] 'trace: Show blockjob actions via bytes, not sectors'
003/20:[] [--] 'stream: Switch stream_populate() to byte-based'
004/20:[] [--] 'stream: Switch stream_run() to byte-based'
005/20:[] [--] 'commit: Switch commit_populate() to byte-based'
006/20:[] [--] 'commit: Switch commit_run() to byte-based'
007/20:[] [-C] 'mirror: Switch MirrorBlockJob to byte-based'
008/20:[] [--] 'mirror: Switch mirror_do_zero_or_discard() to byte-based'
009/20:[] [--] 'mirror: Update signature of mirror_clip_sectors()'
010/20:[] [--] 'mirror: Switch mirror_cow_align() to byte-based'
011/20:[] [--] 'mirror: Switch mirror_do_read() to byte-based'
012/20:[0012] [FC] 'mirror: Switch mirror_iteration() to byte-based'
013/20:[] [--] 'block: Drop unused bdrv_round_sectors_to_clusters()'
014/20:[] [--] 'backup: Switch BackupBlockJob to byte-based'
015/20:[] [--] 'backup: Switch block_backup.h to byte-based'
016/20:[] [--] 'backup: Switch backup_do_cow() to byte-based'
017/20:[] [--] 'backup: Switch backup_run() to byte-based'
018/20:[] [--] 'block: Make bdrv_is_allocated() byte-based'
019/20:[] [--] 'block: Minimize raw use of bds->total_sectors'
020/20:[] [-C] 'block: Make bdrv_is_allocated_above() byte-based'

Eric Blake (20):
  blockjob: Track job ratelimits via bytes, not sectors
  trace: Show blockjob actions via bytes, not sectors
  stream: Switch stream_populate() to byte-based
  stream: Switch stream_run() to byte-based
  commit: Switch commit_populate() to byte-based
  commit: Switch commit_run() to byte-based
  mirror: Switch MirrorBlockJob to byte-based
  mirror: Switch mirror_do_zero_or_discard() to byte-based
  mirror: Update signature of mirror_clip_sectors()
  mirror: Switch mirror_cow_align() to byte-based
  mirror: Switch mirror_do_read() to byte-based
  mirror: Switch mirror_iteration() to byte-based
  block: Drop unused bdrv_round_sectors_to_clusters()
  backup: Switch BackupBlockJob to byte-based
  backup: Switch block_backup.h to byte-based
  backup: Switch backup_do_cow() to byte-based
  backup: Switch backup_run() to byte-based
  block: Make bdrv_is_allocated() byte-based
  block: Minimize raw use of bds->total_sectors
  block: Make bdrv_is_allocated_above() byte-based

 include/block/block.h|  10 +-
 include/block/block_backup.h |  11 +-
 include/qemu/ratelimit.h |   3 +-
 block/backup.c   | 130 ---
 block/commit.c   |  54 
 block/io.c   |  92 +++--
 block/mirror.c   | 302 ++-
 block/replication.c  |  29 +++--
 block/stream.c   |  35 +++--
 block/vvfat.c|  34 +++--
 migration/block.c|   9 +-
 qemu-img.c   |  15 ++-
 qemu-io-cmds.c   |  70 +-
 block/trace-events   |  14 +-
 14 files changed, 397 insertions(+), 411 deletions(-)

-- 
2.9.4




Re: [Qemu-block] [PATCH v4 0/4] more blkdebug tweaks

2017-06-27 Thread Kevin Wolf
Am 27.06.2017 um 18:01 hat Eric Blake geschrieben:
> On 06/06/2017 07:26 AM, Kevin Wolf wrote:
> > Am 05.06.2017 um 22:38 hat Eric Blake geschrieben:
> >> I found a crasher and some odd behavior while rebasing my
> >> bdrv_get_block_status series, so I figured I'd get these things
> >> fixed first.  This is based on top of Max's block branch.
> >>
> >> Available as a tag at:
> >> git fetch git://repo.or.cz/qemu/ericb.git nbd-blkdebug-status-v4
> >>
> >> Since v3:
> >> - check all qemu-iotests (patch 1)
> > 
> > Whoops, somehow I ended up applying and reviewing v4 while looking at
> > the v3 thread... Looks like there really aren't any missing test case
> > updates any more.
> > 
> > Thanks, applied to the block branch.
> 
> Did this get lost somehow?

No idea what happened there, but it looks like it. It's queued now.

Kevin


pgpuX7Ed34cp0.pgp
Description: PGP signature


Re: [Qemu-block] [RFC] QMP design: Fixing query-block and friends

2017-06-27 Thread John Snow


On 06/27/2017 12:31 PM, Kevin Wolf wrote:
> Hi,
> 
> I haven't really liked query-block for a long time, but now that
> blockdev-add and -blockdev have settled, it might finally be the time to
> actually do something about it. In fact, if used together with these
> modern interfaces, our query commands are simply broken, so we have to
> fix something.
> 

[...words...]

> 
> So how do we go forward from here?
> 
> I guess we could add a few hacks o fix the really urgent things, and
> just adding more information is always possible (at the cost of even
> more duplication).
> 

I think you've included this suggestion so that you can summarily
dismiss it as foolish.

> However, it appears to me that I dislike so many thing about our current
> query commands that I'm tempted to say: Throw it all away and start
> over.
> 

Inclined to agree. The structure of the block layer has changed so much
in the past few years and this is easily seen by the gap you've outlined
here.

We have to keep the old query commands around for a while as Eric says,
but I worry that they are so wrong and misleading as to be actively harmful.

Maybe there's some hair trigger somewhere that if $NEW_FEATURE_X is used
to configure QEMU in some way, that the old commands can be deprecated
at runtime, such that we can more aggressively force their retirement.

> If that's what we're going to do, I think I can figure out something
> nice for block nodes. That shouldn't be too hard. The only question
> would be whether we want a command to query one node or whether we would
> keep returning all of them.
> 

Personally in favor of allowing filtering, as an option. I don't see why
we couldn't, or why it would be a problem, or worse in any way.

> I am, however, a bit less confident about BBs. As I said above, I
> consider them part of the qdev devices. As far as I know, there is no
> high-level infrastructure to query runtime state of devices and qdev
> properties are supposed to be read-only. Maybe non-qdev QOM properties
> can be used somehow? But QOM isn't really nice to use as you need to
> query each property individually.
> 
> Another option would be to have a QMP command that takes a qdev ID of a
> block device and queries its BB. Or maybe it should stay like
> query-block and return an array of all of them, but include the qdev
> device name. Actually, maybe query-block can stay as it contains only
> two fields that are useless in the new world.
> 
> 
> I think this has become long enough now, so any opinions? On anything I
> said above, but preferably also about what a new interface should look
> like?
> 

Sadly (for you, if you want advice) you're probably in the best shape to
decide such a thing. I like the idea of being able to query parts of the
tree on a per-device basis, as I think this likely reflects real world
usage the most.

"I want to do a thing to /dev/sda... what are the backends I am dealing
with for that?"

I think that's probably most along the "QMP command that takes a qdev
ID" option that you proposed.

> Kevin
> 



Re: [Qemu-block] [PATCH 00/10] Block layer thread-safety, part 2

2017-06-27 Thread Paolo Bonzini


On 27/06/2017 17:22, Paolo Bonzini wrote:
> This part takes care of drivers and devices, making sure that they can
> accept concurrent I/O from multiple AioContext.
> 
> The following drivers are thread-safe without using any QemuMutex/CoMutex:
> crypto, gluster, null, rbd, win32-aio.  NBD has already been fixed,
> because the patch fixed an unrelated testcase.
> 
> The following drivers already use mutexes for everything except possibly
> snapshots, which do not (yet?) need protection: bochs, cloop, dmg, qcow,
> parallels, vhdx, vmdk, curl, iscsi, nfs.
> 
> The following drivers already use mutexes for _almost_ everything: vpc
> (missing get_block_status), vdi (missing bitmap access), vvfat (missing
> commit), not protected), qcow2 (must call CoQueue APIs under CoMutex).
> They are fixed by patches 1-5.
> 
> The following drivers must be changed to use CoMutex to protect internal
> data: qed (patches 6-8), sheepdog (patch 9).
> 
> The following driver must be changed to support I/O from any AioContext:
> ssh.  It is fixed by patch 10.
> 
> Paolo

Nevermind---my tests were confused by some pre-existing failures in
qemu-iotests (068 for qcow2, caused by the patch dropped in v2 of the
pull request; 181 for vmdk and vpc).  This series is no good.

Paolo

> Paolo Bonzini (10):
>   qcow2: call CoQueue APIs under CoMutex
>   coroutine-lock: add qemu_co_rwlock_downgrade and
> qemu_co_rwlock_upgrade
>   vdi: make it thread-safe
>   vpc: make it thread-safe
>   vvfat: make it thread-safe
>   qed: move tail of qed_aio_write_main to qed_aio_write_{cow,alloc}
>   block: invoke .bdrv_drain callback in coroutine context and from
> AioContext
>   qed: protect table cache with CoMutex
>   sheepdog: add queue_lock
>   ssh: support I/O from any AioContext
> 
>  block/io.c |  42 +++---
>  block/qcow2.c  |   4 +-
>  block/qed-cluster.c|   4 +-
>  block/qed-l2-cache.c   |   6 ++
>  block/qed-table.c  |  24 --
>  block/qed.c| 195 
> +++--
>  block/qed.h|  11 ++-
>  block/sheepdog.c   |  21 -
>  block/ssh.c|  24 --
>  block/vdi.c|  48 +--
>  block/vpc.c|  20 ++---
>  block/vvfat.c  |   8 +-
>  include/block/block_int.h  |   2 +-
>  include/qemu/coroutine.h   |  18 +
>  util/qemu-coroutine-lock.c |  35 
>  15 files changed, 316 insertions(+), 146 deletions(-)
> 



Re: [Qemu-block] null-co undefined read behavior

2017-06-27 Thread Eric Blake
On 06/13/2017 05:55 AM, Kevin Wolf wrote:
> 
> We could possibly make read-zeroes a mandatory option in QAPI
> (currently it seems it doesn't even exist - oops!) and if we really want
> to print warnings for -drive, do it only if the default is used so that
> passing an explicit read-zeroes=off removes the warning. Or we could
> make it mandatory there, too.

I'm late to the conversation, but I like this approach...

> 
> These changes are incompatible, but I think the null-* use cases are
> mostly about testing and debugging, so I feel we can be a bit laxer
> about compatibility (like we already are with blkdebug).

Since null-co is a debug-only interface, I'm not as worried about
backwards-compatibility, and making the option mandatory forces clients
that are debugging to be explicit on whether they want defined reads or
fast behavior.  Once the option is mandatory, then it is also feasible
to use valgrind macros to unconditionally shut up the warning about
uninitialized reads, even when we are intentionally not zeroing the
memory, because at that point the request for fast behavior is intentional.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 1/4] block/qcow2: add compression_algorithm create option

2017-06-27 Thread Peter Lieven

Am 27.06.2017 um 17:04 schrieb Eric Blake:

On 06/27/2017 09:49 AM, Peter Lieven wrote:


Before I continue, can you please give feedback on the following spec
change:

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 80cdfd0..f1428e9 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -85,7 +85,11 @@ in the description of a field.
  be written to (unless for regaining
  consistency).

-Bits 2-63:  Reserved (set to 0)
+Bit 2:  Compression format bit.  Iff this bit

I know what this means, but spelling it "If and only if" or "When" might
make more sense to other readers, as "Iff" is not common in English.


is set then
+the compression format extension MUST
be present
+and MUST be parsed and checked for
compatibility.
+
+Bits 3-63:  Reserved (set to 0)

   80 -  87:  compatible_features
  Bitmask of compatible features. An implementation can
@@ -135,6 +139,7 @@ be stored. Each extension has a structure like the
following:
  0xE2792ACA - Backing file format name
  0x6803f857 - Feature name table
  0x23852875 - Bitmaps extension
+0xC0318300 - Compression format extension

Now that you aren't burning 256 magic numbers, it may make sense to have
the last two hex digits be non-zero.



+== Compression format extension ==
+
+The compression format extension is an optional header extension. It
provides

Inline pasting created interesting wrapping, but the actual patch will
obviously read better.


+the ability to specify the compression algorithm and compression
parameters
+that are used for compressed clusters. This new header MUST be present if
+the incompatible-feature bit "compression format bit" is set and MUST
be absent
+otherwise.
+
+The fields of the compression format extension are:
+
+Byte  0 - 15:  compression_format_name (padded with zeros, but not
+   necessarily null terminated if it has full length)

Do we really want arbitrary names of formats, or do we want to specify
specific algorithms (gzip, lzo, zstd) as an enum?  Which way gives us
maximum likelihood of interoperability?


At some point I have to parse the name and convert it into a number.
This could be the same routine for the parsing of the compress.format
parameter and the compression_format_name in the header field.
The idea was that an accidently change in the enum cannot break
anything as it is local to the qemu binary. But I am ok to have an enum
if the values for each format a fixed. In this case it might be sufficient
to have an uint8_t for the compression format + an uint8_t for the compression
level. Then 2 bytes reserved + an uint32_t for the extra data size. So 
currently (the extra
data is not necessary at the moment for any format) the header would
be limited to 8 bytes.

Anyway if we glue this into a QAPI scheme, I would appreciate to get
some help with it as I am not that familiar with it yet.

Thanks,
Peter




[Qemu-block] [PATCH 06/10] qed: move tail of qed_aio_write_main to qed_aio_write_{cow, alloc}

2017-06-27 Thread Paolo Bonzini
This part is never called for in-place writes, move it away to avoid
the "backwards" coding style typical of callback-based code.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block/qed.c | 69 +++--
 1 file changed, 31 insertions(+), 38 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 385381a78a..3ffdc1716d 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -982,40 +982,11 @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
 BDRVQEDState *s = acb_to_s(acb);
 uint64_t offset = acb->cur_cluster +
   qed_offset_into_cluster(s, acb->cur_pos);
-int ret;
 
 trace_qed_aio_write_main(s, acb, 0, offset, acb->cur_qiov.size);
 
 BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-ret = bdrv_co_pwritev(s->bs->file, offset, acb->cur_qiov.size,
-  >cur_qiov, 0);
-if (ret < 0) {
-return ret;
-}
-
-if (acb->find_cluster_ret != QED_CLUSTER_FOUND) {
-if (s->bs->backing) {
-/*
- * Flush new data clusters before updating the L2 table
- *
- * This flush is necessary when a backing file is in use.  A crash
- * during an allocating write could result in empty clusters in the
- * image.  If the write only touched a subregion of the cluster,
- * then backing image sectors have been lost in the untouched
- * region.  The solution is to flush after writing a new data
- * cluster and before updating the L2 table.
- */
-ret = bdrv_co_flush(s->bs->file->bs);
-if (ret < 0) {
-return ret;
-}
-}
-ret = qed_aio_write_l2_update(acb, acb->cur_cluster);
-if (ret < 0) {
-return ret;
-}
-}
-return 0;
+return bdrv_co_pwritev(s->bs->file, offset, >cur_qiov);
 }
 
 /**
@@ -1050,7 +1021,29 @@ static int coroutine_fn qed_aio_write_cow(QEDAIOCB *acb)
 return ret;
 }
 
-return qed_aio_write_main(acb);
+ret = qed_aio_write_main(acb);
+if (ret < 0) {
+return ret;
+}
+
+if (s->bs->backing) {
+/*
+ * Flush new data clusters before updating the L2 table
+ *
+ * This flush is necessary when a backing file is in use.  A crash
+ * during an allocating write could result in empty clusters in the
+ * image.  If the write only touched a subregion of the cluster,
+ * then backing image sectors have been lost in the untouched
+ * region.  The solution is to flush after writing a new data
+ * cluster and before updating the L2 table.
+ */
+ret = bdrv_co_flush(s->bs->file->bs);
+if (ret < 0) {
+return ret;
+}
+}
+
+return 0;
 }
 
 /**
@@ -1103,6 +1096,7 @@ static int coroutine_fn qed_aio_write_alloc(QEDAIOCB 
*acb, size_t len)
 if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
 return 0;
 }
+acb->cur_cluster = 1;
 } else {
 acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
 }
@@ -1115,15 +1109,14 @@ static int coroutine_fn qed_aio_write_alloc(QEDAIOCB 
*acb, size_t len)
 }
 }
 
-if (acb->flags & QED_AIOCB_ZERO) {
-ret = qed_aio_write_l2_update(acb, 1);
-} else {
+if (!(acb->flags & QED_AIOCB_ZERO)) {
 ret = qed_aio_write_cow(acb);
+if (ret < 0) {
+return ret;
+}
 }
-if (ret < 0) {
-return ret;
-}
-return 0;
+
+return qed_aio_write_l2_update(acb, acb->cur_cluster);
 }
 
 /**
-- 
2.13.0





[Qemu-block] [PATCH 07/10] block: invoke .bdrv_drain callback in coroutine context and from AioContext

2017-06-27 Thread Paolo Bonzini
This will let the callback take a CoMutex in the next patch.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block/io.c| 42 +-
 block/qed.c   |  6 +++---
 include/block/block_int.h |  2 +-
 3 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/block/io.c b/block/io.c
index 9bba730a7e..68f19bbe69 100644
--- a/block/io.c
+++ b/block/io.c
@@ -149,6 +149,37 @@ bool bdrv_requests_pending(BlockDriverState *bs)
 return false;
 }
 
+typedef struct {
+Coroutine *co;
+BlockDriverState *bs;
+bool done;
+} BdrvCoDrainData;
+
+static void coroutine_fn bdrv_drain_invoke_entry(void *opaque)
+{
+BdrvCoDrainData *data = opaque;
+BlockDriverState *bs = data->bs;
+
+bs->drv->bdrv_co_drain(bs);
+
+/* Set data->done before reading bs->wakeup.  */
+atomic_mb_set(>done, true);
+bdrv_wakeup(bs);
+}
+
+static void bdrv_drain_invoke(BlockDriverState *bs)
+{
+BdrvCoDrainData data = { .bs = bs, .done = false };
+
+if (!bs->drv || !bs->drv->bdrv_co_drain) {
+return;
+}
+
+data.co = qemu_coroutine_create(bdrv_drain_invoke_entry, );
+bdrv_coroutine_enter(bs, data.co);
+BDRV_POLL_WHILE(bs, !data.done);
+}
+
 static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
 BdrvChild *child, *tmp;
@@ -156,9 +187,8 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
 
 waited = BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0);
 
-if (bs->drv && bs->drv->bdrv_drain) {
-bs->drv->bdrv_drain(bs);
-}
+/* Ensure any pending metadata writes are submitted to bs->file.  */
+bdrv_drain_invoke(bs);
 
 QLIST_FOREACH_SAFE(child, >children, next, tmp) {
 BlockDriverState *bs = child->bs;
@@ -184,12 +214,6 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
 return waited;
 }
 
-typedef struct {
-Coroutine *co;
-BlockDriverState *bs;
-bool done;
-} BdrvCoDrainData;
-
 static void bdrv_co_drain_bh_cb(void *opaque)
 {
 BdrvCoDrainData *data = opaque;
diff --git a/block/qed.c b/block/qed.c
index 3ffdc1716d..454b7f0f2b 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -350,7 +350,7 @@ static void bdrv_qed_attach_aio_context(BlockDriverState 
*bs,
 }
 }
 
-static void bdrv_qed_drain(BlockDriverState *bs)
+static void coroutine_fn bdrv_qed_co_drain(BlockDriverState *bs)
 {
 BDRVQEDState *s = bs->opaque;
 
@@ -359,7 +359,7 @@ static void bdrv_qed_drain(BlockDriverState *bs)
  */
 if (s->need_check_timer && timer_pending(s->need_check_timer)) {
 qed_cancel_need_check_timer(s);
-qed_need_check_timer_cb(s);
+qed_need_check_timer_entry(s);
 }
 }
 
@@ -1540,7 +1540,7 @@ static BlockDriver bdrv_qed = {
 .bdrv_check   = bdrv_qed_check,
 .bdrv_detach_aio_context  = bdrv_qed_detach_aio_context,
 .bdrv_attach_aio_context  = bdrv_qed_attach_aio_context,
-.bdrv_drain   = bdrv_qed_drain,
+.bdrv_co_drain= bdrv_qed_co_drain,
 };
 
 static void bdrv_qed_init(void)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 15fa602150..6d3fbbfc1e 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -321,7 +321,7 @@ struct BlockDriver {
  * Drain and stop any internal sources of requests in the driver, and
  * remain so until next I/O callback (e.g. bdrv_co_writev) is called.
  */
-void (*bdrv_drain)(BlockDriverState *bs);
+void coroutine_fn (*bdrv_co_drain)(BlockDriverState *bs);
 
 void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
Error **errp);
-- 
2.13.0





Re: [Qemu-block] [PATCH] nbd: Fix regression on resiliency to port scan

2017-06-27 Thread Eric Blake
On 06/11/2017 06:50 AM, Max Reitz wrote:
> On 2017-06-09 00:26, Eric Blake wrote:
>> Back in qemu 2.5, qemu-nbd was immune to port probes (a transient
>> server would not quit, regardless of how many probe connections
>> came and went, until a connection actually negotiated).  But we
>> broke that in commit ee7d7aa

>> Simple test across two terminals:
>> $ qemu-nbd -f raw -p 30001 file
> 
> Maybe this command line should contain a -t, because otherwise I don't
> find it very CVE-worthy.

I was on vacation when this patch got merged, so it's too late to modify
the commit message now.  However, I'm following up now; first to point
out that we now have the actual CVE number (it was not yet assigned when
I originally posted - and if we hadn't merged yet, I would also have
tweaked the commit message to call the id out): CVE-2017-9524

> 
> First, sure, you can kill this NBD server with the below nmap
> invocation, or with an "ncat localhost 30001". But you can also do so
> with a plain "qemu-io -c quit nbd://localhost:30001".

But in that case, without -t, you specifically wanted the server to go
away after the first successful client, and that was indeed a successful
client.

> 
> With -t you actually cannot kill it using ncat or qemu-io, but you can
> with nmap. So this is what the actual issue is.

You do make a point, though.  The command line example I gave is the
old-style NBD, where there is no named export, and where any client is
first in line for the given export - which is rather weak, and why
upstream NBD discourages the old protocol in favor of the new where the
client has to specify WHICH named export it wants.  And in the new
protocol, if you are not using -t, then you want the server to ignore
clients that aren't requesting the export that the server is willing to
serve, rather than hanging up on the first random client that requests
the wrong name.  And you properly investigated that scenario below...

> 
> And secondly, note that even after this patch you can make the NBD
> server quit with "ncat localhost 30001" (just like with qemu-io).
> 
>> $ nmap 127.0.0.1 -p 30001 && \
>>   qemu-io -c 'r 0 512' -f raw nbd://localhost:30001
>>
>> Note that this patch does not change what constitutes successful
>> negotiation (thus, a client must enter transmission phase before
>> that client can be considered as a reason to terminate the server
>> when the connection ends).
> 
> Well, the thing is that transmission phase starts immediately for the
> old-style protocol:
> 
> $ ./qemu-nbd -f raw null-co:// &
> [1] 24581
> $ ncat --send-only localhost 10809 < /dev/null
> nbd/server.c:nbd_receive_request():L706: read failed
> [1]  + 24581 done   ./qemu-nbd -f raw null-co://
> 
> I'm not sure whether this is intended, but then again I'm not sure
> whether we can or have to do anything about this (as I said above, you
> can easily make the server quit with a qemu-io invocation if you're so
> inclined.

Old-style negotiation is not as important these days as new-style where
a handshake occurs (in fact, upstream NBD documents but no longer
implements old-style negotiation, so it has definitely moved into
deprecated usage).

> 
> But the thing is that this really shows there should be a -t in the
> qemu-nbd invocation in this commit message. The merit of this patch is
> that it fixes a *crash* in qemu-nbd if someone closes the connection
> immediately after opening it, not that it allows qemu-nbd to continue to
> run when someone connects who is not an NBD client -- because it really
> doesn't.

So yes, your arguments are good that the REAL denial-of-service is not
against a server that expects only one client (or, with new-style, one
client that knows about the specific export), but against a server
running with -t that is supposed to be persistent regardless of how many
clients consecutively connect to it.

> 
> (With the new style, option negotiation fails so we do not go to
> transmission phase and the server stays alive:
> 
> $ ./qemu-nbd -x foo -f raw null-co:// &
> [1] 24609
> $ ncat --send-only localhost 10809 < /dev/null
> nbd/server.c:nbd_negotiate_options():L442: read failed
> nbd/server.c:nbd_negotiate():L673: option negotiation failed
> $ kill %1
> [1]  + 24609 done   ./qemu-nbd -x foo -f raw null-co://
> 
> )
> 
>> Perhaps we may want to tweak things
>> in a later patch to also treat a client that uses NBD_OPT_ABORT
>> as being a 'successful' negotiation (the client correctly talked
>> the NBD protocol, and informed us it was not going to use our
>> export after all), but that's a discussion for another day.
> 
> Well, about that...
> 
> $ ./qemu-nbd -x foo -f raw null-co:// &
> [1] 24389
> $ ./qemu-io -c quit nbd://localhost/bar
> can't open device nbd://localhost/bar: No export with name 'bar' available
> [1]  + 24389 broken pipe  ./qemu-nbd -x foo -f raw null-co://
> 
> Even worse, the same happens with -t, both before and after this patch.

You later 

Re: [Qemu-block] [PATCH 1/4] block/qcow2: add compression_algorithm create option

2017-06-27 Thread Peter Lieven

Am 27.06.2017 um 17:11 schrieb Peter Lieven:

Am 27.06.2017 um 17:04 schrieb Eric Blake:

On 06/27/2017 09:49 AM, Peter Lieven wrote:


Before I continue, can you please give feedback on the following spec
change:

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 80cdfd0..f1428e9 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -85,7 +85,11 @@ in the description of a field.
  be written to (unless for regaining
  consistency).

-Bits 2-63:  Reserved (set to 0)
+Bit 2:  Compression format bit.  Iff this bit

I know what this means, but spelling it "If and only if" or "When" might
make more sense to other readers, as "Iff" is not common in English.


is set then
+the compression format extension MUST
be present
+and MUST be parsed and checked for
compatibility.
+
+Bits 3-63:  Reserved (set to 0)

   80 -  87:  compatible_features
  Bitmask of compatible features. An implementation can
@@ -135,6 +139,7 @@ be stored. Each extension has a structure like the
following:
  0xE2792ACA - Backing file format name
  0x6803f857 - Feature name table
  0x23852875 - Bitmaps extension
+0xC0318300 - Compression format extension

Now that you aren't burning 256 magic numbers, it may make sense to have
the last two hex digits be non-zero.



+== Compression format extension ==
+
+The compression format extension is an optional header extension. It
provides

Inline pasting created interesting wrapping, but the actual patch will
obviously read better.


+the ability to specify the compression algorithm and compression
parameters
+that are used for compressed clusters. This new header MUST be present if
+the incompatible-feature bit "compression format bit" is set and MUST
be absent
+otherwise.
+
+The fields of the compression format extension are:
+
+Byte  0 - 15:  compression_format_name (padded with zeros, but not
+   necessarily null terminated if it has full length)

Do we really want arbitrary names of formats, or do we want to specify
specific algorithms (gzip, lzo, zstd) as an enum?  Which way gives us
maximum likelihood of interoperability?


At some point I have to parse the name and convert it into a number.
This could be the same routine for the parsing of the compress.format
parameter and the compression_format_name in the header field.
The idea was that an accidently change in the enum cannot break
anything as it is local to the qemu binary. But I am ok to have an enum
if the values for each format a fixed. In this case it might be sufficient
to have an uint8_t for the compression format + an uint8_t for the compression
level. Then 2 bytes reserved + an uint32_t for the extra data size. So 
currently (the extra
data is not necessary at the moment for any format) the header would
be limited to 8 bytes.

Anyway if we glue this into a QAPI scheme, I would appreciate to get
some help with it as I am not that familiar with it yet.


Forgot to mention, the idea to store the name here was to be able to display
an appropiate error message if a format is not supported (like compression 
format xyz is not supported).

Peter



Re: [Qemu-block] [PATCH] docs: add qemu-block-drivers(7) man page

2017-06-27 Thread Eric Blake
On 06/22/2017 07:17 AM, Stefan Hajnoczi wrote:
> Block driver documentation is available in qemu-doc.html.  It would be
> convenient to have documentation for formats, protocols, and filter
> drivers in a man page.
> 
> Extract the relevant part of qemu-doc.html into a new file called
> docs/qemu-block-drivers.texi.  This file can also be built as a
> stand-alone document (man, html, etc).
> 
> Signed-off-by: Stefan Hajnoczi 
> ---

> +
> +@table @option
> +@item raw
> +
> +Raw disk image format. This format has the advantage of
> +being simple and easily exportable to all other emulators. If your
> +file system supports @emph{holes} (for example in ext2 or ext3 on
> +Linux or NTFS on Windows), then only the written sectors will reserve
> +space. Use @code{qemu-img info} to know the real size used by the
> +image or @code{ls -ls} on Unix/Linux.

Worth mentioning the security risk of using raw images under probing
without explicitly calling them out as raw, because then a guest can
impersonate other type of images?

> +
> +Supported options:
> +@table @code
> +@item preallocation
> +Preallocation mode (allowed values: @code{off}, @code{falloc}, @code{full}).
> +@code{falloc} mode preallocates space for image by calling posix_fallocate().
> +@code{full} mode preallocates space for image by writing zeros to underlying
> +storage.
> +@end table
> +
> +@item qcow2
> +QEMU image format, the most versatile format. Use it to have smaller
> +images (useful if your filesystem does not supports holes, for example

s/supports/support/

> +on Windows),

Odd, since you just stated above for raw that Windows DOES support holes
(on NTFS).  Better might be "on FAT filesystems".

> zlib based compression and support of multiple VM
> +snapshots.
> +
> +Supported options:
> +@table @code
> +@item compat
> +Determines the qcow2 version to use. @code{compat=0.10} uses the
> +traditional image format that can be read by any QEMU since 0.10.
> +@code{compat=1.1} enables image format extensions that only QEMU 1.1 and
> +newer understand (this is the default). Amongst others, this includes

I don't have any strong opinion on US vs. UK spelling, but do like to
point out that this is user-facing, if we are trying to consistently
prefer one flavo[u]r of among[st] in our docs.

Then again, much of your patch is just code motion, rather than new
documentation.  So my comments are probably worth separate patches.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH 00/10] Block layer thread-safety, part 2

2017-06-27 Thread Paolo Bonzini
This part takes care of drivers and devices, making sure that they can
accept concurrent I/O from multiple AioContext.

The following drivers are thread-safe without using any QemuMutex/CoMutex:
crypto, gluster, null, rbd, win32-aio.  NBD has already been fixed,
because the patch fixed an unrelated testcase.

The following drivers already use mutexes for everything except possibly
snapshots, which do not (yet?) need protection: bochs, cloop, dmg, qcow,
parallels, vhdx, vmdk, curl, iscsi, nfs.

The following drivers already use mutexes for _almost_ everything: vpc
(missing get_block_status), vdi (missing bitmap access), vvfat (missing
commit), not protected), qcow2 (must call CoQueue APIs under CoMutex).
They are fixed by patches 1-5.

The following drivers must be changed to use CoMutex to protect internal
data: qed (patches 6-8), sheepdog (patch 9).

The following driver must be changed to support I/O from any AioContext:
ssh.  It is fixed by patch 10.

Paolo

Paolo Bonzini (10):
  qcow2: call CoQueue APIs under CoMutex
  coroutine-lock: add qemu_co_rwlock_downgrade and
qemu_co_rwlock_upgrade
  vdi: make it thread-safe
  vpc: make it thread-safe
  vvfat: make it thread-safe
  qed: move tail of qed_aio_write_main to qed_aio_write_{cow,alloc}
  block: invoke .bdrv_drain callback in coroutine context and from
AioContext
  qed: protect table cache with CoMutex
  sheepdog: add queue_lock
  ssh: support I/O from any AioContext

 block/io.c |  42 +++---
 block/qcow2.c  |   4 +-
 block/qed-cluster.c|   4 +-
 block/qed-l2-cache.c   |   6 ++
 block/qed-table.c  |  24 --
 block/qed.c| 195 +++--
 block/qed.h|  11 ++-
 block/sheepdog.c   |  21 -
 block/ssh.c|  24 --
 block/vdi.c|  48 +--
 block/vpc.c|  20 ++---
 block/vvfat.c  |   8 +-
 include/block/block_int.h  |   2 +-
 include/qemu/coroutine.h   |  18 +
 util/qemu-coroutine-lock.c |  35 
 15 files changed, 316 insertions(+), 146 deletions(-)

-- 
2.13.0




[Qemu-block] [PATCH 01/10] qcow2: call CoQueue APIs under CoMutex

2017-06-27 Thread Paolo Bonzini
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block/qcow2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2f94f0326e..70d3f4a18e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1741,8 +1741,6 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState 
*bs, uint64_t offset,
 ret = 0;
 
 fail:
-qemu_co_mutex_unlock(>lock);
-
 while (l2meta != NULL) {
 QCowL2Meta *next;
 
@@ -1756,6 +1754,8 @@ fail:
 l2meta = next;
 }
 
+qemu_co_mutex_unlock(>lock);
+
 qemu_iovec_destroy(_qiov);
 qemu_vfree(cluster_data);
 trace_qcow2_writev_done_req(qemu_coroutine_self(), ret);
-- 
2.13.0





[Qemu-block] [PATCH 04/10] vpc: make it thread-safe

2017-06-27 Thread Paolo Bonzini
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block/vpc.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/block/vpc.c b/block/vpc.c
index 4240ba9d1c..0ff686540a 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -496,12 +496,6 @@ static inline int64_t get_image_offset(BlockDriverState 
*bs, uint64_t offset,
 return block_offset;
 }
 
-static inline int64_t get_sector_offset(BlockDriverState *bs,
-int64_t sector_num, bool write)
-{
-return get_image_offset(bs, sector_num * BDRV_SECTOR_SIZE, write);
-}
-
 /*
  * Writes the footer to the end of the image file. This is needed when the
  * file grows as it overwrites the old footer
@@ -696,6 +690,7 @@ static int64_t coroutine_fn 
vpc_co_get_block_status(BlockDriverState *bs,
 VHDFooter *footer = (VHDFooter*) s->footer_buf;
 int64_t start, offset;
 bool allocated;
+int64_t ret;
 int n;
 
 if (be32_to_cpu(footer->type) == VHD_FIXED) {
@@ -705,10 +700,13 @@ static int64_t coroutine_fn 
vpc_co_get_block_status(BlockDriverState *bs,
(sector_num << BDRV_SECTOR_BITS);
 }
 
-offset = get_sector_offset(bs, sector_num, 0);
+qemu_co_mutex_lock(>lock);
+
+offset = get_image_offset(bs, sector_num << BDRV_SECTOR_BITS, false);
 start = offset;
 allocated = (offset != -1);
 *pnum = 0;
+ret = 0;
 
 do {
 /* All sectors in a block are contiguous (without using the bitmap) */
@@ -723,15 +721,17 @@ static int64_t coroutine_fn 
vpc_co_get_block_status(BlockDriverState *bs,
  * sectors since there is always a bitmap in between. */
 if (allocated) {
 *file = bs->file->bs;
-return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | start;
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | start;
+break;
 }
 if (nb_sectors == 0) {
 break;
 }
-offset = get_sector_offset(bs, sector_num, 0);
+offset = get_image_offset(bs, sector_num << BDRV_SECTOR_BITS, false);
 } while (offset == -1);
 
-return 0;
+qemu_co_mutex_unlock(>lock);
+return ret;
 }
 
 /*
-- 
2.13.0





[Qemu-block] [PATCH 02/10] coroutine-lock: add qemu_co_rwlock_downgrade and qemu_co_rwlock_upgrade

2017-06-27 Thread Paolo Bonzini
These functions are more efficient in the presence of contention.
qemu_co_rwlock_downgrade also guarantees not to block, which may
be useful in some algorithms too.

Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 include/qemu/coroutine.h   | 18 ++
 util/qemu-coroutine-lock.c | 35 +++
 2 files changed, 53 insertions(+)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index a4509bd977..9aff9a735e 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -229,6 +229,24 @@ void qemu_co_rwlock_init(CoRwlock *lock);
 void qemu_co_rwlock_rdlock(CoRwlock *lock);
 
 /**
+ * Write Locks the CoRwlock from a reader.  This is a bit more efficient than
+ * @qemu_co_rwlock_unlock followed by a separate @qemu_co_rwlock_wrlock.
+ * However, if the lock cannot be upgraded immediately, control is transferred
+ * to the caller of the current coroutine.  Also, @qemu_co_rwlock_upgrade
+ * only overrides CoRwlock fairness if there are no concurrent readers, so
+ * another writer might run while @qemu_co_rwlock_upgrade blocks.
+ */
+void qemu_co_rwlock_upgrade(CoRwlock *lock);
+
+/**
+ * Downgrades a write-side critical section to a reader.  Downgrading with
+ * @qemu_co_rwlock_downgrade never blocks, unlike @qemu_co_rwlock_unlock
+ * followed by @qemu_co_rwlock_rdlock.  This makes it more efficient, but
+ * may also sometimes be necessary for correctness.
+ */
+void qemu_co_rwlock_downgrade(CoRwlock *lock);
+
+/**
  * Write Locks the mutex. If the lock cannot be taken immediately because
  * of a parallel reader, control is transferred to the caller of the current
  * coroutine.
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index b44b5d55eb..846ff9167f 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -402,6 +402,21 @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
 qemu_co_mutex_unlock(>mutex);
 }
 
+void qemu_co_rwlock_downgrade(CoRwlock *lock)
+{
+Coroutine *self = qemu_coroutine_self();
+
+/* lock->mutex critical section started in qemu_co_rwlock_wrlock or
+ * qemu_co_rwlock_upgrade.
+ */
+assert(lock->reader == 0);
+lock->reader++;
+qemu_co_mutex_unlock(>mutex);
+
+/* The rest of the read-side critical section is run without the mutex.  */
+self->locks_held++;
+}
+
 void qemu_co_rwlock_wrlock(CoRwlock *lock)
 {
 qemu_co_mutex_lock(>mutex);
@@ -416,3 +431,23 @@ void qemu_co_rwlock_wrlock(CoRwlock *lock)
  * There is no need to update self->locks_held.
  */
 }
+
+void qemu_co_rwlock_upgrade(CoRwlock *lock)
+{
+Coroutine *self = qemu_coroutine_self();
+
+qemu_co_mutex_lock(>mutex);
+assert(lock->reader > 0);
+lock->reader--;
+lock->pending_writer++;
+while (lock->reader) {
+qemu_co_queue_wait(>queue, >mutex);
+}
+lock->pending_writer--;
+
+/* The rest of the write-side critical section is run with
+ * the mutex taken, similar to qemu_co_rwlock_wrlock.  Do
+ * not account for the lock twice in self->locks_held.
+ */
+self->locks_held--;
+}
-- 
2.13.0





[Qemu-block] [PATCH 09/10] sheepdog: add queue_lock

2017-06-27 Thread Paolo Bonzini
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block/sheepdog.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 08d7b11e9d..a6013f0f17 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -390,6 +390,7 @@ struct BDRVSheepdogState {
 QLIST_HEAD(inflight_aio_head, AIOReq) inflight_aio_head;
 QLIST_HEAD(failed_aio_head, AIOReq) failed_aio_head;
 
+CoMutex queue_lock;
 CoQueue overlapping_queue;
 QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head;
 };
@@ -488,7 +489,7 @@ static void wait_for_overlapping_aiocb(BDRVSheepdogState 
*s, SheepdogAIOCB *acb)
 retry:
 QLIST_FOREACH(cb, >inflight_aiocb_head, aiocb_siblings) {
 if (AIOCBOverlapping(acb, cb)) {
-qemu_co_queue_wait(>overlapping_queue, NULL);
+qemu_co_queue_wait(>overlapping_queue, >queue_lock);
 goto retry;
 }
 }
@@ -525,8 +526,10 @@ static void sd_aio_setup(SheepdogAIOCB *acb, 
BDRVSheepdogState *s,
 return;
 }
 
+qemu_co_mutex_lock(>queue_lock);
 wait_for_overlapping_aiocb(s, acb);
 QLIST_INSERT_HEAD(>inflight_aiocb_head, acb, aiocb_siblings);
+qemu_co_mutex_unlock(>queue_lock);
 }
 
 static SocketAddress *sd_socket_address(const char *path,
@@ -785,6 +788,7 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
  * have to move all the inflight requests to the failed queue before
  * resend_aioreq() is called.
  */
+qemu_co_mutex_lock(>queue_lock);
 QLIST_FOREACH_SAFE(aio_req, >inflight_aio_head, aio_siblings, next) {
 QLIST_REMOVE(aio_req, aio_siblings);
 QLIST_INSERT_HEAD(>failed_aio_head, aio_req, aio_siblings);
@@ -794,8 +798,11 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 while (!QLIST_EMPTY(>failed_aio_head)) {
 aio_req = QLIST_FIRST(>failed_aio_head);
 QLIST_REMOVE(aio_req, aio_siblings);
+qemu_co_mutex_unlock(>queue_lock);
 resend_aioreq(s, aio_req);
+qemu_co_mutex_lock(>queue_lock);
 }
+qemu_co_mutex_unlock(>queue_lock);
 }
 
 /*
@@ -887,7 +894,10 @@ static void coroutine_fn aio_read_response(void *opaque)
 */
 s->co_recv = NULL;
 
+qemu_co_mutex_lock(>queue_lock);
 QLIST_REMOVE(aio_req, aio_siblings);
+qemu_co_mutex_unlock(>queue_lock);
+
 switch (rsp.result) {
 case SD_RES_SUCCESS:
 break;
@@ -1307,7 +1317,9 @@ static void coroutine_fn 
add_aio_request(BDRVSheepdogState *s, AIOReq *aio_req,
 uint64_t old_oid = aio_req->base_oid;
 bool create = aio_req->create;
 
+qemu_co_mutex_lock(>queue_lock);
 QLIST_INSERT_HEAD(>inflight_aio_head, aio_req, aio_siblings);
+qemu_co_mutex_unlock(>queue_lock);
 
 if (!nr_copies) {
 error_report("bug");
@@ -1678,6 +1690,7 @@ static int sd_open(BlockDriverState *bs, QDict *options, 
int flags,
 bs->total_sectors = s->inode.vdi_size / BDRV_SECTOR_SIZE;
 pstrcpy(s->name, sizeof(s->name), vdi);
 qemu_co_mutex_init(>lock);
+qemu_co_mutex_init(>queue_lock);
 qemu_co_queue_init(>overlapping_queue);
 qemu_opts_del(opts);
 g_free(buf);
@@ -2431,12 +2444,16 @@ static void coroutine_fn sd_co_rw_vector(SheepdogAIOCB 
*acb)
 
 static void sd_aio_complete(SheepdogAIOCB *acb)
 {
+BDRVSheepdogState *s;
 if (acb->aiocb_type == AIOCB_FLUSH_CACHE) {
 return;
 }
 
+s = acb->s;
+qemu_co_mutex_lock(>queue_lock);
 QLIST_REMOVE(acb, aiocb_siblings);
-qemu_co_queue_restart_all(>s->overlapping_queue);
+qemu_co_queue_restart_all(>overlapping_queue);
+qemu_co_mutex_unlock(>queue_lock);
 }
 
 static coroutine_fn int sd_co_writev(BlockDriverState *bs, int64_t sector_num,
-- 
2.13.0





[Qemu-block] [PATCH 03/10] vdi: make it thread-safe

2017-06-27 Thread Paolo Bonzini
The VirtualBox driver is using a mutex to order all allocating writes,
but it is not protecting accesses to the bitmap because they implicitly
happen under the AioContext mutex.  Change this to use a CoRwlock
explicitly.

Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block/vdi.c | 48 
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index 79af47763b..080e1562e3 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -172,7 +172,7 @@ typedef struct {
 /* VDI header (converted to host endianness). */
 VdiHeader header;
 
-CoMutex write_lock;
+CoRwlock bmap_lock;
 
 Error *migration_blocker;
 } BDRVVdiState;
@@ -485,7 +485,7 @@ static int vdi_open(BlockDriverState *bs, QDict *options, 
int flags,
 goto fail_free_bmap;
 }
 
-qemu_co_mutex_init(>write_lock);
+qemu_co_rwlock_init(>bmap_lock);
 
 return 0;
 
@@ -557,7 +557,9 @@ vdi_co_preadv(BlockDriverState *bs, uint64_t offset, 
uint64_t bytes,
n_bytes, offset);
 
 /* prepare next AIO request */
+qemu_co_rwlock_rdlock(>bmap_lock);
 bmap_entry = le32_to_cpu(s->bmap[block_index]);
+qemu_co_rwlock_unlock(>bmap_lock);
 if (!VDI_IS_ALLOCATED(bmap_entry)) {
 /* Block not allocated, return zeros, no need to wait. */
 qemu_iovec_memset(qiov, bytes_done, 0, n_bytes);
@@ -595,6 +597,7 @@ vdi_co_pwritev(BlockDriverState *bs, uint64_t offset, 
uint64_t bytes,
 uint32_t block_index;
 uint32_t offset_in_block;
 uint32_t n_bytes;
+uint64_t data_offset;
 uint32_t bmap_first = VDI_UNALLOCATED;
 uint32_t bmap_last = VDI_UNALLOCATED;
 uint8_t *block = NULL;
@@ -614,10 +617,19 @@ vdi_co_pwritev(BlockDriverState *bs, uint64_t offset, 
uint64_t bytes,
n_bytes, offset);
 
 /* prepare next AIO request */
+qemu_co_rwlock_rdlock(>bmap_lock);
 bmap_entry = le32_to_cpu(s->bmap[block_index]);
 if (!VDI_IS_ALLOCATED(bmap_entry)) {
 /* Allocate new block and write to it. */
 uint64_t data_offset;
+qemu_co_rwlock_upgrade(>bmap_lock);
+bmap_entry = le32_to_cpu(s->bmap[block_index]);
+if (VDI_IS_ALLOCATED(bmap_entry)) {
+/* A concurrent allocation did the work for us.  */
+qemu_co_rwlock_downgrade(>bmap_lock);
+goto nonallocating_write;
+}
+
 bmap_entry = s->header.blocks_allocated;
 s->bmap[block_index] = cpu_to_le32(bmap_entry);
 s->header.blocks_allocated++;
@@ -635,30 +647,18 @@ vdi_co_pwritev(BlockDriverState *bs, uint64_t offset, 
uint64_t bytes,
 memset(block + offset_in_block + n_bytes, 0,
s->block_size - n_bytes - offset_in_block);
 
-/* Note that this coroutine does not yield anywhere from reading 
the
- * bmap entry until here, so in regards to all the coroutines 
trying
- * to write to this cluster, the one doing the allocation will
- * always be the first to try to acquire the lock.
- * Therefore, it is also the first that will actually be able to
- * acquire the lock and thus the padded cluster is written before
- * the other coroutines can write to the affected area. */
-qemu_co_mutex_lock(>write_lock);
+/* Write the new block under CoRwLock write-side protection,
+ * so this full-cluster write does not overlap a partial write
+ * of the same cluster, issued from the "else" branch.
+ */
 ret = bdrv_pwrite(bs->file, data_offset, block, s->block_size);
-qemu_co_mutex_unlock(>write_lock);
+qemu_co_rwlock_unlock(>bmap_lock);
 } else {
-uint64_t data_offset = s->header.offset_data +
-   (uint64_t)bmap_entry * s->block_size +
-   offset_in_block;
-qemu_co_mutex_lock(>write_lock);
-/* This lock is only used to make sure the following write 
operation
- * is executed after the write issued by the coroutine allocating
- * this cluster, therefore we do not need to keep it locked.
- * As stated above, the allocating coroutine will always try to 
lock
- * the mutex before all the other concurrent accesses to that
- * cluster, therefore at this point we can be absolutely certain
- * that that write operation has returned (there may be other 
writes
- * in flight, but they do not concern this very operation). */
-qemu_co_mutex_unlock(>write_lock);
+nonallocating_write:
+data_offset = s->header.offset_data +
+   

[Qemu-block] [PATCH 05/10] vvfat: make it thread-safe

2017-06-27 Thread Paolo Bonzini
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block/vvfat.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index 8ab647c0c6..d2679c2ff5 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -2982,8 +2982,14 @@ static int coroutine_fn
 write_target_commit(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
 QEMUIOVector *qiov, int flags)
 {
+int ret;
+
 BDRVVVFATState* s = *((BDRVVVFATState**) bs->opaque);
-return try_commit(s);
+qemu_co_mutex_lock(>lock);
+ret = try_commit(s);
+qemu_co_mutex_unlock(>lock);
+
+return ret;
 }
 
 static void write_target_close(BlockDriverState *bs) {
-- 
2.13.0





[Qemu-block] [PATCH 10/10] ssh: support I/O from any AioContext

2017-06-27 Thread Paolo Bonzini
The coroutine may run in a different AioContext, causing the
fd handler to busy wait.  Fix this by resetting the handler
in restart_coroutine, before the coroutine is restarted.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block/ssh.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/block/ssh.c b/block/ssh.c
index 52964416da..e6a5b08de3 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -888,13 +888,22 @@ static int ssh_has_zero_init(BlockDriverState *bs)
 return has_zero_init;
 }
 
+typedef struct BDRVSSHRestart {
+BlockDriverState *bs;
+Coroutine *co;
+} BDRVSSHRestart;
+
 static void restart_coroutine(void *opaque)
 {
-Coroutine *co = opaque;
+BDRVSSHRestart *restart = opaque;
+BlockDriverState *bs = restart->bs;
+BDRVSSHState *s = bs->opaque;
+AioContext *ctx = bdrv_get_aio_context(bs);
 
-DPRINTF("co=%p", co);
+DPRINTF("co=%p", restart->co);
+aio_set_fd_handler(ctx, s->sock, false, NULL, NULL, NULL, NULL);
 
-aio_co_wake(co);
+aio_co_wake(restart->co);
 }
 
 /* A non-blocking call returned EAGAIN, so yield, ensuring the
@@ -905,7 +914,10 @@ static coroutine_fn void co_yield(BDRVSSHState *s, 
BlockDriverState *bs)
 {
 int r;
 IOHandler *rd_handler = NULL, *wr_handler = NULL;
-Coroutine *co = qemu_coroutine_self();
+BDRVSSHRestart restart = {
+.bs = bs,
+.co = qemu_coroutine_self()
+};
 
 r = libssh2_session_block_directions(s->session);
 
@@ -920,11 +932,9 @@ static coroutine_fn void co_yield(BDRVSSHState *s, 
BlockDriverState *bs)
 rd_handler, wr_handler);
 
 aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
-   false, rd_handler, wr_handler, NULL, co);
+   false, rd_handler, wr_handler, NULL, );
 qemu_coroutine_yield();
 DPRINTF("s->sock=%d - back", s->sock);
-aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock, false,
-   NULL, NULL, NULL, NULL);
 }
 
 /* SFTP has a function `libssh2_sftp_seek64' which seeks to a position
-- 
2.13.0




[Qemu-block] [PATCH v2] docs: add qemu-block-drivers(7) man page

2017-06-27 Thread Stefan Hajnoczi
Block driver documentation is available in qemu-doc.html.  It would be
convenient to have documentation for formats, protocols, and filter
drivers in a man page.

Extract the relevant part of qemu-doc.html into a new file called
docs/qemu-block-drivers.texi.  This file can also be built as a
stand-alone document (man, html, etc).

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: John Snow 
---
v2:
 * Rebased onto qemu.git/master after Paolo's docs/ changes [jsnow]

 Makefile |  23 +-
 docs/qemu-block-drivers.texi | 692 +++
 qemu-doc.texi| 673 +
 3 files changed, 711 insertions(+), 677 deletions(-)
 create mode 100644 docs/qemu-block-drivers.texi

diff --git a/Makefile b/Makefile
index 16a0430..8d32d01 100644
--- a/Makefile
+++ b/Makefile
@@ -209,6 +209,9 @@ ifdef BUILD_DOCS
 DOCS=qemu-doc.html qemu-doc.txt qemu.1 qemu-img.1 qemu-nbd.8 qemu-ga.8
 DOCS+=docs/interop/qemu-qmp-ref.html docs/interop/qemu-qmp-ref.txt 
docs/interop/qemu-qmp-ref.7
 DOCS+=docs/interop/qemu-ga-ref.html docs/interop/qemu-ga-ref.txt 
docs/interop/qemu-ga-ref.7
+DOCS+=docs/qemu-block-drivers.html
+DOCS+=docs/qemu-block-drivers.txt
+DOCS+=docs/qemu-block-drivers.7
 ifdef CONFIG_VIRTFS
 DOCS+=fsdev/virtfs-proxy-helper.1
 endif
@@ -528,6 +531,8 @@ distclean: clean
rm -f docs/interop/qemu-qmp-ref.txt docs/interop/qemu-ga-ref.txt
rm -f docs/interop/qemu-qmp-ref.pdf docs/interop/qemu-ga-ref.pdf
rm -f docs/interop/qemu-qmp-ref.html docs/interop/qemu-ga-ref.html
+   rm -f docs/qemu-block-drivers.7 docs/qemu-block-drivers.txt
+   rm -f docs/qemu-block-drivers.pdf docs/qemu-block-drivers.html
for d in $(TARGET_DIRS); do \
rm -rf $$d || exit 1 ; \
 done
@@ -568,11 +573,14 @@ install-doc: $(DOCS)
$(INSTALL_DATA) qemu-doc.txt "$(DESTDIR)$(qemu_docdir)"
$(INSTALL_DATA) docs/interop/qemu-qmp-ref.html 
"$(DESTDIR)$(qemu_docdir)"
$(INSTALL_DATA) docs/interop/qemu-qmp-ref.txt "$(DESTDIR)$(qemu_docdir)"
+   $(INSTALL_DATA) docs/qemu-block-drivers.html "$(DESTDIR)$(qemu_docdir)"
+   $(INSTALL_DATA) docs/qemu-block-drivers.txt "$(DESTDIR)$(qemu_docdir)"
 ifdef CONFIG_POSIX
$(INSTALL_DIR) "$(DESTDIR)$(mandir)/man1"
$(INSTALL_DATA) qemu.1 "$(DESTDIR)$(mandir)/man1"
$(INSTALL_DIR) "$(DESTDIR)$(mandir)/man7"
$(INSTALL_DATA) docs/interop/qemu-qmp-ref.7 "$(DESTDIR)$(mandir)/man7"
+   $(INSTALL_DATA) docs/qemu-block-drivers.7 "$(DESTDIR)$(mandir)/man7"
 ifneq ($(TOOLS),)
$(INSTALL_DATA) qemu-img.1 "$(DESTDIR)$(mandir)/man1"
$(INSTALL_DIR) "$(DESTDIR)$(mandir)/man8"
@@ -719,15 +727,15 @@ fsdev/virtfs-proxy-helper.1: 
fsdev/virtfs-proxy-helper.texi
 qemu-nbd.8: qemu-nbd.texi qemu-option-trace.texi
 qemu-ga.8: qemu-ga.texi
 
-html: qemu-doc.html docs/interop/qemu-qmp-ref.html 
docs/interop/qemu-ga-ref.html
-info: qemu-doc.info docs/interop/qemu-qmp-ref.info 
docs/interop/qemu-ga-ref.info
-pdf: qemu-doc.pdf docs/interop/qemu-qmp-ref.pdf docs/interop/qemu-ga-ref.pdf
-txt: qemu-doc.txt docs/interop/qemu-qmp-ref.txt docs/interop/qemu-ga-ref.txt
+html: qemu-doc.html docs/interop/qemu-qmp-ref.html 
docs/interop/qemu-ga-ref.html docs/qemu-block-drivers.html
+info: qemu-doc.info docs/interop/qemu-qmp-ref.info 
docs/interop/qemu-ga-ref.info docs/qemu-block-drivers.info
+pdf: qemu-doc.pdf docs/interop/qemu-qmp-ref.pdf docs/interop/qemu-ga-ref.pdf 
docs/qemu-block-drivers.pdf
+txt: qemu-doc.txt docs/interop/qemu-qmp-ref.txt docs/interop/qemu-ga-ref.txt 
docs/qemu-block-drivers.txt
 
 qemu-doc.html qemu-doc.info qemu-doc.pdf qemu-doc.txt: \
qemu-img.texi qemu-nbd.texi qemu-options.texi qemu-option-trace.texi \
qemu-monitor.texi qemu-img-cmds.texi qemu-ga.texi \
-   qemu-monitor-info.texi
+   qemu-monitor-info.texi docs/qemu-block-drivers.texi
 
 docs/interop/qemu-ga-ref.dvi docs/interop/qemu-ga-ref.html \
 docs/interop/qemu-ga-ref.info docs/interop/qemu-ga-ref.pdf \
@@ -739,6 +747,11 @@ docs/interop/qemu-qmp-ref.dvi 
docs/interop/qemu-qmp-ref.html \
 docs/interop/qemu-qmp-ref.txt docs/interop/qemu-qmp-ref.7: \
docs/interop/qemu-qmp-ref.texi docs/interop/qemu-qmp-qapi.texi
 
+docs/qemu-block-drivers.dvi docs/qemu-block-drivers.html \
+docs/qemu-block-drivers.info docs/qemu-block-drivers.pdf \
+docs/qemu-block-drivers.txt docs/qemu-block-drivers.7: \
+   docs/qemu-block-drivers.texi
+
 
 ifdef CONFIG_WIN32
 
diff --git a/docs/qemu-block-drivers.texi b/docs/qemu-block-drivers.texi
new file mode 100644
index 000..a43e878
--- /dev/null
+++ b/docs/qemu-block-drivers.texi
@@ -0,0 +1,692 @@
+@c man begin SYNOPSIS
+QEMU block driver reference manual
+@c man end
+
+@c man begin DESCRIPTION
+
+@node disk_images_formats
+
+@subsection Disk image file formats
+
+QEMU supports many image file formats that can be used with VMs as well as with
+any of 

Re: [Qemu-block] [PATCH v4 0/4] more blkdebug tweaks

2017-06-27 Thread Eric Blake
On 06/06/2017 07:26 AM, Kevin Wolf wrote:
> Am 05.06.2017 um 22:38 hat Eric Blake geschrieben:
>> I found a crasher and some odd behavior while rebasing my
>> bdrv_get_block_status series, so I figured I'd get these things
>> fixed first.  This is based on top of Max's block branch.
>>
>> Available as a tag at:
>> git fetch git://repo.or.cz/qemu/ericb.git nbd-blkdebug-status-v4
>>
>> Since v3:
>> - check all qemu-iotests (patch 1)
> 
> Whoops, somehow I ended up applying and reviewing v4 while looking at
> the v3 thread... Looks like there really aren't any missing test case
> updates any more.
> 
> Thanks, applied to the block branch.

Did this get lost somehow?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH RFC v3 4/8] block: convert ThrottleGroup to object with QOM

2017-06-27 Thread Alberto Garcia
On Mon 26 Jun 2017 06:58:32 PM CEST, Manos Pitsidianakis wrote:
> On Mon, Jun 26, 2017 at 03:52:34PM +0100, Stefan Hajnoczi wrote:
>>On Fri, Jun 23, 2017 at 03:46:56PM +0300, Manos Pitsidianakis wrote:
>>> +static bool throttle_group_exists(const char *name)
>>> +{
>>> +ThrottleGroup *iter;
>>> +bool ret = false;
>>> +
>>> +qemu_mutex_lock(_groups_lock);
>>
>>Not sure if this lock or the throttle_groups list are necessary.

As Manos says accesses to the throttle_groups list need to be locked.

What I don't like at first sight is the code duplication in
throttle_group_incref() and throttle_group_obj_complete().

>>Have you seen iothread.c:qmp_query_iothreads()?  All objects are put
>>into a container (the parent object), so you can just iterate over its
>>children.  There's no need for a separate list because QOM already has
>>all the objects.

I haven't looked into this yet but it could be a solution.

Berto



Re: [Qemu-block] [PATCH RFC v3 4/8] block: convert ThrottleGroup to object with QOM

2017-06-27 Thread Manos Pitsidianakis

On Tue, Jun 27, 2017 at 06:05:55PM +0200, Alberto Garcia wrote:

On Mon 26 Jun 2017 06:58:32 PM CEST, Manos Pitsidianakis wrote:

On Mon, Jun 26, 2017 at 03:52:34PM +0100, Stefan Hajnoczi wrote:

On Fri, Jun 23, 2017 at 03:46:56PM +0300, Manos Pitsidianakis wrote:

+static bool throttle_group_exists(const char *name)
+{
+ThrottleGroup *iter;
+bool ret = false;
+
+qemu_mutex_lock(_groups_lock);


Not sure if this lock or the throttle_groups list are necessary.


As Manos says accesses to the throttle_groups list need to be locked.

What I don't like at first sight is the code duplication in
throttle_group_incref() and throttle_group_obj_complete().


Have you seen iothread.c:qmp_query_iothreads()?  All objects are put
into a container (the parent object), so you can just iterate over its
children.  There's no need for a separate list because QOM already has
all the objects.


I haven't looked into this yet but it could be a solution.


If throttle_groups_register_blk() uses QOM instead of calling 
throttle_group_incref() the duplication could be eliminated. Then all of 
throttle-groups.c uses QOM internally. I don't see any reason why not do 
this.



Berto



signature.asc
Description: PGP signature


Re: [Qemu-block] [RFC] QMP design: Fixing query-block and friends

2017-06-27 Thread Eric Blake
On 06/27/2017 11:31 AM, Kevin Wolf wrote:
> Hi,
> 
> I haven't really liked query-block for a long time, but now that
> blockdev-add and -blockdev have settled, it might finally be the time to
> actually do something about it. In fact, if used together with these
> modern interfaces, our query commands are simply broken, so we have to
> fix something.

Agreed.

> 
> However, it appears to me that I dislike so many thing about our current
> query commands that I'm tempted to say: Throw it all away and start
> over.

I'm somewhat leaning this direction as well. We have to keep the old
commands for a while longer (if we don't want to break existing
clients), but libvirt has definitely felt some of the pain of how many
commands and parsing are required in tandem to reconstruct which BDS
node name to use for setting threshold events.

> 
> If that's what we're going to do, I think I can figure out something
> nice for block nodes. That shouldn't be too hard. The only question
> would be whether we want a command to query one node or whether we would
> keep returning all of them.

The age-old filtering question. It's also plausible to have a single
command, with an optional argument, and which always returns an array:
the full array if the argument was omitted, or an array of one matching
the argument when one was provided.  Adding filtering is an easy patch
on top once it is proven to make life easier for a client, and I'm okay
with a first approach that does not filter.

> 
> I am, however, a bit less confident about BBs. As I said above, I
> consider them part of the qdev devices. As far as I know, there is no
> high-level infrastructure to query runtime state of devices and qdev
> properties are supposed to be read-only. Maybe non-qdev QOM properties
> can be used somehow? But QOM isn't really nice to use as you need to
> query each property individually.
> 
> Another option would be to have a QMP command that takes a qdev ID of a
> block device and queries its BB. Or maybe it should stay like
> query-block and return an array of all of them, but include the qdev
> device name. Actually, maybe query-block can stay as it contains only
> two fields that are useless in the new world.

Being able to query all the devices (with their BB's, and only a name
reference to the top-level BDS in use by the BB), separately from being
able to query all BDS, seems like a reasonable thing.  After all,
sometimes you care about what the guest sees (what devices have a BB),
and sometimes you care about what the host is exposing (what BDS are in
use).

> I think this has become long enough now, so any opinions? On anything I
> said above, but preferably also about what a new interface should look
> like?

Our existing interface is definitely awkward, with lots of redundancy in
some places and missing information in others, and a new interface does
seem like we can do better at designing it right up front rather than
bolting on yet more information to the existing queries (which results
in that much more noise to churn through to get to the desired information).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [RFC] QMP design: Fixing query-block and friends

2017-06-27 Thread Kevin Wolf
Hi,

I haven't really liked query-block for a long time, but now that
blockdev-add and -blockdev have settled, it might finally be the time to
actually do something about it. In fact, if used together with these
modern interfaces, our query commands are simply broken, so we have to
fix something.

Just for reference, I'll start with a copy of the most important part of
the schema of our current QMP commands here:

> { 'command': 'query-block', 'returns': ['BlockInfo'] }
> 
> { 'command': 'query-named-block-nodes', 'returns': [ 'BlockDeviceInfo' ] }
> 
> { 'struct': 'BlockInfo',
>   'data': {'device': 'str', 'type': 'str', 'removable': 'bool',
>'locked': 'bool', '*inserted': 'BlockDeviceInfo',
>'*tray_open': 'bool', '*io-status': 'BlockDeviceIoStatus',
>'*dirty-bitmaps': ['BlockDirtyInfo'] } }
> 
> { 'struct': 'BlockDeviceInfo',
>   'data': { 'file': 'str', '*node-name': 'str', 'ro': 'bool', 'drv': 'str',
> '*backing_file': 'str', 'backing_file_depth': 'int',
> 'encrypted': 'bool', 'encryption_key_missing': 'bool',
> 'detect_zeroes': 'BlockdevDetectZeroesOptions',
> 'bps': 'int', 'bps_rd': 'int', 'bps_wr': 'int',
> 'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int',
> 'image': 'ImageInfo',
> '*bps_max': 'int', '*bps_rd_max': 'int',
> '*bps_wr_max': 'int', '*iops_max': 'int',
> '*iops_rd_max': 'int', '*iops_wr_max': 'int',
> '*bps_max_length': 'int', '*bps_rd_max_length': 'int',
> '*bps_wr_max_length': 'int', '*iops_max_length': 'int',
> '*iops_rd_max_length': 'int', '*iops_wr_max_length': 'int',
> '*iops_size': 'int', '*group': 'str', 'cache': 
> 'BlockdevCacheInfo',
> 'write_threshold': 'int' } }
> 
> { 'struct': 'ImageInfo',
>   'data': {'filename': 'str', 'format': 'str', '*dirty-flag': 'bool',
>'*actual-size': 'int', 'virtual-size': 'int',
>'*cluster-size': 'int', '*encrypted': 'bool', '*compressed': 
> 'bool',
>'*backing-filename': 'str', '*full-backing-filename': 'str',
>'*backing-filename-format': 'str', '*snapshots': ['SnapshotInfo'],
>'*backing-image': 'ImageInfo',
>'*format-specific': 'ImageInfoSpecific' } }
> 
> { 'union': 'ImageInfoSpecific',
>   'data': {
>   'qcow2': 'ImageInfoSpecificQCow2',
>   'vmdk': 'ImageInfoSpecificVmdk',
>   # If we need to add block driver specific parameters for
>   # LUKS in future, then we'll subclass QCryptoBlockInfoLUKS
>   # to define a ImageInfoSpecificLUKS
>   'luks': 'QCryptoBlockInfoLUKS'
>   } }

So what's wrong with this? Quite a few things, let's do a quick review
of the commands:

* query-block only returns BlockBackends that are owned by the monitor
  (i.e. they have a name). This used to be true for all BlockBackends
  that were attached to a guest device, but it's not the case any more:
  We want to use -blockdev/-device only with node names, which means
  that the devices create the BB internally - and they aren't visible in
  query-block any more.

* Even if we included those BBs in query-block, they would be missing
  the connection to their qdev device. The BB should really be
  considered part of the device, and if we had a way to query the state
  of a device, query-block would ideally be included there.

* query-named-block-nodes is a bad name. All block nodes are named these
  days. Which also means that it returns all block nodes, so its output
  becomes rather large and includes a lot of redundant information (as
  you'll see below).

* The good news: BlockInfo contains only fields that actually belong to
  BlockBackends, apart from a @type attribute that always contains the
  string "unknown".

  The bad news: A BlockBackend has a lot more information attached, this
  is by far not a full query command for BlockBackends.

* What is BlockDeviceInfo supposed to be? query-named-block-nodes
  returns it, so you would expect that it's a description of a
  BlockDriverState. But it's not. It's a mix of BB and BDS descriptions.
  The BB part just stays zeroed in query-named-block-nodes.

  In other words, if you do query-block, you'll see I/O limits and
  whether a writeback mode is used. But do query-named-block-nodes and
  look at the root node of the same device and it will always tell you
  that the limits are 0 and writeback is on.

  So BlockDeviceInfo contains many things that should really be in
  BlockInfo. Both together appear to be relatively complete for BBs.

* BlockDeviceInfo doesn't really describe the graph. It gives you the
  backing file name as a special case, but it won't tell you anything
  about other child nodes, and even for backing files it doesn't tell
  you which node is actually used.

  Instead, we should have a description of all attached children with
  the link name, the child node name and probably also the permissions
  

Re: [Qemu-block] [PATCH v2 3/4] qcow2: add shrink image support

2017-06-27 Thread Pavel Butsykin

On 26.06.2017 20:47, Max Reitz wrote:

On 2017-06-26 17:23, Pavel Butsykin wrote:

[]


Is there any guarantee that in the future this will not change? Because
in this case it can be a potential danger.


Since this behavior is not documented anywhere, there is no guarantee.


I can add a comment... Or add a new variable with the size of
reftable_tmp, and every time count min(s->refcount_table_size,
reftable_tmp_size)
before accessing to s->refcount_table[]/reftable_tmp[]


Or (1) you add an assertion that refcount_table_size doesn't change
along with a comment why that is the case, which also explains in detail
why the call to qcow2_free_clusters() should be safe: The on-disk
reftable differs from the one in memory. qcow2_free_clusters()and
update_refcount() themselves do not access the reftable, so they are
safe. However, update_refcount() calls alloc_refcount_block(), and that
function does access the reftable: Now, as long as
s->refcount_table_size does not shrink (which I can't see why it would),
refcount_table_index should always be smaller. Now we're accessing
s->refcount_table: This will always return an existing refblock because
this will either be the refblock itself (for self-referencing refblocks)
or another one that is not going to be freed by qcow2_shrink_reftable()
because this function will not free refblocks which cover other clusters
than themselves.
We will then proceed to update the refblock which is either right (if it
is not the refblock to be freed) or won't do anything (if it is the one
to be freed).
In any case, we will never write to the reftable and reading from the
basically outdated cached version will never do anything bad.


OK, SGTM.


Or (2) you copy reftable_tmp into s->refcount_table[] *before* any call
to qcow2_free_clusters(). To make this work, you would need to also
discard all refblocks from the cache in this function here (and not in
update_refcount()) and then only call qcow2_free_clusters() on refblocks
which were not self-referencing. An alternative hack would be to simply
mark the image dirty and just not do any qcow2_free_clusters() call...


The main purpose of qcow2_reftable_shrink() function is discard all
unnecessary refblocks from the file. If we do only rewrite
refcount_table and discard non-self-referencing refblocks (which are
actually very rare), then the meaning of the function is lost.


Or (3) of course it would be possible to not clean up refcount
structures at all...


Nice solution :)


Max





Re: [Qemu-block] [PATCH 1/4] block/qcow2: add compression_algorithm create option

2017-06-27 Thread Eric Blake
On 06/27/2017 09:49 AM, Peter Lieven wrote:

>>
> 
> Before I continue, can you please give feedback on the following spec
> change:
> 
> diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
> index 80cdfd0..f1428e9 100644
> --- a/docs/interop/qcow2.txt
> +++ b/docs/interop/qcow2.txt
> @@ -85,7 +85,11 @@ in the description of a field.
>  be written to (unless for regaining
>  consistency).
> 
> -Bits 2-63:  Reserved (set to 0)
> +Bit 2:  Compression format bit.  Iff this bit

I know what this means, but spelling it "If and only if" or "When" might
make more sense to other readers, as "Iff" is not common in English.

> is set then
> +the compression format extension MUST
> be present
> +and MUST be parsed and checked for
> compatibility.
> +
> +Bits 3-63:  Reserved (set to 0)
> 
>   80 -  87:  compatible_features
>  Bitmask of compatible features. An implementation can
> @@ -135,6 +139,7 @@ be stored. Each extension has a structure like the
> following:
>  0xE2792ACA - Backing file format name
>  0x6803f857 - Feature name table
>  0x23852875 - Bitmaps extension
> +0xC0318300 - Compression format extension

Now that you aren't burning 256 magic numbers, it may make sense to have
the last two hex digits be non-zero.


> +== Compression format extension ==
> +
> +The compression format extension is an optional header extension. It
> provides

Inline pasting created interesting wrapping, but the actual patch will
obviously read better.

> +the ability to specify the compression algorithm and compression
> parameters
> +that are used for compressed clusters. This new header MUST be present if
> +the incompatible-feature bit "compression format bit" is set and MUST
> be absent
> +otherwise.
> +
> +The fields of the compression format extension are:
> +
> +Byte  0 - 15:  compression_format_name (padded with zeros, but not
> +   necessarily null terminated if it has full length)

Do we really want arbitrary names of formats, or do we want to specify
specific algorithms (gzip, lzo, zstd) as an enum?  Which way gives us
maximum likelihood of interoperability?

> +
> +  16:  compression_level (uint8_t)
> +   0 = default compression level
> +   1 = lowest compression level
> +   x = highest compression level (the highest compression
> +   level may vary for different compression formats)
> +
> + 17 - 23:  Reserved for future use, must be zero.

Feels pretty limited - you don't have a length field for variable-length
extension of additional parameters, but have to fit all additions in the
next 8 bytes.  Yes, all extension headers are already paired with a
length parameter outside of the struct, sent alongside the header magic
number, but embedding a length directly in the header (while redundant)
makes it easier to keep information local to the header.  See
extra_data_size under Bitmap directory, for example.  Of course, we may
turn those 8 bytes INTO a length field, that then describe the rest of
the variable length parameters, but why not do it up front?

If we go with an enum mapping of supported compression formats, then you
can go into further details on exactly what extra parameters are
supports for each algorithm; while leaving it as a free-form text string
makes it harder to interpret what any additional payload will represent.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v6 0/5] Improve I/O tests coverage of LUKS driver

2017-06-27 Thread Fam Zheng
On Tue, 06/27 07:43, no-re...@patchew.org wrote:
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> Traceback (most recent call last):
>   File "./patchew-cli", line 430, in test_one
> git_clone_repo(clone, r["repo"], r["head"], logf)
>   File "./patchew-cli", line 48, in git_clone_repo
> stdout=logf, stderr=logf)
>   File "/usr/lib64/python3.5/subprocess.py", line 266, in check_call
> retcode = call(*popenargs, **kwargs)
>   File "/usr/lib64/python3.5/subprocess.py", line 249, in call
> return p.wait(timeout=timeout)
>   File "/usr/lib64/python3.5/subprocess.py", line 1389, in wait
> (pid, sts) = self._try_wait(0)
>   File "/usr/lib64/python3.5/subprocess.py", line 1339, in _try_wait
> (pid, sts) = os.waitpid(self.pid, wait_flags)
> KeyboardInterrupt

My mistake, sorry for the noise.

Fam



Re: [Qemu-block] [PATCH 1/4] block/qcow2: add compression_algorithm create option

2017-06-27 Thread Peter Lieven

Am 27.06.2017 um 14:49 schrieb Eric Blake:

On 06/27/2017 07:34 AM, Peter Lieven wrote:

this patch adds a new compression_algorithm option when creating qcow2 images.
The current default for the compresison algorithm is zlib and zlib will be

s/compresison/compression/


used when this option is omitted (like before).

If the option is specified e.g. with:

  qemu-img create -f qcow2 -o compression_algorithm=zlib image.qcow2 1G

then a new compression algorithm header extension is added and an incompatible
feature bit is set. This means that if the header is present it must be parsed
by Qemu on qcow2_open and it must be validated if the specified compression
algorithm is supported by the current build of Qemu.

This means if the compression_algorithm option is specified Qemu prior to this
commit will not be able to open the created image.

Signed-off-by: Peter Lieven 
---
  block/qcow2.c | 93 ---
  block/qcow2.h | 20 +++---
  docs/interop/qcow2.txt|  8 +++-

Focusing on just the spec change first:


+++ b/docs/interop/qcow2.txt
@@ -85,7 +85,11 @@ in the description of a field.
  be written to (unless for regaining
  consistency).
  
-Bits 2-63:  Reserved (set to 0)

+Bit 2:  Compress algorithm bit.  If this bit is set 
then
+the compress algorithm extension must be parsed
+and checked for compatiblity.

s/compatiblity/compatibility/


+
+Bits 3-63:  Reserved (set to 0)
  
   80 -  87:  compatible_features

  Bitmask of compatible features. An implementation can
@@ -135,6 +139,8 @@ be stored. Each extension has a structure like the 
following:
  0xE2792ACA - Backing file format name
  0x6803f857 - Feature name table
  0x23852875 - Bitmaps extension
+0xC0318300 - Compression Algorithm
+0xC03183xx - Reserved for compression algorithm params

s/params/parameters/

You have now introduced 256 different reserved headers, without
documenting any of their formats.  You absolutely MUST include a
documentation of how the new 0xC0318300 header is laid out (see, for
example, our section on "Bitmaps extension"), along with text mentioning
that the new header MUST be present if incompatible-feature bit is set
and MUST be absent otherwise.  But I also think that with a bit of
proper design work, you only need ONE header for all possible algorithm
parameters, rather than burning an additional 255 unspecified
reservations.  That is, make sure your new header includes a common
prefix including a length field and the algorightm in use, and then the
length covers a variable-length suffix that can be parsed in a
per-algorithm-specific manner for whatever additional parameters are
needed for that algorithm.



Before I continue, can you please give feedback on the following spec change:

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 80cdfd0..f1428e9 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -85,7 +85,11 @@ in the description of a field.
 be written to (unless for regaining
 consistency).

-Bits 2-63:  Reserved (set to 0)
+Bit 2:  Compression format bit.  Iff this bit is set 
then
+the compression format extension MUST be 
present
+and MUST be parsed and checked for 
compatibility.
+
+Bits 3-63:  Reserved (set to 0)

  80 -  87:  compatible_features
 Bitmask of compatible features. An implementation can
@@ -135,6 +139,7 @@ be stored. Each extension has a structure like the 
following:
 0xE2792ACA - Backing file format name
 0x6803f857 - Feature name table
 0x23852875 - Bitmaps extension
+0xC0318300 - Compression format extension
 other  - Unknown header extension, can be safely
  ignored

@@ -208,6 +213,28 @@ The fields of the bitmaps extension are:
starts. Must be aligned to a cluster boundary.


+== Compression format extension ==
+
+The compression format extension is an optional header extension. It provides
+the ability to specify the compression algorithm and compression parameters
+that are used for compressed clusters. This new header MUST be present if
+the incompatible-feature bit "compression format bit" is set and MUST be absent
+otherwise.
+
+The fields of the compression format extension are:
+
+Byte  0 - 15:  compression_format_name (padded with 

Re: [Qemu-block] [Qemu-devel] [PATCH v6 0/5] Improve I/O tests coverage of LUKS driver

2017-06-27 Thread no-reply
Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 20170626123510.20134-1-berra...@redhat.com
Type: series
Subject: [Qemu-devel] [PATCH v6 0/5] Improve I/O tests coverage of LUKS driver

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Traceback (most recent call last):
  File "./patchew-cli", line 430, in test_one
git_clone_repo(clone, r["repo"], r["head"], logf)
  File "./patchew-cli", line 48, in git_clone_repo
stdout=logf, stderr=logf)
  File "/usr/lib64/python3.5/subprocess.py", line 266, in check_call
retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python3.5/subprocess.py", line 249, in call
return p.wait(timeout=timeout)
  File "/usr/lib64/python3.5/subprocess.py", line 1389, in wait
(pid, sts) = self._try_wait(0)
  File "/usr/lib64/python3.5/subprocess.py", line 1339, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt



---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-de...@freelists.org

Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] block/qcow2: add zlib-fast compression algorithm

2017-06-27 Thread Daniel P. Berrange
On Tue, Jun 27, 2017 at 03:23:19PM +0200, Peter Lieven wrote:
> Am 27.06.2017 um 15:16 schrieb Daniel P. Berrange:
> > On Tue, Jun 27, 2017 at 02:34:10PM +0200, Peter Lieven wrote:
> > > this adds support for optimized zlib settings which almost
> > > tripples the compression speed while maintaining about
> > > the same compressed size.
> > > 
> > > Signed-off-by: Peter Lieven 
> > > ---
> > >   block/qcow2-cluster.c |  3 ++-
> > >   block/qcow2.c | 11 +--
> > >   block/qcow2.h |  1 +
> > >   qemu-img.texi |  1 +
> > >   4 files changed, 13 insertions(+), 3 deletions(-)
> > > diff --git a/qemu-img.texi b/qemu-img.texi
> > > index 043c1ba..83a5db2 100644
> > > --- a/qemu-img.texi
> > > +++ b/qemu-img.texi
> > > @@ -627,6 +627,7 @@ The following options are available if support for 
> > > the respective libraries
> > >   has been enabled at compile time:
> > >  zlibUses standard zlib compression (default)
> > > +   zlib-fast   Uses zlib compression with optimized compression 
> > > parameters
> > This looks like a poor design from a future proofing POV. I would much
> > rather see us introduce a more flexible modelling of compression at the
> > QAPI layer which lets us have tunables for each algorith, in the same
> > way that the qcow2 built-in encryption now has ability to set tunables
> > for each algorithm.
> > 
> > eg your "zlib-fast" impl which just uses zlib with a window size of -15
> > could be expressed as
> > 
> > qemu-img create -o compress.format=zlib,compress.window=-15
> 
> It would also need a compress.level, but okay. This will make things
> more complicated as one might not know what good values for
> the algoritm are.
> 
> Wouldn't it be better to have sth like compress.level=1..9 and let
> the implementation decide which parameters it choose for the algorithm?

Yep, there's definitely a choice of approaches - exposing low level
details of each library as parameters, vs trying to come up with a
more abstracted, simplified notation and having the driver pick some
of the low level details implicitly from that. Both are valid, and I
don't have a particular opinion either way.

> Do you have a pointer where I can find out how to imlement that
> in QAPI? Maybe the patches that introduces the encryption settings?

It is a little messy since we don't fully use QAPI in the block create
code path. If you want to look at what I did for encryption though, see
the block/qcow2.c file. In the qcow2_create() method I grab the
encrypt.format option. Later in the qcow2_set_up_encryption() method
I then extract & parse the format specific options from the QemuOpts
array. My code is in Max's block branch, or on block-luks-qcow2-10 branch
at https://github.com/berrange/qemu

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH RFC v3 3/8] block: add throttle block filter driver

2017-06-27 Thread Manos Pitsidianakis

On Tue, Jun 27, 2017 at 01:45:40PM +0100, Stefan Hajnoczi wrote:

On Mon, Jun 26, 2017 at 07:26:41PM +0300, Manos Pitsidianakis wrote:

On Mon, Jun 26, 2017 at 03:30:55PM +0100, Stefan Hajnoczi wrote:
> > +bs->file = bdrv_open_child(NULL, options, "file",
> > +bs, _file, false, _err);
> > +
> > +if (local_err) {
> > +error_propagate(errp, local_err);
> > +return -EINVAL;
> > +}
> > +
> > +qdict_flatten(options);
> > +return throttle_configure_tgm(bs, tgm, options, errp);
>
> Who destroys bs->file on error?

It is reaped by bdrv_open_inherit() on failure, if I'm not mistaken.
That's how other drivers handle this as well. Some (eg block/qcow2.c)
check if bs->file is NULL instead of the error pointer they pass, so
this is not not very consistent.


Maybe I'm missing it but I don't see relevant bs->file cleanup in
bdrv_open_inherit() or bdrv_open_common().

Please post the exact line where it happens.

Stefan


Relevant commit: de234897b60e034ba94b307fc289e2dc692c9251 block: Do not 
unref bs->file on error in BD's open


bdrv_open_inherit() does this on failure:

fail:
   blk_unref(file);
   if (bs->file != NULL) {
   bdrv_unref_child(bs, bs->file);
   }

While looking into this I noticed bdrv_new_open_driver() doesn't handle 
bs->file on failure. It simply unrefs the bs but because its child's ref 
still remains, it is leaked.


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH 1/4] block/qcow2: add compression_algorithm create option

2017-06-27 Thread Peter Lieven

Am 27.06.2017 um 15:20 schrieb Daniel P. Berrange:

On Tue, Jun 27, 2017 at 02:34:07PM +0200, Peter Lieven wrote:

this patch adds a new compression_algorithm option when creating qcow2 images.
The current default for the compresison algorithm is zlib and zlib will be
used when this option is omitted (like before).

If the option is specified e.g. with:

  qemu-img create -f qcow2 -o compression_algorithm=zlib image.qcow2 1G

IMHO we should introduce a nested struct "compress" struct to hold the format
name, and any other format specific arguments, in a way that maps nicely to
any future QAPI representmatch of create options. eg

{ 'enum': 'BlockdevQcow2CompressFormat',
   'data': [ 'zlib', 'lzo' ] }

{ 'union': 'BlockdevQcow2Compress',
   'base': { 'format': 'BlockdevQcow2CompressFormat' },
   'discriminator': 'format',
   'data': { 'zlib': 'BlockdevQcow2CompressZLib',
 'lzo': 'BlockdevQcow2CompressLZO'} }

so it would map to

  qemu-img create -f qcow2 -o compress.format=zlib image.qcow2 1G

and let us have other compress. options specific to each format


Or would it be possible to start with just a compress.level (int) parameter.
In fact that would be sufficient for almost all formats (or better use 
algorithms?).
The windowBits can be default to -15 in the future. It seems the old choice of 
-12
was just not optimal. We just have to use it for backwards compatiblity if the 
compress
options are not specified.

Peter




Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] block/qcow2: add zlib-fast compression algorithm

2017-06-27 Thread Peter Lieven

Am 27.06.2017 um 15:16 schrieb Daniel P. Berrange:

On Tue, Jun 27, 2017 at 02:34:10PM +0200, Peter Lieven wrote:

this adds support for optimized zlib settings which almost
tripples the compression speed while maintaining about
the same compressed size.

Signed-off-by: Peter Lieven 
---
  block/qcow2-cluster.c |  3 ++-
  block/qcow2.c | 11 +--
  block/qcow2.h |  1 +
  qemu-img.texi |  1 +
  4 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/qemu-img.texi b/qemu-img.texi
index 043c1ba..83a5db2 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -627,6 +627,7 @@ The following options are available if support for the 
respective libraries
  has been enabled at compile time:
  
 zlibUses standard zlib compression (default)

+   zlib-fast   Uses zlib compression with optimized compression parameters

This looks like a poor design from a future proofing POV. I would much
rather see us introduce a more flexible modelling of compression at the
QAPI layer which lets us have tunables for each algorith, in the same
way that the qcow2 built-in encryption now has ability to set tunables
for each algorithm.

eg your "zlib-fast" impl which just uses zlib with a window size of -15
could be expressed as

qemu-img create -o compress.format=zlib,compress.window=-15


It would also need a compress.level, but okay. This will make things
more complicated as one might not know what good values for
the algoritm are.

Wouldn't it be better to have sth like compress.level=1..9 and let
the implementation decide which parameters it choose for the algorithm?

Do you have a pointer where I can find out how to imlement that
in QAPI? Maybe the patches that introduces the encryption settings?

Thanks,
Peter




Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] block/qcow2: add zlib-fast compression algorithm

2017-06-27 Thread Daniel P. Berrange
On Tue, Jun 27, 2017 at 02:34:10PM +0200, Peter Lieven wrote:
> this adds support for optimized zlib settings which almost
> tripples the compression speed while maintaining about
> the same compressed size.
> 
> Signed-off-by: Peter Lieven 
> ---
>  block/qcow2-cluster.c |  3 ++-
>  block/qcow2.c | 11 +--
>  block/qcow2.h |  1 +
>  qemu-img.texi |  1 +
>  4 files changed, 13 insertions(+), 3 deletions(-)
> diff --git a/qemu-img.texi b/qemu-img.texi
> index 043c1ba..83a5db2 100644
> --- a/qemu-img.texi
> +++ b/qemu-img.texi
> @@ -627,6 +627,7 @@ The following options are available if support for the 
> respective libraries
>  has been enabled at compile time:
>  
> zlibUses standard zlib compression (default)
> +   zlib-fast   Uses zlib compression with optimized compression 
> parameters

This looks like a poor design from a future proofing POV. I would much
rather see us introduce a more flexible modelling of compression at the
QAPI layer which lets us have tunables for each algorith, in the same
way that the qcow2 built-in encryption now has ability to set tunables
for each algorithm.

eg your "zlib-fast" impl which just uses zlib with a window size of -15
could be expressed as

   qemu-img create -o compress.format=zlib,compress.window=-15



> lzo Uses LZO1X compression


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH 4/4] block/qcow2: add zlib-fast compression algorithm

2017-06-27 Thread Peter Lieven

Am 27.06.2017 um 14:53 schrieb Eric Blake:

On 06/27/2017 07:34 AM, Peter Lieven wrote:

this adds support for optimized zlib settings which almost

Start sentences with a capital.


tripples the compression speed while maintaining about

s/tripples/triples/


the same compressed size.

Signed-off-by: Peter Lieven 
---
  block/qcow2-cluster.c |  3 ++-
  block/qcow2.c | 11 +--
  block/qcow2.h |  1 +
  qemu-img.texi |  1 +
  4 files changed, 13 insertions(+), 3 deletions(-)

+++ b/block/qcow2.h
@@ -173,6 +173,7 @@ typedef struct Qcow2UnknownHeaderExtension {
  enum {
  QCOW2_COMPRESSION_ZLIB  = 0xC0318301,
  QCOW2_COMPRESSION_LZO   = 0xC0318302,
+QCOW2_COMPRESSION_ZLIB_FAST = 0xC0318303,

Back to my comments on 1/4 - we MUST first get the qcow2 specification
right, rather than adding undocumented headers in the code.  And I still
think you only need one variable-length header extension for covering
all possible algorithms, rather than one header per algorithm.  Let's
get the spec right first, before worrying about the code implementing
the spec.



Okay, I think someone came up with the idea to have an optional
header per algorithm, but you are right one header with an optional
parameter payload will also do.

I will split the spec change to a separate patch in V2 to make it easier
to respin.

Thanks,
Peter



Re: [Qemu-block] [Qemu-devel] [PATCH RFC v3 5/8] block: add BlockDevOptionsThrottle to QAPI

2017-06-27 Thread Eric Blake
On 06/23/2017 07:46 AM, Manos Pitsidianakis wrote:
> This is needed to configure throttle filter driver nodes with QAPI.
> 
> Signed-off-by: Manos Pitsidianakis 
> ---
>  qapi/block-core.json | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index f85c2235c7..1d4afafe8c 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2119,7 +2119,7 @@
>  'host_device', 'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs',
>  'null-aio', 'null-co', 'parallels', 'qcow', 'qcow2', 'qed',
>  'quorum', 'raw', 'rbd', 'replication', 'sheepdog', 'ssh',
> -'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
> +'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }

We already have a '@vxhs: Since 2.10' comment, you should add a similar
one for @throttle.


> +
> +##
> +# @BlockdevOptionsThrottle:
> +#
> +# Driver specific block device options for Throttle
> +#
> +# @throttling-group: the name of the throttling group to use
> +#
> +# @options:BlockIOThrottle options
> +# Since: 2.9

You missed 2.9; this should be 2.10.

> +##
> +{ 'struct': 'BlockdevOptionsThrottle',
> +  'data': { 'throttling-group': 'str',
> +'file' : 'BlockdevRef',

Missing documentation for 'file'.

> +'*options' : 'BlockIOThrottle'
> + } }
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH RFC v3 7/8] block: remove legacy I/O throttling

2017-06-27 Thread Stefan Hajnoczi
On Tue, Jun 27, 2017 at 01:45:07AM +0300, Manos Pitsidianakis wrote:
> On Mon, Jun 26, 2017 at 04:44:44PM +0100, Stefan Hajnoczi wrote:
> > On Fri, Jun 23, 2017 at 03:46:59PM +0300, Manos Pitsidianakis wrote:
> > > -void blk_io_limits_disable(BlockBackend *blk)
> > > +void blk_io_limits_disable(BlockBackend *blk, Error **errp)
> > >  {
> > > -assert(blk->public.throttle_group_member.throttle_state);
> > > -bdrv_drained_begin(blk_bs(blk));
> > > -throttle_group_unregister_tgm(>public.throttle_group_member);
> > > -bdrv_drained_end(blk_bs(blk));
> > > +Error *local_err = NULL;
> > > +BlockDriverState *bs, *throttle_node;
> > > +
> > > +throttle_node = blk_get_public(blk)->throttle_node;
> > > +
> > > +assert(throttle_node && throttle_node->refcnt == 1);
> > 
> > I'm not sure if we can enforce refcnt == 1.  What stops other graph
> > manipulation operations from inserting a node above or a BB that uses
> > throttle_node as the root?
> 
> Is this possible unless the user explicitly does this? I suppose going down
> the path and removing it from any place is okay. If the throttle_node has
> more than one parent then the result would be invalid anyway. I don't see
> anything in the code doing this (removing a BDS from a BB->leaf path) or
> wrapping it in some way, is that what should be done?

We cannot assume the user will keep their hands off the graph :).

In general, QEMU cannot assume that external interfaces (e.g.
command-line, monitor, VNC, ...) are only called in convenient and safe
ways.

The code must handle all possible situations.  Usually the simplest
solution is to reject requests that would put QEMU into an invalid
state.

> > 
> > > +
> > > +/* ref throttle_node's child bs so that it isn't lost when 
> > > throttle_node is
> > > + * destroyed */
> > > +bdrv_ref(bs);
> > > +
> > > +/* this destroys throttle_node */
> > > +blk_remove_bs(blk);
> > 
> > This assumes that throttle_node is the top node.  How is this constraint
> > enforced?
> 
> 
> Kevin had said in a previous discussion:
> > The throttle node affects only the backing file now, but not the new
> > top-level node. With manually created filter nodes we may need to
> > provide a way so that the QMP command tells where in the tree the
> > snapshot is actually taken. With automatically created ones, we need to
> > make sure that they always stay on top.
> > 
> > Similar problems arise when block jobs manipulate the graph. For
> > example, think of a commit block job, which will remove the topmost node
> > from the backing file chain. We don't want it to remove the throttling
> > filter together with the qcow2 node. (Or do we? It might be something
> > that needs to be configurable.)
> > 
> > We also have several QMP that currently default to working on the
> > topmost image of the chain. Some of them may actually want to work on
> > the topmost node that isn't an automatically inserted filter.
> > 
> > 
> > I hope this helps you understand why I keep saying that automatic
> > insertion/removal of filter nodes is tricky and we should do it as
> > little as we possibly can.
> 
> Yes, this constraint isn't enforced. The automatically inserted filter is
> not a very clean concept. I'll have to think about it a little more, unless
> there are some ideas.

The block layer has an "op blockers" and permissions API to reject
operations that would violate the graph constraints.  Kevin has recently
worked on it and I hope he can give you some advice in this area.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH RFC v3 4/8] block: convert ThrottleGroup to object with QOM

2017-06-27 Thread Stefan Hajnoczi
On Mon, Jun 26, 2017 at 07:58:32PM +0300, Manos Pitsidianakis wrote:
> On Mon, Jun 26, 2017 at 03:52:34PM +0100, Stefan Hajnoczi wrote:
> > On Fri, Jun 23, 2017 at 03:46:56PM +0300, Manos Pitsidianakis wrote:
> > > +static bool throttle_group_exists(const char *name)
> > > +{
> > > +ThrottleGroup *iter;
> > > +bool ret = false;
> > > +
> > > +qemu_mutex_lock(_groups_lock);
> > 
> > Not sure if this lock or the throttle_groups list are necessary.
> > 
> > Have you seen iothread.c:qmp_query_iothreads()?  All objects are put
> > into a container (the parent object), so you can just iterate over its
> > children.  There's no need for a separate list because QOM already has
> > all the objects.
> > 
> 
> > > +static void throttle_group_obj_complete(UserCreatable *obj, Error **errp)
> > > +{
> > > +char *name = NULL;
> > > +Error *local_error = NULL;
> > > +ThrottleGroup *tg = THROTTLE_GROUP(obj);
> > > +
> > > +name = object_get_canonical_path_component(OBJECT(obj));
> > > +if (throttle_group_exists(name)) {
> > > +error_setg(_error, "A throttle group with this name 
> > > already \
> > > +  exists.");
> > > +goto ret;
> > > +}
> > 
> > QOM should enforce unique id=.  I don't think this is necessary.
> > 
> > > +
> > > +qemu_mutex_lock(_groups_lock);
> > > +tg->name = name;
> > > +qemu_mutex_init(>lock);
> > > +QLIST_INIT(>head);
> > > +QTAILQ_INSERT_TAIL(_groups, tg, list);
> > > +tg->refcount++;
> > > +qemu_mutex_unlock(_groups_lock);
> > > +
> 
> Sorry for the multiple replies but I just remembered this.
> 
> This is necessary because throttle groups are created by other interfaces as
> well. Of course block/throttle-groups.c could use only QOM objects
> internally to eliminate the housekeeping.

I suggest all throttle group objects are visible in QOM.  Who are the
non-QOM users?

I have CCed Pradeep who has been working on virtio-9p throttling.

Pradeep: You may be interested in this series, which makes the throttle
group a QOM object (-object throttle-group,id=group0,bps=1235678).  In
other words, groups are becoming first-class objects instead of being
hidden behind the member devices that use them.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH RFC v3 4/8] block: convert ThrottleGroup to object with QOM

2017-06-27 Thread Stefan Hajnoczi
On Mon, Jun 26, 2017 at 06:24:09PM +0300, Manos Pitsidianakis wrote:
> On Mon, Jun 26, 2017 at 03:52:34PM +0100, Stefan Hajnoczi wrote:
> > > +static void throttle_group_set(Object *obj, Visitor *v, const char * 
> > > name,
> > > +void *opaque, Error **errp)
> > > +
> > > +{
> > > +ThrottleGroup *tg = THROTTLE_GROUP(obj);
> > > +ThrottleConfig cfg = tg->ts.cfg;
> > > +Error *local_err = NULL;
> > > +ThrottleParamInfo *info = opaque;
> > > +int64_t value;
> > 
> > What happens if this property is set while QEMU is already running?
> 
> I assume you mean setting a property while a group has active members and
> requests? My best answer would be "don't do that". I couldn't figure a way
> to do this cleanly. Individual limit changes may render a ThrottleConfig
> invalid, so it should not be allowed. ThrottleGroups and throttle nodes
> should be destroyed and recreated to change limits with this version, but in
> the next it will be done through block_set_io_throttle() which is the
> existing way to change limits and check for validity. This was discussed in
> the proposal about the new syntax we had on the list.

Please ask on IRC or the mailing list if you have questions.

If you are aware of a limitation and don't know the answer then
submitting the code without any comment or question is dangerous.
Reviewers may miss the problem :) and broken code gets merged.

UserCreatableClass has a ->complete() callback.  You can use this to
perform final initialization and then "freeze" the properties by setting
a bool flag.  The property accessor functions can do:

  if (tg->init_complete) {
  error_setg(errp, "Property modification is not allowed at run-time");
  return;
  }

Something like this is used by
backends/hostmem.c:host_memory_backend_set_size(), for example.

> > 
> > > +goto out;
> > > +}
> > 
> > This doesn't validate inputs properly for non int64_t types.
> > 
> > I'm also worried that the command-line bps=,iops=,... options do not
> > have unsigned or double types.  Input ranges and validation should match
> > the QEMU command-line (I know this is a bit of a pain with QOM since the
> > property types are different from QEMU option types).
> 
> I don't know what should be done about this, to be honest, except for
> manually checking the limits for each datatype in the QOM setters.

I believe all throttling parameter types are int64_t in QemuOpts.  If we
want to be compatible with the command-line parameters then they should
also be int64_t here instead of unsigned int or double.

This approach makes sense from the QMP user perspective.  QMP clients
shouldn't have to deal with different data types depending on which
throttling API they use.  Let's keep it consistent - there's no real
drawback to using int64_t.


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH 4/4] block/qcow2: add zlib-fast compression algorithm

2017-06-27 Thread Eric Blake
On 06/27/2017 07:34 AM, Peter Lieven wrote:
> this adds support for optimized zlib settings which almost

Start sentences with a capital.

> tripples the compression speed while maintaining about

s/tripples/triples/

> the same compressed size.
> 
> Signed-off-by: Peter Lieven 
> ---
>  block/qcow2-cluster.c |  3 ++-
>  block/qcow2.c | 11 +--
>  block/qcow2.h |  1 +
>  qemu-img.texi |  1 +
>  4 files changed, 13 insertions(+), 3 deletions(-)
> 

> +++ b/block/qcow2.h
> @@ -173,6 +173,7 @@ typedef struct Qcow2UnknownHeaderExtension {
>  enum {
>  QCOW2_COMPRESSION_ZLIB  = 0xC0318301,
>  QCOW2_COMPRESSION_LZO   = 0xC0318302,
> +QCOW2_COMPRESSION_ZLIB_FAST = 0xC0318303,

Back to my comments on 1/4 - we MUST first get the qcow2 specification
right, rather than adding undocumented headers in the code.  And I still
think you only need one variable-length header extension for covering
all possible algorithms, rather than one header per algorithm.  Let's
get the spec right first, before worrying about the code implementing
the spec.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 1/4] block/qcow2: add compression_algorithm create option

2017-06-27 Thread Eric Blake
On 06/27/2017 07:34 AM, Peter Lieven wrote:
> this patch adds a new compression_algorithm option when creating qcow2 images.
> The current default for the compresison algorithm is zlib and zlib will be

s/compresison/compression/

> used when this option is omitted (like before).
> 
> If the option is specified e.g. with:
> 
>  qemu-img create -f qcow2 -o compression_algorithm=zlib image.qcow2 1G
> 
> then a new compression algorithm header extension is added and an incompatible
> feature bit is set. This means that if the header is present it must be parsed
> by Qemu on qcow2_open and it must be validated if the specified compression
> algorithm is supported by the current build of Qemu.
> 
> This means if the compression_algorithm option is specified Qemu prior to this
> commit will not be able to open the created image.
> 
> Signed-off-by: Peter Lieven 
> ---
>  block/qcow2.c | 93 
> ---
>  block/qcow2.h | 20 +++---
>  docs/interop/qcow2.txt|  8 +++-

Focusing on just the spec change first:

> +++ b/docs/interop/qcow2.txt
> @@ -85,7 +85,11 @@ in the description of a field.
>  be written to (unless for regaining
>  consistency).
>  
> -Bits 2-63:  Reserved (set to 0)
> +Bit 2:  Compress algorithm bit.  If this bit is set 
> then
> +the compress algorithm extension must be 
> parsed
> +and checked for compatiblity.

s/compatiblity/compatibility/

> +
> +Bits 3-63:  Reserved (set to 0)
>  
>   80 -  87:  compatible_features
>  Bitmask of compatible features. An implementation can
> @@ -135,6 +139,8 @@ be stored. Each extension has a structure like the 
> following:
>  0xE2792ACA - Backing file format name
>  0x6803f857 - Feature name table
>  0x23852875 - Bitmaps extension
> +0xC0318300 - Compression Algorithm
> +0xC03183xx - Reserved for compression algorithm 
> params

s/params/parameters/

You have now introduced 256 different reserved headers, without
documenting any of their formats.  You absolutely MUST include a
documentation of how the new 0xC0318300 header is laid out (see, for
example, our section on "Bitmaps extension"), along with text mentioning
that the new header MUST be present if incompatible-feature bit is set
and MUST be absent otherwise.  But I also think that with a bit of
proper design work, you only need ONE header for all possible algorithm
parameters, rather than burning an additional 255 unspecified
reservations.  That is, make sure your new header includes a common
prefix including a length field and the algorightm in use, and then the
length covers a variable-length suffix that can be parsed in a
per-algorithm-specific manner for whatever additional parameters are
needed for that algorithm.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH RFC v3 3/8] block: add throttle block filter driver

2017-06-27 Thread Stefan Hajnoczi
On Mon, Jun 26, 2017 at 07:26:41PM +0300, Manos Pitsidianakis wrote:
> On Mon, Jun 26, 2017 at 03:30:55PM +0100, Stefan Hajnoczi wrote:
> > > +bs->file = bdrv_open_child(NULL, options, "file",
> > > +bs, _file, false, _err);
> > > +
> > > +if (local_err) {
> > > +error_propagate(errp, local_err);
> > > +return -EINVAL;
> > > +}
> > > +
> > > +qdict_flatten(options);
> > > +return throttle_configure_tgm(bs, tgm, options, errp);
> > 
> > Who destroys bs->file on error?
> 
> It is reaped by bdrv_open_inherit() on failure, if I'm not mistaken.
> That's how other drivers handle this as well. Some (eg block/qcow2.c)
> check if bs->file is NULL instead of the error pointer they pass, so
> this is not not very consistent.

Maybe I'm missing it but I don't see relevant bs->file cleanup in
bdrv_open_inherit() or bdrv_open_common().

Please post the exact line where it happens.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH RFC v3 3/8] block: add throttle block filter driver

2017-06-27 Thread Stefan Hajnoczi
On Mon, Jun 26, 2017 at 07:01:18PM +0300, Manos Pitsidianakis wrote:
> On Mon, Jun 26, 2017 at 03:30:55PM +0100, Stefan Hajnoczi wrote:
> > > +static BlockDriver bdrv_throttle = {
> > > +.format_name=   "throttle",
> > > +.protocol_name  =   "throttle",
> > > +.instance_size  =   sizeof(ThrottleGroupMember),
> > > +
> > > +.bdrv_file_open =   throttle_open,
> > > +.bdrv_close =   throttle_close,
> > > +.bdrv_co_flush  =   throttle_co_flush,
> > > +
> > > +.bdrv_child_perm=   bdrv_filter_default_perms,
> > > +
> > > +.bdrv_getlength =   throttle_getlength,
> > > +
> > > +.bdrv_co_preadv =   throttle_co_preadv,
> > > +.bdrv_co_pwritev=   throttle_co_pwritev,
> > > +
> > > +.bdrv_co_pwrite_zeroes  =   throttle_co_pwrite_zeroes,
> > > +.bdrv_co_pdiscard   =   throttle_co_pdiscard,
> > > +
> > > +.bdrv_recurse_is_first_non_filter   =   
> > > bdrv_recurse_is_first_non_filter,
> > > +
> > > +.bdrv_attach_aio_context=   throttle_attach_aio_context,
> > > +.bdrv_detach_aio_context=   throttle_detach_aio_context,
> > > +
> > > +.bdrv_reopen_prepare=   throttle_reopen_prepare,
> > > +.bdrv_reopen_commit =   throttle_reopen_commit,
> > > +.bdrv_reopen_abort  =   throttle_reopen_abort,
> > > +
> > > +.is_filter  =   true,
> > > +};
> > 
> > Missing:
> > bdrv_co_get_block_status()
> > bdrv_truncate()
> > bdrv_get_info()
> > bdrv_probe_blocksizes()
> > bdrv_probe_geometry()
> > bdrv_media_changed()
> > bdrv_eject()
> > bdrv_lock_medium()
> > bdrv_co_ioctl()
> > 
> > See block/raw-format.c.
> > 
> > I think most of these could be modified in block.c or block/io.c to
> > automatically call bs->file's function if drv doesn't implement them.
> > This way all block drivers would transparently pass them through by
> > default and block/raw-format.c code could be eliminated.
> 
> Are these truly necessary? Because other filter drivers (ie quorum,
> blkverify) don't implement them.

Both quorum and blkverify are rarely used.  This explains why the issue
hasn't been found yet.

These are the callbacks I identified which do not automatically forward
to bs->file.  Therefore the throttle driver will break these features
when bs->file supports them.

That's why I suggest forwarding to bs->file in block.c.  Then individual
drivers do not have to implement these callbacks just to forward to
bs->file.  And if the driver wishes to prohibit a feature, it can
implement the callback and return -ENOTSUP.

You can send this fix as a separate patch series, independent of the
throttle driver.  Once it has been merged the throttle driver will gain
support for these features.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH RFC v3 2/8] block: Add aio_context field in ThrottleGroupMember

2017-06-27 Thread Alberto Garcia
On Fri 23 Jun 2017 02:46:54 PM CEST, Manos Pitsidianakis wrote:
> timer_cb() needs to know about the current Aio context of the throttle
> request that is woken up. In order to make ThrottleGroupMember backend
> agnostic, this information is stored in an aio_context field instead of
> accessing it from BlockBackend.
>
> Signed-off-by: Manos Pitsidianakis 

I agree with Stefan's comments, otherwise the patch looks good to me.

Berto



[Qemu-block] [PATCH 3/4] block/qcow2: add lzo compression algorithm

2017-06-27 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/qcow2-cluster.c | 65 +++---
 block/qcow2.c | 72 ++-
 block/qcow2.h |  1 +
 configure |  2 +-
 qemu-img.texi |  1 +
 5 files changed, 95 insertions(+), 46 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 3d341fd..ecb059b 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -24,6 +24,9 @@
 
 #include "qemu/osdep.h"
 #include 
+#ifdef CONFIG_LZO
+#include 
+#endif
 
 #include "qapi/error.h"
 #include "qemu-common.h"
@@ -1521,30 +1524,49 @@ again:
 }
 
 static int decompress_buffer(uint8_t *out_buf, int out_buf_size,
- const uint8_t *buf, int buf_size)
+ const uint8_t *buf, int buf_size,
+ uint32_t compression_algorithm_id)
 {
 z_stream strm1, *strm = 
-int ret, out_len;
-
-memset(strm, 0, sizeof(*strm));
-
-strm->next_in = (uint8_t *)buf;
-strm->avail_in = buf_size;
-strm->next_out = out_buf;
-strm->avail_out = out_buf_size;
-
-ret = inflateInit2(strm, -12);
-if (ret != Z_OK)
-return -1;
-ret = inflate(strm, Z_FINISH);
-out_len = strm->next_out - out_buf;
-if ((ret != Z_STREAM_END && ret != Z_BUF_ERROR) ||
-out_len != out_buf_size) {
+int ret = 0, out_len;
+
+switch (compression_algorithm_id) {
+case QCOW2_COMPRESSION_ZLIB:
+memset(strm, 0, sizeof(*strm));
+
+strm->next_in = (uint8_t *)buf;
+strm->avail_in = buf_size;
+strm->next_out = out_buf;
+strm->avail_out = out_buf_size;
+
+ret = inflateInit2(strm, -12);
+if (ret != Z_OK) {
+return -1;
+}
+ret = inflate(strm, Z_FINISH);
+out_len = strm->next_out - out_buf;
+ret = -(ret != Z_STREAM_END);
 inflateEnd(strm);
-return -1;
+break;
+#ifdef CONFIG_LZO
+case QCOW2_COMPRESSION_LZO:
+out_len = out_buf_size;
+ret = lzo1x_decompress_safe(buf, buf_size, out_buf,
+(lzo_uint *) _len, NULL);
+if (ret == LZO_E_INPUT_NOT_CONSUMED) {
+/* We always read up to the next sector boundary. Thus
+ * buf_size may be larger than the original compressed size. */
+ret = 0;
+}
+break;
+#endif
+default:
+abort(); /* should never reach this point */
 }
-inflateEnd(strm);
-return 0;
+if (out_len != out_buf_size) {
+ret = -1;
+}
+return ret;
 }
 
 int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
@@ -1565,7 +1587,8 @@ int qcow2_decompress_cluster(BlockDriverState *bs, 
uint64_t cluster_offset)
 return ret;
 }
 if (decompress_buffer(s->cluster_cache, s->cluster_size,
-  s->cluster_data + sector_offset, csize) < 0) {
+  s->cluster_data + sector_offset, csize,
+  s->compression_algorithm_id) < 0) {
 return -EIO;
 }
 s->cluster_cache_offset = coffset;
diff --git a/block/qcow2.c b/block/qcow2.c
index c91eb1f..bd65582 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -26,6 +26,9 @@
 #include "sysemu/block-backend.h"
 #include "qemu/module.h"
 #include 
+#ifdef CONFIG_LZO
+#include 
+#endif
 #include "block/qcow2.h"
 #include "qemu/error-report.h"
 #include "qapi/qmp/qerror.h"
@@ -84,6 +87,10 @@ static uint32_t is_compression_algorithm_supported(char 
*algorithm)
 /* no algorithm means the old default of zlib compression
  * with 12 window bits */
 return QCOW2_COMPRESSION_ZLIB;
+#ifdef CONFIG_LZO
+} else if (!strcmp(algorithm, "lzo")) {
+return QCOW2_COMPRESSION_LZO;
+#endif
 }
 return 0;
 }
@@ -2715,8 +2722,8 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
uint64_t offset,
 QEMUIOVector hd_qiov;
 struct iovec iov;
 z_stream strm;
-int ret, out_len;
-uint8_t *buf, *out_buf, *local_buf = NULL;
+int ret, out_len = 0;
+uint8_t *buf, *out_buf = NULL, *local_buf = NULL, *work_buf = NULL;
 uint64_t cluster_offset;
 
 if (bytes == 0) {
@@ -2741,34 +2748,50 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
uint64_t offset,
 buf = qiov->iov[0].iov_base;
 }
 
-out_buf = g_malloc(s->cluster_size);
+switch (s->compression_algorithm_id) {
+case QCOW2_COMPRESSION_ZLIB:
+out_buf = g_malloc(s->cluster_size);
 
-/* best compression, small window, no zlib header */
-memset(, 0, sizeof(strm));
-ret = deflateInit2(, Z_DEFAULT_COMPRESSION,
-   Z_DEFLATED, -12,
-   9, Z_DEFAULT_STRATEGY);
-if (ret != 0) {
-ret = -EINVAL;
-goto fail;
-}
+/* best compression, small window, no zlib header */
+

[Qemu-block] [PATCH 1/4] block/qcow2: add compression_algorithm create option

2017-06-27 Thread Peter Lieven
this patch adds a new compression_algorithm option when creating qcow2 images.
The current default for the compresison algorithm is zlib and zlib will be
used when this option is omitted (like before).

If the option is specified e.g. with:

 qemu-img create -f qcow2 -o compression_algorithm=zlib image.qcow2 1G

then a new compression algorithm header extension is added and an incompatible
feature bit is set. This means that if the header is present it must be parsed
by Qemu on qcow2_open and it must be validated if the specified compression
algorithm is supported by the current build of Qemu.

This means if the compression_algorithm option is specified Qemu prior to this
commit will not be able to open the created image.

Signed-off-by: Peter Lieven 
---
 block/qcow2.c | 93 ---
 block/qcow2.h | 20 +++---
 docs/interop/qcow2.txt|  8 +++-
 include/block/block_int.h | 35 +-
 qemu-img.texi | 10 +
 5 files changed, 138 insertions(+), 28 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2f94f03..893b145 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -60,9 +60,11 @@ typedef struct {
 uint32_t len;
 } QEMU_PACKED QCowExtension;
 
-#define  QCOW2_EXT_MAGIC_END 0
-#define  QCOW2_EXT_MAGIC_BACKING_FORMAT 0xE2792ACA
-#define  QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857
+#define QCOW2_EXT_MAGIC_END   0
+#define QCOW2_EXT_MAGIC_BACKING_FORMAT0xE2792ACA
+#define QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857
+#define QCOW2_EXT_MAGIC_COMPRESSION_ALGORITHM 0xc0318300
+/* 0xc03183xx reserved for further use of compression algorithm parameters */
 
 static int qcow2_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
@@ -76,6 +78,15 @@ static int qcow2_probe(const uint8_t *buf, int buf_size, 
const char *filename)
 return 0;
 }
 
+static uint32_t is_compression_algorithm_supported(char *algorithm)
+{
+if (!algorithm[0] || !strcmp(algorithm, "zlib")) {
+/* no algorithm means the old default of zlib compression
+ * with 12 window bits */
+return QCOW2_COMPRESSION_ZLIB;
+}
+return 0;
+}
 
 /* 
  * read qcow2 extension and fill bs
@@ -148,6 +159,34 @@ static int qcow2_read_extensions(BlockDriverState *bs, 
uint64_t start_offset,
 #endif
 break;
 
+case QCOW2_EXT_MAGIC_COMPRESSION_ALGORITHM:
+if (ext.len >= sizeof(s->compression_algorithm)) {
+error_setg(errp, "ERROR: ext_compression_algorithm: len=%"
+   PRIu32 " too large (>=%zu)", ext.len,
+   sizeof(s->compression_algorithm));
+return 2;
+}
+ret = bdrv_pread(bs->file, offset, s->compression_algorithm,
+ ext.len);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "ERROR: 
ext_compression_algorithm:"
+ " Could not read algorithm name");
+return 3;
+}
+s->compression_algorithm[ext.len] = '\0';
+s->compression_algorithm_id =
+is_compression_algorithm_supported(s->compression_algorithm);
+if (!s->compression_algorithm_id) {
+error_setg(errp, "ERROR: compression algorithm '%s' is "
+   " unsupported", s->compression_algorithm);
+return 4;
+}
+#ifdef DEBUG_EXT
+printf("Qcow2: Got compression algorithm %s\n",
+   s->compression_algorithm);
+#endif
+break;
+
 case QCOW2_EXT_MAGIC_FEATURE_TABLE:
 if (p_feature_table != NULL) {
 void* feature_table = g_malloc0(ext.len + 2 * 
sizeof(Qcow2Feature));
@@ -1104,6 +1143,7 @@ static int qcow2_do_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 s->cluster_cache_offset = -1;
 s->flags = flags;
+s->compression_algorithm_id = QCOW2_COMPRESSION_ZLIB;
 
 ret = qcow2_refcount_init(bs);
 if (ret != 0) {
@@ -1981,6 +2021,21 @@ int qcow2_update_header(BlockDriverState *bs)
 buflen -= ret;
 }
 
+/* Compression Algorithm header extension */
+if (s->compression_algorithm[0]) {
+ret = header_ext_add(buf, QCOW2_EXT_MAGIC_COMPRESSION_ALGORITHM,
+ s->compression_algorithm,
+ strlen(s->compression_algorithm),
+ buflen);
+if (ret < 0) {
+goto fail;
+}
+buf += ret;
+buflen -= ret;
+header->incompatible_features |=
+cpu_to_be64(QCOW2_INCOMPAT_COMPRESSION);
+}
+
 /* Feature table */
 if (s->qcow_version >= 3) {
 Qcow2Feature features[] = {
@@ -1995,6 +2050,11 @@ int qcow2_update_header(BlockDriverState *bs)
 .name = "corrupt bit",
 },
 {
+

[Qemu-block] [PATCH 4/4] block/qcow2: add zlib-fast compression algorithm

2017-06-27 Thread Peter Lieven
this adds support for optimized zlib settings which almost
tripples the compression speed while maintaining about
the same compressed size.

Signed-off-by: Peter Lieven 
---
 block/qcow2-cluster.c |  3 ++-
 block/qcow2.c | 11 +--
 block/qcow2.h |  1 +
 qemu-img.texi |  1 +
 4 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ecb059b..d8e2378 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1532,6 +1532,7 @@ static int decompress_buffer(uint8_t *out_buf, int 
out_buf_size,
 
 switch (compression_algorithm_id) {
 case QCOW2_COMPRESSION_ZLIB:
+case QCOW2_COMPRESSION_ZLIB_FAST:
 memset(strm, 0, sizeof(*strm));
 
 strm->next_in = (uint8_t *)buf;
@@ -1539,7 +1540,7 @@ static int decompress_buffer(uint8_t *out_buf, int 
out_buf_size,
 strm->next_out = out_buf;
 strm->avail_out = out_buf_size;
 
-ret = inflateInit2(strm, -12);
+ret = inflateInit2(strm, -15);
 if (ret != Z_OK) {
 return -1;
 }
diff --git a/block/qcow2.c b/block/qcow2.c
index bd65582..f07d8f0 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -87,6 +87,8 @@ static uint32_t is_compression_algorithm_supported(char 
*algorithm)
 /* no algorithm means the old default of zlib compression
  * with 12 window bits */
 return QCOW2_COMPRESSION_ZLIB;
+} else if (!strcmp(algorithm, "zlib-fast")) {
+return QCOW2_COMPRESSION_ZLIB_FAST;
 #ifdef CONFIG_LZO
 } else if (!strcmp(algorithm, "lzo")) {
 return QCOW2_COMPRESSION_LZO;
@@ -2722,6 +2724,7 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
uint64_t offset,
 QEMUIOVector hd_qiov;
 struct iovec iov;
 z_stream strm;
+int z_level = Z_DEFAULT_COMPRESSION, z_windowBits = -12;
 int ret, out_len = 0;
 uint8_t *buf, *out_buf = NULL, *local_buf = NULL, *work_buf = NULL;
 uint64_t cluster_offset;
@@ -2749,13 +2752,17 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
uint64_t offset,
 }
 
 switch (s->compression_algorithm_id) {
+case QCOW2_COMPRESSION_ZLIB_FAST:
+z_level = Z_BEST_SPEED;
+z_windowBits = -15;
+/* fall-through */
 case QCOW2_COMPRESSION_ZLIB:
 out_buf = g_malloc(s->cluster_size);
 
 /* best compression, small window, no zlib header */
 memset(, 0, sizeof(strm));
-ret = deflateInit2(, Z_DEFAULT_COMPRESSION,
-   Z_DEFLATED, -12,
+ret = deflateInit2(, z_level,
+   Z_DEFLATED, z_windowBits,
9, Z_DEFAULT_STRATEGY);
 if (ret != 0) {
 ret = -EINVAL;
diff --git a/block/qcow2.h b/block/qcow2.h
index 716012c..a89f986 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -173,6 +173,7 @@ typedef struct Qcow2UnknownHeaderExtension {
 enum {
 QCOW2_COMPRESSION_ZLIB  = 0xC0318301,
 QCOW2_COMPRESSION_LZO   = 0xC0318302,
+QCOW2_COMPRESSION_ZLIB_FAST = 0xC0318303,
 };
 
 enum {
diff --git a/qemu-img.texi b/qemu-img.texi
index 043c1ba..83a5db2 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -627,6 +627,7 @@ The following options are available if support for the 
respective libraries
 has been enabled at compile time:
 
zlibUses standard zlib compression (default)
+   zlib-fast   Uses zlib compression with optimized compression parameters
lzo Uses LZO1X compression
 
 The compression algorithm can only be defined at image create time and cannot
-- 
1.9.1




[Qemu-block] [PATCH 2/4] block/qcow2: optimize qcow2_co_pwritev_compressed

2017-06-27 Thread Peter Lieven
if we specify exactly one iov of s->cluster_size bytes we can avoid
the bounce buffer.

Signed-off-by: Peter Lieven 
---
 block/qcow2.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 893b145..c91eb1f 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2716,7 +2716,7 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
uint64_t offset,
 struct iovec iov;
 z_stream strm;
 int ret, out_len;
-uint8_t *buf, *out_buf;
+uint8_t *buf, *out_buf, *local_buf = NULL;
 uint64_t cluster_offset;
 
 if (bytes == 0) {
@@ -2726,8 +2726,8 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
uint64_t offset,
 return bdrv_truncate(bs->file, cluster_offset, NULL);
 }
 
-buf = qemu_blockalign(bs, s->cluster_size);
-if (bytes != s->cluster_size) {
+if (bytes != s->cluster_size || qiov->niov != 1) {
+buf = local_buf = qemu_blockalign(bs, s->cluster_size);
 if (bytes > s->cluster_size ||
 offset + bytes != bs->total_sectors << BDRV_SECTOR_BITS)
 {
@@ -2736,8 +2736,10 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
uint64_t offset,
 }
 /* Zero-pad last write if image size is not cluster aligned */
 memset(buf + bytes, 0, s->cluster_size - bytes);
+qemu_iovec_to_buf(qiov, 0, buf, bytes);
+} else {
+buf = qiov->iov[0].iov_base;
 }
-qemu_iovec_to_buf(qiov, 0, buf, bytes);
 
 out_buf = g_malloc(s->cluster_size);
 
@@ -2805,7 +2807,7 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
uint64_t offset,
 success:
 ret = 0;
 fail:
-qemu_vfree(buf);
+qemu_vfree(local_buf);
 g_free(out_buf);
 return ret;
 }
-- 
1.9.1




[Qemu-block] [PATCH 0/4] block/qcow2: add compression_algorithm create option

2017-06-27 Thread Peter Lieven
this adds a create option for QCOW2 images to specify the compression algorithm
for compressed clusters. The series adds 3 algorithms to choose from:
zlib (default), zlib-fast and lzo. zlib-fast optimizes the zlib parameters
without the need for an additional compression library. lzo is choosen
as we already link against it and its widely available. Further
libraries like zstd are in preparation.

Compression time vs. size of an uncompressed Debian 9 QCOW2 image (size 1148MB):

zlib 35.7s 339MB
zlib-fast12.8s 348MB
lzo  4.2s  429MB

Peter Lieven (4):
  block/qcow2: add compression_algorithm create option
  block/qcow2: optimize qcow2_co_pwritev_compressed
  block/qcow2: add lzo compression algorithm
  block/qcow2: add zlib-fast compression algorithm

 block/qcow2-cluster.c |  66 ++--
 block/qcow2.c | 186 +-
 block/qcow2.h |  22 --
 configure |   2 +-
 docs/interop/qcow2.txt|   8 +-
 include/block/block_int.h |  35 -
 qemu-img.texi |  12 +++
 7 files changed, 251 insertions(+), 80 deletions(-)

-- 
1.9.1




Re: [Qemu-block] [PATCH RFC v3 1/8] block: move ThrottleGroup membership to ThrottleGroupMember

2017-06-27 Thread Manos Pitsidianakis

On Tue, Jun 27, 2017 at 02:08:54PM +0200, Alberto Garcia wrote:

On Fri 23 Jun 2017 02:46:53 PM CEST, Manos Pitsidianakis wrote:

This commit gathers ThrottleGroup membership details from
BlockBackendPublic into ThrottleGroupMember and refactors existing code
to use the structure.

Signed-off-by: Manos Pitsidianakis 


Hey Manos, thanks for the patch. It looks good to me in general, I only
have a couple of comments:


 /* If no IO are queued for scheduling on the next round robin token
- * then decide the token is the current bs because chances are
- * the current bs get the current request queued.
+ * then decide the token is the current tgm because chances are
+ * the current tgm get the current request queued.


I wonder if it's not better to use 'member' instead of 'tgm' in general.
My impression is that the former is clearer and not too long, but I
don't have a very strong opinion so you can keep it like this if you
want.


I will change it, no problem,


-/* Check if an I/O request needs to be throttled, wait and set a timer
- * if necessary, and schedule the next request using a round robin
- * algorithm.
+/* Check if an I/O request needs to be throttled, wait and set a timer if
+ * necessary, and schedule the next request using a round robin algorithm.
  *


There's a few cases like this when you reformat a paragraph but don't
actually change the text. I think it just adds unnecessary noise to the
diff...


Wiki says "It's OK to fix coding style issues in the immediate area (few 
lines) of the lines you're changing" so I left the reformats. Since they 
are noticed they must be too noisey.. I will either remove the changes 
or do a restyling patch later.



--- a/include/qemu/throttle.h
+++ b/include/qemu/throttle.h
@@ -27,6 +27,7 @@

 #include "qemu-common.h"
 #include "qemu/timer.h"
+#include "qemu/coroutine.h"

 #define THROTTLE_VALUE_MAX 1000LL

@@ -153,4 +154,29 @@ bool throttle_schedule_timer(ThrottleState *ts,

 void throttle_account(ThrottleState *ts, bool is_write, uint64_t size);

+
+/* The ThrottleGroupMember structure indicates membership in a ThrottleGroup
+ * and holds related data.
+ */
+
+typedef struct ThrottleGroupMember {
+/* throttled_reqs_lock protects the CoQueues for throttled requests.  */
+CoMutex  throttled_reqs_lock;
+CoQueue  throttled_reqs[2];
+
+/* Nonzero if the I/O limits are currently being ignored; generally
+ * it is zero.  Accessed with atomic operations.
+ */
+unsigned int io_limits_disabled;
+
+/* The following fields are protected by the ThrottleGroup lock.
+ * See the ThrottleGroup documentation for details.
+ * throttle_state tells us if I/O limits are configured. */
+ThrottleState *throttle_state;
+ThrottleTimers throttle_timers;
+unsigned   pending_reqs[2];
+QLIST_ENTRY(ThrottleGroupMember) round_robin;
+
+} ThrottleGroupMember;
+


Any reason why you add this to throttle.h (which is generic throttling
code independent from the block layer) and not to throttle-groups.h?


I put it there because it's not directly ThrottleGroup-related, but 
you're right, if throttle.h is block layer free it should remain that 
way.


Otherwise it all looks good to me, thanks!

Berto



signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH RFC v3 1/8] block: move ThrottleGroup membership to ThrottleGroupMember

2017-06-27 Thread Alberto Garcia
On Fri 23 Jun 2017 02:46:53 PM CEST, Manos Pitsidianakis wrote:
> This commit gathers ThrottleGroup membership details from
> BlockBackendPublic into ThrottleGroupMember and refactors existing code
> to use the structure.
>
> Signed-off-by: Manos Pitsidianakis 

Hey Manos, thanks for the patch. It looks good to me in general, I only
have a couple of comments:

>  /* If no IO are queued for scheduling on the next round robin token
> - * then decide the token is the current bs because chances are
> - * the current bs get the current request queued.
> + * then decide the token is the current tgm because chances are
> + * the current tgm get the current request queued.

I wonder if it's not better to use 'member' instead of 'tgm' in general.
My impression is that the former is clearer and not too long, but I
don't have a very strong opinion so you can keep it like this if you
want.

> -/* Check if an I/O request needs to be throttled, wait and set a timer
> - * if necessary, and schedule the next request using a round robin
> - * algorithm.
> +/* Check if an I/O request needs to be throttled, wait and set a timer if
> + * necessary, and schedule the next request using a round robin algorithm.
>   *

There's a few cases like this when you reformat a paragraph but don't
actually change the text. I think it just adds unnecessary noise to the
diff...

> --- a/include/qemu/throttle.h
> +++ b/include/qemu/throttle.h
> @@ -27,6 +27,7 @@
>  
>  #include "qemu-common.h"
>  #include "qemu/timer.h"
> +#include "qemu/coroutine.h"
>  
>  #define THROTTLE_VALUE_MAX 1000LL
>  
> @@ -153,4 +154,29 @@ bool throttle_schedule_timer(ThrottleState *ts,
>  
>  void throttle_account(ThrottleState *ts, bool is_write, uint64_t size);
>  
> +
> +/* The ThrottleGroupMember structure indicates membership in a ThrottleGroup
> + * and holds related data.
> + */
> +
> +typedef struct ThrottleGroupMember {
> +/* throttled_reqs_lock protects the CoQueues for throttled requests.  */
> +CoMutex  throttled_reqs_lock;
> +CoQueue  throttled_reqs[2];
> +
> +/* Nonzero if the I/O limits are currently being ignored; generally
> + * it is zero.  Accessed with atomic operations.
> + */
> +unsigned int io_limits_disabled;
> +
> +/* The following fields are protected by the ThrottleGroup lock.
> + * See the ThrottleGroup documentation for details.
> + * throttle_state tells us if I/O limits are configured. */
> +ThrottleState *throttle_state;
> +ThrottleTimers throttle_timers;
> +unsigned   pending_reqs[2];
> +QLIST_ENTRY(ThrottleGroupMember) round_robin;
> +
> +} ThrottleGroupMember;
> +

Any reason why you add this to throttle.h (which is generic throttling
code independent from the block layer) and not to throttle-groups.h?

Otherwise it all looks good to me, thanks!

Berto



Re: [Qemu-block] [PATCH] docs: add qemu-block-drivers(7) man page

2017-06-27 Thread Stefan Hajnoczi
On Mon, Jun 26, 2017 at 04:40:58PM -0400, John Snow wrote:
> 
> 
> On 06/22/2017 08:17 AM, Stefan Hajnoczi wrote:
> > Block driver documentation is available in qemu-doc.html.  It would be
> > convenient to have documentation for formats, protocols, and filter
> > drivers in a man page.
> > 
> > Extract the relevant part of qemu-doc.html into a new file called
> > docs/qemu-block-drivers.texi.  This file can also be built as a
> > stand-alone document (man, html, etc).
> > 
> > Signed-off-by: Stefan Hajnoczi 
> 
> Thanks, this is really useful information to have. Looks good to me, but
> Paolo's changes prevent it from applying now.
> 
> So, as of last week:
> 
> Reviewed-by: John Snow 

Thanks for the reminder.  Will rebase and resend.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH 0/2] Enhance block status when crossing EOF

2017-06-27 Thread Fam Zheng
On Thu, 05/04 21:14, Eric Blake wrote:
> Thus, this is a followup to that series, but I'm also okay if we
> think it is too much maintenance compared to the potential gains,
> and decide to drop it after all.

The comments are good enough and I like how this makes the function interface a
bit more powerful. Fixed the typo as pointed out by Stefan and applied. Thanks!

Fam



Re: [Qemu-block] [PATCH v6 8/8] vmdk: Make vmdk_get_cluster_offset() return cluster offset only

2017-06-27 Thread Fam Zheng
On Mon, 06/05 13:22, Ashijeet Acharya wrote:
> vmdk_alloc_clusters() introduced earlier now handles the task of
> allocating clusters and performing COW when needed. Thus we can change
> vmdk_get_cluster_offset() to stick to the sole purpose of returning
> cluster offset using sector number. Update the changes at all call
> sites.
> 
> Signed-off-by: Ashijeet Acharya 
> ---
>  block/vmdk.c | 56 
>  1 file changed, 12 insertions(+), 44 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index 9fa2414..accf1c3 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -1485,25 +1485,16 @@ static int vmdk_alloc_clusters(BlockDriverState *bs,
>   * For flat extents, the start offset as parsed from the description file is
>   * returned.
>   *
> - * For sparse extents, look up in L1, L2 table. If allocate is true, return 
> an
> - * offset for a new cluster and update L2 cache. If there is a backing file,
> - * COW is done before returning; otherwise, zeroes are written to the 
> allocated
> - * cluster. Both COW and zero writing skips the sector range
> - * [@skip_start_sector, @skip_end_sector) passed in by caller, because caller
> - * has new data to write there.
> + * For sparse extents, look up the L1, L2 table.
>   *
>   * Returns: VMDK_OK if cluster exists and mapped in the image.
> - *  VMDK_UNALLOC if cluster is not mapped and @allocate is false.
> - *  VMDK_ERROR if failed.
> + *  VMDK_UNALLOC if cluster is not mapped.
> + *  VMDK_ERROR if failed
>   */
>  static int vmdk_get_cluster_offset(BlockDriverState *bs,
> VmdkExtent *extent,
> -   VmdkMetaData *m_data,
> uint64_t offset,
> -   bool allocate,
> -   uint64_t *cluster_offset,
> -   uint64_t skip_start_bytes,
> -   uint64_t skip_end_bytes)
> +   uint64_t *cluster_offset)
>  {
>  int l1_index, l2_offset, l2_index;
>  uint32_t *l2_table;
> @@ -1528,31 +1519,9 @@ static int vmdk_get_cluster_offset(BlockDriverState 
> *bs,
>  }
>  
>  if (!cluster_sector || zeroed) {
> -if (!allocate) {
> -return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
> -}
> -
> -cluster_sector = extent->next_cluster_sector;
> -extent->next_cluster_sector += extent->cluster_sectors;
> -
> -/* First of all we write grain itself, to avoid race condition
> - * that may to corrupt the image.
> - * This problem may occur because of insufficient space on host disk
> - * or inappropriate VM shutdown.
> - */
> -ret = vmdk_perform_cow(bs, extent, cluster_sector * BDRV_SECTOR_SIZE,
> -offset, skip_start_bytes, skip_end_bytes);
> -if (ret) {
> -return ret;
> -}
> -if (m_data) {
> -m_data->valid = 1;
> -m_data->l1_index = l1_index;
> -m_data->l2_index = l2_index;
> -m_data->l2_offset = l2_offset;
> -m_data->l2_cache_entry = _table[l2_index];
> -}
> +return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
>  }
> +
>  *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
>  return VMDK_OK;
>  }
> @@ -1595,9 +1564,7 @@ static int64_t coroutine_fn 
> vmdk_co_get_block_status(BlockDriverState *bs,
>  return 0;
>  }
>  qemu_co_mutex_lock(>lock);
> -ret = vmdk_get_cluster_offset(bs, extent, NULL,
> -  sector_num * 512, false, ,
> -  0, 0);
> +ret = vmdk_get_cluster_offset(bs, extent, sector_num * 512, );
>  qemu_co_mutex_unlock(>lock);
>  
>  index_in_cluster = vmdk_find_index_in_cluster(extent, sector_num);
> @@ -1788,13 +1755,14 @@ vmdk_co_preadv(BlockDriverState *bs, uint64_t offset, 
> uint64_t bytes,
>  ret = -EIO;
>  goto fail;
>  }
> -ret = vmdk_get_cluster_offset(bs, extent, NULL,
> -  offset, false, _offset, 0, 0);
> +
>  offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
>  
>  n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
>   - offset_in_cluster);
>  
> +ret = vmdk_get_cluster_offset(bs, extent, offset, _offset);
> +
>  if (ret != VMDK_OK) {
>  /* if not allocated, try to read from parent image, if exist */
>  if (bs->backing && ret != VMDK_ZEROED) {
> @@ -2541,9 +2509,9 @@ static int vmdk_check(BlockDriverState *bs, 
> BdrvCheckResult *result,
>  sector_num);
>  break;
>  }
> -ret = vmdk_get_cluster_offset(bs, extent, NULL,
> +ret 

Re: [Qemu-block] [PATCH v6 7/8] vmdk: Update metadata for multiple clusters

2017-06-27 Thread Fam Zheng
On Mon, 06/05 13:22, Ashijeet Acharya wrote:
> Include a next pointer in VmdkMetaData struct to point to the previous
> allocated L2 table. Modify vmdk_L2update to start updating metadata for
> allocation of multiple clusters at once.
> 
> Signed-off-by: Ashijeet Acharya 
> ---
>  block/vmdk.c | 128 
> ++-
>  1 file changed, 101 insertions(+), 27 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index b671dc9..9fa2414 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -137,6 +137,8 @@ typedef struct VmdkMetaData {
>  int valid;
>  uint32_t *l2_cache_entry;
>  uint32_t nb_clusters;
> +uint32_t offset;
> +struct VmdkMetaData *next;
>  } VmdkMetaData;
>  
>  typedef struct VmdkGrainMarker {
> @@ -1116,34 +1118,87 @@ exit:
>  return ret;
>  }
>  
> -static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data,
> - uint32_t offset)
> +static int vmdk_alloc_cluster_link_l2(VmdkExtent *extent,
> +  VmdkMetaData *m_data, bool zeroed)
>  {
> -offset = cpu_to_le32(offset);
> +int i;
> +uint32_t offset, temp_offset;
> +int *l2_table_array;
> +int l2_array_size;
> +
> +if (zeroed) {
> +temp_offset = VMDK_GTE_ZEROED;
> +} else {
> +temp_offset = m_data->offset;
> +}
> +
> +l2_array_size = sizeof(uint32_t) * m_data->nb_clusters;
> +l2_table_array = qemu_try_blockalign(extent->file->bs,
> + QEMU_ALIGN_UP(l2_array_size,
> +   BDRV_SECTOR_SIZE));
> +if (l2_table_array == NULL) {
> +return VMDK_ERROR;
> +}
> +memset(l2_table_array, 0, QEMU_ALIGN_UP(l2_array_size, 
> BDRV_SECTOR_SIZE));
>  /* update L2 table */
> +offset = temp_offset;
> +for (i = 0; i < m_data->nb_clusters; i++) {
> +l2_table_array[i] = cpu_to_le32(offset);
> +if (!zeroed) {
> +offset += 128;

s/128/extent->cluster_sectors/?

> +}
> +}
>  if (bdrv_pwrite_sync(extent->file,
> -((int64_t)m_data->l2_offset * 512)
> -+ (m_data->l2_index * sizeof(offset)),
> -, sizeof(offset)) < 0) {
> + ((int64_t)m_data->l2_offset * 512)
> + + ((m_data->l2_index) * sizeof(offset)),
> + l2_table_array, l2_array_size) < 0) {
>  return VMDK_ERROR;
>  }
>  /* update backup L2 table */
>  if (extent->l1_backup_table_offset != 0) {
>  m_data->l2_offset = extent->l1_backup_table[m_data->l1_index];
>  if (bdrv_pwrite_sync(extent->file,
> -((int64_t)m_data->l2_offset * 512)
> -+ (m_data->l2_index * sizeof(offset)),
> -, sizeof(offset)) < 0) {
> + ((int64_t)m_data->l2_offset * 512)
> + + ((m_data->l2_index) * sizeof(offset)),
> + l2_table_array, l2_array_size) < 0) {
>  return VMDK_ERROR;
>  }
>  }
> +
> +offset = temp_offset;
>  if (m_data->l2_cache_entry) {
> -*m_data->l2_cache_entry = offset;
> +for (i = 0; i < m_data->nb_clusters; i++) {
> +*m_data->l2_cache_entry = cpu_to_le32(offset);
> +m_data->l2_cache_entry++;
> +
> +if (!zeroed) {
> +offset += 128;

Ditto.

> +}
> +}
>  }
>  
> +qemu_vfree(l2_table_array);
>  return VMDK_OK;
>  }
>  
> +static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data,
> + bool zeroed)
> +{
> +int ret;
> +
> +while (m_data->next != NULL) {
> +
> +ret = vmdk_alloc_cluster_link_l2(extent, m_data, zeroed);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +m_data = m_data->next;
> + }
> +
> + return VMDK_OK;
> +}
> +
>  /*
>   * vmdk_l2load
>   *
> @@ -1261,9 +1316,10 @@ static int get_cluster_table(VmdkExtent *extent, 
> uint64_t offset,
>   *
>   *   VMDK_ERROR:in error cases
>   */
> +
>  static int vmdk_handle_alloc(BlockDriverState *bs, VmdkExtent *extent,
>   uint64_t offset, uint64_t *cluster_offset,
> - int64_t *bytes, VmdkMetaData *m_data,
> + int64_t *bytes, VmdkMetaData **m_data,
>   bool allocate, uint32_t *alloc_clusters_counter)
>  {
>  int l1_index, l2_offset, l2_index;
> @@ -1272,6 +1328,7 @@ static int vmdk_handle_alloc(BlockDriverState *bs, 
> VmdkExtent *extent,
>  uint32_t nb_clusters;
>  bool zeroed = false;
>  uint64_t skip_start_bytes, skip_end_bytes;
> +VmdkMetaData *old_m_data;
>  int ret;
>  
>  ret = get_cluster_table(extent, offset, _index, _offset,
> @@ -1323,13 

Re: [Qemu-block] [PATCH v6 7/8] vmdk: Update metadata for multiple clusters

2017-06-27 Thread Fam Zheng
On Mon, 06/05 13:22, Ashijeet Acharya wrote:
> @@ -1876,6 +1942,13 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t 
> offset,
>  offset += n_bytes;
>  bytes_done += n_bytes;
>  
> +while (m_data->next != NULL) {

If you do

   while (m_data) {

> +VmdkMetaData *next;
> +next = m_data->next;
> +g_free(m_data);
> +m_data = next;
> +}
> +
>  /* update CID on the first write every time the virtual disk is
>   * opened */
>  if (!s->cid_updated) {
> @@ -1886,6 +1959,7 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t 
> offset,
>  s->cid_updated = true;
>  }
>  }
> +g_free(m_data);

then you can remove this line.

>  return 0;
>  }
>  
> -- 
> 2.6.2
> 



Re: [Qemu-block] [PATCH v6 6/8] vmdk: New functions to assist allocating multiple clusters

2017-06-27 Thread Fam Zheng
On Mon, 06/05 13:22, Ashijeet Acharya wrote:
> +/*
> + * vmdk_handle_alloc
> + *
> + * Allocate new clusters for an area that either is yet unallocated or needs 
> a
> + * copy on write. If *cluster_offset is non_zero, clusters are only 
> allocated if
> + * the new allocation can match the specified host offset.

I don't think this matches the function body, the passed in *cluster_offset
value is ignored.

> + *
> + * Returns:
> + *   VMDK_OK:   if new clusters were allocated, *bytes may be decreased 
> if
> + *  the new allocation doesn't cover all of the requested 
> area.
> + *  *cluster_offset is updated to contain the offset of the
> + *  first newly allocated cluster.
> + *
> + *   VMDK_UNALLOC:  if no clusters could be allocated. *cluster_offset is 
> left
> + *  unchanged.
> + *
> + *   VMDK_ERROR:in error cases
> + */
> +static int vmdk_handle_alloc(BlockDriverState *bs, VmdkExtent *extent,
> + uint64_t offset, uint64_t *cluster_offset,
> + int64_t *bytes, VmdkMetaData *m_data,
> + bool allocate, uint32_t *alloc_clusters_counter)
> +{
> +int l1_index, l2_offset, l2_index;
> +uint32_t *l2_table;
> +uint32_t cluster_sector;
> +uint32_t nb_clusters;
> +bool zeroed = false;
> +uint64_t skip_start_bytes, skip_end_bytes;
> +int ret;
> +
> +ret = get_cluster_table(extent, offset, _index, _offset,
> +_index, _table);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +cluster_sector = le32_to_cpu(l2_table[l2_index]);
> +
> +skip_start_bytes = vmdk_find_offset_in_cluster(extent, offset);
> +/* Calculate the number of clusters to look for. Here we truncate the 
> last
> + * cluster, i.e. 1 less than the actual value calculated as we may need 
> to
> + * perform COW for the last one. */
> +nb_clusters = DIV_ROUND_UP(skip_start_bytes + *bytes,
> +   extent->cluster_sectors << BDRV_SECTOR_BITS) 
> - 1;
> +
> +nb_clusters = MIN(nb_clusters, extent->l2_size - l2_index);
> +assert(nb_clusters <= INT_MAX);
> +
> +/* update bytes according to final nb_clusters value */
> +if (nb_clusters != 0) {
> +*bytes = ((nb_clusters * extent->cluster_sectors) << 
> BDRV_SECTOR_BITS)
> + - skip_start_bytes;
> +} else {
> +nb_clusters = 1;
> +}
> +*alloc_clusters_counter += nb_clusters;
> +skip_end_bytes = skip_start_bytes + MIN(*bytes,
> + extent->cluster_sectors * BDRV_SECTOR_SIZE
> +- skip_start_bytes);

I don't understand the MIN part, shouldn't skip_end_bytes simply be
skip_start_bytes + *bytes?

> +
> +if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> +zeroed = true;
> +}
> +
> +if (!cluster_sector || zeroed) {
> +if (!allocate) {
> +return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
> +}
> +
> +cluster_sector = extent->next_cluster_sector;
> +extent->next_cluster_sector += extent->cluster_sectors
> +* nb_clusters;
> +
> +ret = vmdk_perform_cow(bs, extent, cluster_sector * BDRV_SECTOR_SIZE,
> +   offset, skip_start_bytes,
> +   skip_end_bytes);
> +if (ret < 0) {
> +return ret;
> +}
> +if (m_data) {
> +m_data->valid = 1;
> +m_data->l1_index = l1_index;
> +m_data->l2_index = l2_index;
> +m_data->l2_offset = l2_offset;
> +m_data->l2_cache_entry = _table[l2_index];
> +m_data->nb_clusters = nb_clusters;
> +}
> +}
> +*cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
> +return VMDK_OK;
> +}
> +
> +/*
> + * vmdk_alloc_clusters
> + *
> + * For a given offset on the virtual disk, find the cluster offset in vmdk
> + * file. If the offset is not found, allocate a new cluster.
> + *
> + * If the cluster is newly allocated, m_data->nb_clusters is set to the 
> number
> + * of contiguous clusters that have been allocated. In this case, the other
> + * fields of m_data are valid and contain information about the first 
> allocated
> + * cluster.
> + *
> + * Returns:
> + *
> + *   VMDK_OK:   on success and @cluster_offset was set
> + *
> + *   VMDK_UNALLOC:  if no clusters were allocated and @cluster_offset is
> + *  set to zero
> + *
> + *   VMDK_ERROR:in error cases
> + */
> +static int vmdk_alloc_clusters(BlockDriverState *bs,
> +   VmdkExtent *extent,
> +   VmdkMetaData *m_data, uint64_t offset,
> +   bool allocate, uint64_t *cluster_offset,
> +   int64_t bytes,
> +