date:20170920

[Qemu-block] [PATCH 2/2] ide: generic ide_data_write

2017-09-20 Thread John Snow

Signed-off-by: John Snow 
---
 hw/ide/core.c | 86 +--
 hw/ide/trace-events   |  3 +-
 include/hw/ide/internal.h |  1 +
 3 files changed, 40 insertions(+), 50 deletions(-)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 393f523..af49de5 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -2266,6 +2266,42 @@ static bool ide_is_pio_out(IDEState *s)
 abort();
 }
 
+void ide_data_write(void *opaque, uint32_t addr, short nbytes, uint32_t val)
+{
+IDEBus *bus = opaque;
+IDEState *s = idebus_active_if(bus);
+uint8_t *p;
+
+trace_ide_data_write(addr, nbytes, val, bus, s);
+
+/* PIO data access allowed only when DRQ bit is set. The result of a write
+ * during PIO out is indeterminate, just ignore it. */
+if (!(s->status & DRQ_STAT) || ide_is_pio_out(s)) {
+return;
+}
+
+if (nbytes != 2 && nbytes != 4) {
+return;
+}
+
+p = s->data_ptr;
+if (p + nbytes > s->data_end) {
+return;
+}
+
+if (nbytes == 2) {
+*(uint16_t *)p = le16_to_cpu(val);
+} else if (nbytes == 4) {
+*(uint32_t *)p = le32_to_cpu(val);
+}
+p += nbytes;
+s->data_ptr = p;
+if (p >= s->data_end) {
+s->status &= ~DRQ_STAT;
+s->end_transfer_func(s);
+}
+}
+
 uint32_t ide_data_read(void *opaque, uint32_t addr, short nbytes)
 {
 IDEBus *bus = opaque;
@@ -2311,30 +2347,7 @@ uint32_t ide_data_read(void *opaque, uint32_t addr, 
short nbytes)
 
 void ide_data_writew(void *opaque, uint32_t addr, uint32_t val)
 {
-IDEBus *bus = opaque;
-IDEState *s = idebus_active_if(bus);
-uint8_t *p;
-
-trace_ide_data_writew(addr, val, bus, s);
-
-/* PIO data access allowed only when DRQ bit is set. The result of a write
- * during PIO out is indeterminate, just ignore it. */
-if (!(s->status & DRQ_STAT) || ide_is_pio_out(s)) {
-return;
-}
-
-p = s->data_ptr;
-if (p + 2 > s->data_end) {
-return;
-}
-
-*(uint16_t *)p = le16_to_cpu(val);
-p += 2;
-s->data_ptr = p;
-if (p >= s->data_end) {
-s->status &= ~DRQ_STAT;
-s->end_transfer_func(s);
-}
+return ide_data_write(opaque, addr, 2, val);
 }
 
 uint32_t ide_data_readw(void *opaque, uint32_t addr)
@@ -2344,30 +2357,7 @@ uint32_t ide_data_readw(void *opaque, uint32_t addr)
 
 void ide_data_writel(void *opaque, uint32_t addr, uint32_t val)
 {
-IDEBus *bus = opaque;
-IDEState *s = idebus_active_if(bus);
-uint8_t *p;
-
-trace_ide_data_writel(addr, val, bus, s);
-
-/* PIO data access allowed only when DRQ bit is set. The result of a write
- * during PIO out is indeterminate, just ignore it. */
-if (!(s->status & DRQ_STAT) || ide_is_pio_out(s)) {
-return;
-}
-
-p = s->data_ptr;
-if (p + 4 > s->data_end) {
-return;
-}
-
-*(uint32_t *)p = le32_to_cpu(val);
-p += 4;
-s->data_ptr = p;
-if (p >= s->data_end) {
-s->status &= ~DRQ_STAT;
-s->end_transfer_func(s);
-}
+return ide_data_write(opaque, addr, 4, val);
 }
 
 uint32_t ide_data_readl(void *opaque, uint32_t addr)
diff --git a/hw/ide/trace-events b/hw/ide/trace-events
index e42c428..e92c0bb 100644
--- a/hw/ide/trace-events
+++ b/hw/ide/trace-events
@@ -7,9 +7,8 @@ ide_ioport_write(uint32_t addr, const char *reg, uint32_t val, 
void *bus, void *
 ide_status_read(uint32_t addr, uint32_t val, void *bus, void *s)   
"IDE PIO rd @ 0x%"PRIx32" (Alt Status); val 0x%02"PRIx32"; bus %p; IDEState 
%p"
 ide_cmd_write(uint32_t addr, uint32_t val, void *bus)  
"IDE PIO wr @ 0x%"PRIx32" (Device Control); val 0x%02"PRIx32"; bus %p"
 # Warning: verbose
-ide_data_writew(uint32_t addr, uint32_t val, void *bus, void *s)   
"IDE PIO wr @ 0x%"PRIx32" (Data: Word); val 0x%04"PRIx32"; bus %p; IDEState 
%p"
-ide_data_writel(uint32_t addr, uint32_t val, void *bus, void *s)   
"IDE PIO wr @ 0x%"PRIx32" (Data: Long); val 0x%08"PRIx32"; bus %p; IDEState 
%p"
 ide_data_read(uint32_t addr, short nbytes, uint32_t val, void *bus, void *s)   
"IDE PIO rd @ 0x%"PRIx32" (Data: %d bytes); val 0x%08"PRIx32"; bus %p; 
IDEState %p"
+ide_data_write(uint32_t addr, short nbytes, uint32_t val, void *bus, void *s)  
"IDE PIO wr @ 0x%"PRIx32" (Data: %d bytes); val 0x%08"PRIx32"; bus %p; 
IDEState %p"
 
 # misc
 ide_exec_cmd(void *bus, void *state, uint32_t cmd) "IDE exec cmd: bus %p; 
state %p; cmd 0x%02x"
diff --git a/include/hw/ide/internal.h b/include/hw/ide/internal.h
index 3159c66..deb592d 100644
--- a/include/hw/ide/internal.h
+++ b/include/hw/ide/internal.h
@@ -598,6 +598,7 @@ void ide_ioport_write(void *opaque, uint32_t addr, uint32_t 
val);
 uint32_t ide_ioport_read(void *opaque, uint32_t addr1);
 uint32_t ide_status_read(void *opaque, uint32_t addr);
 void ide_cmd_write(void *opaque, uint32_t addr, uint32_t val);
+void

[Qemu-block] [PATCH 0/2] IDE: combine portio r/w functions

2017-09-20 Thread John Snow

Mark, here's a quick sketch for you. There are two things I don't like,
but didn't care enough to fix:

(1) Restricting nbytes to 2 or 4 means some extra boilerplate
to quiet compilers who don't know it will only ever be 2 or 4
(2) the address value is all-but-ignored, it carries over from the
portio signature and is useful primarily for tracing, but it's
a little ugly/deceiving to take a parameter and not use it.

Suggested-by: Mark Cave-Ayland 

John Snow (2):
  ide: generic ide_data_read
  ide: generic ide_data_write

 hw/ide/core.c | 185 +-
 hw/ide/trace-events   |   7 +-
 include/hw/ide/internal.h |   4 +-
 3 files changed, 89 insertions(+), 107 deletions(-)

-- 
2.9.5

Re: [Qemu-block] [PATCH 2/2] qcow2: truncate the tail of the image file after shrinking the image

2017-09-20 Thread John Snow



On 09/20/2017 09:58 AM, Pavel Butsykin wrote:
> Now after shrinking the image, at the end of the image file, there might be a
> tail that probably will never be used. So we can find the last used cluster 
> and
> cut the tail.
> 
> Signed-off-by: Pavel Butsykin 
> ---
>  block/qcow2-refcount.c | 21 +
>  block/qcow2.c  | 19 +++
>  block/qcow2.h  |  1 +
>  3 files changed, 41 insertions(+)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 88d5a3f1ad..5e221a166c 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -3181,3 +3181,24 @@ out:
>  g_free(reftable_tmp);
>  return ret;
>  }
> +
> +int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
> +{
> +BDRVQcow2State *s = bs->opaque;
> +int64_t i, last_cluster, nb_clusters = size_to_clusters(s, size);
> +uint64_t refcount;
> +
> +for (i = 0, last_cluster = 0; i < nb_clusters; i++) {
> +int ret = qcow2_get_refcount(bs, i, );
> +if (ret < 0) {
> +fprintf(stderr, "Can't get refcount for cluster %" PRId64 ": 
> %s\n",
> +i, strerror(-ret));
> +continue;
> +}
> +
> +if (refcount > 0) {
> +last_cluster = i;
> +}
> +}
> +return last_cluster;
> +}> diff --git a/block/qcow2.c b/block/qcow2.c
> index 8a4311d338..c3b6dd44c4 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -3106,6 +3106,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
> offset,
>  new_l1_size = size_to_l1(s, offset);
>  
>  if (offset < old_length) {
> +int64_t image_end_offset, old_file_size;
>  if (prealloc != PREALLOC_MODE_OFF) {
>  error_setg(errp,
> "Preallocation can't be used for shrinking an image");
> @@ -3134,6 +3135,24 @@ static int qcow2_truncate(BlockDriverState *bs, 
> int64_t offset,
>   "Failed to discard unused refblocks");
>  return ret;
>  }
> +
> +old_file_size = bdrv_getlength(bs->file->bs);
> +if (old_file_size < 0) {
> +error_setg_errno(errp, -old_file_size,
> + "Failed to inquire current file length");
> +return old_file_size;
> +}
> +image_end_offset = (qcow2_get_last_cluster(bs, old_file_size) + 1) *
> +   s->cluster_size;
> +if (image_end_offset < old_file_size) {
> +ret = bdrv_truncate(bs->file, image_end_offset,
> +PREALLOC_MODE_OFF, NULL);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret,
> + "Failed to truncate the tail of the image");

I've recently become skeptical of what partial resize successes look
like, but that's an issue for another day entirely.

> +return ret;
> +}
> +}
>  } else {
>  ret = qcow2_grow_l1_table(bs, new_l1_size, true);
>  if (ret < 0) {
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 5a289a81e2..782a206ecb 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -597,6 +597,7 @@ int qcow2_change_refcount_order(BlockDriverState *bs, int 
> refcount_order,
>  BlockDriverAmendStatusCB *status_cb,
>  void *cb_opaque, Error **errp);
>  int qcow2_shrink_reftable(BlockDriverState *bs);
> +int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
>  
>  /* qcow2-cluster.c functions */
>  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
> 

Reviewed-by: John Snow 

Looks sane to me, but under which circumstances might we grow such a
tail? I assume the actual truncate call aligns to cluster boundaries as
appropriate, so is this a bit of a "quick fix" to cull unused clusters
that happened to be near the truncate boundary?

It might be worth documenting the circumstances that produces this
unused space that will never get used. My hunch is that such unused
space should likely be getting reclaimed elsewhere and not here, but
perhaps I'm misunderstanding the causal factors.

--js

Re: [Qemu-block] [PATCH 1/2] qcow2: fix return error code in qcow2_truncate()

2017-09-20 Thread John Snow



On 09/20/2017 09:58 AM, Pavel Butsykin wrote:
> Signed-off-by: Pavel Butsykin 
> ---
>  block/qcow2.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 2174a84d1f..8a4311d338 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -3166,7 +3166,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
> offset,
>  if (old_file_size < 0) {
>  error_setg_errno(errp, -old_file_size,
>   "Failed to inquire current file length");
> -return ret;
> +return old_file_size;
>  }
>  
>  nb_new_data_clusters = DIV_ROUND_UP(offset - old_length,
> @@ -3195,7 +3195,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
> offset,
>  if (allocation_start < 0) {
>  error_setg_errno(errp, -allocation_start,
>   "Failed to resize refcount structures");
> -return -allocation_start;
> +return allocation_start;
>  }
>  
>  clusters_allocated = qcow2_alloc_clusters_at(bs, allocation_start,
> 

Yikes...

Reviewed-by: John Snow

Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] block/block-backend.c: remove blk_pread_unthrottled()

2017-09-20 Thread John Snow



On 09/20/2017 07:43 AM, Manos Pitsidianakis wrote:
> blk_pread_unthrottled was used to bypass I/O throttling on the BlockBackend in
> the case of async I/O. This is not needed anymore and we can just call
> blk_pread() directly.
> 
> Signed-off-by: Manos Pitsidianakis 
> ---
>  include/sysemu/block-backend.h |  2 --
>  block/block-backend.c  | 16 
>  hw/block/hd-geometry.c |  7 +--
>  3 files changed, 1 insertion(+), 24 deletions(-)
> 


Acked-by: John Snow

Re: [Qemu-block] [Qemu-devel] [PATCH v9 05/20] dirty-bitmap: Avoid size query failure during truncate

2017-09-20 Thread John Snow

On 09/20/2017 09:11 AM, Eric Blake wrote:
> On 09/19/2017 09:10 PM, Fam Zheng wrote:
> 
>>>
>>> Do you suspect that almost certainly if bdrv_truncate() fails overall
>>> that the image format driver will either unmount the image or become
>>> read-only?
> 
> Uggh - it feels like I've bitten off more than I can chew with this
> patch - I'm getting bogged down by trying to fix bad behavior in code
> that is mostly unrelated to the patch at hand, so I don't have a good
> opinion on WHAT is supposed to happen if bdrv_truncate() fails, only
> that I'm trying to avoid compounding that failure even worse.
> 

Yes, I apologize -- I realize I'm holding this series hostage. For now I
am just trying to legitimately understand the behavior. I am willing to
accept "It's sorta busted right now, but -EOUTOFSCOPE"

>>> I suppose if *not* that's a bug for callers of bdrv_truncate to allow
>>> that kind of monkey business, but if it CAN happen, hbitmap only guards
>>> against such things with an assert (which, IIRC, is not guaranteed to be
>>> on for all builds)
>>
>> It's guaranteed since a few hours ago:
>>
>> commit 262a69f4282e44426c7a132138581d400053e0a1
> 
> Indeed - but even without my patch, we would have hit the assertion
> failures when trying to resize the dirty bitmap to -1 when
> bdrv_nb_sectors() fails (which was likely if refresh_total_sectors()
> failed).
> 
>>> So the question is: "bdrv_truncate failure is NOT considered recoverable
>>> in ANY case, is it?"
>>>
>>> It may possibly be safer to, if the initial truncate request succeeds,
>>> apply a best-effort to the bitmap before returning the error.
>>
>> Like fallback "offset" (or it aligned up to bs cluster size) if
>> refresh_total_sectors() returns error? I think that is okay.
> 
> Here's my proposal for squashing in a best-effort dirty-bitmap resize no
> matter what happens in refresh_total_sectors() (but really, if you
> successfully truncate the disk but then get a failure while trying to
> read back the actual new size, which may differ from the requested size,
> you're probably doomed down the road anyways).
> 
> diff --git i/block.c w/block.c
> index 3caf6bb093..ef5af81f66 100644
> --- i/block.c
> +++ w/block.c
> @@ -3552,8 +3552,9 @@ int bdrv_truncate(BdrvChild *child, int64_t
> offset, PreallocMode prealloc,
>  if (ret < 0) {
>  error_setg_errno(errp, -ret, "Could not refresh total sector
> count");
>  } else {
> -bdrv_dirty_bitmap_truncate(bs, bs->total_sectors *
> BDRV_SECTOR_SIZE);
> +offset = bs->total_sectors * BDRV_SECTOR_SIZE;
>  }
> +bdrv_dirty_bitmap_truncate(bs, offset);
>  bdrv_parent_cb_resize(bs);
>  atomic_inc(>write_gen);
>  return ret;
> 
> 

Don't respin on my accord, I'm trying to find out if there is a problem;
I'm not convinced of one yet. Just thinking out loud.

Two cases:

(1) Attempt to resize larger. Resize succeeds, but refresh fails.
Possibly a temporary protocol failure, but we'll assume the resize
actually worked. Bitmap does not get resized, however any caller of
truncate *must* assume that the resize did not succeed. Any calls to
write beyond previous EOF are a bug by the calling module.

(2) Attempt to resize smaller, an actual truncate. Call succeeds but
refresh doesn't. Bitmap is now larger than the drive. The bitmap itself
is perfectly capable of describing reads/writes even to the now-OOB
area, but it's unlikely the BB would submit any. Problems may arise if
the BB does not treat this as a hard failure and a user later attempts
to use this bitmap for a backup operation, as the trailing bits now
reference disk segments that may or may not physically exist. Likely to
hit EIO problems during block jobs.

If we do decide to resize the bitmap even on refresh failure, We
probably do still run the risk of the bitmap being slightly bigger or
slightly smaller than the actual size due to alignment.

It sounds like the resize operation itself needs to be able to return to
the caller the actual size of the operation instead of forcing the
caller to query separately in a follow-up call to really "fix" this.

Considering that either resizing or not resizing the bitmap after a
partial failure probably still leaves us with a possibly dangerous
bitmap, I don't think I'll hold you to the flames over this one.

--js

Re: [Qemu-block] Block Migration and CPU throttling

2017-09-20 Thread Peter Lieven

Am 19.09.2017 um 16:41 schrieb Paolo Bonzini:
> On 19/09/2017 15:36, Peter Lieven wrote:
>> Hi,
>>
>> I just noticed that CPU throttling and Block Migration don't work
>> together very well.
>> During block migration the throttling heuristic detects that we
>> obviously make no progress
>> in ram transfer. But the reason is the running block migration and not a
>> too high dirty pages rate.
>>
>> The result is that any VM is throttled by 99% during block migration.
>>
>> I wonder what the best way would be fix this. I came up with the
>> following ideas so far:
>>
>> - disable throttling while block migration is in bulk stage
>> - check if absolute number of num_dirty_pages_period crosses a threshold
>> and not if its just
>>   greater than 50% of transferred bytes
>> - check if migration_dirty_pages > 0. This slows down throttling, but
>> does not avoid it completely.
> If you can use nbd-server and drive-mirror for block migration (libvirt
> would do it), then you will use multiple sockets and be able to migrate
> block and RAM at the same time.
>
> Otherwise, disabling throttling during the bulk stage is the one that
> seems nicest and most promising.

Okay, but this can be done independently of the nbd approach.
If someone uses classic block migration and auto converge his
vserver will freeze.

I will send a patch to fix that.

Thanks,
Peter

Re: [Qemu-block] Block Migration and CPU throttling

2017-09-20 Thread Peter Lieven

Am 19.09.2017 um 16:41 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (p...@kamp.de) wrote:
>> Am 19.09.2017 um 16:38 schrieb Dr. David Alan Gilbert:
>>> * Peter Lieven (p...@kamp.de) wrote:
 Hi,

 I just noticed that CPU throttling and Block Migration don't work together 
 very well.
 During block migration the throttling heuristic detects that we obviously 
 make no progress
 in ram transfer. But the reason is the running block migration and not a 
 too high dirty pages rate.

 The result is that any VM is throttled by 99% during block migration.
>>> Hmm that's unfortunate; do you have a bandwidth set lower than your
>>> actual network connection? I'm just wondering if it's actually going
>>> between the block and RAM iterative sections or getting stuck in ne.
>> It happens also if source and dest are on the same machine and speed is set 
>> to 100G.
> But does it happen if they're not and the speed is set low?

Yes, it does. I noticed it in our test environment between different nodes with 
a 10G
link in between. But its totally clear why it happens. During block migration 
we transfer
all dirty memory pages in each round (if there is moderate memory load), but 
all dirty
pages are obviously more than 50% of the transferred ram in that round.
It is more exactly 100%. But the current logic triggers on this condition.

I think I will go forward and send a patch which disables auto converge during
block migration bulk stage.

Thanks for your feedback,
Peter

Re: [Qemu-block] [PATCH 1/3] block: add bdrv_co_drain_end callback

2017-09-20 Thread Manos Pitsidianakis


On Wed, Sep 20, 2017 at 03:26:32PM +0100, Stefan Hajnoczi wrote:

On Wed, Sep 20, 2017 at 01:23:09PM +0300, Manos Pitsidianakis wrote:

@@ -188,7 +194,7 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
 waited = BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0);

 /* Ensure any pending metadata writes are submitted to bs->file.  */
-bdrv_drain_invoke(bs);
+bdrv_drain_invoke(bs, begin);


Do you need to move bdrv_drain_invoke(bs, begin) before
BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0)?

This will ensure that throttling is disabled and the TGM restarted
before we wait for requests to complete.



Hm yes. Before, the order was irrelevant because BlockBackend issued the 
drain first by restarting the tgm in blk_root_drained_begin.


signature.asc
Description: PGP signature

Re: [Qemu-block] [PATCH v3 0/3] nbd client refactoring and fixing

2017-09-20 Thread Eric Blake

On 09/20/2017 07:45 AM, Vladimir Sementsov-Ogievskiy wrote:
> v3: dropped all controversial things. I'll try to implement minimal
> structured reply and block status over this small residue.
> 
> Vladimir Sementsov-Ogievskiy (3):
>   block/nbd-client: refactor nbd_co_receive_reply
>   block/nbd-client: simplify check in nbd_co_receive_reply
>   block/nbd-client: nbd_co_send_request: fix return code

Thanks; queued on my NBD tree, will send pull request after a bit more
time to see if it collects further reviews:
git://repo.or.cz/qemu/ericb.git nbd

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 2/3] block/nbd-client: simplify check in nbd_co_receive_reply

2017-09-20 Thread Eric Blake

On 09/20/2017 07:45 AM, Vladimir Sementsov-Ogievskiy wrote:
> If we are woken up from while() loop in nbd_read_reply_entry
> handles must be equal. If we are woken up from
> nbd_recv_coroutines_wake_all s->quit must be true, so we do
> not need checking handles equality.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Eric Blake 

> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index acd8e5d007..486bfff9f7 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -189,9 +189,10 @@ static int nbd_co_receive_reply(NBDClientSession *s,
>  s->requests[i].receiving = true;
>  qemu_coroutine_yield();
>  s->requests[i].receiving = false;
> -if (s->reply.handle != request->handle || !s->ioc || s->quit) {
> +if (!s->ioc || s->quit) {
>  ret = -EIO;
>  } else {
> +assert(s->reply.handle == request->handle);
>  ret = -s->reply.error;
>  if (qiov && s->reply.error == 0) {
>  assert(request->len == iov_size(qiov->iov, qiov->niov));
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 1/3] block: add bdrv_co_drain_end callback

2017-09-20 Thread Stefan Hajnoczi

On Wed, Sep 20, 2017 at 01:23:09PM +0300, Manos Pitsidianakis wrote:
> @@ -188,7 +194,7 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
>  waited = BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0);
>  
>  /* Ensure any pending metadata writes are submitted to bs->file.  */
> -bdrv_drain_invoke(bs);
> +bdrv_drain_invoke(bs, begin);

Do you need to move bdrv_drain_invoke(bs, begin) before
BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0)?

This will ensure that throttling is disabled and the TGM restarted
before we wait for requests to complete.

Stefan

Re: [Qemu-block] [PATCH v3 3/3] block/nbd-client: nbd_co_send_request: fix return code

2017-09-20 Thread Eric Blake

On 09/20/2017 07:45 AM, Vladimir Sementsov-Ogievskiy wrote:
> It's incorrect to return success rc >= 0 if we skip qio_channel_writev_all()
> call due to s->quit.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Eric Blake 

> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 486bfff9f7..9d1e154feb 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -161,6 +161,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
> NULL) < 0) {
>  rc = -EIO;
>  }
> +} else if (rc >= 0) {
> +rc = -EIO;
>  }
>  qio_channel_set_cork(s->ioc, false);
>  } else {
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring

2017-09-20 Thread Stefan Hajnoczi

On Tue, Sep 19, 2017 at 10:57:50AM +0100, Daniel P. Berrange wrote:
> On Tue, Sep 19, 2017 at 10:44:16AM +0100, Stefan Hajnoczi wrote:
> > On Mon, Sep 18, 2017 at 06:26:51PM +0200, Max Reitz wrote:
> > > On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> > > > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> > > >> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> > > >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> > >  This patch implements active synchronous mirroring.  In active mode, 
> > >  the
> > >  passive mechanism will still be in place and is used to copy all
> > >  initially dirty clusters off the source disk; but every write request
> > >  will write data both to the source and the target disk, so the source
> > >  cannot be dirtied faster than data is mirrored to the target.  Also,
> > >  once the block job has converged (BLOCK_JOB_READY sent), source and
> > >  target are guaranteed to stay in sync (unless an error occurs).
> > > 
> > >  Optionally, dirty data can be copied to the target disk on read
> > >  operations, too.
> > > 
> > >  Active mode is completely optional and currently disabled at 
> > >  runtime.  A
> > >  later patch will add a way for users to enable it.
> > > 
> > >  Signed-off-by: Max Reitz 
> > >  ---
> > >   qapi/block-core.json |  23 +++
> > >   block/mirror.c   | 187 
> > >  +--
> > >   2 files changed, 205 insertions(+), 5 deletions(-)
> > > 
> > >  diff --git a/qapi/block-core.json b/qapi/block-core.json
> > >  index bb11815608..e072cfa67c 100644
> > >  --- a/qapi/block-core.json
> > >  +++ b/qapi/block-core.json
> > >  @@ -938,6 +938,29 @@
> > > 'data': ['top', 'full', 'none', 'incremental'] }
> > >   
> > >   ##
> > >  +# @MirrorCopyMode:
> > >  +#
> > >  +# An enumeration whose values tell the mirror block job when to
> > >  +# trigger writes to the target.
> > >  +#
> > >  +# @passive: copy data in background only.
> > >  +#
> > >  +# @active-write: when data is written to the source, write it
> > >  +#(synchronously) to the target as well.  In 
> > >  addition,
> > >  +#data is copied in background just like in @passive
> > >  +#mode.
> > >  +#
> > >  +# @active-read-write: write data to the target (synchronously) both
> > >  +# when it is read from and written to the 
> > >  source.
> > >  +# In addition, data is copied in background just
> > >  +# like in @passive mode.
> > > >>>
> > > >>> I'm not sure the terms "active"/"passive" are helpful.  "Active 
> > > >>> commit"
> > > >>> means committing the top-most BDS while the guest is accessing it.  
> > > >>> The
> > > >>> "passive" mirror block still works on the top-most BDS while the guest
> > > >>> is accessing it.
> > > >>>
> > > >>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's 
> > > >>> also
> > > >>> the terminology used in disk replication (e.g. DRBD).
> > > >>
> > > >> I'd be OK with that, too, but I think I remember that in the past at
> > > >> least Kevin made a clear distinction between active/passive and
> > > >> sync/async when it comes to mirroring.
> > > >>
> > > >>> Ideally the user wouldn't have to worry about async vs sync because 
> > > >>> QEMU
> > > >>> would switch modes as appropriate in order to converge.  That way
> > > >>> libvirt also doesn't have to worry about this.
> > > >>
> > > >> So here you mean async/sync in the way I meant it, i.e., whether the
> > > >> mirror operations themselves are async/sync?
> > > > 
> > > > The meaning I had in mind is:
> > > > 
> > > > Sync mirroring means a guest write waits until the target write
> > > > completes.
> > > 
> > > I.e. active-sync, ...
> > > 
> > > > Async mirroring means guest writes completes independently of target
> > > > writes.
> > > 
> > > ... i.e. passive or active-async in the future.
> > > 
> > > So you really want qemu to decide whether to use active or passive mode
> > > depending on what's enough to let the block job converge and not
> > > introduce any switch for the user?
> > > 
> > > I'm not sure whether I like this too much, mostly because "libvirt
> > > doesn't have to worry" doesn't feel quite right to me.  If we don't make
> > > libvirt worry about this, then qemu has to worry.  I'm not sure whether
> > > that's any better.
> > > 
> > > I think this really does get into policy territory.  Just switching to
> > > active mode the instant target writes are slower than source writes may
> > > not be what the user wants: Maybe it's OK for a short duration because
> > > they don't care about hard convergence too much.  Maybe they want to
> > > switch to active mode already when "only" twice as much

Re: [Qemu-block] [PATCH 1/2] block/block-backend.c: add blk_check_byte_request call to blk_pread/blk_pwrite

2017-09-20 Thread Manos Pitsidianakis


On Wed, Sep 20, 2017 at 02:24:21PM +0200, Kevin Wolf wrote:

Am 20.09.2017 um 13:43 hat Manos Pitsidianakis geschrieben:

blk_check_byte_request() is called from the blk_co_pwritev/blk_co_preadv to
check if the request offset and request bytes parameters are valid for the
given Blockbackend. Let's do that in blk_pread/blk_pwrite too.

Signed-off-by: Manos Pitsidianakis 


I don't think this is necessary, blk_pread/pwrite() are only wrappers
around blk_co_preadv/pwritev(), so we're already checking the
parameters.

Kevin



Alright, this can be dropped then. Thanks!


signature.asc
Description: PGP signature

Re: [Qemu-block] [PATCH 1/2] qcow2: fix return error code in qcow2_truncate()

2017-09-20 Thread Eric Blake

On 09/20/2017 08:58 AM, Pavel Butsykin wrote:
> Signed-off-by: Pavel Butsykin 
> ---
>  block/qcow2.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Wow, bdrv_truncate() has had problems!  You found fixes unrelated to my
current struggles to fix the dirty-bitmap failure.

Reviewed-by: Eric Blake 
CC: qemu-sta...@nongnu.org

> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 2174a84d1f..8a4311d338 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -3166,7 +3166,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
> offset,
>  if (old_file_size < 0) {
>  error_setg_errno(errp, -old_file_size,
>   "Failed to inquire current file length");
> -return ret;
> +return old_file_size;
>  }
>  
>  nb_new_data_clusters = DIV_ROUND_UP(offset - old_length,
> @@ -3195,7 +3195,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
> offset,
>  if (allocation_start < 0) {
>  error_setg_errno(errp, -allocation_start,
>   "Failed to resize refcount structures");
> -return -allocation_start;
> +return allocation_start;
>  }
>  
>  clusters_allocated = qcow2_alloc_clusters_at(bs, allocation_start,
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 3/3] block/throttle.c: add bdrv_co_drain_begin/end callbacks

2017-09-20 Thread Stefan Hajnoczi

On Wed, Sep 20, 2017 at 01:23:11PM +0300, Manos Pitsidianakis wrote:
> Signed-off-by: Manos Pitsidianakis 
> ---
>  block/throttle.c | 18 ++
>  1 file changed, 18 insertions(+)

Reviewed-by: Stefan Hajnoczi

Re: [Qemu-block] [PATCH v2] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-20 Thread Stefan Hajnoczi

On Wed, Sep 20, 2017 at 02:17:51PM +0200, Alberto Garcia wrote:
> On Wed 20 Sep 2017 01:39:02 PM CEST, Manos Pitsidianakis wrote:
> >>>  void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
> >>>  {
> >>>  ThrottleTimers *tt = >throttle_timers;
> >>> +ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, 
> >>> ts);
> >>> +
> >>> +qemu_mutex_lock(>lock);
> >>> +if (timer_pending(tt->timers[0])) {
> >>> +tg->any_timer_armed[0] = false;
> >>> +}
> >>> +if (timer_pending(tt->timers[1])) {
> >>> +tg->any_timer_armed[1] = false;
> >>> +}
> >>> +qemu_mutex_unlock(>lock);
> >>> +
> >>>  throttle_timers_detach_aio_context(tt);
> >>>  tgm->aio_context = NULL;
> >>>  }
> >>
> >>I'm sorry that I didn't noticed this in my previous e-mail, but after
> >>this call you might have destroyed the timer that was set for that
> >>throttling group, so if there are pending requests waiting it can
> >>happen that no one wakes them up.
> >>
> >>I think that the queue needs to be restarted after this, probably
> >>after having reattached the context (or actually after detaching it
> >>already, but then what happens if you try to restart the queue while
> >>aio_context is NULL?).
> >
> > aio_co_enter in the restart queue function requires that aio_context
> > is non-NULL. Perhaps calling throttle_group_restart_tgm in
> > throttle_group_attach_aio_context will suffice.
> 
> But can we guarantee that everything is safe between the _detach() and
> _attach() calls?

Here is a second patch that I didn't send because I was unsure whether it's
needed.

The idea is to move the throttle_group_detach/attach_aio_context() into
bdrv_set_aio_context() so it becomes part of the bdrv_drained_begin/end()
region.

The guarantees we have inside the drained region:

1. Throttling has been temporarily disabled.
2. throttle_group_restart_tgm() has been called to kick throttled reqs.
3. All requests are completed.

This is a nice, controlled environment to do the detach/attach.  That said,
Berto's point still stands: what about other ThrottleGroupMembers who have
throttled requests queued?

The main reason I didn't publish this patch is because Manos' "block: remove
legacy I/O throttling" series ought to remove this code anyway soon.

diff --git a/block/block-backend.c b/block/block-backend.c
index 45d9101be3..624eb3dc15 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1747,10 +1747,6 @@ void blk_set_aio_context(BlockBackend *blk, AioContext 
*new_context)
 ThrottleGroupMember *tgm = >public.throttle_group_member;
 
 if (bs) {
-if (tgm->throttle_state) {
-throttle_group_detach_aio_context(tgm);
-throttle_group_attach_aio_context(tgm, new_context);
-}
 bdrv_set_aio_context(bs, new_context);
 }
 }
@@ -2029,6 +2025,13 @@ static void blk_root_drained_end(BdrvChild *child)
 BlockBackend *blk = child->opaque;
 assert(blk->quiesce_counter);
 
+if (tgm->throttle_state) {
+AioContext *new_context = bdrv_aio_context(blk_bs(blk));
+
+throttle_group_detach_aio_context(tgm);
+throttle_group_attach_aio_context(tgm, new_context);
+}
+
 assert(blk->public.throttle_group_member.io_limits_disabled);
 atomic_dec(>public.throttle_group_member.io_limits_disabled);

Re: [Qemu-block] [PATCH 2/3] block: rename bdrv_co_drain to bdrv_co_drain_begin

2017-09-20 Thread Stefan Hajnoczi

On Wed, Sep 20, 2017 at 01:23:10PM +0300, Manos Pitsidianakis wrote:
> Signed-off-by: Manos Pitsidianakis 
> ---
>  include/block/block_int.h | 2 +-
>  block/io.c| 4 ++--
>  block/qed.c   | 6 +++---
>  3 files changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi

Re: [Qemu-block] [PATCH 1/3] block: add bdrv_co_drain_end callback

2017-09-20 Thread Stefan Hajnoczi

On Wed, Sep 20, 2017 at 01:23:09PM +0300, Manos Pitsidianakis wrote:
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index ba4c383393..ea1326e3c7 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -359,6 +359,8 @@ struct BlockDriver {
>   */
>  void coroutine_fn (*bdrv_co_drain)(BlockDriverState *bs);
>  
> +void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);

Please update the doc comment to describe the environment under which
begin() and end() are invoked.  It should be clear when to implement
begin() and/or end().

Re: [Qemu-block] ping Re: [PATCH v7 03/16] migration: split common postcopy out of ram postcopy

2017-09-20 Thread Juan Quintela

"Dr. David Alan Gilbert"  wrote:
> * Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote:
>> ping for 1-3
>> Can we merge them?
>
> I see all of them have R-b's; so lets try and put them in the next
> migration merge.
>
> Quintela: Sound good?

Yeap.

Later, Juan.

Re: [Qemu-block] [PATCH] qemu-iotests/195: Change multiple nodes in one run

2017-09-20 Thread Eric Blake

On 09/20/2017 07:15 AM, Kevin Wolf wrote:
> This additional check for change-backing-file makes sure that actions on
> one part of the graph don't interfere with later actions elsewhere.
> 
> Suggested-by: Eric Blake 
> Signed-off-by: Kevin Wolf 
> ---
>  tests/qemu-iotests/195 | 17 
>  tests/qemu-iotests/195.out | 51 
> ++
>  2 files changed, 68 insertions(+)
> 

> +run_qemu -drive if=none,file="$TEST_IMG",node-name=top,backing.node-name=mid 
> < +{"execute":"qmp_capabilities"}
> +{"execute":"change-backing-file", 
> "arguments":{"device":"none0","image-node-name":"mid","backing-file":"/dev/null"}}
> +{"execute":"change-backing-file", 
> "arguments":{"device":"none0","image-node-name":"top","backing-file":"/dev/null"}}

I guess 'mid' is still accessible even though 'top' now no longer points
to mid in the header, because we rely on our cache of the relationships
rather than what is written into the header (of course, as soon as qemu
quits, we've broken the chain, but that doesn't matter for this test).

> +{"execute":"change-backing-file", 
> "arguments":{"device":"none0","image-node-name":"mid","backing-file":"/dev/null"}}

Should the second modification to 'mid' be distinct from the first (as
in, back to $TEST_IMG.mid), to make sure we are actually making another
change, rather than triggering some short-cut that might treat a no-op
change as something that can be skipped?  Even bouncing between
"$TEST_IMG.mid", "./$TEST_IMG.mid", and "././$TEST_IMG.mid" will access
the same file (and thus not break the chain), while still making it
obvious which changes took effect, when compared to your change to
/dev/null.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 0/6] block: Fix permissions after ro/rw reopen

2017-09-20 Thread Kevin Wolf

Am 15.09.2017 um 12:10 hat Kevin Wolf geschrieben:
> bdrv_reopen() can switch nodes between read-only and read-write modes.
> This has implications for the required permissions on their child nodes.
> For example, a qcow2 node requests write permissions on bs->file only if
> it is writable itself.
> 
> This means that during bdrv_reopen(), the permissions need to be
> recalculated in order to prevent failures where the bs->file
> permissions don't match its actual read-only state (e.g. bs->file is a
> read-write node, but the permission still enforces read-only access).

Applied to the block branch.

Kevin

[Qemu-block] [PATCH 1/2] qcow2: fix return error code in qcow2_truncate()

2017-09-20 Thread Pavel Butsykin

Signed-off-by: Pavel Butsykin 
---
 block/qcow2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2174a84d1f..8a4311d338 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3166,7 +3166,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
offset,
 if (old_file_size < 0) {
 error_setg_errno(errp, -old_file_size,
  "Failed to inquire current file length");
-return ret;
+return old_file_size;
 }
 
 nb_new_data_clusters = DIV_ROUND_UP(offset - old_length,
@@ -3195,7 +3195,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
offset,
 if (allocation_start < 0) {
 error_setg_errno(errp, -allocation_start,
  "Failed to resize refcount structures");
-return -allocation_start;
+return allocation_start;
 }
 
 clusters_allocated = qcow2_alloc_clusters_at(bs, allocation_start,
-- 
2.14.1

[Qemu-block] [PATCH 0/2] Truncate the tail of the image file in qcow2 shrinking

2017-09-20 Thread Pavel Butsykin

Now after shrinking the qcow2 image, at the end of the image file, there might
be a tail that probably will never be used. Although it will not bring any
tangible benefit, we can cut the tail if it is. Yes, it will not free up disk
space, but if the blocks were be allocated sequentially and the image is not
heavily fragmented then the virtual size of the image file will be commensurate
with the real size. It also doesn't look like a great plus.. Well, at least we
can discuss it.

Pavel Butsykin (2):
  qcow2: fix return error code in qcow2_truncate()
  qcow2: truncate the tail of the image file after shrinking the image

 block/qcow2-refcount.c | 21 +
 block/qcow2.c  | 23 +--
 block/qcow2.h  |  1 +
 3 files changed, 43 insertions(+), 2 deletions(-)

-- 
2.14.1

[Qemu-block] [PATCH 2/2] qcow2: truncate the tail of the image file after shrinking the image

2017-09-20 Thread Pavel Butsykin

Now after shrinking the image, at the end of the image file, there might be a
tail that probably will never be used. So we can find the last used cluster and
cut the tail.

Signed-off-by: Pavel Butsykin 
---
 block/qcow2-refcount.c | 21 +
 block/qcow2.c  | 19 +++
 block/qcow2.h  |  1 +
 3 files changed, 41 insertions(+)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 88d5a3f1ad..5e221a166c 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -3181,3 +3181,24 @@ out:
 g_free(reftable_tmp);
 return ret;
 }
+
+int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
+{
+BDRVQcow2State *s = bs->opaque;
+int64_t i, last_cluster, nb_clusters = size_to_clusters(s, size);
+uint64_t refcount;
+
+for (i = 0, last_cluster = 0; i < nb_clusters; i++) {
+int ret = qcow2_get_refcount(bs, i, );
+if (ret < 0) {
+fprintf(stderr, "Can't get refcount for cluster %" PRId64 ": %s\n",
+i, strerror(-ret));
+continue;
+}
+
+if (refcount > 0) {
+last_cluster = i;
+}
+}
+return last_cluster;
+}
diff --git a/block/qcow2.c b/block/qcow2.c
index 8a4311d338..c3b6dd44c4 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3106,6 +3106,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
offset,
 new_l1_size = size_to_l1(s, offset);
 
 if (offset < old_length) {
+int64_t image_end_offset, old_file_size;
 if (prealloc != PREALLOC_MODE_OFF) {
 error_setg(errp,
"Preallocation can't be used for shrinking an image");
@@ -3134,6 +3135,24 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
offset,
  "Failed to discard unused refblocks");
 return ret;
 }
+
+old_file_size = bdrv_getlength(bs->file->bs);
+if (old_file_size < 0) {
+error_setg_errno(errp, -old_file_size,
+ "Failed to inquire current file length");
+return old_file_size;
+}
+image_end_offset = (qcow2_get_last_cluster(bs, old_file_size) + 1) *
+   s->cluster_size;
+if (image_end_offset < old_file_size) {
+ret = bdrv_truncate(bs->file, image_end_offset,
+PREALLOC_MODE_OFF, NULL);
+if (ret < 0) {
+error_setg_errno(errp, -ret,
+ "Failed to truncate the tail of the image");
+return ret;
+}
+}
 } else {
 ret = qcow2_grow_l1_table(bs, new_l1_size, true);
 if (ret < 0) {
diff --git a/block/qcow2.h b/block/qcow2.h
index 5a289a81e2..782a206ecb 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -597,6 +597,7 @@ int qcow2_change_refcount_order(BlockDriverState *bs, int 
refcount_order,
 BlockDriverAmendStatusCB *status_cb,
 void *cb_opaque, Error **errp);
 int qcow2_shrink_reftable(BlockDriverState *bs);
+int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
 
 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
-- 
2.14.1

[Qemu-block] [PATCH v2] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-20 Thread Stefan Hajnoczi

Clear tg->any_timer_armed[] when throttling timers are destroyed during
AioContext attach/detach.  Failure to do so causes throttling to hang
because we believe the timer is already scheduled!

The following was broken at least since QEMU 2.10.0 with -drive
iops=100:

  $ dd if=/dev/zero of=/dev/vdb oflag=direct count=1000
  (qemu) stop
  (qemu) cont
  ...I/O is stuck...

Reported-by: Yongxue Hong 
Signed-off-by: Stefan Hajnoczi 
---
v2:
 * Use timer_pending(tt->timers[i]) instead of token [Berto]
 * Fix s/destroy/destroyed/ typo [Eric]

 block/throttle-groups.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index 6ba992c8d7..0a49f3c409 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -592,6 +592,17 @@ void throttle_group_attach_aio_context(ThrottleGroupMember 
*tgm,
 void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
 {
 ThrottleTimers *tt = >throttle_timers;
+ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts);
+
+qemu_mutex_lock(>lock);
+if (timer_pending(tt->timers[0])) {
+tg->any_timer_armed[0] = false;
+}
+if (timer_pending(tt->timers[1])) {
+tg->any_timer_armed[1] = false;
+}
+qemu_mutex_unlock(>lock);
+
 throttle_timers_detach_aio_context(tt);
 tgm->aio_context = NULL;
 }
-- 
2.13.5

Re: [Qemu-block] [PATCH 1/2] block/block-backend.c: add blk_check_byte_request call to blk_pread/blk_pwrite

2017-09-20 Thread Kevin Wolf

Am 20.09.2017 um 13:43 hat Manos Pitsidianakis geschrieben:
> blk_check_byte_request() is called from the blk_co_pwritev/blk_co_preadv to
> check if the request offset and request bytes parameters are valid for the
> given Blockbackend. Let's do that in blk_pread/blk_pwrite too.
> 
> Signed-off-by: Manos Pitsidianakis 

I don't think this is necessary, blk_pread/pwrite() are only wrappers
around blk_co_preadv/pwritev(), so we're already checking the
parameters.

Kevin

Re: [Qemu-block] [PATCH v2] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-20 Thread Alberto Garcia

On Wed 20 Sep 2017 01:39:02 PM CEST, Manos Pitsidianakis wrote:
>>>  void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
>>>  {
>>>  ThrottleTimers *tt = >throttle_timers;
>>> +ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, 
>>> ts);
>>> +
>>> +qemu_mutex_lock(>lock);
>>> +if (timer_pending(tt->timers[0])) {
>>> +tg->any_timer_armed[0] = false;
>>> +}
>>> +if (timer_pending(tt->timers[1])) {
>>> +tg->any_timer_armed[1] = false;
>>> +}
>>> +qemu_mutex_unlock(>lock);
>>> +
>>>  throttle_timers_detach_aio_context(tt);
>>>  tgm->aio_context = NULL;
>>>  }
>>
>>I'm sorry that I didn't noticed this in my previous e-mail, but after
>>this call you might have destroyed the timer that was set for that
>>throttling group, so if there are pending requests waiting it can
>>happen that no one wakes them up.
>>
>>I think that the queue needs to be restarted after this, probably
>>after having reattached the context (or actually after detaching it
>>already, but then what happens if you try to restart the queue while
>>aio_context is NULL?).
>
> aio_co_enter in the restart queue function requires that aio_context
> is non-NULL. Perhaps calling throttle_group_restart_tgm in
> throttle_group_attach_aio_context will suffice.

But can we guarantee that everything is safe between the _detach() and
_attach() calls?

Berto

Re: [Qemu-block] [PATCH v2] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-20 Thread Alberto Garcia

On Wed 20 Sep 2017 12:17:40 PM CEST, Stefan Hajnoczi wrote:
> @@ -592,6 +592,17 @@ void 
> throttle_group_attach_aio_context(ThrottleGroupMember *tgm,
>  void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
>  {
>  ThrottleTimers *tt = >throttle_timers;
> +ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts);
> +
> +qemu_mutex_lock(>lock);
> +if (timer_pending(tt->timers[0])) {
> +tg->any_timer_armed[0] = false;
> +}
> +if (timer_pending(tt->timers[1])) {
> +tg->any_timer_armed[1] = false;
> +}
> +qemu_mutex_unlock(>lock);
> +
>  throttle_timers_detach_aio_context(tt);
>  tgm->aio_context = NULL;
>  }

I'm sorry that I didn't noticed this in my previous e-mail, but after
this call you might have destroyed the timer that was set for that
throttling group, so if there are pending requests waiting it can happen
that no one wakes them up.

I think that the queue needs to be restarted after this, probably after
having reattached the context (or actually after detaching it already,
but then what happens if you try to restart the queue while aio_context
is NULL?).

Berto

Re: [Qemu-block] [PATCH] block/throttle-groups.c: allocate RestartData on the heap

2017-09-20 Thread Alberto Garcia

On Mon 18 Sep 2017 10:25:29 PM CEST, Manos Pitsidianakis wrote:
> RestartData is the opaque data of the throttle_group_restart_queue_entry
> coroutine. By being stack allocated, it isn't available anymore if
> aio_co_enter schedules the coroutine with a bottom halve and runs after
> throttle_group_restart_queue returns.
>
> Signed-off-by: Manos Pitsidianakis 

Reviewed-by: Alberto Garcia 

Berto

Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-20 Thread Alberto Garcia

On Wed 20 Sep 2017 09:06:20 AM CEST, Kevin Wolf wrote:
>> |---+--+-+---+--|
>> | Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU |
>> |---+--+-+---+--|
>> | 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS   |
>> |  2 TB |  2 MB| 4 MB [1 TB] |  576 IOPS | 11000 IOPS   |
>> |---+--+-+---+--|
>> 
>> The improvements are clearly visible, but it's important to point out
>> a couple of things:
>> 
>>- L2 cache size is always < total L2 metadata on disk (otherwise
>>  this wouldn't make sense). Increasing the L2 cache size improves
>>  performance a lot (and makes the effect of these patches
>>  disappear), but it requires more RAM.
>
> Do you have the numbers for the two cases abve if the L2 tables
> covered the whole image?

Yeah, sorry, it's around 6 IOPS in both cases (more or less what I
also get with a raw image).

>>- Doing random reads over the whole disk is probably not a very
>>  realistic scenario. During normal usage only certain areas of the
>>  disk need to be accessed, so performance should be much better
>>  with the same amount of cache.
>>- I wrote a best-case scenario test (several I/O jobs each accesing
>>  a part of the disk that requires loading its own L2 table) and my
>>  patched version is 20x faster even with 64KB clusters.
>
> I suppose you choose the scenario so that the number of jobs is larger
> than the number of cached L2 tables without the patch, but smaller than
> than the number of cache entries with the patch?

Exactly, I should have made that explicit :) I had 32 jobs, each one of
them limited to a small area (32MB), so with 4K pages you only need
128KB of cache memory (vs 2MB with the current code).

> We will probably need to do some more benchmarking to find a good
> default value for the cached chunks. 4k is nice and small, so we can
> cover many parallel jobs without using too much memory. But if we have
> a single sequential job, we may end up doing the metadata updates in
> small 4k chunks instead of doing a single larger write.

Right, although a 4K table can already hold pointers to 512 data
clusters, so even if you do sequential I/O you don't need to update the
metadata so often, do you?

I guess the default value should probably depend on the cluster size.

>>- We need a proper name for these sub-tables that we are loading
>>  now. I'm actually still struggling with this :-) I can't think of
>>  any name that is clear enough and not too cumbersome to use (L2
>>  subtables? => Confusing. L3 tables? => they're not really that).
>
> L2 table chunk? Or just L2 cache entry?

Yeah, something like that, but let's see how variables end up being
named :)

Berto

Re: [Qemu-block] [PATCH v3 03/20] file-posix: Switch to .bdrv_co_block_status()

2017-09-20 Thread Fam Zheng

On Thu, 09/14 09:40, Eric Blake wrote:
> We are gradually moving away from sector-based interfaces, towards
> byte-based.  Update the file protocol driver accordingly.  In mapping
> mode, note that the entire file is reported as allocated, so we can
> take a shortcut and skip lseek().
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v2: tweak comment, add mapping support
> ---
>  block/file-posix.c | 57 
> ++
>  1 file changed, 32 insertions(+), 25 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 72ecfbb0e0..6813059867 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -2107,24 +2107,25 @@ static int find_allocation(BlockDriverState *bs, 
> off_t start,
>  }
> 
>  /*
> - * Returns the allocation status of the specified sectors.
> + * Returns the allocation status of the specified offset.
>   *
> - * If 'sector_num' is beyond the end of the disk image the return value is 0
> + * If 'offset' is beyond the end of the disk image the return value is 0
>   * and 'pnum' is set to 0.
>   *
> - * 'pnum' is set to the number of sectors (including and immediately 
> following
> - * the specified sector) that are known to be in the same
> + * 'pnum' is set to the number of bytes (including and immediately following
> + * the specified offset) that are known to be in the same
>   * allocated/unallocated state.
>   *
> - * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
> + * 'bytes' is the max value 'pnum' should be set to.  If bytes goes
>   * beyond the end of the disk image it will be clamped.
>   */
> -static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
> -int64_t sector_num,
> -int nb_sectors, int 
> *pnum,
> -BlockDriverState **file)
> +static int64_t coroutine_fn raw_co_block_status(BlockDriverState *bs,
> +bool mapping,
> +int64_t offset,
> +int64_t bytes, int64_t *pnum,
> +BlockDriverState **file)
>  {
> -off_t start, data = 0, hole = 0;
> +off_t data = 0, hole = 0;
>  int64_t total_size;
>  int ret;
> 
> @@ -2133,39 +2134,45 @@ static int64_t coroutine_fn 
> raw_co_get_block_status(BlockDriverState *bs,
>  return ret;
>  }
> 
> -start = sector_num * BDRV_SECTOR_SIZE;
>  total_size = bdrv_getlength(bs);
>  if (total_size < 0) {
>  return total_size;
> -} else if (start >= total_size) {
> +} else if (offset >= total_size) {
>  *pnum = 0;
>  return 0;
> -} else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) {
> -nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE);
> +} else if (offset + bytes > total_size) {
> +bytes = total_size - offset;
>  }
> 
> -ret = find_allocation(bs, start, , );
> +if (!mapping) {
> +*pnum = bytes;
> +*file = bs;
> +return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID |
> +(offset & BDRV_BLOCK_OFFSET_MASK);
> +}

I may be missing something, because the last time I tried to understand the
rationale behind "mapping" was already some time ago: shouldn't we still
distinguish hole and data? What will omitting BDRV_BLOCK_ZERO help?

Fam

> +
> +ret = find_allocation(bs, offset, , );
>  if (ret == -ENXIO) {
>  /* Trailing hole */
> -*pnum = nb_sectors;
> +*pnum = bytes;
>  ret = BDRV_BLOCK_ZERO;
>  } else if (ret < 0) {
>  /* No info available, so pretend there are no holes */
> -*pnum = nb_sectors;
> +*pnum = bytes;
>  ret = BDRV_BLOCK_DATA;
> -} else if (data == start) {
> -/* On a data extent, compute sectors to the end of the extent,
> +} else if (data == offset) {
> +/* On a data extent, compute bytes to the end of the extent,
>   * possibly including a partial sector at EOF. */
> -*pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, 
> BDRV_SECTOR_SIZE));
> +*pnum = MIN(bytes, hole - offset);
>  ret = BDRV_BLOCK_DATA;
>  } else {
> -/* On a hole, compute sectors to the beginning of the next extent.  
> */
> -assert(hole == start);
> -*pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE);
> +/* On a hole, compute bytes to the beginning of the next extent.  */
> +assert(hole == offset);
> +*pnum = MIN(bytes, data - offset);
>  ret = BDRV_BLOCK_ZERO;
>  }
>  *file = bs;
> -return ret | BDRV_BLOCK_OFFSET_VALID | start;
> +return ret | BDRV_BLOCK_OFFSET_VALID | (offset & BDRV_BLOCK_OFFSET_MASK);
>  }
> 
>  static coroutine_fn

[Qemu-block] [PATCH 2/3] block: rename bdrv_co_drain to bdrv_co_drain_begin

2017-09-20 Thread Manos Pitsidianakis

Signed-off-by: Manos Pitsidianakis 
---
 include/block/block_int.h | 2 +-
 block/io.c| 4 ++--
 block/qed.c   | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index ea1326e3c7..fd10f53b43 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -357,7 +357,7 @@ struct BlockDriver {
  * Drain and stop any internal sources of requests in the driver, and
  * remain so until next I/O callback (e.g. bdrv_co_writev) is called.
  */
-void coroutine_fn (*bdrv_co_drain)(BlockDriverState *bs);
+void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs);
 
 void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
 
diff --git a/block/io.c b/block/io.c
index 465345289d..6b4b13b777 100644
--- a/block/io.c
+++ b/block/io.c
@@ -162,7 +162,7 @@ static void coroutine_fn bdrv_drain_invoke_entry(void 
*opaque)
 BlockDriverState *bs = data->bs;
 
 if (data->begin) {
-bs->drv->bdrv_co_drain(bs);
+bs->drv->bdrv_co_drain_begin(bs);
 } else {
 bs->drv->bdrv_co_drain_end(bs);
 }
@@ -176,7 +176,7 @@ static void bdrv_drain_invoke(BlockDriverState *bs, bool 
begin)
 {
 BdrvCoDrainData data = { .bs = bs, .done = false, .begin = begin};
 
-if (!bs->drv || (begin && !bs->drv->bdrv_co_drain) ||
+if (!bs->drv || (begin && !bs->drv->bdrv_co_drain_begin) ||
 (!begin && !bs->drv->bdrv_co_drain_end)) {
 return;
 }
diff --git a/block/qed.c b/block/qed.c
index 28e2ec89e8..821dcaa055 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -265,7 +265,7 @@ static bool qed_plug_allocating_write_reqs(BDRVQEDState *s)
 assert(!s->allocating_write_reqs_plugged);
 if (s->allocating_acb != NULL) {
 /* Another allocating write came concurrently.  This cannot happen
- * from bdrv_qed_co_drain, but it can happen when the timer runs.
+ * from bdrv_qed_co_drain_begin, but it can happen when the timer runs.
  */
 qemu_co_mutex_unlock(>table_lock);
 return false;
@@ -358,7 +358,7 @@ static void bdrv_qed_attach_aio_context(BlockDriverState 
*bs,
 }
 }
 
-static void coroutine_fn bdrv_qed_co_drain(BlockDriverState *bs)
+static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs)
 {
 BDRVQEDState *s = bs->opaque;
 
@@ -1608,7 +1608,7 @@ static BlockDriver bdrv_qed = {
 .bdrv_check   = bdrv_qed_check,
 .bdrv_detach_aio_context  = bdrv_qed_detach_aio_context,
 .bdrv_attach_aio_context  = bdrv_qed_attach_aio_context,
-.bdrv_co_drain= bdrv_qed_co_drain,
+.bdrv_co_drain_begin  = bdrv_qed_co_drain_begin,
 };
 
 static void bdrv_qed_init(void)
-- 
2.11.0

Re: [Qemu-block] [PATCH v2] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-20 Thread Manos Pitsidianakis


On Wed, Sep 20, 2017 at 11:17:40AM +0100, Stefan Hajnoczi wrote:

Clear tg->any_timer_armed[] when throttling timers are destroyed during
AioContext attach/detach.  Failure to do so causes throttling to hang
because we believe the timer is already scheduled!

The following was broken at least since QEMU 2.10.0 with -drive
iops=100:

 $ dd if=/dev/zero of=/dev/vdb oflag=direct count=1000
 (qemu) stop
 (qemu) cont
 ...I/O is stuck...

Reported-by: Yongxue Hong 
Signed-off-by: Stefan Hajnoczi 


Reviewed-by: Manos Pitsidianakis 



v2:
* Use timer_pending(tt->timers[i]) instead of token [Berto]
* Fix s/destroy/destroyed/ typo [Eric]

block/throttle-groups.c | 11 +++
1 file changed, 11 insertions(+)

diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index 6ba992c8d7..0a49f3c409 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -592,6 +592,17 @@ void throttle_group_attach_aio_context(ThrottleGroupMember 
*tgm,
void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
{
ThrottleTimers *tt = >throttle_timers;
+ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts);
+
+qemu_mutex_lock(>lock);
+if (timer_pending(tt->timers[0])) {
+tg->any_timer_armed[0] = false;
+}
+if (timer_pending(tt->timers[1])) {
+tg->any_timer_armed[1] = false;
+}
+qemu_mutex_unlock(>lock);
+
throttle_timers_detach_aio_context(tt);
tgm->aio_context = NULL;
}
--
2.13.5




signature.asc
Description: PGP signature

[Qemu-block] [PATCH 1/2] block/block-backend.c: add blk_check_byte_request call to blk_pread/blk_pwrite

2017-09-20 Thread Manos Pitsidianakis

blk_check_byte_request() is called from the blk_co_pwritev/blk_co_preadv to
check if the request offset and request bytes parameters are valid for the
given Blockbackend. Let's do that in blk_pread/blk_pwrite too.

Signed-off-by: Manos Pitsidianakis 
---
 block/block-backend.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 45d9101be3..110a52d5b1 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1288,22 +1288,23 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, 
int64_t offset,
 
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
 {
-int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
+int ret = blk_check_byte_request(blk, offset, count);
 if (ret < 0) {
 return ret;
 }
-return count;
+ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
+return ret < 0 ? ret : count;
 }
 
 int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int count,
BdrvRequestFlags flags)
 {
-int ret = blk_prw(blk, offset, (void *) buf, count, blk_write_entry,
-  flags);
+int ret = blk_check_byte_request(blk, offset, count);
 if (ret < 0) {
 return ret;
 }
-return count;
+ret = blk_prw(blk, offset, (void *) buf, count, blk_write_entry, flags);
+return ret < 0 ? ret : count;
 }
 
 int64_t blk_getlength(BlockBackend *blk)
-- 
2.11.0

[Qemu-block] [PATCH 1/3] block: add bdrv_co_drain_end callback

2017-09-20 Thread Manos Pitsidianakis

BlockDriverState has a bdrv_do_drain() callback but no equivalent for the end
of the drain. The throttle driver (block/throttle.c) needs a way to mark the
end of the drain in order to toggle io_limits_disabled correctly, thus
bdrv_co_drain_end is needed.

Signed-off-by: Manos Pitsidianakis 
---
 include/block/block_int.h |  2 ++
 block/io.c| 43 +++
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index ba4c383393..ea1326e3c7 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -359,6 +359,8 @@ struct BlockDriver {
  */
 void coroutine_fn (*bdrv_co_drain)(BlockDriverState *bs);
 
+void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
+
 void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
Error **errp);
 void (*bdrv_del_child)(BlockDriverState *parent, BdrvChild *child,
diff --git a/block/io.c b/block/io.c
index 4378ae4c7d..465345289d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -153,6 +153,7 @@ typedef struct {
 Coroutine *co;
 BlockDriverState *bs;
 bool done;
+bool begin;
 } BdrvCoDrainData;
 
 static void coroutine_fn bdrv_drain_invoke_entry(void *opaque)
@@ -160,18 +161,23 @@ static void coroutine_fn bdrv_drain_invoke_entry(void 
*opaque)
 BdrvCoDrainData *data = opaque;
 BlockDriverState *bs = data->bs;
 
-bs->drv->bdrv_co_drain(bs);
+if (data->begin) {
+bs->drv->bdrv_co_drain(bs);
+} else {
+bs->drv->bdrv_co_drain_end(bs);
+}
 
 /* Set data->done before reading bs->wakeup.  */
 atomic_mb_set(>done, true);
 bdrv_wakeup(bs);
 }
 
-static void bdrv_drain_invoke(BlockDriverState *bs)
+static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
 {
-BdrvCoDrainData data = { .bs = bs, .done = false };
+BdrvCoDrainData data = { .bs = bs, .done = false, .begin = begin};
 
-if (!bs->drv || !bs->drv->bdrv_co_drain) {
+if (!bs->drv || (begin && !bs->drv->bdrv_co_drain) ||
+(!begin && !bs->drv->bdrv_co_drain_end)) {
 return;
 }
 
@@ -180,7 +186,7 @@ static void bdrv_drain_invoke(BlockDriverState *bs)
 BDRV_POLL_WHILE(bs, !data.done);
 }
 
-static bool bdrv_drain_recurse(BlockDriverState *bs)
+static bool bdrv_drain_recurse(BlockDriverState *bs, bool begin)
 {
 BdrvChild *child, *tmp;
 bool waited;
@@ -188,7 +194,7 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
 waited = BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0);
 
 /* Ensure any pending metadata writes are submitted to bs->file.  */
-bdrv_drain_invoke(bs);
+bdrv_drain_invoke(bs, begin);
 
 QLIST_FOREACH_SAFE(child, >children, next, tmp) {
 BlockDriverState *bs = child->bs;
@@ -205,7 +211,7 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
  */
 bdrv_ref(bs);
 }
-waited |= bdrv_drain_recurse(bs);
+waited |= bdrv_drain_recurse(bs, begin);
 if (in_main_loop) {
 bdrv_unref(bs);
 }
@@ -221,12 +227,18 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 BlockDriverState *bs = data->bs;
 
 bdrv_dec_in_flight(bs);
-bdrv_drained_begin(bs);
+if (data->begin) {
+bdrv_drained_begin(bs);
+} else {
+bdrv_drained_end(bs);
+}
+
 data->done = true;
 aio_co_wake(co);
 }
 
-static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
+static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
+bool begin)
 {
 BdrvCoDrainData data;
 
@@ -239,6 +251,7 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs)
 .co = qemu_coroutine_self(),
 .bs = bs,
 .done = false,
+.begin = begin,
 };
 bdrv_inc_in_flight(bs);
 aio_bh_schedule_oneshot(bdrv_get_aio_context(bs),
@@ -253,7 +266,7 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs)
 void bdrv_drained_begin(BlockDriverState *bs)
 {
 if (qemu_in_coroutine()) {
-bdrv_co_yield_to_drain(bs);
+bdrv_co_yield_to_drain(bs, true);
 return;
 }
 
@@ -262,17 +275,22 @@ void bdrv_drained_begin(BlockDriverState *bs)
 bdrv_parent_drained_begin(bs);
 }
 
-bdrv_drain_recurse(bs);
+bdrv_drain_recurse(bs, true);
 }
 
 void bdrv_drained_end(BlockDriverState *bs)
 {
+if (qemu_in_coroutine()) {
+bdrv_co_yield_to_drain(bs, false);
+return;
+}
 assert(bs->quiesce_counter > 0);
 if (atomic_fetch_dec(>quiesce_counter) > 1) {
 return;
 }
 
 bdrv_parent_drained_end(bs);
+bdrv_drain_recurse(bs, false);
 aio_enable_external(bdrv_get_aio_context(bs));
 }
 
@@ -350,7 +368,7 @@ void bdrv_drain_all_begin(void)
 aio_context_acquire(aio_context);
 for (bs =

Re: [Qemu-block] [PATCH v3 03/20] file-posix: Switch to .bdrv_co_block_status()

2017-09-20 Thread Eric Blake

On 09/20/2017 04:57 AM, Fam Zheng wrote:
> On Thu, 09/14 09:40, Eric Blake wrote:
>> We are gradually moving away from sector-based interfaces, towards
>> byte-based.  Update the file protocol driver accordingly.  In mapping
>> mode, note that the entire file is reported as allocated, so we can
>> take a shortcut and skip lseek().
>>
>> Signed-off-by: Eric Blake 
>>

>>
>> -ret = find_allocation(bs, start, , );
>> +if (!mapping) {
>> +*pnum = bytes;
>> +*file = bs;
>> +return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID |
>> +(offset & BDRV_BLOCK_OFFSET_MASK);
>> +}
> 
> I may be missing something, because the last time I tried to understand the
> rationale behind "mapping" was already some time ago: shouldn't we still
> distinguish hole and data? What will omitting BDRV_BLOCK_ZERO help?

Hmm, the commit message is slightly off (in part, because I switched the
sense of the bool flag between series revisions, but did not properly
update the commit text to match).  In mapping mode, we want to return as
much information as possible (the client is something like 'qemu-img
map'), including where the holes lie.  But when we are NOT in mapping
mode, we care more about learning which portions of the file are
described in the current layer of the backing chain, rather than
delegating to another layer, regardless of whether the read will see
data or zeroes.  By the time we are at the POSIX file protocol layer, we
know that every byte in the file system/block device has a 1:1 mapping
to the bytes that the guest will read (we do not delegate to any backing
file), so we can simply report the entire remainder of the file as
allocated without worrying about holes.

Here's where the mapping flag was added and semantics documented (in
series 3; whereas the current email is series 4):
https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03542.html

So what I really need to do is fix the commit message to read:

In mapping mode, we must report as much information as possible about
where holes can be found; but when we don't care about mapping, the user
is more interested in how much of the guest view will come from the
current layer rather than delegating to some other BDS, and we can take
the shortcut that all of the remainder of the file fits that
description, and therefore take a shortcut and skip lseek() for a larger
*pnum result.

(the same comment probably applies to several other patches in the series)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH 0/2] remove blk_pread_unthrottled()

2017-09-20 Thread Manos Pitsidianakis

Cleanups for minor stuff I noticed while looking around blk_root_drained_*

Manos Pitsidianakis (2):
  block/block-backend.c: add blk_check_byte_request call to
blk_pread/blk_pwrite
  block/block-backend.c: remove blk_pread_unthrottled()

 include/sysemu/block-backend.h |  2 --
 block/block-backend.c  | 27 ++-
 hw/block/hd-geometry.c |  7 +--
 3 files changed, 7 insertions(+), 29 deletions(-)

-- 
2.11.0

[Qemu-block] [PATCH 3/3] block/throttle.c: add bdrv_co_drain_begin/end callbacks

2017-09-20 Thread Manos Pitsidianakis

Signed-off-by: Manos Pitsidianakis 
---
 block/throttle.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/block/throttle.c b/block/throttle.c
index 5bca76300f..833175ac77 100644
--- a/block/throttle.c
+++ b/block/throttle.c
@@ -197,6 +197,21 @@ static bool 
throttle_recurse_is_first_non_filter(BlockDriverState *bs,
 return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
 }
 
+static void coroutine_fn throttle_co_drain_begin(BlockDriverState *bs)
+{
+ThrottleGroupMember *tgm = bs->opaque;
+if (atomic_fetch_inc(>io_limits_disabled) == 0) {
+throttle_group_restart_tgm(tgm);
+}
+}
+
+static void coroutine_fn throttle_co_drain_end(BlockDriverState *bs)
+{
+ThrottleGroupMember *tgm = bs->opaque;
+assert(tgm->io_limits_disabled);
+atomic_dec(>io_limits_disabled);
+}
+
 static BlockDriver bdrv_throttle = {
 .format_name=   "throttle",
 .protocol_name  =   "throttle",
@@ -226,6 +241,9 @@ static BlockDriver bdrv_throttle = {
 .bdrv_reopen_abort  =   throttle_reopen_abort,
 .bdrv_co_get_block_status   =   bdrv_co_get_block_status_from_file,
 
+.bdrv_co_drain_begin=   throttle_co_drain_begin,
+.bdrv_co_drain_end  =   throttle_co_drain_end,
+
 .is_filter  =   true,
 };
 
-- 
2.11.0

Re: [Qemu-block] [PATCH v2 5/7] block/nbd-client: nbd_co_send_request: return -EIO if s->quit was set in parallel

2017-09-20 Thread Vladimir Sementsov-Ogievskiy


18.09.2017 19:01, Eric Blake wrote:

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:

It's incorrect to return success rc >= 0 if we skip qio_channel_writev_all()
call due to s->quit.

Does this need to cc: qemu-stable for 2.10.1 (or put another way, can we
come up with some scenario of EAGAIN or other handling that would
actually set s->quit in a parallel coroutine when a client sends out
multiple requests at once)?


Intuitively, I think it should not be possible, first nbd_send_request 
should not do

something serious (should not yield) after set_cork..




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/nbd-client.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index f331f08a9a..280147e6a7 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -189,6 +189,9 @@ static int nbd_co_send_request(BlockDriverState *bs,
  }
  
  err:

+if (rc >= 0 && s->quit) {
+rc = -EIO;
+}

I'm still not convinced this is in the right place.  This fails the
send_request regardless of whether we skipped qio_channel_writev_all();
shouldn't the rc be set only in the case that we actually skipped
writing the full command because s->quit was detected at that point in time?


  if (rc < 0) {
  s->quit = true;
  s->requests[i].coroutine = NULL;



--
Best regards,
Vladimir

[Qemu-block] [PATCH v3 0/3] nbd client refactoring and fixing

2017-09-20 Thread Vladimir Sementsov-Ogievskiy

v3: dropped all controversial things. I'll try to implement minimal
structured reply and block status over this small residue.

Vladimir Sementsov-Ogievskiy (3):
  block/nbd-client: refactor nbd_co_receive_reply
  block/nbd-client: simplify check in nbd_co_receive_reply
  block/nbd-client: nbd_co_send_request: fix return code

 block/nbd-client.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

-- 
2.11.1

Re: [Qemu-block] [PATCH v2] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-20 Thread Manos Pitsidianakis


On Wed, Sep 20, 2017 at 01:08:52PM +0200, Alberto Garcia wrote:

On Wed 20 Sep 2017 12:17:40 PM CEST, Stefan Hajnoczi wrote:

@@ -592,6 +592,17 @@ void throttle_group_attach_aio_context(ThrottleGroupMember 
*tgm,
 void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
 {
 ThrottleTimers *tt = >throttle_timers;
+ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts);
+
+qemu_mutex_lock(>lock);
+if (timer_pending(tt->timers[0])) {
+tg->any_timer_armed[0] = false;
+}
+if (timer_pending(tt->timers[1])) {
+tg->any_timer_armed[1] = false;
+}
+qemu_mutex_unlock(>lock);
+
 throttle_timers_detach_aio_context(tt);
 tgm->aio_context = NULL;
 }


I'm sorry that I didn't noticed this in my previous e-mail, but after
this call you might have destroyed the timer that was set for that
throttling group, so if there are pending requests waiting it can happen
that no one wakes them up.

I think that the queue needs to be restarted after this, probably after
having reattached the context (or actually after detaching it already,
but then what happens if you try to restart the queue while aio_context
is NULL?).


aio_co_enter in the restart queue function requires that aio_context is 
non-NULL. Perhaps calling throttle_group_restart_tgm in 
throttle_group_attach_aio_context will suffice.


signature.asc
Description: PGP signature

[Qemu-block] [PATCH 0/3] add bdrv_co_drain_begin/end BlockDriver callbacks

2017-09-20 Thread Manos Pitsidianakis

This patch series renames bdrv_co_drain to bdrv_co_drain_begin and adds a new 
bdrv_co_drain_end callback to match bdrv_drained_begin/end and 
drained_begin/end of BdrvChild. This is needed because the throttle driver 
(block/throttle.c) needs a way to mark the end of the drain in order to toggle 
io_limits_disabled correctly.

Based-on: <20170918202529.28379-1-el13...@mail.ntua.gr>
"block/throttle-groups.c: allocate RestartData on the heap"
Which fixes a coroutine crash in block/throttle-groups.c

Manos Pitsidianakis (3):
  block: add bdrv_co_drain_end callback
  block: rename bdrv_co_drain to bdrv_co_drain_begin
  block/throttle.c: add bdrv_co_drain_begin/end callbacks

 include/block/block_int.h |  4 +++-
 block/io.c| 42 ++
 block/qed.c   |  6 +++---
 block/throttle.c  | 18 ++
 4 files changed, 54 insertions(+), 16 deletions(-)

-- 
2.11.0

[Qemu-block] [PATCH 2/2] block/block-backend.c: remove blk_pread_unthrottled()

2017-09-20 Thread Manos Pitsidianakis

blk_pread_unthrottled was used to bypass I/O throttling on the BlockBackend in
the case of async I/O. This is not needed anymore and we can just call
blk_pread() directly.

Signed-off-by: Manos Pitsidianakis 
---
 include/sysemu/block-backend.h |  2 --
 block/block-backend.c  | 16 
 hw/block/hd-geometry.c |  7 +--
 3 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index c4e52a5fa3..c881bfdfe4 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -117,8 +117,6 @@ char *blk_get_attached_dev_id(BlockBackend *blk);
 BlockBackend *blk_by_dev(void *dev);
 BlockBackend *blk_by_qdev_id(const char *id, Error **errp);
 void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops, void *opaque);
-int blk_pread_unthrottled(BlockBackend *blk, int64_t offset, uint8_t *buf,
-  int bytes);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
unsigned int bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags);
diff --git a/block/block-backend.c b/block/block-backend.c
index 110a52d5b1..83329fce15 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1145,22 +1145,6 @@ static int blk_prw(BlockBackend *blk, int64_t offset, 
uint8_t *buf,
 return rwco.ret;
 }
 
-int blk_pread_unthrottled(BlockBackend *blk, int64_t offset, uint8_t *buf,
-  int count)
-{
-int ret;
-
-ret = blk_check_byte_request(blk, offset, count);
-if (ret < 0) {
-return ret;
-}
-
-blk_root_drained_begin(blk->root);
-ret = blk_pread(blk, offset, buf, count);
-blk_root_drained_end(blk->root);
-return ret;
-}
-
 int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
   int bytes, BdrvRequestFlags flags)
 {
diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c
index 57ad5012a7..211307f73f 100644
--- a/hw/block/hd-geometry.c
+++ b/hw/block/hd-geometry.c
@@ -62,12 +62,7 @@ static int guess_disk_lchs(BlockBackend *blk,
 
 blk_get_geometry(blk, _sectors);
 
-/**
- * The function will be invoked during startup not only in sync I/O mode,
- * but also in async I/O mode. So the I/O throttling function has to
- * be disabled temporarily here, not permanently.
- */
-if (blk_pread_unthrottled(blk, 0, buf, BDRV_SECTOR_SIZE) < 0) {
+if (blk_pread(blk, 0, buf, BDRV_SECTOR_SIZE) < 0) {
 return -1;
 }
 /* test msdos magic */
-- 
2.11.0

[Qemu-block] [PATCH] qemu-iotests/195: Change multiple nodes in one run

2017-09-20 Thread Kevin Wolf

This additional check for change-backing-file makes sure that actions on
one part of the graph don't interfere with later actions elsewhere.

Suggested-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/195 | 17 
 tests/qemu-iotests/195.out | 51 ++
 2 files changed, 68 insertions(+)

diff --git a/tests/qemu-iotests/195 b/tests/qemu-iotests/195
index 05a239cbf5..eb2861380d 100755
--- a/tests/qemu-iotests/195
+++ b/tests/qemu-iotests/195
@@ -86,6 +86,23 @@ EOF
 
 _img_info
 
+echo
+echo "Change backing file of multiple nodes in a single run"
+echo
+
+_make_test_img -b "$TEST_IMG.mid"
+
+run_qemu -drive if=none,file="$TEST_IMG",node-name=top,backing.node-name=mid 
<

[Qemu-block] [PATCH v3 1/3] block/nbd-client: refactor nbd_co_receive_reply

2017-09-20 Thread Vladimir Sementsov-Ogievskiy

"NBDReply *reply" parameter of nbd_co_receive_reply is used only
to pass return value for nbd_co_request (reply.error). Remove it
and use function return value instead.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 block/nbd-client.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index ee7f758e68..acd8e5d007 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -178,26 +178,26 @@ err:
 return rc;
 }
 
-static void nbd_co_receive_reply(NBDClientSession *s,
- NBDRequest *request,
- NBDReply *reply,
- QEMUIOVector *qiov)
+static int nbd_co_receive_reply(NBDClientSession *s,
+NBDRequest *request,
+QEMUIOVector *qiov)
 {
+int ret;
 int i = HANDLE_TO_INDEX(s, request->handle);
 
 /* Wait until we're woken up by nbd_read_reply_entry.  */
 s->requests[i].receiving = true;
 qemu_coroutine_yield();
 s->requests[i].receiving = false;
-*reply = s->reply;
-if (reply->handle != request->handle || !s->ioc || s->quit) {
-reply->error = EIO;
+if (s->reply.handle != request->handle || !s->ioc || s->quit) {
+ret = -EIO;
 } else {
-if (qiov && reply->error == 0) {
+ret = -s->reply.error;
+if (qiov && s->reply.error == 0) {
 assert(request->len == iov_size(qiov->iov, qiov->niov));
 if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
   NULL) < 0) {
-reply->error = EIO;
+ret = -EIO;
 s->quit = true;
 }
 }
@@ -217,6 +217,8 @@ static void nbd_co_receive_reply(NBDClientSession *s,
 s->in_flight--;
 qemu_co_queue_next(>free_sema);
 qemu_co_mutex_unlock(>send_mutex);
+
+return ret;
 }
 
 static int nbd_co_request(BlockDriverState *bs,
@@ -224,7 +226,6 @@ static int nbd_co_request(BlockDriverState *bs,
   QEMUIOVector *qiov)
 {
 NBDClientSession *client = nbd_get_client_session(bs);
-NBDReply reply;
 int ret;
 
 assert(!qiov || request->type == NBD_CMD_WRITE ||
@@ -232,12 +233,11 @@ static int nbd_co_request(BlockDriverState *bs,
 ret = nbd_co_send_request(bs, request,
   request->type == NBD_CMD_WRITE ? qiov : NULL);
 if (ret < 0) {
-reply.error = -ret;
-} else {
-nbd_co_receive_reply(client, request, ,
- request->type == NBD_CMD_READ ? qiov : NULL);
+return ret;
 }
-return -reply.error;
+
+return nbd_co_receive_reply(client, request,
+request->type == NBD_CMD_READ ? qiov : NULL);
 }
 
 int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
-- 
2.11.1

[Qemu-block] [PATCH v3 3/3] block/nbd-client: nbd_co_send_request: fix return code

2017-09-20 Thread Vladimir Sementsov-Ogievskiy

It's incorrect to return success rc >= 0 if we skip qio_channel_writev_all()
call due to s->quit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 486bfff9f7..9d1e154feb 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -161,6 +161,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
NULL) < 0) {
 rc = -EIO;
 }
+} else if (rc >= 0) {
+rc = -EIO;
 }
 qio_channel_set_cork(s->ioc, false);
 } else {
-- 
2.11.1

[Qemu-block] [PATCH v3 2/3] block/nbd-client: simplify check in nbd_co_receive_reply

2017-09-20 Thread Vladimir Sementsov-Ogievskiy

If we are woken up from while() loop in nbd_read_reply_entry
handles must be equal. If we are woken up from
nbd_recv_coroutines_wake_all s->quit must be true, so we do
not need checking handles equality.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index acd8e5d007..486bfff9f7 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -189,9 +189,10 @@ static int nbd_co_receive_reply(NBDClientSession *s,
 s->requests[i].receiving = true;
 qemu_coroutine_yield();
 s->requests[i].receiving = false;
-if (s->reply.handle != request->handle || !s->ioc || s->quit) {
+if (!s->ioc || s->quit) {
 ret = -EIO;
 } else {
+assert(s->reply.handle == request->handle);
 ret = -s->reply.error;
 if (qiov && s->reply.error == 0) {
 assert(request->len == iov_size(qiov->iov, qiov->niov));
-- 
2.11.1

Re: [Qemu-block] [Qemu-devel] [PATCH v9 05/20] dirty-bitmap: Avoid size query failure during truncate

2017-09-20 Thread Eric Blake

On 09/19/2017 09:10 PM, Fam Zheng wrote:

>>
>> Do you suspect that almost certainly if bdrv_truncate() fails overall
>> that the image format driver will either unmount the image or become
>> read-only?

Uggh - it feels like I've bitten off more than I can chew with this
patch - I'm getting bogged down by trying to fix bad behavior in code
that is mostly unrelated to the patch at hand, so I don't have a good
opinion on WHAT is supposed to happen if bdrv_truncate() fails, only
that I'm trying to avoid compounding that failure even worse.

>> I suppose if *not* that's a bug for callers of bdrv_truncate to allow
>> that kind of monkey business, but if it CAN happen, hbitmap only guards
>> against such things with an assert (which, IIRC, is not guaranteed to be
>> on for all builds)
> 
> It's guaranteed since a few hours ago:
> 
> commit 262a69f4282e44426c7a132138581d400053e0a1

Indeed - but even without my patch, we would have hit the assertion
failures when trying to resize the dirty bitmap to -1 when
bdrv_nb_sectors() fails (which was likely if refresh_total_sectors()
failed).

>> So the question is: "bdrv_truncate failure is NOT considered recoverable
>> in ANY case, is it?"
>>
>> It may possibly be safer to, if the initial truncate request succeeds,
>> apply a best-effort to the bitmap before returning the error.
> 
> Like fallback "offset" (or it aligned up to bs cluster size) if
> refresh_total_sectors() returns error? I think that is okay.

Here's my proposal for squashing in a best-effort dirty-bitmap resize no
matter what happens in refresh_total_sectors() (but really, if you
successfully truncate the disk but then get a failure while trying to
read back the actual new size, which may differ from the requested size,
you're probably doomed down the road anyways).

diff --git i/block.c w/block.c
index 3caf6bb093..ef5af81f66 100644
--- i/block.c
+++ w/block.c
@@ -3552,8 +3552,9 @@ int bdrv_truncate(BdrvChild *child, int64_t
offset, PreallocMode prealloc,
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Could not refresh total sector
count");
 } else {
-bdrv_dirty_bitmap_truncate(bs, bs->total_sectors *
BDRV_SECTOR_SIZE);
+offset = bs->total_sectors * BDRV_SECTOR_SIZE;
 }
+bdrv_dirty_bitmap_truncate(bs, offset);
 bdrv_parent_cb_resize(bs);
 atomic_inc(>write_gen);
 return ret;

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 2/2] block/block-backend.c: remove blk_pread_unthrottled()

2017-09-20 Thread Kevin Wolf

Am 20.09.2017 um 13:43 hat Manos Pitsidianakis geschrieben:
> blk_pread_unthrottled was used to bypass I/O throttling on the BlockBackend in
> the case of async I/O. This is not needed anymore and we can just call
> blk_pread() directly.
> 
> Signed-off-by: Manos Pitsidianakis 

We already did a related commit to the same effect in block/io.c:

commit 90c78624f157ba41c3761c1a54864de03a7ec350
Author: Kevin Wolf 
Date:   Thu Apr 7 18:33:29 2016 +0200

block: Don't disable I/O throttling on sync requests

We had to disable I/O throttling with synchronous requests 
because we
didn't use to run timers in nested event loops when the code was
introduced. This isn't true any more, and throttling works just 
fine
even when using the synchronous API.

The removed code is in fact dead code since commit a8823a3b 
('block: Use
blk_co_pwritev() for blk_write()') because I/O throttling can 
only be
set on the top layer, but BlockBackend always uses the coroutine
interface now instead of using the sync API emulation in 
block.c.

Signed-off-by: Kevin Wolf 
Message-Id: <1458660792-3035-2-git-send-email-kw...@redhat.com>
Signed-off-by: Paolo Bonzini 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 

The same reasoning (at least the first part) applies to this patch.

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [PATCH] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-20 Thread Alberto Garcia

On Tue 19 Sep 2017 05:50:25 PM CEST, Stefan Hajnoczi wrote:
> Clear tg->any_timer_armed[] when throttling timers are destroy during
> AioContext attach/detach.  Failure to do so causes throttling to hang
> because we believe the timer is already scheduled!
>
> The following was broken at least since QEMU 2.10.0 with -drive
> iops=100:
>
>   $ dd if=/dev/zero of=/dev/vdb oflag=direct count=1000
>   (qemu) stop
>   (qemu) cont
>   ...I/O is stuck...
>
> Reported-by: Yongxue Hong 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/throttle-groups.c | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/block/throttle-groups.c b/block/throttle-groups.c
> index 6ba992c8d7..2bfd03faa0 100644
> --- a/block/throttle-groups.c
> +++ b/block/throttle-groups.c
> @@ -592,7 +592,20 @@ void 
> throttle_group_attach_aio_context(ThrottleGroupMember *tgm,
>  void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
>  {
>  ThrottleTimers *tt = >throttle_timers;
> +ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts);
> +
>  throttle_timers_detach_aio_context(tt);
> +
> +/* Forget about these timers, they have been destroyed */
> +qemu_mutex_lock(>lock);
> +if (tg->tokens[0] == tgm) {
> +tg->any_timer_armed[0] = false;
> +}
> +if (tg->tokens[1] == tgm) {
> +tg->any_timer_armed[1] = false;
> +}
> +qemu_mutex_unlock(>lock);

I think I'd rather check the timers directly using timer_pending().

See https://github.com/qemu/qemu/blob/dbe824cc5/block/throttle-groups.c#L426

Berto

Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-20 Thread Kevin Wolf

Am 19.09.2017 um 17:07 hat Alberto Garcia geschrieben:
> Hi everyone,
> 
> over the past few weeks I have been testing the effects of reducing
> the size of the entries in the qcow2 L2 cache. This was briefly
> mentioned by Denis in the same thread where we discussed subcluster
> allocation back in April, but I'll describe here the problem and the
> proposal in detail.
> [...]

Thanks for working on this, Berto! I think this is essential for large
cluster sizes and have been meaning to make a change like this for a
long time, but I never found the time for it.

> Some results from my tests (using an SSD drive and random 4K reads):
> 
> |---+--+-+---+--|
> | Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU |
> |---+--+-+---+--|
> | 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS   |
> |  2 TB |  2 MB| 4 MB [1 TB] |  576 IOPS | 11000 IOPS   |
> |---+--+-+---+--|
> 
> The improvements are clearly visible, but it's important to point out
> a couple of things:
> 
>- L2 cache size is always < total L2 metadata on disk (otherwise
>  this wouldn't make sense). Increasing the L2 cache size improves
>  performance a lot (and makes the effect of these patches
>  disappear), but it requires more RAM.

Do you have the numbers for the two cases abve if the L2 tables covered
the whole image?

>- Doing random reads over the whole disk is probably not a very
>  realistic scenario. During normal usage only certain areas of the
>  disk need to be accessed, so performance should be much better
>  with the same amount of cache.
>- I wrote a best-case scenario test (several I/O jobs each accesing
>  a part of the disk that requires loading its own L2 table) and my
>  patched version is 20x faster even with 64KB clusters.

I suppose you choose the scenario so that the number of jobs is larger
than the number of cached L2 tables without the patch, but smaller than
than the number of cache entries with the patch?

We will probably need to do some more benchmarking to find a good
default value for the cached chunks. 4k is nice and small, so we can
cover many parallel jobs without using too much memory. But if we have a
single sequential job, we may end up doing the metadata updates in
small 4k chunks instead of doing a single larger write.

Of course, if this starts becoming a problem (maybe unlikely?), we can
always change the cache code to gather any adjacent dirty chunks in the
cache when writing out something. Same thing for readahead, if we can
find a policy when to evict old entries for readahead.

>- We need a proper name for these sub-tables that we are loading
>  now. I'm actually still struggling with this :-) I can't think of
>  any name that is clear enough and not too cumbersome to use (L2
>  subtables? => Confusing. L3 tables? => they're not really that).

L2 table chunk? Or just L2 cache entry?

> I think I haven't forgotten anything. As I said I have a working
> prototype of this and if you like the idea I'd like to publish it
> soon. Any questions or comments will be appreciated.

Please do post it!

Kevin

Re: [Qemu-block] [PATCH v8 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration

2017-09-20 Thread Kevin Wolf

Am 19.09.2017 um 21:42 hat Eric Blake geschrieben:
> However...
> 
> >> -sbc = limit >> BDRV_SECTOR_BITS;
> >>  assert(DIV_ROUND_UP(bm_size, limit) == tb_size);
> >>
> >> -while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) 
> >> {
> >> -uint64_t cluster = sector / sbc;
> >> +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
> >> +uint64_t cluster = offset / limit;
> 
> bdrv_dirty_iter_next() returns the next dirty bit (which is not
> necessarily the first bit in the cluster).  For the purposes of
> serialization, we want to serialize the entire cluster in one go, even
> though we will be serializing 0 bits up until the first dirty bit.  So
> offset at this point may be unaligned,

Ok, this is the part that I was missing. It makes a lot more sense now.

Also, I think 'cluster' meaning bitmap clusters and not qcow2 clusters
here confused me a bit.

> > The part that I'm missing yet is why we need to do it. The bitmap
> > granularity is also the granularity of bdrv_dirty_iter_next(), so isn't
> > offset already aligned and we could even assert that instead of aligning
> > down? (As long we enforce our restriction, which we seem to do in
> > bitmap_list_load().)
> 
> Sadly, a quick:
> [...]
> does NOT fail iotests 165, which appears to be the only test that
> actually hammers on qcow2 bitmaps (changing it to an 'assert(false)'
> only shows an effect on 165) - which means our test is NOT exercising
> all possible alignments.  And it's python-based, with lame output, which
> makes debugging it painful.  But never fear, for v9 I will improve the
> test to actually affect the bitmap at a point that would fail with my
> temporary assertion in place, and thus proving that we DO need to align
> down.  Note that test 165 is testing only a 1G image, but I just showed
> that 64k clusters with 64k granularity covers up to 32G of image space
> in one cluster of the bitmap, so the test is only covering one cluster
> of serialization in the first place, and as written, the test is
> dirtying byte 0, which explains why it happens to get an offset aligned
> to limit, even though that is not a valid assertion.

More tests are always welcome and a good argument for getting a series
merged. :-)

Kevin


signature.asc
Description: PGP signature

54 matches

Mail list logo