Re: [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-06 Thread Peter Lieven
Am 06.03.2018 um 17:35 schrieb Peter Lieven:
> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
>> On Mon, Mar 05, 2018 at 02:52:16PM +, Dr. David Alan Gilbert wrote:
>>> * Peter Lieven (p...@kamp.de) wrote:
 Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 
>> and was curious what was the reason
>> to choose 512MB as readahead? The question is that I found that the 
>> source VM gets very unresponsive I/O wise
>> while the initial 512MB are read and furthermore seems to stay 
>> unreasponsive if we choose a high migration speed
>> and have a fast storage on the destination VM.
>>
>> In our environment I modified this value to 16MB which seems to work 
>> much smoother. I wonder if we should make
>> this a user configurable value or define a different rate limit for the 
>> block transfer in bulk stage at least?
> I don't know if benchmarks were run when choosing the value.  From the
> commit description it sounds like the main purpose was to limit the
> amount of memory that can be consumed.
>
> 16 MB also fulfills that criteria :), but why is the source VM more
> responsive with a lower value?
>
> Perhaps the issue is queue depth on the storage device - the block
> migration code enqueues up to 512 MB worth of reads, and guest I/O has
> to wait?
 That is my guess. Especially if the destination storage is faster we 
 basically alsways have
 512 I/Os in flight on the source storage.

 Does anyone mind if the reduce that value to 16MB or do we need a better 
 mechanism?
>>> We've got migration-parameters these days; you could connect it to one
>>> of those fairly easily I think.
>>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>>> already there.
>>> Then you can set it to whatever you like.
>> It would be nice to solve the performance problem without adding a
>> tuneable.
>>
>> On the other hand, QEMU has no idea what the queue depth of the device
>> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>>
>> 512 parallel requests is much too high.  Most parallel I/O benchmarking
>> is done at 32-64 queue depth.
>>
>> I think that 16 parallel requests is a reasonable maximum number for a
>> background job.
>>
>> We need to be clear though that the purpose of this change is unrelated
>> to the original 512 MB memory footprint goal.  It just happens to touch
>> the same constant but the goal is now to submit at most 16 I/O requests
>> in parallel to avoid monopolizing the I/O device.
> I think we should really look at this. The variables that control if we stay 
> in the while loop or not are incremented and decremented
> at the following places:
>
> mig_save_device_dirty:
> mig_save_device_bulk:
> block_mig_state.submitted++;
>
> blk_mig_read_cb:
> block_mig_state.submitted--;
> block_mig_state.read_done++;
>
> flush_blks:
> block_mig_state.read_done--;
>
> The condition of the while loop is:
> (block_mig_state.submitted +
> block_mig_state.read_done) * BLOCK_SIZE <
>qemu_file_get_rate_limit(f) &&
>(block_mig_state.submitted +
> block_mig_state.read_done) <
>MAX_INFLIGHT_IO)
>
> At first I wonder if we ever reach the rate-limit because we put the read 
> buffers onto f AFTER we exit the while loop?
>
> And even if we reach the limit we constantly maintain 512 I/Os in parallel 
> because we immediately decrement read_done
> when we put the buffers to f in flush_blks. In the next iteration of the 
> while loop we then read again until we have 512 in-flight I/Os.
>
> And shouldn't we have a time limit to limit the time we stay in the while 
> loop? I think we artificially delay sending data to f?

Thinking about it for a while I would propose the following:

a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
b) add MAX_PARALLEL_IO with a value of 16
c) compare qemu_file_get_rate_limit only with block_mig_state.read_done

This would yield in the following condition for the while loop:

(block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) &&
 (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS &&
 block_mig_state.submitted < MAX_PARALLEL_IO)

Sounds that like a plan?

Peter





Re: [Qemu-block] [PATCH 2/2] iotests: add 208 nbd-server + blockdev-snapshot-sync test case

2018-03-06 Thread Stefano Panella
I have applied this patch and when I run the following qmp commands I I do
not see the crash anymore but there is still something wrong because only
/root/a is opened from qemu. It looks like nbd-server-stop is also getting
rid of the nodes added with blockdev-snapshot-sync, therfore is than not
possible to do blockdev-del on /root/d because node-name is not found

{ "execute": "qmp_capabilities" }
{
"execute": "blockdev-add",
"arguments": {
"driver": "qcow2",
"node-name": "/root/a",
"discard": "unmap",
"cache": {
"direct": true
},
"file": {
"driver": "file",
"filename": "/root/a"
}
}
}

{
"execute": "nbd-server-start",
"arguments": {
"addr": {
"type": "unix",
"data": {
"path": "/tmp/nbd.test1"
}
}
}
}

{
"execute": "nbd-server-add",
"arguments": {
"device": "/root/a",
"writable": true
}
}


{
"execute": "blockdev-snapshot-sync",
"arguments": {
"node-name": "/root/a",
"snapshot-node-name": "/root/b",
"snapshot-file": "/root/b"
}
}
{
"execute": "blockdev-snapshot-sync",
"arguments": {
"node-name": "/root/b",
"snapshot-node-name": "/root/c",
"snapshot-file": "/root/c"
}
}
{
"execute": "blockdev-snapshot-sync",
"arguments": {
"node-name": "/root/c",
"snapshot-node-name": "/root/d",
"snapshot-file": "/root/d"
}
}

{
"execute": "nbd-server-stop"
}


On Tue, Mar 6, 2018 at 8:48 PM, Stefan Hajnoczi  wrote:
>
> This test case adds an NBD server export and then invokes
> blockdev-snapshot-sync, which changes the BlockDriverState node that the
> NBD server's BlockBackend points to.  This is an interesting scenario to
> test and exercises the code path fixed by the previous commit.
>
> Signed-off-by: Stefan Hajnoczi 
> ---
>  tests/qemu-iotests/208 | 55
++
>  tests/qemu-iotests/208.out |  9 
>  tests/qemu-iotests/group   |  1 +
>  3 files changed, 65 insertions(+)
>  create mode 100755 tests/qemu-iotests/208
>  create mode 100644 tests/qemu-iotests/208.out
>
> diff --git a/tests/qemu-iotests/208 b/tests/qemu-iotests/208
> new file mode 100755
> index 00..4e82b96c82
> --- /dev/null
> +++ b/tests/qemu-iotests/208
> @@ -0,0 +1,55 @@
> +#!/usr/bin/env python
> +#
> +# Copyright (C) 2018 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +#
> +# Creator/Owner: Stefan Hajnoczi 
> +#
> +# Check that the runtime NBD server does not crash when stopped after
> +# blockdev-snapshot-sync.
> +
> +import iotests
> +
> +with iotests.FilePath('disk.img') as disk_img_path, \
> + iotests.FilePath('disk-snapshot.img') as disk_snapshot_img_path, \
> + iotests.FilePath('nbd.sock') as nbd_sock_path, \
> + iotests.VM() as vm:
> +
> +img_size = '10M'
> +iotests.qemu_img_pipe('create', '-f', iotests.imgfmt, disk_img_path,
img_size)
> +
> +iotests.log('Launching VM...')
> +(vm.add_drive(disk_img_path, 'node-name=drive0-node',
interface='none')
> +   .launch())
> +
> +iotests.log('Starting NBD server...')
> +iotests.log(vm.qmp('nbd-server-start', addr={
> +"type": "unix",
> +"data": {
> +"path": nbd_sock_path,
> +}
> +}))
> +
> +iotests.log('Adding NBD export...')
> +iotests.log(vm.qmp('nbd-server-add', device='drive0-node',
writable=True))
> +
> +iotests.log('Creating external snapshot...')
> +iotests.log(vm.qmp('blockdev-snapshot-sync',
> +node_name='drive0-node',
> +snapshot_node_name='drive0-snapshot-node',
> +snapshot_file=disk_snapshot_img_path))
> +
> +iotests.log('Stopping NBD server...')
> +iotests.log(vm.qmp('nbd-server-stop'))
> diff --git a/tests/qemu-iotests/208.out b/tests/qemu-iotests/208.out
> new file mode 100644
> index 00..3687e9d0dd
> --- /dev/null
> +++ b/tests/qemu-iotests/208.out
> @@ -0,0 +1,9 @@
> +Launching VM...
> +Starting NBD server...
> +{u'return': {}}
> +Adding NBD export...
> +{u'return': {}}
> +Creating external snapshot...
> +{u'return': {}}
> +Stopping NBD server...
> +{u'return': {}}
> diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
> index a2dfe79d86..01c03019dd 100644
> --- a/tests/qemu-iotests/group
> +++ b/tests/qemu-iotests/group
> @@ -202,3 +202,4 @@
>  203 rw auto
>  204 rw auto quick
>  205 rw auto quick
> +208 rw auto quick
> --
> 2.14.3
>


Re: [Qemu-block] [Qemu-devel] [RFC v4 15/21] blockjobs: add prepare callback

2018-03-06 Thread John Snow


On 02/28/2018 12:04 PM, Kevin Wolf wrote:
> Am 24.02.2018 um 00:51 hat John Snow geschrieben:
>> Some jobs upon finalization may need to perform some work that can
>> still fail. If these jobs are part of a transaction, it's important
>> that these callbacks fail the entire transaction.
>>
>> We allow for a new callback in addition to commit/abort/clean that
>> allows us the opportunity to have fairly late-breaking failures
>> in the transactional process.
>>
>> The expected flow is:
>>
>> - All jobs in a transaction converge to the WAITING state
>>   (added in a forthcoming commit)
>> - All jobs prepare to call either commit/abort
>> - If any job fails, is canceled, or fails preparation, all jobs
>>   call their .abort callback.
>> - All jobs enter the PENDING state, awaiting manual intervention
>>   (also added in a forthcoming commit)
>> - block-job-finalize is issued by the user/management layer
>> - All jobs call their commit callbacks.
>>
>> Signed-off-by: John Snow 
> 
> You almost made me believe the scary thought that we need transactional
> graph modifications, but after writing half of the reply, I think it's
> just that your order here is wrong.
> 

Sorry, yes, this blurb was outdated. I regret that it wasted your time.

> So .prepare is the last thing in the whole process that is allowed to
> fail. Graph manipulations such as bdrv_replace_node() can fail. Graph
> manipulations can also only be made in response to block-job-finalize
> because the management layer must be aware of them. Take them together
> and you have a problem.
> 
> Didn't we already establish earlier that .prepare/.commit/.abort must be
> called together and cannot be separated by waiting for a QMP command
> because of locking and things?
> 

Right; so what really happens is that in response to the FINALIZE verb,
the prepare loop is done first to check for success, and then commit or
abort are dispatched as appropriate.

> So if you go to PENDING first, then wait for block-job-finalize and only
> then call .prepare/.commit/.abort, we should be okay for both problems.
> 
> And taking a look at the final state, that seems to be what you do, so
> in the end, it's probably just the commit message that needs a fix.

Yep, sorry.

> 
>>  blockjob.c   | 34 +++---
>>  include/block/blockjob_int.h | 10 ++
>>  2 files changed, 41 insertions(+), 3 deletions(-)
>>
>> diff --git a/blockjob.c b/blockjob.c
>> index 8f02c03880..1c010ec100 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -394,6 +394,18 @@ static void block_job_update_rc(BlockJob *job)
>>  }
>>  }
>>  
>> +static int block_job_prepare(BlockJob *job)
>> +{
>> +if (job->ret) {
>> +goto out;
>> +}
>> +if (job->driver->prepare) {
>> +job->ret = job->driver->prepare(job);
>> +}
>> + out:
>> +return job->ret;
>> +}
> 
> Why not just if (job->ret == 0 && job->driver->prepare) and save the
> goto?
> 

Churn. ¯\_(ツ)_/¯

> Kevin
> 



Re: [Qemu-block] [PATCH] block: include original filename when reporting invalid URIs

2018-03-06 Thread Jeff Cody
On Tue, Feb 06, 2018 at 10:52:04AM +, Daniel P. Berrangé wrote:
> Consider passing a JSON based block driver to "qemu-img commit"
> 
> $ qemu-img commit 'json:{"driver":"qcow2","file":{"driver":"gluster",\
>   "volume":"gv0","path":"sn1.qcow2",
>   "server":[{"type":\
> "tcp","host":"10.73.199.197","port":"24007"}]},}'
> 
> Currently it will commit the content and then report an incredibly
> useless error message when trying to re-open the committed image:
> 
>   qemu-img: invalid URI
>   Usage: 
> file=gluster[+transport]://[host[:port]]volume/path[?socket=...][,file.debug=N][,file.logfile=/path/filename.log]
> 
> With this fix we get:
> 
>   qemu-img: invalid URI json:{"server.0.host": "10.73.199.197",
>   "driver": "gluster", "path": "luks.qcow2", "server.0.type":
>   "tcp", "server.0.port": "24007", "volume": "gv0"}
> 
> Of course the root cause problem still exists, but now we know
> what actually needs fixing.
> 
> Signed-off-by: Daniel P. Berrangé 
> ---
>  block/gluster.c  | 2 +-
>  block/sheepdog.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/gluster.c b/block/gluster.c
> index 0f4265a3a4..0215e19087 100644
> --- a/block/gluster.c
> +++ b/block/gluster.c
> @@ -660,7 +660,7 @@ static struct glfs 
> *qemu_gluster_init(BlockdevOptionsGluster *gconf,
>  if (filename) {
>  ret = qemu_gluster_parse_uri(gconf, filename);
>  if (ret < 0) {
> -error_setg(errp, "invalid URI");
> +error_setg(errp, "invalid URI %s", filename);
>  error_append_hint(errp, "Usage: file=gluster[+transport]://"
>  "[host[:port]]volume/path[?socket=...]"
>  "[,file.debug=N]"
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index f684477328..c847ab6c98 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -1050,7 +1050,7 @@ static void sd_parse_uri(SheepdogConfig *cfg, const 
> char *filename,
>  
>  cfg->uri = uri = uri_parse(filename);
>  if (!uri) {
> -error_setg(, "invalid URI");
> +error_setg(, "invalid URI '%s'", filename);
>  goto out;
>  }
>  
> -- 
> 2.14.3
> 

Thanks,

Applied to my block branch:

git://github.com/codyprime/qemu-kvm-jtc block

-Jeff



Re: [Qemu-block] [PATCH 0/2] block/ssh: Implement .bdrv_refresh_filename()

2018-03-06 Thread John Snow


On 02/05/2018 03:22 PM, Max Reitz wrote:
> This series implements .bdrv_refresh_filename() for the ssh block
> driver, along with an appropriate .bdrv_dirname() so we don't chop off
> query strings for backing files with relative filenames.
> 
> This series depends on my “block: Fix some filename generation issues”
> series and on Pino's “ssh: switch from libssh2 to libssh” patch.
> 
> Based-on: 20180205151835.20812-1-mre...@redhat.com
> Based-on: 20180118164439.2120-1-ptosc...@redhat.com
> 
> 
> Max Reitz (2):
>   block/ssh: Implement .bdrv_refresh_filename()
>   block/ssh: Implement .bdrv_dirname()
> 
>  block/ssh.c | 72 
> +++--
>  1 file changed, 65 insertions(+), 7 deletions(-)
> 

Did this one rot on the vine?

>1 month old.



[Qemu-block] [PATCH 1/2] block: let blk_add/remove_aio_context_notifier() tolerate BDS changes

2018-03-06 Thread Stefan Hajnoczi
Commit 2019ba0a0197 ("block: Add AioContextNotifier functions to BB")
added blk_add/remove_aio_context_notifier() and implemented them by
passing through the bdrv_*() equivalent.

This doesn't work across bdrv_append(), which detaches child->bs and
re-attaches it to a new BlockDriverState.  When
blk_remove_aio_context_notifier() is called we will access the new BDS
instead of the one where the notifier was added!

>From the point of view of the blk_*() API user, changes to the root BDS
should be transparent.

This patch maintains a list of AioContext notifiers in BlockBackend and
adds/removes them from the BlockDriverState as needed.

Reported-by: Stefano Panella 
Cc: Max Reitz 
Signed-off-by: Stefan Hajnoczi 
---
 block/block-backend.c | 63 +++
 block/trace-events|  2 ++
 2 files changed, 65 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 94ffbb6a60..aa27698820 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -31,6 +31,13 @@
 
 static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
 
+typedef struct BlockBackendAioNotifier {
+void (*attached_aio_context)(AioContext *new_context, void *opaque);
+void (*detach_aio_context)(void *opaque);
+void *opaque;
+QLIST_ENTRY(BlockBackendAioNotifier) list;
+} BlockBackendAioNotifier;
+
 struct BlockBackend {
 char *name;
 int refcnt;
@@ -69,6 +76,7 @@ struct BlockBackend {
 bool allow_write_beyond_eof;
 
 NotifierList remove_bs_notifiers, insert_bs_notifiers;
+QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
 int quiesce_counter;
 VMChangeStateEntry *vmsh;
@@ -239,6 +247,36 @@ static int blk_root_inactivate(BdrvChild *child)
 return 0;
 }
 
+static void blk_root_attach(BdrvChild *child)
+{
+BlockBackend *blk = child->opaque;
+BlockBackendAioNotifier *notifier;
+
+trace_blk_root_attach(child, blk, child->bs);
+
+QLIST_FOREACH(notifier, >aio_notifiers, list) {
+bdrv_add_aio_context_notifier(child->bs,
+notifier->attached_aio_context,
+notifier->detach_aio_context,
+notifier->opaque);
+}
+}
+
+static void blk_root_detach(BdrvChild *child)
+{
+BlockBackend *blk = child->opaque;
+BlockBackendAioNotifier *notifier;
+
+trace_blk_root_detach(child, blk, child->bs);
+
+QLIST_FOREACH(notifier, >aio_notifiers, list) {
+bdrv_remove_aio_context_notifier(child->bs,
+notifier->attached_aio_context,
+notifier->detach_aio_context,
+notifier->opaque);
+}
+}
+
 static const BdrvChildRole child_root = {
 .inherit_options= blk_root_inherit_options,
 
@@ -252,6 +290,9 @@ static const BdrvChildRole child_root = {
 
 .activate   = blk_root_activate,
 .inactivate = blk_root_inactivate,
+
+.attach = blk_root_attach,
+.detach = blk_root_detach,
 };
 
 /*
@@ -279,6 +320,7 @@ BlockBackend *blk_new(uint64_t perm, uint64_t shared_perm)
 
 notifier_list_init(>remove_bs_notifiers);
 notifier_list_init(>insert_bs_notifiers);
+QLIST_INIT(>aio_notifiers);
 
 QTAILQ_INSERT_TAIL(_backends, blk, link);
 return blk;
@@ -356,6 +398,7 @@ static void blk_delete(BlockBackend *blk)
 }
 assert(QLIST_EMPTY(>remove_bs_notifiers.notifiers));
 assert(QLIST_EMPTY(>insert_bs_notifiers.notifiers));
+assert(QLIST_EMPTY(>aio_notifiers));
 QTAILQ_REMOVE(_backends, blk, link);
 drive_info_del(blk->legacy_dinfo);
 block_acct_cleanup(>stats);
@@ -1813,8 +1856,15 @@ void blk_add_aio_context_notifier(BlockBackend *blk,
 void (*attached_aio_context)(AioContext *new_context, void *opaque),
 void (*detach_aio_context)(void *opaque), void *opaque)
 {
+BlockBackendAioNotifier *notifier;
 BlockDriverState *bs = blk_bs(blk);
 
+notifier = g_new(BlockBackendAioNotifier, 1);
+notifier->attached_aio_context = attached_aio_context;
+notifier->detach_aio_context = detach_aio_context;
+notifier->opaque = opaque;
+QLIST_INSERT_HEAD(>aio_notifiers, notifier, list);
+
 if (bs) {
 bdrv_add_aio_context_notifier(bs, attached_aio_context,
   detach_aio_context, opaque);
@@ -1827,12 +1877,25 @@ void blk_remove_aio_context_notifier(BlockBackend *blk,
  void (*detach_aio_context)(void *),
  void *opaque)
 {
+BlockBackendAioNotifier *notifier;
 BlockDriverState *bs = blk_bs(blk);
 
 if (bs) {
 bdrv_remove_aio_context_notifier(bs, attached_aio_context,
  detach_aio_context, opaque);
 }
+
+QLIST_FOREACH(notifier, >aio_notifiers, list) {
+if (notifier->attached_aio_context == attached_aio_context &&
+notifier->detach_aio_context 

[Qemu-block] [PATCH 2/2] iotests: add 208 nbd-server + blockdev-snapshot-sync test case

2018-03-06 Thread Stefan Hajnoczi
This test case adds an NBD server export and then invokes
blockdev-snapshot-sync, which changes the BlockDriverState node that the
NBD server's BlockBackend points to.  This is an interesting scenario to
test and exercises the code path fixed by the previous commit.

Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/208 | 55 ++
 tests/qemu-iotests/208.out |  9 
 tests/qemu-iotests/group   |  1 +
 3 files changed, 65 insertions(+)
 create mode 100755 tests/qemu-iotests/208
 create mode 100644 tests/qemu-iotests/208.out

diff --git a/tests/qemu-iotests/208 b/tests/qemu-iotests/208
new file mode 100755
index 00..4e82b96c82
--- /dev/null
+++ b/tests/qemu-iotests/208
@@ -0,0 +1,55 @@
+#!/usr/bin/env python
+#
+# Copyright (C) 2018 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+# Creator/Owner: Stefan Hajnoczi 
+#
+# Check that the runtime NBD server does not crash when stopped after
+# blockdev-snapshot-sync.
+
+import iotests
+
+with iotests.FilePath('disk.img') as disk_img_path, \
+ iotests.FilePath('disk-snapshot.img') as disk_snapshot_img_path, \
+ iotests.FilePath('nbd.sock') as nbd_sock_path, \
+ iotests.VM() as vm:
+
+img_size = '10M'
+iotests.qemu_img_pipe('create', '-f', iotests.imgfmt, disk_img_path, 
img_size)
+
+iotests.log('Launching VM...')
+(vm.add_drive(disk_img_path, 'node-name=drive0-node', interface='none')
+   .launch())
+
+iotests.log('Starting NBD server...')
+iotests.log(vm.qmp('nbd-server-start', addr={
+"type": "unix",
+"data": {
+"path": nbd_sock_path,
+}
+}))
+
+iotests.log('Adding NBD export...')
+iotests.log(vm.qmp('nbd-server-add', device='drive0-node', writable=True))
+
+iotests.log('Creating external snapshot...')
+iotests.log(vm.qmp('blockdev-snapshot-sync',
+node_name='drive0-node',
+snapshot_node_name='drive0-snapshot-node',
+snapshot_file=disk_snapshot_img_path))
+
+iotests.log('Stopping NBD server...')
+iotests.log(vm.qmp('nbd-server-stop'))
diff --git a/tests/qemu-iotests/208.out b/tests/qemu-iotests/208.out
new file mode 100644
index 00..3687e9d0dd
--- /dev/null
+++ b/tests/qemu-iotests/208.out
@@ -0,0 +1,9 @@
+Launching VM...
+Starting NBD server...
+{u'return': {}}
+Adding NBD export...
+{u'return': {}}
+Creating external snapshot...
+{u'return': {}}
+Stopping NBD server...
+{u'return': {}}
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index a2dfe79d86..01c03019dd 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -202,3 +202,4 @@
 203 rw auto
 204 rw auto quick
 205 rw auto quick
+208 rw auto quick
-- 
2.14.3




[Qemu-block] [PATCH 0/2] block: fix nbd-server-stop crash after blockdev-snapshot-sync

2018-03-06 Thread Stefan Hajnoczi
The blockdev-snapshot-sync command uses bdrv_append() to update all parents to
point at the external snapshot node.  This breaks BlockBackend's
blk_add/remove_aio_context_notifier(), which doesn't expect a BDS change.

Patch 1 fixes this by tracking AioContext notifiers in BlockBackend.

See the test case in Patch 2 for a reproducer.

Stefan Hajnoczi (2):
  block: let blk_add/remove_aio_context_notifier() tolerate BDS changes
  iotests: add 208 nbd-server + blockdev-snapshot-sync test case

 block/block-backend.c  | 63 ++
 block/trace-events |  2 ++
 tests/qemu-iotests/208 | 55 
 tests/qemu-iotests/208.out |  9 +++
 tests/qemu-iotests/group   |  1 +
 5 files changed, 130 insertions(+)
 create mode 100755 tests/qemu-iotests/208
 create mode 100644 tests/qemu-iotests/208.out

-- 
2.14.3




Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/2] block latency histogram

2018-03-06 Thread Emilio G. Cota
On Tue, Mar 06, 2018 at 16:00:17 +, Stefan Hajnoczi wrote:
> On Wed, Feb 07, 2018 at 03:50:35PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > Vladimir Sementsov-Ogievskiy (2):
> >   block/accounting: introduce latency histogram
> >   qapi: add block latency histogram interface
(snip)
> >  5 files changed, 228 insertions(+), 1 deletion(-)
> 
> The feature looks good.  I posted documentation and code readability
> suggestions.

Hi Vladimir,

Did you consider using qdist? For reference, see commit bf3afd5f419
and/or grep 'qdist'.

Thanks,

Emilio



Re: [Qemu-block] [Qemu-devel] [RFC] qemu-img: Drop BLK_ZERO from convert

2018-03-06 Thread Kevin Wolf
Am 06.03.2018 um 14:47 hat Stefan Hajnoczi geschrieben:
> On Wed, Feb 28, 2018 at 09:11:32PM +0100, Max Reitz wrote:
> > On 2018-02-28 19:08, Max Reitz wrote:
> > > On 2018-02-27 17:17, Stefan Hajnoczi wrote:
> > >> On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
> > >>> There are filesystems (among which is tmpfs) that have a hard time
> > >>> reporting allocation status.  That is definitely a bug in them.
> > >>>
> > >>> However, there is no good reason why qemu-img convert should query the
> > >>> allocation status in the first place.  It does zero detection by itself
> > >>> anyway, so we can detect unallocated areas ourselves.
> > >>>
> > >>> Furthermore, if a filesystem driver has any sense, reading unallocated
> > >>> data should take just as much time as lseek(SEEK_DATA) + memset().  So
> > >>> the only overhead we introduce by dropping the manual lseek() call is a
> > >>> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
> > >>> should be relatively quick.
> > >>
> > >> This makes sense.  Which file systems did you test this patch on?
> > > 
> > > On tmpfs and xfs, so far.
> > > 
> > >> XFS, ext4, and tmpfs would be a good minimal test set to prove the
> > >> patch.  Perhaps with two input files:
> > >> 1. A file that is mostly filled with data.
> > >> 2. A file that is only sparsely populated with data.
> > > 
> > > And probably with vmdk, which (by default) forbids querying any areas
> > > larger than 64 kB.
> > > 
> > >> The time taken should be comparable with the time before this patch.
> > > 
> > > Yep, I'll do some benchmarks.
> > 
> > And the results are in.  I've created 2 GB images on various filesystems
> > in various formats, then I've either written 64 kB every 32 MB to them
> > ("sparse"), or left out 64 kB every 32 MB ("full").  Then I've converted
> > them to null-co:// and took the (real) time through "time". (Script is
> > attached.)
> > 
> > I've attached the raw results before and after this patch.  Usually, I
> > did six runs for each case and dropped the most extreme outlier --
> > except for full vmdk images, where I've only done one run for each case
> > because creating these images can take a very long time.
> > 
> > Here are the differences from before to after:
> > 
> > sparse raw on tmpfs:+ 19 % (436 ms to 520 ms)
> > sparse qcow2 on tmpfs:  - 31 % (435 ms to 301 ms)
> > sparse vmdk on tmpfs:   + 37 % (214 ms to 294 ms)
> > 
> > sparse raw on xfs:  + 69 % (452 ms to 762 ms)
> > sparse qcow2 on xfs:- 34 % (462 ms to 304 ms)
> > sparse vmdk on xfs: + 42 % (210 ms to 298 ms)
> > 
> > sparse raw on ext4: +360 % (144 ms to 655 ms)
> > sparse qcow2 on ext4:   +120 % (147 ms to 330 ms)
> > sparse vmdk on ext4:+ 16 % (253 ms to 293 ms)
> > 
> > 
> > full raw on tmpfs:  -  9 % (437 ms to 398 ms)
> > full qcow2 on tmpfs:- 75 % (1.63 s to 403 ms)
> > full vmdk on tmpfs: -100 % (10 min to 767 ms)
> > 
> > full raw on xfs:-  1 % (407 ms to 404 ms, insignificant)
> > full qcow2 on xfs:  -  1 % (410 ms to 404 ms, insignificant)
> > full vmdk on xfs:   - 33 % (1.05 s to 695 ms)
> > 
> > 
> > 
> > 
> > full raw on ext4:   -  2 % (308 ms to 301 ms, insignificant)
> > full qcow2 on ext4: +  2 % (307 ms to 312 ms, insignificant)
> > full vmdk on ext4:  - 74 % (3.53 s to 839 ms)
> > 
> > 
> > So...  It's more extreme than I had hoped, that's for sure.  What I
> > conclude from this is:
> > 
> > (1) This patch is generally good for nearly fully allocated images.  In
> > the worst case (on well-behaving filesystems with well-optimized image
> > formats) it changes nothing.  In the best case, conversion time is
> > reduced drastically.

This makes sense. Asking the kernel whether a block is zero only helps
the performance if the result is yes occasionally, otherwise it's just
wasted work.

Maybe we could try to guess the ratio by comparing the number of
allocated blocks in the image file and the virtual disk size or
something? Then we could only call lseek() when we can actually expect
an improvement from it.

> > (2) For sparse raw images, this is absolutely devastating.  Reading them
> > now takes more than (ext4) or nearly (xfs) twice as much time as reading
> > a fully allocated image.  So much for "if a filesystem driver has any
> > sense".

Are you sure that only the filesystem is the problem? Checking for every
single byte of an image whether it is zero has to cost some performance.
The fully allocated image doesn't suffer from this because (a) it only
has to check the first byte in each block and (b) the cost was already
there before this patch.

In fact, if null-co supported .bdrv_co_pwrite_zeroes, I think you would
get even worse results for your patch because then the pre-patch version
doesn't even have to do the memset().

> > (2a) It might be worth noting that on xfs, reading the sparse file took
> > longer even before this patch...
> > 
> > (3) qcow2 is different: It benefits from 

Re: [Qemu-block] [PATCH v2 0/7] Add checks for corruption in the snapshot table

2018-03-06 Thread Kevin Wolf
Am 06.03.2018 um 17:14 hat Alberto Garcia geschrieben:
> Hey,
> 
> here's the new version of this series. It fixes a leak reported by
> Kevin and adds a couple of error_report_err() to make use of the
> message returned by qcow2_validate_table().

Thanks, applied to the block branch.

Kevin



Re: [Qemu-block] [PATCH] iotests: Update output of 051 and 186 after commit 1454509726719e0933c

2018-03-06 Thread Thomas Huth
On 06.03.2018 17:45, Alberto Garcia wrote:
> Signed-off-by: Alberto Garcia 
> ---
>  tests/qemu-iotests/051.pc.out | 20 
>  tests/qemu-iotests/186.out| 22 +++---
>  2 files changed, 3 insertions(+), 39 deletions(-)
> 
> diff --git a/tests/qemu-iotests/051.pc.out b/tests/qemu-iotests/051.pc.out
> index 830c11880a..b01f9a90d7 100644
> --- a/tests/qemu-iotests/051.pc.out
> +++ b/tests/qemu-iotests/051.pc.out
> @@ -117,20 +117,10 @@ Testing: -drive if=ide,media=cdrom
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) quit
>  
> -Testing: -drive if=scsi,media=cdrom
> -QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) QEMU_PROG: -drive if=scsi,media=cdrom: warning: bus=0,unit=0 is 
> deprecated with this machine type
> -quit
> -
>  Testing: -drive if=ide
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) QEMU_PROG: Initialization of device ide-hd failed: Device needs 
> media, but drive is empty
>  
> -Testing: -drive if=scsi
> -QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) QEMU_PROG: -drive if=scsi: warning: bus=0,unit=0 is deprecated with 
> this machine type
> -QEMU_PROG: -drive if=scsi: Device needs media, but drive is empty
> -
>  Testing: -drive if=virtio
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) QEMU_PROG: -drive if=virtio: Device needs media, but drive is empty
> @@ -170,20 +160,10 @@ Testing: -drive 
> file=TEST_DIR/t.qcow2,if=ide,media=cdrom,readonly=on
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) quit
>  
> -Testing: -drive file=TEST_DIR/t.qcow2,if=scsi,media=cdrom,readonly=on
> -QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) QEMU_PROG: -drive 
> file=TEST_DIR/t.qcow2,if=scsi,media=cdrom,readonly=on: warning: bus=0,unit=0 
> is deprecated with this machine type
> -quit
> -
>  Testing: -drive file=TEST_DIR/t.qcow2,if=ide,readonly=on
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) QEMU_PROG: Initialization of device ide-hd failed: Block node is 
> read-only
>  
> -Testing: -drive file=TEST_DIR/t.qcow2,if=scsi,readonly=on
> -QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) QEMU_PROG: -drive file=TEST_DIR/t.qcow2,if=scsi,readonly=on: warning: 
> bus=0,unit=0 is deprecated with this machine type
> -quit
> -
>  Testing: -drive file=TEST_DIR/t.qcow2,if=virtio,readonly=on
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) quit

Ack for that part.

> diff --git a/tests/qemu-iotests/186.out b/tests/qemu-iotests/186.out
> index c8377fe146..d83bba1a88 100644
> --- a/tests/qemu-iotests/186.out
> +++ b/tests/qemu-iotests/186.out
> @@ -444,31 +444,15 @@ ide0-cd0 (NODE_NAME): null-co:// (null-co, read-only)
>  
>  Testing: -drive if=scsi,driver=null-co
>  QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) QEMU_PROG: -drive if=scsi,driver=null-co: warning: bus=0,unit=0 is 
> deprecated with this machine type
> -info block
> -scsi0-hd0 (NODE_NAME): null-co:// (null-co)
> -Attached to:  /machine/unattached/device[27]/scsi.0/legacy[0]
> -Cache mode:   writeback
> -(qemu) quit
> +(qemu) QEMU_PROG: -drive if=scsi,driver=null-co: machine type does not 
> support if=scsi,bus=0,unit=0
>  
>  Testing: -drive if=scsi,media=cdrom
>  QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) QEMU_PROG: -drive if=scsi,media=cdrom: warning: bus=0,unit=0 is 
> deprecated with this machine type
> -info block
> -scsi0-cd0: [not inserted]
> -Attached to:  /machine/unattached/device[27]/scsi.0/legacy[0]
> -Removable device: not locked, tray closed
> -(qemu) quit
> +(qemu) QEMU_PROG: -drive if=scsi,media=cdrom: machine type does not support 
> if=scsi,bus=0,unit=0
>  
>  Testing: -drive if=scsi,driver=null-co,media=cdrom
>  QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) QEMU_PROG: -drive if=scsi,driver=null-co,media=cdrom: warning: 
> bus=0,unit=0 is deprecated with this machine type
> -info block
> -scsi0-cd0 (NODE_NAME): null-co:// (null-co, read-only)
> -Attached to:  /machine/unattached/device[27]/scsi.0/legacy[0]
> -Removable device: not locked, tray closed
> -Cache mode:   writeback
> -(qemu) quit
> +(qemu) QEMU_PROG: -drive if=scsi,driver=null-co,media=cdrom: machine type 
> does not support if=scsi,bus=0,unit=0

That rather sounds like this "if=scsi" test should be removed now?

 Thomas



Re: [Qemu-block] [PATCH v4 0/8] Call check and invalidate_cache from coroutine context

2018-03-06 Thread Kevin Wolf
Am 01.03.2018 um 17:36 hat Paolo Bonzini geschrieben:
> Check and invalidate_cache share some parts of the implementation
> with the regular I/O path.  This is sometimes complicated because the
> I/O path wants to use a CoMutex but that is not possible outside coroutine
> context.  By moving things to coroutine context, we can remove special
> cases.  In fact, invalidate_cache is already called from coroutine context
> because incoming migration is placed in a coroutine.
> 
> I'm including two patches from Stefan to rename the bdrv_create callback
> to bdrv_co_create, because it is already called from coroutine context.
> The name is now bdrv_co_create_opts, with bdrv_co_create reserved for
> the QAPI-based version that Kevin is working on.
> 
> qcow2 still has cache flushing in non-coroutine context, coming from
> qcow2_reopen_prepare->qcow2_update_options_prepare,
> qcow2_close->qcow2_inactivate and several dirty bitmap functions.

Thanks, applied to the block branch.

Kevin



Re: [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-06 Thread Peter Lieven
Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
> On Mon, Mar 05, 2018 at 02:52:16PM +, Dr. David Alan Gilbert wrote:
>> * Peter Lieven (p...@kamp.de) wrote:
>>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
 On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 
> and was curious what was the reason
> to choose 512MB as readahead? The question is that I found that the 
> source VM gets very unresponsive I/O wise
> while the initial 512MB are read and furthermore seems to stay 
> unreasponsive if we choose a high migration speed
> and have a fast storage on the destination VM.
>
> In our environment I modified this value to 16MB which seems to work much 
> smoother. I wonder if we should make
> this a user configurable value or define a different rate limit for the 
> block transfer in bulk stage at least?
 I don't know if benchmarks were run when choosing the value.  From the
 commit description it sounds like the main purpose was to limit the
 amount of memory that can be consumed.

 16 MB also fulfills that criteria :), but why is the source VM more
 responsive with a lower value?

 Perhaps the issue is queue depth on the storage device - the block
 migration code enqueues up to 512 MB worth of reads, and guest I/O has
 to wait?
>>> That is my guess. Especially if the destination storage is faster we 
>>> basically alsways have
>>> 512 I/Os in flight on the source storage.
>>>
>>> Does anyone mind if the reduce that value to 16MB or do we need a better 
>>> mechanism?
>> We've got migration-parameters these days; you could connect it to one
>> of those fairly easily I think.
>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>> already there.
>> Then you can set it to whatever you like.
> It would be nice to solve the performance problem without adding a
> tuneable.
>
> On the other hand, QEMU has no idea what the queue depth of the device
> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>
> 512 parallel requests is much too high.  Most parallel I/O benchmarking
> is done at 32-64 queue depth.
>
> I think that 16 parallel requests is a reasonable maximum number for a
> background job.
>
> We need to be clear though that the purpose of this change is unrelated
> to the original 512 MB memory footprint goal.  It just happens to touch
> the same constant but the goal is now to submit at most 16 I/O requests
> in parallel to avoid monopolizing the I/O device.

I think we should really look at this. The variables that control if we stay in 
the while loop or not are incremented and decremented
at the following places:

mig_save_device_dirty:
mig_save_device_bulk:
block_mig_state.submitted++;

blk_mig_read_cb:
block_mig_state.submitted--;
block_mig_state.read_done++;

flush_blks:
block_mig_state.read_done--;

The condition of the while loop is:
(block_mig_state.submitted +
block_mig_state.read_done) * BLOCK_SIZE <
   qemu_file_get_rate_limit(f) &&
   (block_mig_state.submitted +
block_mig_state.read_done) <
   MAX_INFLIGHT_IO)

At first I wonder if we ever reach the rate-limit because we put the read 
buffers onto f AFTER we exit the while loop?

And even if we reach the limit we constantly maintain 512 I/Os in parallel 
because we immediately decrement read_done
when we put the buffers to f in flush_blks. In the next iteration of the while 
loop we then read again until we have 512 in-flight I/Os.

And shouldn't we have a time limit to limit the time we stay in the while loop? 
I think we artificially delay sending data to f?

Peter





Re: [Qemu-block] [PATCH v3] iotests: Test abnormally large size in compressed cluster descriptor

2018-03-06 Thread Alberto Garcia
ping

On Mon 26 Feb 2018 03:36:23 PM CET, Alberto Garcia wrote:
> L2 entries for compressed clusters have a field that indicates the
> number of sectors used to store the data in the image.
>
> That's however not the size of the compressed data itself, just the
> number of sectors where that data is located. The actual data size is
> usually not a multiple of the sector size, and therefore cannot be
> represented with this field.
>
> The way it works is that QEMU reads all the specified sectors and
> starts decompressing the data until there's enough to recover the
> original uncompressed cluster. If there are any bytes left that
> haven't been decompressed they are simply ignored.
>
> One consequence of this is that even if the size field is larger than
> it needs to be QEMU can handle it just fine: it will read more data
> from disk but it will ignore the extra bytes.
>
> This test creates an image with two compressed clusters that use 5
> sectors (2.5 KB) each, increases the size field to the maximum (8192
> sectors, or 4 MB) and verifies that the data can be read without
> problems.
>
> This test is important because while the decompressed data takes
> exactly one cluster, the maximum value allowed in the compressed size
> field is twice the cluster size. So although QEMU won't produce images
> with such large values we need to make sure that it can handle them.
>
> Another effect of increasing the size field is that it can make
> it include data from the following host cluster(s). In this case
> 'qemu-img check' will detect that the refcounts are not correct, and
> we'll need to rebuild them.
>
> Additionally, this patch also tests that decreasing the size corrupts
> the image since the original data can no longer be recovered. In this
> case QEMU returns an error when trying to read the compressed data,
> but 'qemu-img check' doesn't see anything wrong if the refcounts are
> consistent.
>
> One possible task for the future is to make 'qemu-img check' verify
> the sizes of the compressed clusters, by trying to decompress the data
> and checking that the size stored in the L2 entry is correct.
>
> Signed-off-by: Alberto Garcia 
> ---
>
> v3: Add TODO comment, as suggested by Eric.
>
> Corrupt the length of the second compressed cluster as well so the
> uncompressed data would span three host clusters.
>
> v2: We now have two scenarios where we make QEMU read data from the
> next host cluster and from beyond the end of the image. This
> version also runs qemu-img check on the corrupted image.
>
> If the size field is too small, reading fails but qemu-img check
> succeeds.
>
> If the size field is too large, reading succeeds but qemu-img
> check fails (this can be repaired, though).
>
> ---
>  tests/qemu-iotests/122 | 45 +
>  tests/qemu-iotests/122.out | 31 +++
>  2 files changed, 76 insertions(+)
>
> diff --git a/tests/qemu-iotests/122 b/tests/qemu-iotests/122
> index 45b359c2ba..5b9593016c 100755
> --- a/tests/qemu-iotests/122
> +++ b/tests/qemu-iotests/122
> @@ -130,6 +130,51 @@ $QEMU_IO -c "read -P 01024k 1022k" "$TEST_IMG" 2>&1 
> | _filter_qemu_io | _fil
>  
>  
>  echo
> +echo "=== Corrupted size field in compressed cluster descriptor ==="
> +echo
> +# Create an empty image, fill half of it with data and compress it.
> +# The L2 entries of the two compressed clusters are located at
> +# 0x80 and 0x88, their original values are 0x400800a0
> +# and 0x400800a00802 (5 sectors for compressed data each).
> +TEST_IMG="$TEST_IMG".1 _make_test_img 8M
> +$QEMU_IO -c "write -P 0x11 0 4M" "$TEST_IMG".1 2>&1 | _filter_qemu_io | 
> _filter_testdir
> +$QEMU_IMG convert -c -O qcow2 -o cluster_size=2M "$TEST_IMG".1 "$TEST_IMG"
> +
> +# Reduce size of compressed data to 4 sectors: this corrupts the image.
> +poke_file "$TEST_IMG" $((0x80)) "\x40\x06"
> +$QEMU_IO -c "read  -P 0x11 0 4M" "$TEST_IMG" 2>&1 | _filter_qemu_io | 
> _filter_testdir
> +
> +# 'qemu-img check' however doesn't see anything wrong because it
> +# doesn't try to decompress the data and the refcounts are consistent.
> +# TODO: update qemu-img so this can be detected
> +_check_test_img
> +
> +# Increase size of compressed data to the maximum (8192 sectors).
> +# This makes QEMU read more data (8192 sectors instead of 5, host
> +# addresses [0xa0, 0xdf]), but the decompression algorithm
> +# stops once we have enough to restore the uncompressed cluster, so
> +# the rest of the data is ignored.
> +poke_file "$TEST_IMG" $((0x80)) "\x7f\xfe"
> +# Do it also for the second compressed cluster (L2 entry at 0x88).
> +# In this case the compressed data would span 3 host clusters
> +# (host addresses: [0xa00802, 0xe00801])
> +poke_file "$TEST_IMG" $((0x88)) "\x7f\xfe"
> +
> +# Here the image is too small so we're asking QEMU to read beyond the
> +# end of the image.
> +$QEMU_IO -c "read 

Re: [Qemu-block] [PATCH] qemu-iotests: fix 203 migration completion race

2018-03-06 Thread Stefan Hajnoczi
On Mon, Mar 05, 2018 at 05:04:52PM +0100, Max Reitz wrote:
> On 2018-03-05 16:59, Stefan Hajnoczi wrote:
> > There is a race between the test's 'query-migrate' QMP command after the
> > QMP 'STOP' event and completing the migration:
> > 
> > The test case invokes 'query-migrate' upon receiving 'STOP'.  At this
> > point the migration thread may still be in the process of completing.
> > Therefore 'query-migrate' can return 'status': 'active' for a brief
> > window of time instead of 'status': 'completed'.  This results in
> > qemu-iotests 203 hanging.
> > 
> > Solve the race by enabling the 'events' migration capability, which
> > causes QEMU to emit migration-specific QMP events that do not suffer
> > from this race condition.  Wait for the QMP 'MIGRATION' event with
> > 'status': 'completed'.
> > 
> > Reported-by: Max Reitz 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  tests/qemu-iotests/203 | 15 +++
> >  tests/qemu-iotests/203.out |  5 +
> >  2 files changed, 16 insertions(+), 4 deletions(-)
> 
> So much for "the ppoll() dungeon"...

It was still a pain to debug :).

I put a ring buffer into the QMP monitor input/output code.  Then it was
possible to figure out the issue via GDB on a hung QEMU:

  (gdb) p current_run_state
  RUN_STATE_POSTMIGRATE
  (gdb) p current_migration->status
  MIGRATION_STATUS_COMPLETED
  (gdb) p monitor_out_ring
  ...'STOP' event...
  (gdb) p monitor_in_ring
  ...query-migrate...  <-- okay, the test checked if migration finished

Then looking at the code:

  static void migration_completion(MigrationState *s)
  {
  ...
  if (s->state == MIGRATION_STATUS_ACTIVE) {
  qemu_mutex_lock_iothread();
  s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
  qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
  s->vm_was_running = runstate_is_running();
  ret = global_state_store();

  if (!ret) {
  bool inactivate = !migrate_colo_enabled();

v The stop event comes from here
  ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
  ...
  }
  qemu_mutex_unlock_iothread(); <--- oh, no!
  ...
  if (!migrate_colo_enabled()) {
  migrate_set_state(>state, current_active_state,
MIGRATION_STATUS_COMPLETED); <-- too late!
  }

  return;


signature.asc
Description: PGP signature


[Qemu-block] [PATCH v2 7/7] qcow2: Make qemu-img check detect corrupted L1 tables in snapshots

2018-03-06 Thread Alberto Garcia
'qemu-img check' cannot detect if a snapshot's L1 table is corrupted.
This patch checks the table's offset and size and reports corruption
if the values are not valid.

This patch doesn't add code to fix that corruption yet, only to detect
and report it.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-refcount.c | 14 ++
 tests/qemu-iotests/080 |  2 ++
 tests/qemu-iotests/080.out | 20 
 3 files changed, 36 insertions(+)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 443ec0a6e5..9d4b4c29d6 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -2019,6 +2019,20 @@ static int calculate_refcounts(BlockDriverState *bs, 
BdrvCheckResult *res,
 /* snapshots */
 for (i = 0; i < s->nb_snapshots; i++) {
 sn = s->snapshots + i;
+if (offset_into_cluster(s, sn->l1_table_offset)) {
+fprintf(stderr, "ERROR snapshot %s (%s) l1_offset=%#" PRIx64 ": "
+"L1 table is not cluster aligned; snapshot table entry "
+"corrupted\n", sn->id_str, sn->name, sn->l1_table_offset);
+res->corruptions++;
+continue;
+}
+if (sn->l1_size > QCOW_MAX_L1_SIZE / sizeof(uint64_t)) {
+fprintf(stderr, "ERROR snapshot %s (%s) l1_size=%#" PRIx32 ": "
+"L1 table is too large; snapshot table entry corrupted\n",
+sn->id_str, sn->name, sn->l1_size);
+res->corruptions++;
+continue;
+}
 ret = check_refcounts_l1(bs, res, refcount_table, nb_clusters,
  sn->l1_table_offset, sn->l1_size, 0, fix);
 if (ret < 0) {
diff --git a/tests/qemu-iotests/080 b/tests/qemu-iotests/080
index f8e7d6f4df..4dbe68e950 100755
--- a/tests/qemu-iotests/080
+++ b/tests/qemu-iotests/080
@@ -182,6 +182,7 @@ poke_file "$TEST_IMG" "$offset_snap1_l1_offset" 
"\x00\x00\x00\x00\x00\x40\x02\x0
-c 'write 0 4k'; } 2>&1 | _filter_qemu_io | _filter_testdir
 { $QEMU_IMG snapshot -a test $TEST_IMG; } 2>&1 | _filter_testdir
 { $QEMU_IMG snapshot -d test $TEST_IMG; } 2>&1 | _filter_testdir
+_check_test_img
 
 echo
 echo "== Invalid snapshot L1 table size =="
@@ -195,6 +196,7 @@ poke_file "$TEST_IMG" "$offset_snap1_l1_size" 
"\x10\x00\x00\x00"
-c 'write 0 4k'; } 2>&1 | _filter_qemu_io | _filter_testdir
 { $QEMU_IMG snapshot -a test $TEST_IMG; } 2>&1 | _filter_testdir
 { $QEMU_IMG snapshot -d test $TEST_IMG; } 2>&1 | _filter_testdir
+_check_test_img
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/080.out b/tests/qemu-iotests/080.out
index 89bcd27172..4e0f7f7b92 100644
--- a/tests/qemu-iotests/080.out
+++ b/tests/qemu-iotests/080.out
@@ -71,6 +71,16 @@ write failed: Invalid argument
 qemu-img: Snapshot L1 table offset invalid
 qemu-img: Could not apply snapshot 'test': Failed to load snapshot: Invalid 
argument
 qemu-img: Could not delete snapshot 'test': Snapshot L1 table offset invalid
+ERROR snapshot 1 (test) l1_offset=0x400200: L1 table is not cluster aligned; 
snapshot table entry corrupted
+Leaked cluster 4 refcount=2 reference=1
+Leaked cluster 5 refcount=2 reference=1
+Leaked cluster 6 refcount=1 reference=0
+
+1 errors were found on the image.
+Data may be corrupted, or further writes to the image may corrupt it.
+
+3 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 
 == Invalid snapshot L1 table size ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
@@ -84,4 +94,14 @@ write failed: File too large
 qemu-img: Snapshot L1 table too large
 qemu-img: Could not apply snapshot 'test': Failed to load snapshot: File too 
large
 qemu-img: Could not delete snapshot 'test': Snapshot L1 table too large
+ERROR snapshot 1 (test) l1_size=0x1000: L1 table is too large; snapshot 
table entry corrupted
+Leaked cluster 4 refcount=2 reference=1
+Leaked cluster 5 refcount=2 reference=1
+Leaked cluster 6 refcount=1 reference=0
+
+1 errors were found on the image.
+Data may be corrupted, or further writes to the image may corrupt it.
+
+3 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 *** done
-- 
2.11.0




[Qemu-block] [PATCH v2 4/7] qcow2: Check snapshot L1 tables in qcow2_check_metadata_overlap()

2018-03-06 Thread Alberto Garcia
The inactive-l2 overlap check iterates uses the L1 tables from all
snapshots, but it does not validate them first.

We now have a function to take care of this, so let's use it.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-refcount.c | 10 +-
 tests/qemu-iotests/080 |  4 
 tests/qemu-iotests/080.out |  4 
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 126cca3276..443ec0a6e5 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -2614,9 +2614,17 @@ int qcow2_check_metadata_overlap(BlockDriverState *bs, 
int ign, int64_t offset,
 uint64_t l1_ofs = s->snapshots[i].l1_table_offset;
 uint32_t l1_sz  = s->snapshots[i].l1_size;
 uint64_t l1_sz2 = l1_sz * sizeof(uint64_t);
-uint64_t *l1 = g_try_malloc(l1_sz2);
+uint64_t *l1;
 int ret;
 
+ret = qcow2_validate_table(bs, l1_ofs, l1_sz, sizeof(uint64_t),
+   QCOW_MAX_L1_SIZE, "", NULL);
+if (ret < 0) {
+return ret;
+}
+
+l1 = g_try_malloc(l1_sz2);
+
 if (l1_sz2 && l1 == NULL) {
 return -ENOMEM;
 }
diff --git a/tests/qemu-iotests/080 b/tests/qemu-iotests/080
index 5622604f83..018f815697 100755
--- a/tests/qemu-iotests/080
+++ b/tests/qemu-iotests/080
@@ -178,6 +178,8 @@ _make_test_img 64M
 poke_file "$TEST_IMG" "$offset_snap1_l1_offset" 
"\x00\x00\x00\x00\x00\x40\x02\x00"
 { $QEMU_IMG convert -s test $TEST_IMG $TEST_IMG.snap; } 2>&1 | _filter_testdir
 { $QEMU_IMG amend -o compat=0.10 $TEST_IMG; } 2>&1 | _filter_testdir
+{ $QEMU_IO -c "open -o overlap-check.inactive-l2=on $TEST_IMG" \
+   -c 'write 0 4k'; } 2>&1 | _filter_qemu_io | _filter_testdir
 
 echo
 echo "== Invalid snapshot L1 table size =="
@@ -187,6 +189,8 @@ _make_test_img 64M
 poke_file "$TEST_IMG" "$offset_snap1_l1_size" "\x10\x00\x00\x00"
 { $QEMU_IMG convert -s test $TEST_IMG $TEST_IMG.snap; } 2>&1 | _filter_testdir
 { $QEMU_IMG amend -o compat=0.10 $TEST_IMG; } 2>&1 | _filter_testdir
+{ $QEMU_IO -c "open -o overlap-check.inactive-l2=on $TEST_IMG" \
+   -c 'write 0 4k'; } 2>&1 | _filter_qemu_io | _filter_testdir
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/080.out b/tests/qemu-iotests/080.out
index 315edbc26f..d622ac06c0 100644
--- a/tests/qemu-iotests/080.out
+++ b/tests/qemu-iotests/080.out
@@ -66,6 +66,8 @@ wrote 512/512 bytes at offset 0
 qemu-img: Failed to load snapshot: Snapshot L1 table offset invalid
 qemu-img: Snapshot L1 table offset invalid
 qemu-img: Error while amending options: Invalid argument
+Failed to flush the refcount block cache: Invalid argument
+write failed: Invalid argument
 
 == Invalid snapshot L1 table size ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
@@ -74,4 +76,6 @@ wrote 512/512 bytes at offset 0
 qemu-img: Failed to load snapshot: Snapshot L1 table too large
 qemu-img: Snapshot L1 table too large
 qemu-img: Error while amending options: File too large
+Failed to flush the refcount block cache: File too large
+write failed: File too large
 *** done
-- 
2.11.0




[Qemu-block] [PATCH v2 6/7] qcow2: Check snapshot L1 table in qcow2_snapshot_delete()

2018-03-06 Thread Alberto Garcia
This function deletes a snapshot from disk, removing its entry from
the snapshot table, freeing its L1 table and decreasing the refcounts
of all clusters.

The L1 table offset and size are however not validated. If we use
invalid values in this function we'll probably corrupt the image even
more, so we should return an error instead.

We now have a function to take care of this, so let's use it.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-snapshot.c | 7 +++
 tests/qemu-iotests/080 | 2 ++
 tests/qemu-iotests/080.out | 2 ++
 3 files changed, 11 insertions(+)

diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index 0faf728dc4..74293be470 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -611,6 +611,13 @@ int qcow2_snapshot_delete(BlockDriverState *bs,
 }
 sn = s->snapshots[snapshot_index];
 
+ret = qcow2_validate_table(bs, sn.l1_table_offset, sn.l1_size,
+   sizeof(uint64_t), QCOW_MAX_L1_SIZE,
+   "Snapshot L1 table", errp);
+if (ret < 0) {
+return ret;
+}
+
 /* Remove it from the snapshot list */
 memmove(s->snapshots + snapshot_index,
 s->snapshots + snapshot_index + 1,
diff --git a/tests/qemu-iotests/080 b/tests/qemu-iotests/080
index 538857310f..f8e7d6f4df 100755
--- a/tests/qemu-iotests/080
+++ b/tests/qemu-iotests/080
@@ -181,6 +181,7 @@ poke_file "$TEST_IMG" "$offset_snap1_l1_offset" 
"\x00\x00\x00\x00\x00\x40\x02\x0
 { $QEMU_IO -c "open -o overlap-check.inactive-l2=on $TEST_IMG" \
-c 'write 0 4k'; } 2>&1 | _filter_qemu_io | _filter_testdir
 { $QEMU_IMG snapshot -a test $TEST_IMG; } 2>&1 | _filter_testdir
+{ $QEMU_IMG snapshot -d test $TEST_IMG; } 2>&1 | _filter_testdir
 
 echo
 echo "== Invalid snapshot L1 table size =="
@@ -193,6 +194,7 @@ poke_file "$TEST_IMG" "$offset_snap1_l1_size" 
"\x10\x00\x00\x00"
 { $QEMU_IO -c "open -o overlap-check.inactive-l2=on $TEST_IMG" \
-c 'write 0 4k'; } 2>&1 | _filter_qemu_io | _filter_testdir
 { $QEMU_IMG snapshot -a test $TEST_IMG; } 2>&1 | _filter_testdir
+{ $QEMU_IMG snapshot -d test $TEST_IMG; } 2>&1 | _filter_testdir
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/080.out b/tests/qemu-iotests/080.out
index 1525e1b087..89bcd27172 100644
--- a/tests/qemu-iotests/080.out
+++ b/tests/qemu-iotests/080.out
@@ -70,6 +70,7 @@ Failed to flush the refcount block cache: Invalid argument
 write failed: Invalid argument
 qemu-img: Snapshot L1 table offset invalid
 qemu-img: Could not apply snapshot 'test': Failed to load snapshot: Invalid 
argument
+qemu-img: Could not delete snapshot 'test': Snapshot L1 table offset invalid
 
 == Invalid snapshot L1 table size ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
@@ -82,4 +83,5 @@ Failed to flush the refcount block cache: File too large
 write failed: File too large
 qemu-img: Snapshot L1 table too large
 qemu-img: Could not apply snapshot 'test': Failed to load snapshot: File too 
large
+qemu-img: Could not delete snapshot 'test': Snapshot L1 table too large
 *** done
-- 
2.11.0




[Qemu-block] [PATCH v2 1/7] qcow2: Generalize validate_table_offset() into qcow2_validate_table()

2018-03-06 Thread Alberto Garcia
This function checks that the offset and size of a table are valid.

While the offset checks are fine, the size check is too generic, since
it only verifies that the total size in bytes fits in a 64-bit
integer. In practice all tables used in qcow2 have much smaller size
limits, so the size needs to be checked again for each table using its
actual limit.

This patch generalizes this function by allowing the caller to specify
the maximum size for that table. In addition to that it allows passing
an Error variable.

The function is also renamed and made public since we're going to use
it in other parts of the code.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2.c  | 77 ++
 block/qcow2.h  | 10 +++---
 tests/qemu-iotests/080.out | 16 +-
 3 files changed, 43 insertions(+), 60 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 071dc4d608..1ab6abf3b5 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -559,26 +559,23 @@ static int qcow2_check(BlockDriverState *bs, 
BdrvCheckResult *result,
 return ret;
 }
 
-static int validate_table_offset(BlockDriverState *bs, uint64_t offset,
- uint64_t entries, size_t entry_len)
+int qcow2_validate_table(BlockDriverState *bs, uint64_t offset,
+ uint64_t entries, size_t entry_len,
+ int64_t max_size_bytes, const char *table_name,
+ Error **errp)
 {
 BDRVQcow2State *s = bs->opaque;
-uint64_t size;
+
+if (entries > max_size_bytes / entry_len) {
+error_setg(errp, "%s too large", table_name);
+return -EFBIG;
+}
 
 /* Use signed INT64_MAX as the maximum even for uint64_t header fields,
  * because values will be passed to qemu functions taking int64_t. */
-if (entries > INT64_MAX / entry_len) {
-return -EINVAL;
-}
-
-size = entries * entry_len;
-
-if (INT64_MAX - size < offset) {
-return -EINVAL;
-}
-
-/* Tables must be cluster aligned */
-if (offset_into_cluster(s, offset) != 0) {
+if ((INT64_MAX - entries * entry_len < offset) ||
+(offset_into_cluster(s, offset) != 0)) {
+error_setg(errp, "%s offset invalid", table_name);
 return -EINVAL;
 }
 
@@ -1308,47 +1305,42 @@ static int qcow2_do_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->refcount_table_size =
 header.refcount_table_clusters << (s->cluster_bits - 3);
 
-if (header.refcount_table_clusters > qcow2_max_refcount_clusters(s)) {
-error_setg(errp, "Reference count table too large");
-ret = -EINVAL;
-goto fail;
-}
-
 if (header.refcount_table_clusters == 0 && !(flags & BDRV_O_CHECK)) {
 error_setg(errp, "Image does not contain a reference count table");
 ret = -EINVAL;
 goto fail;
 }
 
-ret = validate_table_offset(bs, s->refcount_table_offset,
-s->refcount_table_size, sizeof(uint64_t));
+ret = qcow2_validate_table(bs, s->refcount_table_offset,
+   header.refcount_table_clusters,
+   s->cluster_size, QCOW_MAX_REFTABLE_SIZE,
+   "Reference count table", errp);
 if (ret < 0) {
-error_setg(errp, "Invalid reference count table offset");
 goto fail;
 }
 
-/* Snapshot table offset/length */
-if (header.nb_snapshots > QCOW_MAX_SNAPSHOTS) {
-error_setg(errp, "Too many snapshots");
-ret = -EINVAL;
-goto fail;
-}
-
-ret = validate_table_offset(bs, header.snapshots_offset,
-header.nb_snapshots,
-sizeof(QCowSnapshotHeader));
+/* The total size in bytes of the snapshot table is checked in
+ * qcow2_read_snapshots() because the size of each snapshot is
+ * variable and we don't know it yet.
+ * Here we only check the offset and number of snapshots. */
+ret = qcow2_validate_table(bs, header.snapshots_offset,
+   header.nb_snapshots,
+   sizeof(QCowSnapshotHeader),
+   sizeof(QCowSnapshotHeader) * QCOW_MAX_SNAPSHOTS,
+   "Snapshot table", errp);
 if (ret < 0) {
-error_setg(errp, "Invalid snapshot table offset");
 goto fail;
 }
 
 /* read the level 1 table */
-if (header.l1_size > QCOW_MAX_L1_SIZE / sizeof(uint64_t)) {
-error_setg(errp, "Active L1 table too large");
-ret = -EFBIG;
+ret = qcow2_validate_table(bs, header.l1_table_offset,
+   header.l1_size, sizeof(uint64_t),
+   QCOW_MAX_L1_SIZE, "Active L1 table", errp);
+if (ret < 0) {
 goto fail;
 }
 s->l1_size = header.l1_size;
+

[Qemu-block] [PATCH v2 0/7] Add checks for corruption in the snapshot table

2018-03-06 Thread Alberto Garcia
Hey,

here's the new version of this series. It fixes a leak reported by
Kevin and adds a couple of error_report_err() to make use of the
message returned by qcow2_validate_table().

Regards,

Berto

Changes:

v2:
- Patch 3: Don't leak l1_table and report the error returned by
  qcow2_validate_table()
- Patch 5: Report the error returned by qcow2_validate_table().

v1: https://lists.gnu.org/archive/html/qemu-block/2018-03/msg00030.html
- Initial version

Output of backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/7:[] [-C] 'qcow2: Generalize validate_table_offset() into 
qcow2_validate_table()'
002/7:[] [--] 'qcow2: Check L1 table offset in qcow2_snapshot_load_tmp()'
003/7:[0010] [FC] 'qcow2: Check L1 table parameters in 
qcow2_expand_zero_clusters()'
004/7:[] [-C] 'qcow2: Check snapshot L1 tables in 
qcow2_check_metadata_overlap()'
005/7:[0007] [FC] 'qcow2: Check snapshot L1 table in qcow2_snapshot_goto()'
006/7:[] [-C] 'qcow2: Check snapshot L1 table in qcow2_snapshot_delete()'
007/7:[] [-C] 'qcow2: Make qemu-img check detect corrupted L1 tables in 
snapshots'

Alberto Garcia (7):
  qcow2: Generalize validate_table_offset() into qcow2_validate_table()
  qcow2: Check L1 table offset in qcow2_snapshot_load_tmp()
  qcow2: Check L1 table parameters in qcow2_expand_zero_clusters()
  qcow2: Check snapshot L1 tables in qcow2_check_metadata_overlap()
  qcow2: Check snapshot L1 table in qcow2_snapshot_goto()
  qcow2: Check snapshot L1 table in qcow2_snapshot_delete()
  qcow2: Make qemu-img check detect corrupted L1 tables in snapshots

 block/qcow2-cluster.c  | 24 ++-
 block/qcow2-refcount.c | 24 ++-
 block/qcow2-snapshot.c | 24 +--
 block/qcow2.c  | 77 ++
 block/qcow2.h  | 10 +++---
 tests/qemu-iotests/080 | 22 -
 tests/qemu-iotests/080.out | 58 --
 7 files changed, 166 insertions(+), 73 deletions(-)

-- 
2.11.0




[Qemu-block] [PATCH v2 5/7] qcow2: Check snapshot L1 table in qcow2_snapshot_goto()

2018-03-06 Thread Alberto Garcia
This function copies a snapshot's L1 table into the active one without
validating it first.

We now have a function to take care of this, so let's use it.

Signed-off-by: Alberto Garcia 
Cc: Eric Blake 
---
 block/qcow2-snapshot.c | 9 +
 tests/qemu-iotests/080 | 2 ++
 tests/qemu-iotests/080.out | 4 
 3 files changed, 15 insertions(+)

diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index bf94f4f434..0faf728dc4 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -465,6 +465,7 @@ int qcow2_snapshot_goto(BlockDriverState *bs, const char 
*snapshot_id)
 {
 BDRVQcow2State *s = bs->opaque;
 QCowSnapshot *sn;
+Error *local_err = NULL;
 int i, snapshot_index;
 int cur_l1_bytes, sn_l1_bytes;
 int ret;
@@ -477,6 +478,14 @@ int qcow2_snapshot_goto(BlockDriverState *bs, const char 
*snapshot_id)
 }
 sn = >snapshots[snapshot_index];
 
+ret = qcow2_validate_table(bs, sn->l1_table_offset, sn->l1_size,
+   sizeof(uint64_t), QCOW_MAX_L1_SIZE,
+   "Snapshot L1 table", _err);
+if (ret < 0) {
+error_report_err(local_err);
+goto fail;
+}
+
 if (sn->disk_size != bs->total_sectors * BDRV_SECTOR_SIZE) {
 error_report("qcow2: Loading snapshots with different disk "
 "size is not implemented");
diff --git a/tests/qemu-iotests/080 b/tests/qemu-iotests/080
index 018f815697..538857310f 100755
--- a/tests/qemu-iotests/080
+++ b/tests/qemu-iotests/080
@@ -180,6 +180,7 @@ poke_file "$TEST_IMG" "$offset_snap1_l1_offset" 
"\x00\x00\x00\x00\x00\x40\x02\x0
 { $QEMU_IMG amend -o compat=0.10 $TEST_IMG; } 2>&1 | _filter_testdir
 { $QEMU_IO -c "open -o overlap-check.inactive-l2=on $TEST_IMG" \
-c 'write 0 4k'; } 2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IMG snapshot -a test $TEST_IMG; } 2>&1 | _filter_testdir
 
 echo
 echo "== Invalid snapshot L1 table size =="
@@ -191,6 +192,7 @@ poke_file "$TEST_IMG" "$offset_snap1_l1_size" 
"\x10\x00\x00\x00"
 { $QEMU_IMG amend -o compat=0.10 $TEST_IMG; } 2>&1 | _filter_testdir
 { $QEMU_IO -c "open -o overlap-check.inactive-l2=on $TEST_IMG" \
-c 'write 0 4k'; } 2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IMG snapshot -a test $TEST_IMG; } 2>&1 | _filter_testdir
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/080.out b/tests/qemu-iotests/080.out
index d622ac06c0..1525e1b087 100644
--- a/tests/qemu-iotests/080.out
+++ b/tests/qemu-iotests/080.out
@@ -68,6 +68,8 @@ qemu-img: Snapshot L1 table offset invalid
 qemu-img: Error while amending options: Invalid argument
 Failed to flush the refcount block cache: Invalid argument
 write failed: Invalid argument
+qemu-img: Snapshot L1 table offset invalid
+qemu-img: Could not apply snapshot 'test': Failed to load snapshot: Invalid 
argument
 
 == Invalid snapshot L1 table size ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
@@ -78,4 +80,6 @@ qemu-img: Snapshot L1 table too large
 qemu-img: Error while amending options: File too large
 Failed to flush the refcount block cache: File too large
 write failed: File too large
+qemu-img: Snapshot L1 table too large
+qemu-img: Could not apply snapshot 'test': Failed to load snapshot: File too 
large
 *** done
-- 
2.11.0




[Qemu-block] [PATCH v2 3/7] qcow2: Check L1 table parameters in qcow2_expand_zero_clusters()

2018-03-06 Thread Alberto Garcia
This function iterates over all snapshots of a qcow2 file in order to
expand all zero clusters, but it does not validate the snapshots' L1
tables first.

We now have a function to take care of this, so let's use it.

We can also take the opportunity to replace the sector-based
bdrv_read() with bdrv_pread().

Signed-off-by: Alberto Garcia 
Cc: Eric Blake 
---
 block/qcow2-cluster.c  | 24 +---
 tests/qemu-iotests/080 |  2 ++
 tests/qemu-iotests/080.out |  4 
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 98908c4264..1aee726c6a 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -25,6 +25,7 @@
 #include "qemu/osdep.h"
 #include 
 
+#include "qapi/error.h"
 #include "qemu-common.h"
 #include "block/block_int.h"
 #include "block/qcow2.h"
@@ -2092,11 +2093,21 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
 }
 
 for (i = 0; i < s->nb_snapshots; i++) {
-int l1_sectors = DIV_ROUND_UP(s->snapshots[i].l1_size *
-  sizeof(uint64_t), BDRV_SECTOR_SIZE);
+int l1_size2;
+uint64_t *new_l1_table;
+Error *local_err = NULL;
 
-uint64_t *new_l1_table =
-g_try_realloc(l1_table, l1_sectors * BDRV_SECTOR_SIZE);
+ret = qcow2_validate_table(bs, s->snapshots[i].l1_table_offset,
+   s->snapshots[i].l1_size, sizeof(uint64_t),
+   QCOW_MAX_L1_SIZE, "Snapshot L1 table",
+   _err);
+if (ret < 0) {
+error_report_err(local_err);
+goto fail;
+}
+
+l1_size2 = s->snapshots[i].l1_size * sizeof(uint64_t);
+new_l1_table = g_try_realloc(l1_table, l1_size2);
 
 if (!new_l1_table) {
 ret = -ENOMEM;
@@ -2105,9 +2116,8 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
 
 l1_table = new_l1_table;
 
-ret = bdrv_read(bs->file,
-s->snapshots[i].l1_table_offset / BDRV_SECTOR_SIZE,
-(void *)l1_table, l1_sectors);
+ret = bdrv_pread(bs->file, s->snapshots[i].l1_table_offset,
+ l1_table, l1_size2);
 if (ret < 0) {
 goto fail;
 }
diff --git a/tests/qemu-iotests/080 b/tests/qemu-iotests/080
index 6a10e7defa..5622604f83 100755
--- a/tests/qemu-iotests/080
+++ b/tests/qemu-iotests/080
@@ -177,6 +177,7 @@ _make_test_img 64M
 { $QEMU_IMG snapshot -c test $TEST_IMG; } 2>&1 | _filter_testdir
 poke_file "$TEST_IMG" "$offset_snap1_l1_offset" 
"\x00\x00\x00\x00\x00\x40\x02\x00"
 { $QEMU_IMG convert -s test $TEST_IMG $TEST_IMG.snap; } 2>&1 | _filter_testdir
+{ $QEMU_IMG amend -o compat=0.10 $TEST_IMG; } 2>&1 | _filter_testdir
 
 echo
 echo "== Invalid snapshot L1 table size =="
@@ -185,6 +186,7 @@ _make_test_img 64M
 { $QEMU_IMG snapshot -c test $TEST_IMG; } 2>&1 | _filter_testdir
 poke_file "$TEST_IMG" "$offset_snap1_l1_size" "\x10\x00\x00\x00"
 { $QEMU_IMG convert -s test $TEST_IMG $TEST_IMG.snap; } 2>&1 | _filter_testdir
+{ $QEMU_IMG amend -o compat=0.10 $TEST_IMG; } 2>&1 | _filter_testdir
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/080.out b/tests/qemu-iotests/080.out
index f0d9038d55..315edbc26f 100644
--- a/tests/qemu-iotests/080.out
+++ b/tests/qemu-iotests/080.out
@@ -64,10 +64,14 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 wrote 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qemu-img: Failed to load snapshot: Snapshot L1 table offset invalid
+qemu-img: Snapshot L1 table offset invalid
+qemu-img: Error while amending options: Invalid argument
 
 == Invalid snapshot L1 table size ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 wrote 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qemu-img: Failed to load snapshot: Snapshot L1 table too large
+qemu-img: Snapshot L1 table too large
+qemu-img: Error while amending options: File too large
 *** done
-- 
2.11.0




[Qemu-block] [PATCH v2 2/7] qcow2: Check L1 table offset in qcow2_snapshot_load_tmp()

2018-03-06 Thread Alberto Garcia
This function checks that the size of a snapshot's L1 table is not too
large, but it doesn't validate the offset.

We now have a function to take care of this, so let's use it.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-snapshot.c |  8 +---
 tests/qemu-iotests/080 | 10 +-
 tests/qemu-iotests/080.out |  8 +++-
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index cee25f582b..bf94f4f434 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -704,9 +704,11 @@ int qcow2_snapshot_load_tmp(BlockDriverState *bs,
 sn = >snapshots[snapshot_index];
 
 /* Allocate and read in the snapshot's L1 table */
-if (sn->l1_size > QCOW_MAX_L1_SIZE / sizeof(uint64_t)) {
-error_setg(errp, "Snapshot L1 table too large");
-return -EFBIG;
+ret = qcow2_validate_table(bs, sn->l1_table_offset, sn->l1_size,
+   sizeof(uint64_t), QCOW_MAX_L1_SIZE,
+   "Snapshot L1 table", errp);
+if (ret < 0) {
+return ret;
 }
 new_l1_bytes = sn->l1_size * sizeof(uint64_t);
 new_l1_table = qemu_try_blockalign(bs->file->bs,
diff --git a/tests/qemu-iotests/080 b/tests/qemu-iotests/080
index 1c2bd85742..6a10e7defa 100755
--- a/tests/qemu-iotests/080
+++ b/tests/qemu-iotests/080
@@ -171,7 +171,15 @@ poke_file "$TEST_IMG" "$offset_l2_table_0" 
"\x80\x00\x00\xff\xff\xff\x00\x00"
 { $QEMU_IMG snapshot -c test $TEST_IMG; } 2>&1 | _filter_qemu_io | 
_filter_testdir
 
 echo
-echo "== Invalid snapshot L1 table =="
+echo "== Invalid snapshot L1 table offset =="
+_make_test_img 64M
+{ $QEMU_IO -c "write 0 512" $TEST_IMG; } 2>&1 | _filter_qemu_io | 
_filter_testdir
+{ $QEMU_IMG snapshot -c test $TEST_IMG; } 2>&1 | _filter_testdir
+poke_file "$TEST_IMG" "$offset_snap1_l1_offset" 
"\x00\x00\x00\x00\x00\x40\x02\x00"
+{ $QEMU_IMG convert -s test $TEST_IMG $TEST_IMG.snap; } 2>&1 | _filter_testdir
+
+echo
+echo "== Invalid snapshot L1 table size =="
 _make_test_img 64M
 { $QEMU_IO -c "write 0 512" $TEST_IMG; } 2>&1 | _filter_qemu_io | 
_filter_testdir
 { $QEMU_IMG snapshot -c test $TEST_IMG; } 2>&1 | _filter_testdir
diff --git a/tests/qemu-iotests/080.out b/tests/qemu-iotests/080.out
index 4c7790c9ee..f0d9038d55 100644
--- a/tests/qemu-iotests/080.out
+++ b/tests/qemu-iotests/080.out
@@ -59,7 +59,13 @@ wrote 512/512 bytes at offset 0
 qemu-img: Could not create snapshot 'test': -27 (File too large)
 qemu-img: Could not create snapshot 'test': -11 (Resource temporarily 
unavailable)
 
-== Invalid snapshot L1 table ==
+== Invalid snapshot L1 table offset ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+qemu-img: Failed to load snapshot: Snapshot L1 table offset invalid
+
+== Invalid snapshot L1 table size ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 wrote 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.11.0




Re: [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-06 Thread Peter Lieven
Am 05.03.2018 um 15:52 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (p...@kamp.de) wrote:
>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
 I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 
 and was curious what was the reason
 to choose 512MB as readahead? The question is that I found that the source 
 VM gets very unresponsive I/O wise
 while the initial 512MB are read and furthermore seems to stay 
 unreasponsive if we choose a high migration speed
 and have a fast storage on the destination VM.

 In our environment I modified this value to 16MB which seems to work much 
 smoother. I wonder if we should make
 this a user configurable value or define a different rate limit for the 
 block transfer in bulk stage at least?
>>> I don't know if benchmarks were run when choosing the value.  From the
>>> commit description it sounds like the main purpose was to limit the
>>> amount of memory that can be consumed.
>>>
>>> 16 MB also fulfills that criteria :), but why is the source VM more
>>> responsive with a lower value?
>>>
>>> Perhaps the issue is queue depth on the storage device - the block
>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>>> to wait?
>> That is my guess. Especially if the destination storage is faster we 
>> basically alsways have
>> 512 I/Os in flight on the source storage.
>>
>> Does anyone mind if the reduce that value to 16MB or do we need a better 
>> mechanism?
> We've got migration-parameters these days; you could connect it to one
> of those fairly easily I think.
> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
> already there.
> Then you can set it to whatever you like.

I will have a look at this.

Thank you,
Peter

>
> Dave
>
>> Peter
>>
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK






Re: [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-06 Thread Stefan Hajnoczi
On Mon, Mar 05, 2018 at 02:52:16PM +, Dr. David Alan Gilbert wrote:
> * Peter Lieven (p...@kamp.de) wrote:
> > Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
> > > On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
> > >> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 
> > >> and was curious what was the reason
> > >> to choose 512MB as readahead? The question is that I found that the 
> > >> source VM gets very unresponsive I/O wise
> > >> while the initial 512MB are read and furthermore seems to stay 
> > >> unreasponsive if we choose a high migration speed
> > >> and have a fast storage on the destination VM.
> > >>
> > >> In our environment I modified this value to 16MB which seems to work 
> > >> much smoother. I wonder if we should make
> > >> this a user configurable value or define a different rate limit for the 
> > >> block transfer in bulk stage at least?
> > > I don't know if benchmarks were run when choosing the value.  From the
> > > commit description it sounds like the main purpose was to limit the
> > > amount of memory that can be consumed.
> > >
> > > 16 MB also fulfills that criteria :), but why is the source VM more
> > > responsive with a lower value?
> > >
> > > Perhaps the issue is queue depth on the storage device - the block
> > > migration code enqueues up to 512 MB worth of reads, and guest I/O has
> > > to wait?
> > 
> > That is my guess. Especially if the destination storage is faster we 
> > basically alsways have
> > 512 I/Os in flight on the source storage.
> > 
> > Does anyone mind if the reduce that value to 16MB or do we need a better 
> > mechanism?
> 
> We've got migration-parameters these days; you could connect it to one
> of those fairly easily I think.
> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
> already there.
> Then you can set it to whatever you like.

It would be nice to solve the performance problem without adding a
tuneable.

On the other hand, QEMU has no idea what the queue depth of the device
is.  Therefore it cannot prioritize guest I/O over block migration I/O.

512 parallel requests is much too high.  Most parallel I/O benchmarking
is done at 32-64 queue depth.

I think that 16 parallel requests is a reasonable maximum number for a
background job.

We need to be clear though that the purpose of this change is unrelated
to the original 512 MB memory footprint goal.  It just happens to touch
the same constant but the goal is now to submit at most 16 I/O requests
in parallel to avoid monopolizing the I/O device.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/2] block latency histogram

2018-03-06 Thread Stefan Hajnoczi
On Wed, Feb 07, 2018 at 03:50:35PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> v2:
> 
> 01: add block_latency_histogram_clear()
> 02: fix spelling (sorry =()
> some rewordings
> remove histogram if latency parameter unspecified
> 
> Vladimir Sementsov-Ogievskiy (2):
>   block/accounting: introduce latency histogram
>   qapi: add block latency histogram interface
> 
>  qapi/block-core.json   | 73 +-
>  include/block/accounting.h |  9 +
>  block/accounting.c | 97 
> ++
>  block/qapi.c   | 31 +++
>  blockdev.c | 19 +
>  5 files changed, 228 insertions(+), 1 deletion(-)

The feature looks good.  I posted documentation and code readability
suggestions.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 2/2] qapi: add block latency histogram interface

2018-03-06 Thread Stefan Hajnoczi
On Wed, Feb 07, 2018 at 03:50:37PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 8225308904..74fe3fe9c4 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -451,6 +451,74 @@
> 'status': 'DirtyBitmapStatus'} }
>  
>  ##
> +# @BlockLatencyHistogramInfo:
> +#
> +# Block latency histogram.
> +#
> +# @latency: list of latency points in microseconds. Matches the @latency

The meaning of "point" is unclear here.  I think the term "interval
boundary" is clearer, but an example may help to illustrate this.  The
field could also be renamed to "intervals".

> +#   parameter from the last call to block-latency-histogram-set for 
> the
> +#   given device.

The fastest NVMe devices are specified at 10 microsecond latency or
less.  The software stack adds latency on top of that, but using
microseconds as units here may not be future-proof.

I suggest using nanoseconds, just in case.

> +#
> +# @read: list of read-request counts corresponding to latency region.
> +#len(@read) = len(@latency) + 1
> +#@read[0] corresponds to latency region [0, @latency[0])

"latency region" == "bin" or "interval" in histogram terminology

> +#for 0 < i < len(@latency): @read[i] corresponds to latency region
> +#[@latency[i-1], @latency[i])
> +#@read.last_element corresponds to latency region
> +#[@latency.last_element, +inf)
> +#
> +# @write: list of write-request counts (see @read semantics)
> +#
> +# @flush: list of flush-request counts (see @read semantics)
> +#
> +# Since: 2.12
> +##
> +{ 'struct': 'BlockLatencyHistogramInfo',
> +  'data': {'latency': ['uint64'],
> +   'read': ['uint64'],
> +   'write': ['uint64'],
> +   'flush': ['uint64'] } }
> +
> +##
> +# @block-latency-histogram-set:
> +#
> +# Remove old latency histogram (if exist) and (optionally) create a new one.

s/if exist/if present/

> +# Add a latency histogram to block device. If a latency histogram already
> +# exists for the device it will be removed and a new one created. The latency
> +# histogram may be queried through query-blockstats.
> +# Set, restart or clear latency histogram.
> +#
> +# @device: device name to set latency histogram for.

query-blockstats also works on nodes, so this command must work not just
on devices but on named nodes too.

> +# @latency: If unspecified, the latency histogram will be removed (if 
> exists).
> +#   Otherwise it should be list of latency points in microseconds. 
> Old
> +#   latency histogram will be removed (if exists) and a new one will 
> be
> +#   created. The sequence must be ascending, elements must be greater
> +#   than zero. Histogram latency regions would be
> +#   [0, @latency[0]), ..., [@latency[i], @latency[i+1]), ...,
> +#   [@latency.last_element, +inf)

Reworded:

  @intervals: An optional list of interval boundary values in
  nanoseconds, all greater than zero and in ascending order.
  The list [10, 50, 100] produces the following histogram
  bins: [0, 10), [10, 50), [50, 100), [100, +inf).  If the
  field is absent the existing latency histogram is removed
  from the device (if present).


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] block/accounting: introduce latency histogram

2018-03-06 Thread Stefan Hajnoczi
On Wed, Feb 07, 2018 at 03:50:36PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Introduce latency histogram statics for block devices.
> For each accounted operation type latency region [0, +inf) is
> divided into subregions by several points. Then, calculate
> hits for each subregion.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/accounting.h |  9 +
>  block/accounting.c | 97 
> ++
>  2 files changed, 106 insertions(+)
> 
> diff --git a/include/block/accounting.h b/include/block/accounting.h
> index b833d26d6c..9679020f64 100644
> --- a/include/block/accounting.h
> +++ b/include/block/accounting.h
> @@ -45,6 +45,12 @@ struct BlockAcctTimedStats {
>  QSLIST_ENTRY(BlockAcctTimedStats) entries;
>  };
>  
> +typedef struct BlockLatencyHistogram {
> +int size;
> +uint64_t *points; /* @size-1 points here (all points, except 0 and +inf) 
> */
> +uint64_t *histogram[BLOCK_MAX_IOTYPE]; /* @size elements for each type */

The semantics of these fields is not obvious.  Please include a comment
with an example or ASCII-art diagram so anyone reading the code
understands the purpose of the fields without reading all the code.

According to Wikipedia and Mathworld, "intervals" and "bins" are
commonly used terms:
https://en.wikipedia.org/wiki/Histogram
http://mathworld.wolfram.com/Histogram.html

I suggest:

  typedef struct {
  /* The following histogram is represented like this:
   *
   * 5|   *
   * 4|   *
   * 3| * *
   * 2| * **
   * 1| ****
   *  +--
   *  10   50   100
   *
   * BlockLatencyHistogram histogram = {
   * .nbins = 4,
   * .intervals = {10, 50, 100},
   * .bins = {3, 1, 5, 2},
   * };
   */
  size_t nbins;

  /* Interval boundaries (there are @nbins - 1 elements) */
  uint64_t *intervals;

  /* Frequency data (there are @nbins elements) */
  uint64_t *bins[BLOCK_MAX_IOTYPE];
  } BlockLatencyHistogram;

> +/* block_latency_histogram_compare_func
> + * Compare @key with interval [@el, @el+1), where @el+1 is a next array 
> element
> + * after @el.
> + * Return: -1 if @key < @el
> + *  0 if @key in [@el, @el+1)
> + * +1 if @key >= @el+1

"@el+1" is confusing, usually x+1 means "the value of x plus 1", not
"y" (a different variable!).

I suggest:

  /* block_latency_histogram_compare_func:
   * Compare @key with interval [@it[0], @it[1]).
   * Return: -1 if @key < @it[0]
   *  0 if @key in [@it[0], @it[1])
   * +1 if @key >= @it[1]
   */
  static int block_latency_histogram_compare_func(const void *key, const void 
*it)

> +int block_latency_histogram_set(BlockAcctStats *stats, uint64List *latency)
> +{
> +BlockLatencyHistogram *hist = >latency_histogram;
> +uint64List *entry;
> +uint64_t *ptr;
> +int i;
> +uint64_t prev = 0;
> +
> +hist->size = 1;
> +
> +for (entry = latency; entry; entry = entry->next) {
> +if (entry->value <= prev) {
> +return -EINVAL;
> +}
> +hist->size++;
> +prev = entry->value;
> +}
> +
> +hist->points = g_renew(uint64_t, hist->points, hist->size - 1);
> +for (entry = latency, ptr = hist->points; entry;
> + entry = entry->next, ptr++)
> +{
> +*ptr = entry->value;
> +}
> +
> +for (i = 0; i < BLOCK_MAX_IOTYPE; i++) {
> +hist->histogram[i] = g_renew(uint64_t, hist->histogram[i], 
> hist->size);
> +memset(hist->histogram[i], 0, hist->size * sizeof(uint64_t));

Is there a reason for using g_renew() and then clearing everything?
This is more concise:

  g_free(hist->histogram[i]);
  hist->histogram[i] = g_new0(uint64_t, hist->size);


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH 3/7] qcow2: Check L1 table parameters in qcow2_expand_zero_clusters()

2018-03-06 Thread Alberto Garcia
On Tue 06 Mar 2018 04:11:17 PM CET, Kevin Wolf wrote:
> I've finished the review now, the rest looks correct.
>
> The only other thing I wondered is about the cases where you pass a
> NULL errp because the callers don't get an Error parameter, so they
> can't pass it on. Some of these callers already use error_report(), so
> it would be okay to use error_report_err() for an error returned by
> qcow2_validate_table(), too. I think that would improve the messages.

Good idea, I'll change that and resend the series.

Berto



Re: [Qemu-block] [PATCH 3/7] qcow2: Check L1 table parameters in qcow2_expand_zero_clusters()

2018-03-06 Thread Kevin Wolf
Am 06.03.2018 um 16:01 hat Alberto Garcia geschrieben:
> On Tue 06 Mar 2018 03:54:26 PM CET, Kevin Wolf wrote:
> >> @@ -2092,11 +2092,18 @@ int qcow2_expand_zero_clusters(BlockDriverState 
> >> *bs,
> >>  }
> >>  
> >>  for (i = 0; i < s->nb_snapshots; i++) {
> >> -int l1_sectors = DIV_ROUND_UP(s->snapshots[i].l1_size *
> >> -  sizeof(uint64_t), BDRV_SECTOR_SIZE);
> >> +int l1_size2;
> >> +uint64_t *new_l1_table;
> >>  
> >> -uint64_t *new_l1_table =
> >> -g_try_realloc(l1_table, l1_sectors * BDRV_SECTOR_SIZE);
> >> +ret = qcow2_validate_table(bs, s->snapshots[i].l1_table_offset,
> >> +   s->snapshots[i].l1_size, 
> >> sizeof(uint64_t),
> >> +   QCOW_MAX_L1_SIZE, "", NULL);
> >> +if (ret < 0) {
> >> +return ret;
> >
> > Shouldn't this be goto fail?
> 
> You're right, this is a loop, and l1_table could have been initialized
> in previous iterations.
> 
> I'll send a corrected version with this change, but first I'll wait a
> bit in case you see anything else in the series.

I've finished the review now, the rest looks correct.

The only other thing I wondered is about the cases where you pass a
NULL errp because the callers don't get an Error parameter, so they
can't pass it on. Some of these callers already use error_report(), so
it would be okay to use error_report_err() for an error returned by
qcow2_validate_table(), too. I think that would improve the messages.

Kevin



Re: [Qemu-block] [PATCH 3/7] qcow2: Check L1 table parameters in qcow2_expand_zero_clusters()

2018-03-06 Thread Alberto Garcia
On Tue 06 Mar 2018 03:54:26 PM CET, Kevin Wolf wrote:
>> @@ -2092,11 +2092,18 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
>>  }
>>  
>>  for (i = 0; i < s->nb_snapshots; i++) {
>> -int l1_sectors = DIV_ROUND_UP(s->snapshots[i].l1_size *
>> -  sizeof(uint64_t), BDRV_SECTOR_SIZE);
>> +int l1_size2;
>> +uint64_t *new_l1_table;
>>  
>> -uint64_t *new_l1_table =
>> -g_try_realloc(l1_table, l1_sectors * BDRV_SECTOR_SIZE);
>> +ret = qcow2_validate_table(bs, s->snapshots[i].l1_table_offset,
>> +   s->snapshots[i].l1_size, 
>> sizeof(uint64_t),
>> +   QCOW_MAX_L1_SIZE, "", NULL);
>> +if (ret < 0) {
>> +return ret;
>
> Shouldn't this be goto fail?

You're right, this is a loop, and l1_table could have been initialized
in previous iterations.

I'll send a corrected version with this change, but first I'll wait a
bit in case you see anything else in the series.

Berto



Re: [Qemu-block] [PATCH 3/7] qcow2: Check L1 table parameters in qcow2_expand_zero_clusters()

2018-03-06 Thread Kevin Wolf
Am 01.03.2018 um 17:27 hat Alberto Garcia geschrieben:
> This function iterates over all snapshots of a qcow2 file in order to
> expand all zero clusters, but it does not validate the snapshots' L1
> tables first.
> 
> We now have a function to take care of this, so let's use it.
> 
> We can also take the opportunity to replace the sector-based
> bdrv_read() with bdrv_pread().
> 
> Signed-off-by: Alberto Garcia 
> ---
>  block/qcow2-cluster.c  | 20 +---
>  tests/qemu-iotests/080 |  2 ++
>  tests/qemu-iotests/080.out |  2 ++
>  3 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index e406b0f3b9..40167ac09c 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -2092,11 +2092,18 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
>  }
>  
>  for (i = 0; i < s->nb_snapshots; i++) {
> -int l1_sectors = DIV_ROUND_UP(s->snapshots[i].l1_size *
> -  sizeof(uint64_t), BDRV_SECTOR_SIZE);
> +int l1_size2;
> +uint64_t *new_l1_table;
>  
> -uint64_t *new_l1_table =
> -g_try_realloc(l1_table, l1_sectors * BDRV_SECTOR_SIZE);
> +ret = qcow2_validate_table(bs, s->snapshots[i].l1_table_offset,
> +   s->snapshots[i].l1_size, sizeof(uint64_t),
> +   QCOW_MAX_L1_SIZE, "", NULL);
> +if (ret < 0) {
> +return ret;

Shouldn't this be goto fail?

Kevin



Re: [Qemu-block] [PATCH v4 0/8] Call check and invalidate_cache from coroutine context

2018-03-06 Thread Paolo Bonzini
On 06/03/2018 14:59, Kevin Wolf wrote:
> Am 01.03.2018 um 17:36 hat Paolo Bonzini geschrieben:
>> Check and invalidate_cache share some parts of the implementation
>> with the regular I/O path.  This is sometimes complicated because the
>> I/O path wants to use a CoMutex but that is not possible outside coroutine
>> context.  By moving things to coroutine context, we can remove special
>> cases.  In fact, invalidate_cache is already called from coroutine context
>> because incoming migration is placed in a coroutine.
>>
>> I'm including two patches from Stefan to rename the bdrv_create callback
>> to bdrv_co_create, because it is already called from coroutine context.
>> The name is now bdrv_co_create_opts, with bdrv_co_create reserved for
>> the QAPI-based version that Kevin is working on.
>>
>> qcow2 still has cache flushing in non-coroutine context, coming from
>> qcow2_reopen_prepare->qcow2_update_options_prepare,
>> qcow2_close->qcow2_inactivate and several dirty bitmap functions.
> 
> Hmm... Which commit is this based on? I can't seem to find one where it
> applies cleanly.

It's on top of bec9c64ef7be8063f1192608b83877bc5c9ea217, I'll respin.

Paolo



Re: [Qemu-block] [PATCH 0/7] Add checks for corruption in the snapshot table

2018-03-06 Thread Alberto Garcia
On Tue 06 Mar 2018 03:06:46 PM CET, Kevin Wolf wrote:
>> So although having a way to repair this kind of corruption would be
>> nice, we should still allow the user to try to open the image and
>> read data from it without requiring a potentially destructive
>> operation first.
>
> Generally speaking, there is also always the option to refuse opening
> a corrupted image read-write, but to allow opening it read-only. This
> would probably still require some checks on use, though (or maybe just
> hiding all snapshots).

You're right. The checks are in this series anyway.

Berto



Re: [Qemu-block] [PATCH 0/7] Add checks for corruption in the snapshot table

2018-03-06 Thread Kevin Wolf
Am 01.03.2018 um 17:27 hat Alberto Garcia geschrieben:
> Hey ho!
> 
> As we have already discussed in the mailing list, the offset and size
> values of snapshots' L1 tables are almost never validated in the QEMU
> code.
> 
> One way to deal with this is to validate them when the image is being
> opened (in qcow2_read_snapshots()) and return an error if there's
> anything wrong. However this would render the image unusable because
> we wouldn't be able to try to recover data using the active table as
> we can do now.
> 
> Another possibility is to refuse opening the image but add a way to
> fix it using 'qemu-img check'. But this would be a destructive
> operation, and knowing that the image is already corrupted we can't
> guarantee that we wouldn't be corrupting it even more.
> 
> So although having a way to repair this kind of corruption would be
> nice, we should still allow the user to try to open the image and read
> data from it without requiring a potentially destructive operation
> first.

Generally speaking, there is also always the option to refuse opening a
corrupted image read-write, but to allow opening it read-only. This
would probably still require some checks on use, though (or maybe just
hiding all snapshots).

Kevin

> So this series simply validates the snapshots' L1 tables in the places
> where they are actually used. We already have a function to do this
> kind of checks so we only need to call it.
> 
> Regards,
> 
> Berto
> 
> Alberto Garcia (7):
>   qcow2: Generalize validate_table_offset() into qcow2_validate_table()
>   qcow2: Check L1 table offset in qcow2_snapshot_load_tmp()
>   qcow2: Check L1 table parameters in qcow2_expand_zero_clusters()
>   qcow2: Check snapshot L1 tables in qcow2_check_metadata_overlap()
>   qcow2: Check snapshot L1 table in qcow2_snapshot_goto()
>   qcow2: Check snapshot L1 table in qcow2_snapshot_delete()
>   qcow2: Make qemu-img check detect corrupted L1 tables in snapshots
> 
>  block/qcow2-cluster.c  | 20 +++-
>  block/qcow2-refcount.c | 24 ++-
>  block/qcow2-snapshot.c | 21 +++--
>  block/qcow2.c  | 77 
> ++
>  block/qcow2.h  | 10 +++---
>  tests/qemu-iotests/080 | 22 -
>  tests/qemu-iotests/080.out | 54 ++--
>  7 files changed, 155 insertions(+), 73 deletions(-)
> 
> -- 
> 2.11.0
> 



Re: [Qemu-block] [PATCH v4 0/8] Call check and invalidate_cache from coroutine context

2018-03-06 Thread Kevin Wolf
Am 01.03.2018 um 17:36 hat Paolo Bonzini geschrieben:
> Check and invalidate_cache share some parts of the implementation
> with the regular I/O path.  This is sometimes complicated because the
> I/O path wants to use a CoMutex but that is not possible outside coroutine
> context.  By moving things to coroutine context, we can remove special
> cases.  In fact, invalidate_cache is already called from coroutine context
> because incoming migration is placed in a coroutine.
> 
> I'm including two patches from Stefan to rename the bdrv_create callback
> to bdrv_co_create, because it is already called from coroutine context.
> The name is now bdrv_co_create_opts, with bdrv_co_create reserved for
> the QAPI-based version that Kevin is working on.
> 
> qcow2 still has cache flushing in non-coroutine context, coming from
> qcow2_reopen_prepare->qcow2_update_options_prepare,
> qcow2_close->qcow2_inactivate and several dirty bitmap functions.

Hmm... Which commit is this based on? I can't seem to find one where it
applies cleanly.

Kevin



Re: [Qemu-block] [Qemu-devel] [RFC] qemu-img: Drop BLK_ZERO from convert

2018-03-06 Thread Stefan Hajnoczi
On Wed, Feb 28, 2018 at 09:11:32PM +0100, Max Reitz wrote:
> On 2018-02-28 19:08, Max Reitz wrote:
> > On 2018-02-27 17:17, Stefan Hajnoczi wrote:
> >> On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
> >>> There are filesystems (among which is tmpfs) that have a hard time
> >>> reporting allocation status.  That is definitely a bug in them.
> >>>
> >>> However, there is no good reason why qemu-img convert should query the
> >>> allocation status in the first place.  It does zero detection by itself
> >>> anyway, so we can detect unallocated areas ourselves.
> >>>
> >>> Furthermore, if a filesystem driver has any sense, reading unallocated
> >>> data should take just as much time as lseek(SEEK_DATA) + memset().  So
> >>> the only overhead we introduce by dropping the manual lseek() call is a
> >>> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
> >>> should be relatively quick.
> >>
> >> This makes sense.  Which file systems did you test this patch on?
> > 
> > On tmpfs and xfs, so far.
> > 
> >> XFS, ext4, and tmpfs would be a good minimal test set to prove the
> >> patch.  Perhaps with two input files:
> >> 1. A file that is mostly filled with data.
> >> 2. A file that is only sparsely populated with data.
> > 
> > And probably with vmdk, which (by default) forbids querying any areas
> > larger than 64 kB.
> > 
> >> The time taken should be comparable with the time before this patch.
> > 
> > Yep, I'll do some benchmarks.
> 
> And the results are in.  I've created 2 GB images on various filesystems
> in various formats, then I've either written 64 kB every 32 MB to them
> ("sparse"), or left out 64 kB every 32 MB ("full").  Then I've converted
> them to null-co:// and took the (real) time through "time". (Script is
> attached.)
> 
> I've attached the raw results before and after this patch.  Usually, I
> did six runs for each case and dropped the most extreme outlier --
> except for full vmdk images, where I've only done one run for each case
> because creating these images can take a very long time.
> 
> Here are the differences from before to after:
> 
> sparse raw on tmpfs:+ 19 % (436 ms to 520 ms)
> sparse qcow2 on tmpfs:  - 31 % (435 ms to 301 ms)
> sparse vmdk on tmpfs:   + 37 % (214 ms to 294 ms)
> 
> sparse raw on xfs:  + 69 % (452 ms to 762 ms)
> sparse qcow2 on xfs:- 34 % (462 ms to 304 ms)
> sparse vmdk on xfs: + 42 % (210 ms to 298 ms)
> 
> sparse raw on ext4: +360 % (144 ms to 655 ms)
> sparse qcow2 on ext4:   +120 % (147 ms to 330 ms)
> sparse vmdk on ext4:+ 16 % (253 ms to 293 ms)
> 
> 
> full raw on tmpfs:  -  9 % (437 ms to 398 ms)
> full qcow2 on tmpfs:- 75 % (1.63 s to 403 ms)
> full vmdk on tmpfs: -100 % (10 min to 767 ms)
> 
> full raw on xfs:-  1 % (407 ms to 404 ms, insignificant)
> full qcow2 on xfs:  -  1 % (410 ms to 404 ms, insignificant)
> full vmdk on xfs:   - 33 % (1.05 s to 695 ms)
> 
> 
> 
> 
> full raw on ext4:   -  2 % (308 ms to 301 ms, insignificant)
> full qcow2 on ext4: +  2 % (307 ms to 312 ms, insignificant)
> full vmdk on ext4:  - 74 % (3.53 s to 839 ms)
> 
> 
> So...  It's more extreme than I had hoped, that's for sure.  What I
> conclude from this is:
> 
> (1) This patch is generally good for nearly fully allocated images.  In
> the worst case (on well-behaving filesystems with well-optimized image
> formats) it changes nothing.  In the best case, conversion time is
> reduced drastically.
> 
> (2) For sparse raw images, this is absolutely devastating.  Reading them
> now takes more than (ext4) or nearly (xfs) twice as much time as reading
> a fully allocated image.  So much for "if a filesystem driver has any
> sense".
> 
> (2a) It might be worth noting that on xfs, reading the sparse file took
> longer even before this patch...
> 
> (3) qcow2 is different: It benefits from this patch on tmpfs and xfs
> (note that reading a sparse qcow2 file took longer than reading a full
> qcow2 file before this patch!), but it gets pretty much destroyed on
> ext4, too.
> 
> (4) As for sparse vmdk images...  Reading them takes longer, but it's
> still fasster than reading full vmdk images, so that's not horrible.
> 
> 
> So there we are.  I was wrong about filesystem drivers having any sense,
> so this patch can indeed have a hugely negative impact.
> 
> I would argue that conversion time for full images is more important,
> because that's probably the main use case; but the thing is that here
> this patch only helps for tmpfs and vmdk.  We don't care too much about
> vmdk, and the fact that tmpfs takes so long simply is a bug.
> 
> I guess the xfs/tmpfs results would still be in a range where they are
> barely acceptable (because it's mainly a qcow2 vs. raw tradeoff), but
> the ext4 horrors probably make this patch a no-go in its current form.
> 
> In any case it's interesting to see that even the current qemu-img
> convert takes longer to read sparsely allocated qcow2/raw 

[Qemu-block] [PATCH v4] iotests: Tweak 030 in order to trigger a race condition with parallel jobs

2018-03-06 Thread Alberto Garcia
This patch tweaks TestParallelOps in iotest 030 so it allocates data
in smaller regions (256KB/512KB instead of 512KB/1MB) and the
block-stream job in test_stream_commit() only needs to copy data that
is at the very end of the image.

This way when the block-stream job is awakened it will finish right
away without any chance of being stopped by block_job_sleep_ns(). This
triggers the bug that was fixed by 3d5d319e1221082974711af1d09d82f and
1a63a907507fbbcfaee3f622907ec24 and is therefore a more useful test
case for parallel block jobs.

After this patch the aforementiond bug can also be reproduced with the
test_stream_parallel() test case.

Since with this change the stream job in test_stream_commit() finishes
early, this patch introduces a similar test case where both jobs are
slowed down so they can actually run in parallel.

Signed-off-by: Alberto Garcia 
Cc: John Snow 
---

v4: Mention that commit 1a63a907507fbbcfaee3f622907ec24 also
contributes to solve the original bug (both commits need to
reverted in order to reproduce this bug reliably).

Rewrite the loop that writes data into the images to make it more
readable.

---
 tests/qemu-iotests/030 | 52 ++
 tests/qemu-iotests/030.out |  4 ++--
 2 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index 457984b8e9..b5f88959aa 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -156,7 +156,7 @@ class TestSingleDrive(iotests.QMPTestCase):
 class TestParallelOps(iotests.QMPTestCase):
 num_ops = 4 # Number of parallel block-stream operations
 num_imgs = num_ops * 2 + 1
-image_len = num_ops * 1024 * 1024
+image_len = num_ops * 512 * 1024
 imgs = []
 
 def setUp(self):
@@ -176,14 +176,14 @@ class TestParallelOps(iotests.QMPTestCase):
  '-o', 'backing_file=%s' % self.imgs[i-1], self.imgs[i])
 
 # Put data into the images we are copying data from
-for i in range(self.num_imgs / 2):
-img_index = i * 2 + 1
-# Alternate between 512k and 1M.
+odd_img_indexes = [x for x in reversed(range(self.num_imgs)) if x % 2 
== 1]
+for i in range(len(odd_img_indexes)):
+# Alternate between 256KB and 512KB.
 # This way jobs will not finish in the same order they were created
-num_kb = 512 + 512 * (i % 2)
+num_kb = 256 + 256 * (i % 2)
 qemu_io('-f', iotests.imgfmt,
-'-c', 'write -P %d %d %d' % (i, i*1024*1024, num_kb * 
1024),
-self.imgs[img_index])
+'-c', 'write -P 0xFF %dk %dk' % (i * 512, num_kb),
+self.imgs[odd_img_indexes[i]])
 
 # Attach the drive to the VM
 self.vm = iotests.VM()
@@ -318,12 +318,14 @@ class TestParallelOps(iotests.QMPTestCase):
 self.wait_until_completed(drive='commit-drive0')
 
 # Test a block-stream and a block-commit job in parallel
-def test_stream_commit(self):
+# Here the stream job is supposed to finish quickly in order to reproduce
+# the scenario that triggers the bug fixed in 3d5d319e1221 and 1a63a907507
+def test_stream_commit_1(self):
 self.assertLessEqual(8, self.num_imgs)
 self.assert_no_active_block_jobs()
 
 # Stream from node0 into node2
-result = self.vm.qmp('block-stream', device='node2', job_id='node2')
+result = self.vm.qmp('block-stream', device='node2', 
base_node='node0', job_id='node2')
 self.assert_qmp(result, 'return', {})
 
 # Commit from the active layer into node3
@@ -348,6 +350,38 @@ class TestParallelOps(iotests.QMPTestCase):
 
 self.assert_no_active_block_jobs()
 
+# This is similar to test_stream_commit_1 but both jobs are slowed
+# down so they can run in parallel for a little while.
+def test_stream_commit_2(self):
+self.assertLessEqual(8, self.num_imgs)
+self.assert_no_active_block_jobs()
+
+# Stream from node0 into node4
+result = self.vm.qmp('block-stream', device='node4', 
base_node='node0', job_id='node4', speed=1024*1024)
+self.assert_qmp(result, 'return', {})
+
+# Commit from the active layer into node5
+result = self.vm.qmp('block-commit', device='drive0', 
base=self.imgs[5], speed=1024*1024)
+self.assert_qmp(result, 'return', {})
+
+# Wait for all jobs to be finished.
+pending_jobs = ['node4', 'drive0']
+while len(pending_jobs) > 0:
+for event in self.vm.get_qmp_events(wait=True):
+if event['event'] == 'BLOCK_JOB_COMPLETED':
+node_name = self.dictpath(event, 'data/device')
+self.assertTrue(node_name in pending_jobs)
+self.assert_qmp_absent(event, 'data/error')
+pending_jobs.remove(node_name)
+  

Re: [Qemu-block] Limiting coroutine stack usage

2018-03-06 Thread Stefan Hajnoczi
On Tue, Feb 20, 2018 at 06:04:02PM +0100, Peter Lieven wrote:
> I remember we discussed a long time ago to limit the stack usage of all 
> functions that are executed in a coroutine
> context to a very low value to be able to safely limit the coroutine stack 
> size as well.
> 
> I checked through all functions in block/, migration/ and nbd/ and there are 
> only very few larger or unbound stack
> allocations that can easily be fixed.
> 
> Now my question: Is there an easy way to add a cflag like -Wstack-usage=2048 
> to all objects in a given directory only?
> I tried to add a llimit to the whole project, but fixing this will be a 
> larger task.

2KB is fine for QEMU code but actual coroutine stack sizes will have to
be at least 8KB, I guess, in order for third-party libraries to work
(e.g. curl, rbd).  PATH_MAX is 4KB on Linux.

Nested event loops in QEMU code can also result in deep call stacks.
This happens when aio_poll() invokes an fd handler or BH that also
invokes aio_poll().

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH] block: implement the bdrv_reopen_prepare helper for LUKS driver

2018-03-06 Thread Kevin Wolf
Am 18.01.2018 um 11:31 hat Daniel P. Berrange geschrieben:
> If the bdrv_reopen_prepare helper isn't provided, the qemu-img commit
> command fails to re-open the base layer after committing changes into
> it. Provide a no-op implementation for the LUKS driver, since there
> is not any custom work that needs doing to re-open it.
> 
> Signed-off-by: Daniel P. Berrange 

Thanks, applied to the block branch.

Kevin



Re: [Qemu-block] [PATCH] block: make BDRV_POLL_WHILE() re-entrancy safe

2018-03-06 Thread Kevin Wolf
Am 06.03.2018 um 11:53 hat Stefan Hajnoczi geschrieben:
> Nested BDRV_POLL_WHILE() calls can occur.  Currently
> assert(!bs_->wakeup) will fail when this happens.
> 
> This patch converts bs->wakeup from bool to a counter.
> 
> Nesting works correctly because outer BDRV_POLL_WHILE() callers evaluate
> the condition again after the inner caller completes (invoking the inner
> caller counts as aio_poll() progress).
> 
> Reported-by: "fuweiwei (C)" 
> Cc: Paolo Bonzini 
> Signed-off-by: Stefan Hajnoczi 

Doesn't this conflict with your own AIO_WAIT_WHILE() patch?

Kevin



Re: [Qemu-block] [PATCH] block: make BDRV_POLL_WHILE() re-entrancy safe

2018-03-06 Thread Paolo Bonzini
On 06/03/2018 11:53, Stefan Hajnoczi wrote:
> Nested BDRV_POLL_WHILE() calls can occur.  Currently
> assert(!bs_->wakeup) will fail when this happens.
> 
> This patch converts bs->wakeup from bool to a counter.
> 
> Nesting works correctly because outer BDRV_POLL_WHILE() callers evaluate
> the condition again after the inner caller completes (invoking the inner
> caller counts as aio_poll() progress).
> 
> Reported-by: "fuweiwei (C)" 
> Cc: Paolo Bonzini 
> Signed-off-by: Stefan Hajnoczi 
> ---
> Fu Weiwei: Please retest and let us know if this fixes the assertion
> failure you hit.  Thanks!
> ---
>  include/block/block.h | 7 +++
>  include/block/block_int.h | 2 +-
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index fac401ba3e..990b97f0ad 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -385,9 +385,8 @@ void bdrv_drain_all(void);
>   * other I/O threads' AioContexts (see for example \
>   * block_job_defer_to_main_loop for how to do it). \
>   */\
> -assert(!bs_->wakeup);  \
> -/* Set bs->wakeup before evaluating cond.  */  \
> -atomic_mb_set(_->wakeup, true); \
> +/* Increment bs->wakeup before evaluating cond. */ \
> +atomic_inc(_->wakeup);  \
>  while (busy_) {\
>  if ((cond)) {  \
>  waited_ = busy_ = true;\
> @@ -399,7 +398,7 @@ void bdrv_drain_all(void);
>  waited_ |= busy_;  \
>  }  \
>  }  \
> -atomic_set(_->wakeup, false);   \
> +atomic_dec(_->wakeup);  \
>  }  \
>  waited_; })
>  
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 5ea63f8fa8..0f360c0ed5 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -712,7 +712,7 @@ struct BlockDriverState {
>  /* Internal to BDRV_POLL_WHILE and bdrv_wakeup.  Accessed with atomic
>   * ops.
>   */
> -bool wakeup;
> +unsigned wakeup;
>  
>  /* counter for nested bdrv_io_plug.
>   * Accessed with atomic ops.
> 

Reviewed-by: Paolo Bonzini 

At least, the assertion made it fail fast. :)

Paolo



[Qemu-block] [PATCH] block: make BDRV_POLL_WHILE() re-entrancy safe

2018-03-06 Thread Stefan Hajnoczi
Nested BDRV_POLL_WHILE() calls can occur.  Currently
assert(!bs_->wakeup) will fail when this happens.

This patch converts bs->wakeup from bool to a counter.

Nesting works correctly because outer BDRV_POLL_WHILE() callers evaluate
the condition again after the inner caller completes (invoking the inner
caller counts as aio_poll() progress).

Reported-by: "fuweiwei (C)" 
Cc: Paolo Bonzini 
Signed-off-by: Stefan Hajnoczi 
---
Fu Weiwei: Please retest and let us know if this fixes the assertion
failure you hit.  Thanks!
---
 include/block/block.h | 7 +++
 include/block/block_int.h | 2 +-
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index fac401ba3e..990b97f0ad 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -385,9 +385,8 @@ void bdrv_drain_all(void);
  * other I/O threads' AioContexts (see for example \
  * block_job_defer_to_main_loop for how to do it). \
  */\
-assert(!bs_->wakeup);  \
-/* Set bs->wakeup before evaluating cond.  */  \
-atomic_mb_set(_->wakeup, true); \
+/* Increment bs->wakeup before evaluating cond. */ \
+atomic_inc(_->wakeup);  \
 while (busy_) {\
 if ((cond)) {  \
 waited_ = busy_ = true;\
@@ -399,7 +398,7 @@ void bdrv_drain_all(void);
 waited_ |= busy_;  \
 }  \
 }  \
-atomic_set(_->wakeup, false);   \
+atomic_dec(_->wakeup);  \
 }  \
 waited_; })
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 5ea63f8fa8..0f360c0ed5 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -712,7 +712,7 @@ struct BlockDriverState {
 /* Internal to BDRV_POLL_WHILE and bdrv_wakeup.  Accessed with atomic
  * ops.
  */
-bool wakeup;
+unsigned wakeup;
 
 /* counter for nested bdrv_io_plug.
  * Accessed with atomic ops.
-- 
2.14.3




Re: [Qemu-block] [Qemu-devel] [PATCH] block: implement the bdrv_reopen_prepare helper for LUKS driver

2018-03-06 Thread Daniel P . Berrangé
Ping 

On Fri, Feb 16, 2018 at 01:53:12PM +, Daniel P. Berrangé wrote:
> Ping, can this be queued in the block tree, since it appears the no-op impl
> is ok ?
> 
> On Thu, Jan 18, 2018 at 10:31:43AM +, Daniel P. Berrange wrote:
> > If the bdrv_reopen_prepare helper isn't provided, the qemu-img commit
> > command fails to re-open the base layer after committing changes into
> > it. Provide a no-op implementation for the LUKS driver, since there
> > is not any custom work that needs doing to re-open it.
> > 
> > Signed-off-by: Daniel P. Berrange 
> > ---
> >  block/crypto.c | 7 +++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/block/crypto.c b/block/crypto.c
> > index 60ddf8623e..bb9a8f5376 100644
> > --- a/block/crypto.c
> > +++ b/block/crypto.c
> > @@ -382,6 +382,12 @@ static void block_crypto_close(BlockDriverState *bs)
> >  qcrypto_block_free(crypto->block);
> >  }
> >  
> > +static int block_crypto_reopen_prepare(BDRVReopenState *state,
> > +   BlockReopenQueue *queue, Error 
> > **errp)
> > +{
> > +/* nothing needs checking */
> > +return 0;
> > +}
> >  
> >  /*
> >   * 1 MB bounce buffer gives good performance / memory tradeoff
> > @@ -620,6 +626,7 @@ BlockDriver bdrv_crypto_luks = {
> >  .bdrv_truncate  = block_crypto_truncate,
> >  .create_opts= _crypto_create_opts_luks,
> >  
> > +.bdrv_reopen_prepare = block_crypto_reopen_prepare,
> >  .bdrv_refresh_limits = block_crypto_refresh_limits,
> >  .bdrv_co_preadv = block_crypto_co_preadv,
> >  .bdrv_co_pwritev= block_crypto_co_pwritev,
> > -- 
> > 2.14.3
> > 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PULL v2 00/38] Block layer patches

2018-03-06 Thread Peter Maydell
On 6 March 2018 at 10:24, Kevin Wolf  wrote:
> Am 06.03.2018 um 11:09 hat Peter Maydell geschrieben:
>> On 5 March 2018 at 17:51, Kevin Wolf  wrote:
>> > The following changes since commit 
>> > 86f4c7e05b1c44dbe1b329a51f311f10aef6ff34:
>> >
>> >   Merge remote-tracking branch 
>> > 'remotes/pmaydell/tags/pull-target-arm-20180302' into staging (2018-03-02 
>> > 14:37:10 +)
>> >
>> > are available in the git repository at:
>> >
>> >   git://repo.or.cz/qemu/kevin.git tags/for-upstream
>> >
>> > for you to fetch changes up to bfe1a14c180ec44c033be12b9151252ffda69292:
>> >
>> >   block: Fix NULL dereference on empty drive error (2018-03-05 18:45:32 
>> > +0100)
>> >
>> > 
>> > Block layer patches
>> >
>> > 
>>
>> This pull turns out to add a bunch of tests that end in the name "error",
>> which triggers my buildlog scanning scripts that look for "error:".
>> Could we rename those?
>
> Wouldn't it make more sense to change your script so that it considers
> "error" only if it's a whole word rather than part of an identifier?
> These test cases concern error paths, so it's kind of expected that
> their name contains "error" as a substring.

Mmm, yes, that turns out not to be too difficult. I've done that.

thanks
-- PMM



Re: [Qemu-block] [PATCH v3] iotests: Tweak 030 in order to trigger a race condition with parallel jobs

2018-03-06 Thread Alberto Garcia
On Mon 05 Mar 2018 05:59:47 PM CET, Max Reitz wrote:
> Secondly, I've reverted 3d5d319e1221082 to test this, and I could
> reproduce failure exactly once.  Since then, no luck (in like 20
> attempts, I think)...

Oh, I see. I was bisecting this and it seems that 1a63a907507fbbcf made
this problem more difficult to reproduce. If you revert that one too (in
addition to 3d5d319e1221082, that is) then it crashes very easily.

I don't think it makes sense to tweak the test ever further, I would
simply mention that "this triggers the bug that was fixed by 3d5d319e122
and 1a63a907507fbbcf" or something like that.

Berto



Re: [Qemu-block] [PULL v2 00/38] Block layer patches

2018-03-06 Thread Kevin Wolf
Am 06.03.2018 um 11:09 hat Peter Maydell geschrieben:
> On 5 March 2018 at 17:51, Kevin Wolf  wrote:
> > The following changes since commit 86f4c7e05b1c44dbe1b329a51f311f10aef6ff34:
> >
> >   Merge remote-tracking branch 
> > 'remotes/pmaydell/tags/pull-target-arm-20180302' into staging (2018-03-02 
> > 14:37:10 +)
> >
> > are available in the git repository at:
> >
> >   git://repo.or.cz/qemu/kevin.git tags/for-upstream
> >
> > for you to fetch changes up to bfe1a14c180ec44c033be12b9151252ffda69292:
> >
> >   block: Fix NULL dereference on empty drive error (2018-03-05 18:45:32 
> > +0100)
> >
> > 
> > Block layer patches
> >
> > 
> 
> This pull turns out to add a bunch of tests that end in the name "error",
> which triggers my buildlog scanning scripts that look for "error:".
> Could we rename those?

Wouldn't it make more sense to change your script so that it considers
"error" only if it's a whole word rather than part of an identifier?
These test cases concern error paths, so it's kind of expected that
their name contains "error" as a substring.

Kevin



Re: [Qemu-block] [PULL v2 00/38] Block layer patches

2018-03-06 Thread Peter Maydell
On 5 March 2018 at 17:51, Kevin Wolf  wrote:
> The following changes since commit 86f4c7e05b1c44dbe1b329a51f311f10aef6ff34:
>
>   Merge remote-tracking branch 
> 'remotes/pmaydell/tags/pull-target-arm-20180302' into staging (2018-03-02 
> 14:37:10 +)
>
> are available in the git repository at:
>
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to bfe1a14c180ec44c033be12b9151252ffda69292:
>
>   block: Fix NULL dereference on empty drive error (2018-03-05 18:45:32 +0100)
>
> 
> Block layer patches
>
> 

This pull turns out to add a bunch of tests that end in the name "error",
which triggers my buildlog scanning scripts that look for "error:".
Could we rename those?

thanks
-- PMM



Re: [Qemu-block] [PATCH] block: include original filename when reporting invalid URIs

2018-03-06 Thread Stefan Hajnoczi
On Tue, Feb 06, 2018 at 10:52:04AM +, Daniel P. Berrangé wrote:
> Consider passing a JSON based block driver to "qemu-img commit"
> 
> $ qemu-img commit 'json:{"driver":"qcow2","file":{"driver":"gluster",\
>   "volume":"gv0","path":"sn1.qcow2",
>   "server":[{"type":\
> "tcp","host":"10.73.199.197","port":"24007"}]},}'
> 
> Currently it will commit the content and then report an incredibly
> useless error message when trying to re-open the committed image:
> 
>   qemu-img: invalid URI
>   Usage: 
> file=gluster[+transport]://[host[:port]]volume/path[?socket=...][,file.debug=N][,file.logfile=/path/filename.log]
> 
> With this fix we get:
> 
>   qemu-img: invalid URI json:{"server.0.host": "10.73.199.197",
>   "driver": "gluster", "path": "luks.qcow2", "server.0.type":
>   "tcp", "server.0.port": "24007", "volume": "gv0"}
> 
> Of course the root cause problem still exists, but now we know
> what actually needs fixing.
> 
> Signed-off-by: Daniel P. Berrangé 
> ---
>  block/gluster.c  | 2 +-
>  block/sheepdog.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

Jeff: Ping

> diff --git a/block/gluster.c b/block/gluster.c
> index 0f4265a3a4..0215e19087 100644
> --- a/block/gluster.c
> +++ b/block/gluster.c
> @@ -660,7 +660,7 @@ static struct glfs 
> *qemu_gluster_init(BlockdevOptionsGluster *gconf,
>  if (filename) {
>  ret = qemu_gluster_parse_uri(gconf, filename);
>  if (ret < 0) {
> -error_setg(errp, "invalid URI");
> +error_setg(errp, "invalid URI %s", filename);
>  error_append_hint(errp, "Usage: file=gluster[+transport]://"
>  "[host[:port]]volume/path[?socket=...]"
>  "[,file.debug=N]"
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index f684477328..c847ab6c98 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -1050,7 +1050,7 @@ static void sd_parse_uri(SheepdogConfig *cfg, const 
> char *filename,
>  
>  cfg->uri = uri = uri_parse(filename);
>  if (!uri) {
> -error_setg(, "invalid URI");
> +error_setg(, "invalid URI '%s'", filename);
>  goto out;
>  }
>  
> -- 
> 2.14.3
> 
> 


signature.asc
Description: PGP signature