Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-12-02 Thread Francesco Romani


- Original Message -
> From: "Stefan Hajnoczi" 
> To: "Francesco Romani" 
> Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, qemu-devel@nongnu.org, 
> lcapitul...@redhat.com
> Sent: Tuesday, December 2, 2014 4:01:17 PM
> Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds  
> threshold
> 
> On Fri, Nov 28, 2014 at 01:31:07PM +0100, Francesco Romani wrote:
> > @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState
> > *bs)
> >  info->iops_size = cfg.op_size;
> >  }
> >  
> > +info->write_threshold = bdrv_usage_threshold_get(bs);
> 
> Overall looks good but I notice that "write_threshold" and
> "usage_threshold" are both used.  Please use just one consistently (I
> think "write_threshold" is clearer).

Agreed. Will use "write_threshold" everywhere, file names included.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-12-02 Thread Stefan Hajnoczi
On Fri, Nov 28, 2014 at 01:31:07PM +0100, Francesco Romani wrote:
> @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState 
> *bs)
>  info->iops_size = cfg.op_size;
>  }
>  
> +info->write_threshold = bdrv_usage_threshold_get(bs);

Overall looks good but I notice that "write_threshold" and
"usage_threshold" are both used.  Please use just one consistently (I
think "write_threshold" is clearer).


pgpoo5lthMtKa.pgp
Description: PGP signature


Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-12-02 Thread Francesco Romani
- Original Message -
> From: "Francesco Romani" 
> To: "Eric Blake" 
> Cc: kw...@redhat.com, lcapitul...@redhat.com, qemu-devel@nongnu.org, 
> stefa...@redhat.com, mdr...@linux.vnet.ibm.com
> Sent: Tuesday, December 2, 2014 8:47:45 AM
> Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage  exceeds 
> threshold
> 
> Thanks for the quick review!

> > Missing the change to the 'query-block' and 'query-named-block-nodes'
> > examples to show the new always-present output field.
> 

Sorry, I missed the *examples* keyword. Will fix.

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-12-01 Thread Francesco Romani
Thanks for the quick review!

- Original Message -
> From: "Eric Blake" 
> To: "Francesco Romani" , qemu-devel@nongnu.org
> Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, 
> lcapitul...@redhat.com
> Sent: Monday, December 1, 2014 10:07:38 PM
> Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds  
> threshold
> 
> On 11/28/2014 05:31 AM, Francesco Romani wrote:
> > Managing applications, like oVirt (http://www.ovirt.org), make extensive
> > use of thin-provisioned disk images.
> > To let the guest run smoothly and be not unnecessarily paused, oVirt sets
> > a disk usage threshold (so called 'high water mark') based on the
> > occupation
> > of the device,  and automatically extends the image once the threshold
> > is reached or exceeded.
> > 
> > In order to detect the crossing of the threshold, oVirt has no choice but
> > aggressively polling the QEMU monitor using the query-blockstats command.
> > This lead to unnecessary system load, and is made even worse under scale:
> > deployments with hundreds of VMs are no longer rare.
> > 
> > To fix this, this patch adds:
> > * A new monitor command to set a mark for a given block device.
> > * A new event to report if a block device usage exceeds the threshold.
> > 
> > This will allow the managing application to drop the polling
> > altogether and just wait for a watermark crossing event.
> 
> I like the idea!
> 
> Question - what happens if management misses the event (for example, if
> libvirtd is restarted)?  Does the existing 'query-blockstats' and/or
> 'query-named-block-nodes' still work to query the current threshold and
> whether it has been exceeded, as a poll-once command executed when
> reconnecting to the monitor?

Indeed oVirt wants to keep the existing polling and to use it as fallback,
to make sure no events are missed. oVirt will not rely on the new notification
*alone*.
The plan is to "just" poll *much less* frequently. Today's default poll
rate is every 2 (two) seconds, so there is a lot of room for improvement.

> 
> > 
> > Signed-off-by: Francesco Romani 
> > ---
> 
> No need for a 0/1 cover letter on a 1-patch series; you have the option
> of just putting the side-band information here and sending it as a
> single mail.  But the cover letter approach doesn't hurt either, and I
> can see how it can be easier for some workflows to always send a cover
> letter than to special-case a 1-patch series.

Also, I found that a separate context/introduction/room for comments
would helped, especially on first two revisions of the patch which were
tagged as RFC. I have zero problems in dropping the cover letter now that
consensus is forming and patch is taking shape, just let me know.

[... snip spelling: thanks! will fix]
> > +++ b/qapi/block-core.json
> > @@ -239,6 +239,9 @@
> >  #
> >  # @iops_size: #optional an I/O size in bytes (Since 1.7)
> >  #
> > +# @write-threshold: configured write threshold for the device.
> > +#   0 if disabled. (Since 2.3)
> > +#
> >  # Since: 0.14.0
> >  #
> >  ##
> > @@ -253,7 +256,7 @@
> >  '*bps_max': 'int', '*bps_rd_max': 'int',
> >  '*bps_wr_max': 'int', '*iops_max': 'int',
> >  '*iops_rd_max': 'int', '*iops_wr_max': 'int',
> > -'*iops_size': 'int' } }
> > +'*iops_size': 'int', 'write-threshold': 'uint64' } }
> 
> In QMP specs, 'uint64' and 'int' are practically synonyms.  I can live
> with either spelling, although 'int' is more common.
> 
> Bikeshed on naming: Although we prefer '-' over '_' in new interfaces,
> we also favor consistency, and BlockDeviceInfo is one of those dinosaur
> commands that uses _ everywhere until your addition.  So naming this
> field 'write_threshold' would be more consistent.

Agreed. Will fix.

> > +#
> > +# @node-name: graph node name on which the threshold was exceeded.
> > +#
> > +# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
> > +#
> > +# @offset-threshold: last configured threshold, in bytes.
> > +#
> 
> Might want to mention that this event is one-shot; after it triggers, a
> user must re-register a threshold to get the event again.

Good point. Will fix.

> 
> > +# Since: 2.3
> > +##
> > +{ 'event': 'BLOCK_U

Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-12-01 Thread Eric Blake
On 11/28/2014 05:31 AM, Francesco Romani wrote:
> Managing applications, like oVirt (http://www.ovirt.org), make extensive
> use of thin-provisioned disk images.
> To let the guest run smoothly and be not unnecessarily paused, oVirt sets
> a disk usage threshold (so called 'high water mark') based on the occupation
> of the device,  and automatically extends the image once the threshold
> is reached or exceeded.
> 
> In order to detect the crossing of the threshold, oVirt has no choice but
> aggressively polling the QEMU monitor using the query-blockstats command.
> This lead to unnecessary system load, and is made even worse under scale:
> deployments with hundreds of VMs are no longer rare.
> 
> To fix this, this patch adds:
> * A new monitor command to set a mark for a given block device.
> * A new event to report if a block device usage exceeds the threshold.
> 
> This will allow the managing application to drop the polling
> altogether and just wait for a watermark crossing event.

I like the idea!

Question - what happens if management misses the event (for example, if
libvirtd is restarted)?  Does the existing 'query-blockstats' and/or
'query-named-block-nodes' still work to query the current threshold and
whether it has been exceeded, as a poll-once command executed when
reconnecting to the monitor?

> 
> Signed-off-by: Francesco Romani 
> ---

No need for a 0/1 cover letter on a 1-patch series; you have the option
of just putting the side-band information here and sending it as a
single mail.  But the cover letter approach doesn't hurt either, and I
can see how it can be easier for some workflows to always send a cover
letter than to special-case a 1-patch series.

> +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> +void *opaque)
> +{
> +BdrvTrackedRequest *req = opaque;
> +BlockDriverState *bs = req->bs;
> +uint64_t amount = 0;
> +
> +amount = bdrv_usage_threshold_exceeded(bs, req);
> +if (amount > 0) {
> +qapi_event_send_block_usage_threshold(
> +bs->node_name,
> +amount,
> +bs->write_threshold_offset,
> +&error_abort);
> +
> +/* autodisable to avoid to flood the monitor */

s/to flood/flooding/

> +/*
> + * bdrv_usage_threshold_is_set
> + *
> + * Tell if an usage threshold is set for a given BDS.

s/an usage/a usage/

(in English, the difference between "a" and "an" is whether the leading
sound of the next word is pronounced or not; in this case, "usage" is
pronounced with a hard "yoo-sage".  It may help to remember "an umbrella
for a unicorn")

> +++ b/qapi/block-core.json
> @@ -239,6 +239,9 @@
>  #
>  # @iops_size: #optional an I/O size in bytes (Since 1.7)
>  #
> +# @write-threshold: configured write threshold for the device.
> +#   0 if disabled. (Since 2.3)
> +#
>  # Since: 0.14.0
>  #
>  ##
> @@ -253,7 +256,7 @@
>  '*bps_max': 'int', '*bps_rd_max': 'int',
>  '*bps_wr_max': 'int', '*iops_max': 'int',
>  '*iops_rd_max': 'int', '*iops_wr_max': 'int',
> -'*iops_size': 'int' } }
> +'*iops_size': 'int', 'write-threshold': 'uint64' } }

In QMP specs, 'uint64' and 'int' are practically synonyms.  I can live
with either spelling, although 'int' is more common.

Bikeshed on naming: Although we prefer '-' over '_' in new interfaces,
we also favor consistency, and BlockDeviceInfo is one of those dinosaur
commands that uses _ everywhere until your addition.  So naming this
field 'write_threshold' would be more consistent.

> +##
> +# @BLOCK_USAGE_THRESHOLD
> +#
> +# Emitted when writes on block device reaches or exceeds the
> +# configured threshold. For thin-provisioned devices, this
> +# means the device should be extended to avoid pausing for
> +# disk exaustion.

s/exaustion/exhaustion/

> +#
> +# @node-name: graph node name on which the threshold was exceeded.
> +#
> +# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
> +#
> +# @offset-threshold: last configured threshold, in bytes.
> +#

Might want to mention that this event is one-shot; after it triggers, a
user must re-register a threshold to get the event again.

> +# Since: 2.3
> +##
> +{ 'event': 'BLOCK_USAGE_THRESHOLD',
> +  'data': { 'node-name': 'str',
> + 'amount-exceeded': 'uint64',

TAB damage.  Please use spaces.  ./scripts/checkpatch.pl will catch some
offenders (although I didn't test if it will catch this one).

However, here you are correct in using '-' for naming :)

> + 'threshold': 'uint64' } }
> +
> +##
> +# @block-set-threshold
> +#
> +# Change usage threshold for a block drive. An event will be delivered
> +# if a write to this block drive crosses the configured threshold.
> +# This is useful to transparently resize thin-provisioned drives without
> +# the guest OS noticing.
> +#
> +# @node-name: graph node name on which the threshold must be set.
> +#
> +#

[Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-11-28 Thread Francesco Romani
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
altogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani 
---
 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/usage-threshold.c | 124 
 include/block/block_int.h   |   4 ++
 include/block/usage-threshold.h |  62 
 qapi/block-core.json|  48 +++-
 qmp-commands.hx |  28 +
 tests/Makefile  |   3 +
 tests/test-usage-threshold.c| 101 
 9 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h
 create mode 100644 tests/test-usage-threshold.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..43e381d 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += usage-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index a87a34a..c85abb0 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include "block/qapi.h"
 #include "block/block_int.h"
+#include "block/usage-threshold.h"
 #include "qmp-commands.h"
 #include "qapi-visit.h"
 #include "qapi/qmp-output-visitor.h"
@@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
 info->iops_size = cfg.op_size;
 }
 
+info->write_threshold = bdrv_usage_threshold_get(bs);
+
 return info;
 }
 
diff --git a/block/usage-threshold.c b/block/usage-threshold.c
new file mode 100644
index 000..9faf5e6
--- /dev/null
+++ b/block/usage-threshold.c
@@ -0,0 +1,124 @@
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani 
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "block/block_int.h"
+#include "block/coroutine.h"
+#include "block/usage-threshold.h"
+#include "qemu/notify.h"
+#include "qapi-event.h"
+#include "qmp-commands.h"
+
+
+uint64_t bdrv_usage_threshold_get(const BlockDriverState *bs)
+{
+return bs->write_threshold_offset;
+}
+
+bool bdrv_usage_threshold_is_set(const BlockDriverState *bs)
+{
+return !!(bs->write_threshold_offset > 0);
+}
+
+static void usage_threshold_disable(BlockDriverState *bs)
+{
+if (bdrv_usage_threshold_is_set(bs)) {
+notifier_with_return_remove(&bs->write_threshold_notifier);
+bs->write_threshold_offset = 0;
+}
+}
+
+uint64_t bdrv_usage_threshold_exceeded(const BlockDriverState *bs,
+   const BdrvTrackedRequest *req)
+{
+if (bdrv_usage_threshold_is_set(bs)) {
+if (req->offset > bs->write_threshold_offset) {
+return (req->offset - bs->write_threshold_offset) + req->bytes;
+}
+if ((req->offset + req->bytes) > bs->write_threshold_offset) {
+return (req->offset + req->bytes) - bs->write_threshold_offset;
+}
+}
+return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req->bs;
+uint64_t amount = 0;
+
+amount = bdrv_usage_threshold_exceeded(bs, req);
+if (amount > 0) {
+qapi_event_send_block_usage_threshold(
+bs->node_name,
+amount,
+bs->write_threshold_offset,
+&error_abort);
+
+/* autodisable to avoid to flood the monitor */
+usage_threshold_disable(bs);
+}
+
+return 0; /* should always let other notifiers run */
+}
+
+static void usage_threshold_register_notifier(BlockDriverState *bs)
+{
+bs->write_threshold_notifier.notify = before_write_notify;
+notifier_with_return_list_add(&bs-