Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-21 Thread Francesco Romani
- Original Message -
 From: Kevin Wolf kw...@redhat.com
 To: Stefan Hajnoczi stefa...@redhat.com
 Cc: mdr...@linux.vnet.ibm.com, Stefan Hajnoczi stefa...@gmail.com, 
 lcapitul...@redhat.com, qemu-devel@nongnu.org,
 Francesco Romani from...@redhat.com
 Sent: Thursday, November 20, 2014 12:34:28 PM
 Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
 reporting for block devices

One way to solve this is to require that the management tool tells QEMU
which exact BlockDriverState node the threshold applies to.  Then QEMU
doesn't need any hardcoded policy.  But I'm not sure how realistic that
it at the moment (whether management tools are uses node names for each
node yet), so it may be best to hardcode the bs-file traversal that
I've suggested.

Kevin: Do you agree?
   
   I have a feeling that we would regret this in the long run because it
   would allow only one special case of a general problem (watching a BDS).
   This means that we'll get inconsistent APIs.
   
   We're only talking about an optimisation here, even though a very
   useful one, so I wouldn't easily make compromises here. We should
   probably insist on using the node-name. Management tools need new code
   anyway to make use of the new functionality, so they can implement
   node-name support as well while they're at it.
  
  Using node-name is the best thing to do.
  
  My concern is just whether libvirt and other management tools are
  actually using node-name yet.
 
 I don't think so. They also don't use blockdev-add yet.
 
 But that's not a reason for us to add hacks that allow libvirt and other
 management tools to avoid the proper APIs even in the future. They just
 need to add support for node-names if they want to use new qemu features.
 New features require support for new infrastructure, I think that's fair.
 
 If they feel that representing complete BDS graphs in their code is too
 much work for now, they can still keep temporary hacks with hardcoded
 assumptions in their management code (like setting file.node-name and
 ignoring other setups). At least it would be temporary hacks there; if
 we did them in qemu, they would be a permanent API.

I'm fine to use node_name in my patch, it looks even much simpler and cleaner

I'd love to take this chance and learn more about the topic, becuse
I'm very near to the border of my knowledge in that area.
I joined the discussion quite later, so my sources are actually
pretty sparse. Mostly:
- staring at the sources and git history
- googling for specific bits
- presentations like 
http://www.linux-kvm.org/wiki/images/3/34/Kvm-forum-2013-block-dev-configuration.pdf

There are some sources I'm missing? Hopefully a nice wiki page I somehow lost :)

A couple of specific questions more, mostly to make sure I can do meaningful
tests for my next submission:

1. I'm running a simple test using the attached script -
which is a qemu command line adapted from libvirt ouput driven
by oVirt. There is a way to attach a name at this stage, using a QMP command?

2. (related to the former) it seems from a not-so-deep look that the blessed 
(only?)
way to set a proper node_name is using blockdev-add.
If so, I'm not sure I follow how the qemu boot flow would look like.
It will not be anymore as simple as crafting a command line and run the qemu, 
right?
IIUC some interaction with QMP will be needed (sorry for asking silly question,
trying to fill gaps in my knowledge).

Thanks for the great feedback!


-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani


qemu.sh
Description: application/shellscript


Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-21 Thread Kevin Wolf
Am 21.11.2014 um 09:43 hat Francesco Romani geschrieben:
 - Original Message -
  From: Kevin Wolf kw...@redhat.com
  To: Stefan Hajnoczi stefa...@redhat.com
  Cc: mdr...@linux.vnet.ibm.com, Stefan Hajnoczi stefa...@gmail.com, 
  lcapitul...@redhat.com, qemu-devel@nongnu.org,
  Francesco Romani from...@redhat.com
  Sent: Thursday, November 20, 2014 12:34:28 PM
  Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
  reporting for block devices
 
 One way to solve this is to require that the management tool tells 
 QEMU
 which exact BlockDriverState node the threshold applies to.  Then QEMU
 doesn't need any hardcoded policy.  But I'm not sure how realistic 
 that
 it at the moment (whether management tools are uses node names for 
 each
 node yet), so it may be best to hardcode the bs-file traversal that
 I've suggested.
 
 Kevin: Do you agree?

I have a feeling that we would regret this in the long run because it
would allow only one special case of a general problem (watching a BDS).
This means that we'll get inconsistent APIs.

We're only talking about an optimisation here, even though a very
useful one, so I wouldn't easily make compromises here. We should
probably insist on using the node-name. Management tools need new code
anyway to make use of the new functionality, so they can implement
node-name support as well while they're at it.
   
   Using node-name is the best thing to do.
   
   My concern is just whether libvirt and other management tools are
   actually using node-name yet.
  
  I don't think so. They also don't use blockdev-add yet.
  
  But that's not a reason for us to add hacks that allow libvirt and other
  management tools to avoid the proper APIs even in the future. They just
  need to add support for node-names if they want to use new qemu features.
  New features require support for new infrastructure, I think that's fair.
  
  If they feel that representing complete BDS graphs in their code is too
  much work for now, they can still keep temporary hacks with hardcoded
  assumptions in their management code (like setting file.node-name and
  ignoring other setups). At least it would be temporary hacks there; if
  we did them in qemu, they would be a permanent API.
 
 I'm fine to use node_name in my patch, it looks even much simpler and cleaner
 
 I'd love to take this chance and learn more about the topic, becuse
 I'm very near to the border of my knowledge in that area.
 I joined the discussion quite later, so my sources are actually
 pretty sparse. Mostly:
 - staring at the sources and git history
 - googling for specific bits
 - presentations like 
 http://www.linux-kvm.org/wiki/images/3/34/Kvm-forum-2013-block-dev-configuration.pdf

The video recording of that presentation wasn't really good quality,
unfortunately. But you can watch the one of this year's presentation by
Max and myself (you're probably mostly interested in Max's part), the
title was More Block Device Configuration.

 There are some sources I'm missing? Hopefully a nice wiki page I somehow lost 
 :)

I'm afraid that this single wiki page that has a well structured
presentation of all information doesn't exist.

 A couple of specific questions more, mostly to make sure I can do meaningful
 tests for my next submission:
 
 1. I'm running a simple test using the attached script -
 which is a qemu command line adapted from libvirt ouput driven
 by oVirt. There is a way to attach a name at this stage, using a QMP command?

No, node-name is assigned at the BlockDriverState (BDS) creation and
can't be changed later on.

 2. (related to the former) it seems from a not-so-deep look that the blessed 
 (only?)
 way to set a proper node_name is using blockdev-add.
 If so, I'm not sure I follow how the qemu boot flow would look like.
 It will not be anymore as simple as crafting a command line and run the qemu, 
 right?
 IIUC some interaction with QMP will be needed (sorry for asking silly 
 question,
 trying to fill gaps in my knowledge).

-drive on the command line can do everything that blockdev-add can do.
So let's assume you have a qcow2 image on a filesystem. Then you end up
with two BDSes, one for the format driver and one for accessing the
filesystem:

BlockBackend (virtual device) - qcow2 BDS - file BDS (raw-posix.c)

For assigning a node name to the qcow2 BDS, you simply specify it in the
obvious way:

-drive file=test.qcow2,node-name=foo

Now if you want to assign a node name to the file BDS as well, you would
get nested dicts in the blockdev-add call. In -drive a dot syntax is
used to represent this:

-drive file=test.qcow2,node-name=foo,file.node-name=bar

Are things a bit clearer with this?

Kevin



Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-21 Thread Francesco Romani
- Original Message -
 From: Kevin Wolf kw...@redhat.com
 To: Francesco Romani from...@redhat.com
 Cc: qemu-devel@nongnu.org, Stefan Hajnoczi stefa...@gmail.com, 
 mdr...@linux.vnet.ibm.com, Luiz Capitulino
 lcapitul...@redhat.com, Stefan Hajnoczi stefa...@redhat.com
 Sent: Friday, November 21, 2014 11:11:26 AM
 Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
 reporting for block devices
[...]
  1. I'm running a simple test using the attached script -
  which is a qemu command line adapted from libvirt ouput driven
  by oVirt. There is a way to attach a name at this stage, using a QMP
  command?
 
 No, node-name is assigned at the BlockDriverState (BDS) creation and
 can't be changed later on.

Makes sense to me.

  2. (related to the former) it seems from a not-so-deep look that the
  blessed (only?)
  way to set a proper node_name is using blockdev-add.
  If so, I'm not sure I follow how the qemu boot flow would look like.
  It will not be anymore as simple as crafting a command line and run the
  qemu, right?
  IIUC some interaction with QMP will be needed (sorry for asking silly
  question,
  trying to fill gaps in my knowledge).
 
 -drive on the command line can do everything that blockdev-add can do.
 So let's assume you have a qcow2 image on a filesystem. Then you end up
 with two BDSes, one for the format driver and one for accessing the
 filesystem:
 
 BlockBackend (virtual device) - qcow2 BDS - file BDS (raw-posix.c)
 
 For assigning a node name to the qcow2 BDS, you simply specify it in the
 obvious way:
 
 -drive file=test.qcow2,node-name=foo
 
 Now if you want to assign a node name to the file BDS as well, you would
 get nested dicts in the blockdev-add call. In -drive a dot syntax is
 used to represent this:
 
 -drive file=test.qcow2,node-name=foo,file.node-name=bar
 
 Are things a bit clearer with this?

Yes, thanks a lot. I was a bit misleaded by the lack of the reference (after a 
very quick
look) in the man page.
Maybe the manpage is out of date, but this is a different story -and maybe a 
different patch :)

New revision will come in a few days.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-21 Thread Eric Blake
On 11/21/2014 01:43 AM, Francesco Romani wrote:

 A couple of specific questions more, mostly to make sure I can do meaningful
 tests for my next submission:
 
 1. I'm running a simple test using the attached script -
 which is a qemu command line adapted from libvirt ouput driven
 by oVirt. There is a way to attach a name at this stage, using a QMP command?

Libvirt isn't yet attaching node names, and right now, there is no way
to retroactively attach a node name (only at creation).  Jeff Cody
proposed a patch prior to 2.1 that would give ALL nodes a generated name
if one was not supplied, but we still haven't taken that patch in, and
by now it probably needs rebasing...

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-20 Thread Francesco Romani
- Original Message -
 From: Stefan Hajnoczi stefa...@redhat.com
 To: Francesco Romani from...@redhat.com
 Cc: kw...@redhat.com, Stefan Hajnoczi stefa...@gmail.com, 
 mdr...@linux.vnet.ibm.com, qemu-devel@nongnu.org,
 lcapitul...@redhat.com
 Sent: Wednesday, November 19, 2014 4:52:51 PM
 Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
 reporting for block devices
 
 On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote:
+static int coroutine_fn before_write_notify(NotifierWithReturn
*notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+int64_t amount = 0;
+
+assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
+assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);
   
   Does the code still make these assumptions or are the asserts left over
   from previous versions of the patch?
  
  It's a leftover.
  I understood they don't hurt and add a bit of safety, but if they are
  confusing
  I'll remove them.
 
 Yes, it made me wonder why.  Probably best to remove them.

Will do

[...]
  At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block
  device,
  and we'd like to be notified about the allocation of the lvm block device,
  which (IIUC)
  is the last bs-file.
  
  This is a simple topology (unless I'm missing something), and that's
  the reason why I go down just one level.
  
  Of course I want a general solution for this change, so...
 
 There is a block driver for error injection called blkdebug (see
 docs/blkdebug.txt).  Here is an example of the following topology:
 
   raw_bsd (drive0) - blkdebug - raw-posix (test.img)
 
 qemu-system-x86_64 -drive
 if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img
 
 The blkdebug driver is interposing between the raw_bsd (drive0) root and
 the raw-posix leaf node.

Thanks, I'll have a look

[...]
 The management tool should not need to inspect the graph because the
 graph can change at runtime (e.g. imagine I/O throttling is implemented
 as a BlockDriverState node then it could appear/disapper when the
 feature is activated/deactivated).  Instead the management tool should
 name the nodes it knows about and then use those node names.

Agreed - and indeed simpler for us (oVirt), which it doesn't hurt :)

  If we descend the bs-file chain, AFAIU the easiest mapping, being the
  device name,
  is easily lost because only the outermost BlockDriverState has a device
  name attached, so when the
  notification trigger
  bdrv_get_device_name() will return NULL
 
 In the worst case a name string can be passed in along with the
 threshold values.

OK, I guess to keep a copy of the string with g_strdup() could be good enough 
start,
at least for further discussion.

Thanks for your review and for the informations,
I'll submit a new revision of the patch in a couple of days,
to give to other reviewers some time to jump in.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-20 Thread Kevin Wolf
Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben:
 On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
  +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t 
  threshold_bytes)
  +{
  +BlockDriverState *target_bs = bs;
  +if (bs-file) {
  +target_bs = bs-file;
  +}
 
 Hmm...I think now I understand why you are trying to use bs-file.  This
 is an attempt to make image formats work with the threshold.
 
 Unfortunately the BlockDriverState topology can be more complicated than
 just 1 level.
 
 If we hardcode a strategy to traverse bs-file then it will work in most
 cases:
 
   while (bs-file) {
   bs = bs-file;
   }
 
 But there are cases like VMDK extent files where a BlockDriverState
 actually has multiple children.
 
 One way to solve this is to require that the management tool tells QEMU
 which exact BlockDriverState node the threshold applies to.  Then QEMU
 doesn't need any hardcoded policy.  But I'm not sure how realistic that
 it at the moment (whether management tools are uses node names for each
 node yet), so it may be best to hardcode the bs-file traversal that
 I've suggested.
 
 Kevin: Do you agree?

I have a feeling that we would regret this in the long run because it
would allow only one special case of a general problem (watching a BDS).
This means that we'll get inconsistent APIs.

We're only talking about an optimisation here, even though a very
useful one, so I wouldn't easily make compromises here. We should
probably insist on using the node-name. Management tools need new code
anyway to make use of the new functionality, so they can implement
node-name support as well while they're at it.

Kevin


pgptAWMrPmrn0.pgp
Description: PGP signature


Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-20 Thread Stefan Hajnoczi
On Thu, Nov 20, 2014 at 11:30:53AM +0100, Kevin Wolf wrote:
 Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben:
  On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
   +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t 
   threshold_bytes)
   +{
   +BlockDriverState *target_bs = bs;
   +if (bs-file) {
   +target_bs = bs-file;
   +}
  
  Hmm...I think now I understand why you are trying to use bs-file.  This
  is an attempt to make image formats work with the threshold.
  
  Unfortunately the BlockDriverState topology can be more complicated than
  just 1 level.
  
  If we hardcode a strategy to traverse bs-file then it will work in most
  cases:
  
while (bs-file) {
bs = bs-file;
}
  
  But there are cases like VMDK extent files where a BlockDriverState
  actually has multiple children.
  
  One way to solve this is to require that the management tool tells QEMU
  which exact BlockDriverState node the threshold applies to.  Then QEMU
  doesn't need any hardcoded policy.  But I'm not sure how realistic that
  it at the moment (whether management tools are uses node names for each
  node yet), so it may be best to hardcode the bs-file traversal that
  I've suggested.
  
  Kevin: Do you agree?
 
 I have a feeling that we would regret this in the long run because it
 would allow only one special case of a general problem (watching a BDS).
 This means that we'll get inconsistent APIs.
 
 We're only talking about an optimisation here, even though a very
 useful one, so I wouldn't easily make compromises here. We should
 probably insist on using the node-name. Management tools need new code
 anyway to make use of the new functionality, so they can implement
 node-name support as well while they're at it.

Using node-name is the best thing to do.

My concern is just whether libvirt and other management tools are
actually using node-name yet.

Stefan


pgpyKyWSIYpyf.pgp
Description: PGP signature


Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-20 Thread Kevin Wolf
Am 20.11.2014 um 12:04 hat Stefan Hajnoczi geschrieben:
 On Thu, Nov 20, 2014 at 11:30:53AM +0100, Kevin Wolf wrote:
  Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben:
   On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
+void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t 
threshold_bytes)
+{
+BlockDriverState *target_bs = bs;
+if (bs-file) {
+target_bs = bs-file;
+}
   
   Hmm...I think now I understand why you are trying to use bs-file.  This
   is an attempt to make image formats work with the threshold.
   
   Unfortunately the BlockDriverState topology can be more complicated than
   just 1 level.
   
   If we hardcode a strategy to traverse bs-file then it will work in most
   cases:
   
 while (bs-file) {
 bs = bs-file;
 }
   
   But there are cases like VMDK extent files where a BlockDriverState
   actually has multiple children.
   
   One way to solve this is to require that the management tool tells QEMU
   which exact BlockDriverState node the threshold applies to.  Then QEMU
   doesn't need any hardcoded policy.  But I'm not sure how realistic that
   it at the moment (whether management tools are uses node names for each
   node yet), so it may be best to hardcode the bs-file traversal that
   I've suggested.
   
   Kevin: Do you agree?
  
  I have a feeling that we would regret this in the long run because it
  would allow only one special case of a general problem (watching a BDS).
  This means that we'll get inconsistent APIs.
  
  We're only talking about an optimisation here, even though a very
  useful one, so I wouldn't easily make compromises here. We should
  probably insist on using the node-name. Management tools need new code
  anyway to make use of the new functionality, so they can implement
  node-name support as well while they're at it.
 
 Using node-name is the best thing to do.
 
 My concern is just whether libvirt and other management tools are
 actually using node-name yet.

I don't think so. They also don't use blockdev-add yet.

But that's not a reason for us to add hacks that allow libvirt and other
management tools to avoid the proper APIs even in the future. They just
need to add support for node-names if they want to use new qemu features.
New features require support for new infrastructure, I think that's fair.

If they feel that representing complete BDS graphs in their code is too
much work for now, they can still keep temporary hacks with hardcoded
assumptions in their management code (like setting file.node-name and
ignoring other setups). At least it would be temporary hacks there; if
we did them in qemu, they would be a permanent API.

Kevin


pgpmPx5IZ5nrz.pgp
Description: PGP signature


Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-20 Thread Eric Blake
On 11/20/2014 04:04 AM, Stefan Hajnoczi wrote:

 We're only talking about an optimisation here, even though a very
 useful one, so I wouldn't easily make compromises here. We should
 probably insist on using the node-name. Management tools need new code
 anyway to make use of the new functionality, so they can implement
 node-name support as well while they're at it.
 
 Using node-name is the best thing to do.
 
 My concern is just whether libvirt and other management tools are
 actually using node-name yet.

Libvirt is not yet using it, but the more compelling we make it, the
more libvirt will accelerate the efforts needed to start using
node-name.  I'm okay with requiring node-name here.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-19 Thread Stefan Hajnoczi
On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote:
   +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
   +void *opaque)
   +{
   +BdrvTrackedRequest *req = opaque;
   +BlockDriverState *bs = req-bs;
   +int64_t amount = 0;
   +
   +assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
   +assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);
  
  Does the code still make these assumptions or are the asserts left over
  from previous versions of the patch?
 
 It's a leftover.
 I understood they don't hurt and add a bit of safety, but if they are 
 confusing
 I'll remove them.

Yes, it made me wonder why.  Probably best to remove them.

   +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t
   threshold_bytes)
   +{
   +BlockDriverState *target_bs = bs;
   +if (bs-file) {
   +target_bs = bs-file;
   +}
  
  Hmm...I think now I understand why you are trying to use bs-file.  This
  is an attempt to make image formats work with the threshold.
  
  Unfortunately the BlockDriverState topology can be more complicated than
  just 1 level.
 
 I thought so but I can't reproduce yet more complex topologies.
 Disclosure: I'm testing against the topology I know to be used on oVirt, 
 lacking
 immediate availability of others: suggestions welcome.
 
 At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block 
 device,
 and we'd like to be notified about the allocation of the lvm block device, 
 which (IIUC)
 is the last bs-file.
 
 This is a simple topology (unless I'm missing something), and that's
 the reason why I go down just one level.
 
 Of course I want a general solution for this change, so...

There is a block driver for error injection called blkdebug (see
docs/blkdebug.txt).  Here is an example of the following topology:

  raw_bsd (drive0) - blkdebug - raw-posix (test.img)

qemu-system-x86_64 -drive 
if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img

The blkdebug driver is interposing between the raw_bsd (drive0) root and
the raw-posix leaf node.

  If we hardcode a strategy to traverse bs-file then it will work in most
  cases:
  
while (bs-file) {
bs = bs-file;
}
  
  But there are cases like VMDK extent files where a BlockDriverState
  actually has multiple children.
  
  One way to solve this is to require that the management tool tells QEMU
  which exact BlockDriverState node the threshold applies to.  Then QEMU
  doesn't need any hardcoded policy.  But I'm not sure how realistic that
  it at the moment (whether management tools are uses node names for each
  node yet), so it may be best to hardcode the bs-file traversal that
  I've suggested.
 
 oVirt relies on libvirt[1], and uses device node (e.g. 'vda').
 
 BTW, I haven't found a way to inspect from the outside (e.g. monitor command) 
 the BlockDriverState
 topology, there is a way I'm missing?

You can get the BlockDriverState and -backing_hd chain using the
query-block QMP command.  I'm not aware of a command that returns the
full graph of BlockDriverState nodes.

The management tool should not need to inspect the graph because the
graph can change at runtime (e.g. imagine I/O throttling is implemented
as a BlockDriverState node then it could appear/disapper when the
feature is activated/deactivated).  Instead the management tool should
name the nodes it knows about and then use those node names.

 Another issue I don't yet have a proper solution is related to this one.
 
 The management app will have deal with VM with more than one block device 
 disk, so we need
 a way to map the notification with the corresponding device.
 
 If we descend the bs-file chain, AFAIU the easiest mapping, being the device 
 name,
 is easily lost because only the outermost BlockDriverState has a device name 
 attached, so when the
 notification trigger
 bdrv_get_device_name() will return NULL

In the worst case a name string can be passed in along with the
threshold values.

 I believe we don't necessarily need a device name in the notification, for 
 example something
 like an opaque cookie set together with the threshold and returned back with 
 the notification
 will suffice. Of course the device name is nicer :)

Agreed.


pgphSm0Y8l855.pgp
Description: PGP signature


Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-18 Thread Francesco Romani
- Original Message -
 From: Stefan Hajnoczi stefa...@gmail.com
 To: Francesco Romani from...@redhat.com
 Cc: kw...@redhat.com, lcapitul...@redhat.com, qemu-devel@nongnu.org, 
 stefa...@redhat.com, mdr...@linux.vnet.ibm.com
 Sent: Monday, November 17, 2014 5:49:36 PM
 Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
 reporting for block devices
 
 On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
 
 Sorry for the long review delay.  Looks pretty good, just one real issue
 to think about at the bottom.

Hi Stefan, thanks for the review and no problem for the delay :)

 
  +static void usage_threshold_disable(BlockDriverState *bs)
  +{
 
 It would be safest to make this idempotent:
 
 if (!usage_threshold_is_set(bs)) {
 return;
 }
 
 That way it can be called multiple times.

Will change.

 
  +notifier_with_return_remove(bs-wr_usage_threshold_notifier);
  +bs-wr_offset_threshold = 0;
  +}
  +
  +static int usage_threshold_is_set(const BlockDriverState *bs)
  +{
  +return !!(bs-wr_offset_threshold  0);
  +}
 
 Please use the bool type instead of an int return value.

Sure, will fix.

  +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
  +void *opaque)
  +{
  +BdrvTrackedRequest *req = opaque;
  +BlockDriverState *bs = req-bs;
  +int64_t amount = 0;
  +
  +assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
  +assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);
 
 Does the code still make these assumptions or are the asserts left over
 from previous versions of the patch?

It's a leftover.
I understood they don't hurt and add a bit of safety, but if they are confusing
I'll remove them.

  +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t
  threshold_bytes)
  +{
  +BlockDriverState *target_bs = bs;
  +if (bs-file) {
  +target_bs = bs-file;
  +}
 
 Hmm...I think now I understand why you are trying to use bs-file.  This
 is an attempt to make image formats work with the threshold.
 
 Unfortunately the BlockDriverState topology can be more complicated than
 just 1 level.

I thought so but I can't reproduce yet more complex topologies.
Disclosure: I'm testing against the topology I know to be used on oVirt, lacking
immediate availability of others: suggestions welcome.

At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block device,
and we'd like to be notified about the allocation of the lvm block device, 
which (IIUC)
is the last bs-file.

This is a simple topology (unless I'm missing something), and that's
the reason why I go down just one level.

Of course I want a general solution for this change, so...

 If we hardcode a strategy to traverse bs-file then it will work in most
 cases:
 
   while (bs-file) {
   bs = bs-file;
   }
 
 But there are cases like VMDK extent files where a BlockDriverState
 actually has multiple children.
 
 One way to solve this is to require that the management tool tells QEMU
 which exact BlockDriverState node the threshold applies to.  Then QEMU
 doesn't need any hardcoded policy.  But I'm not sure how realistic that
 it at the moment (whether management tools are uses node names for each
 node yet), so it may be best to hardcode the bs-file traversal that
 I've suggested.

oVirt relies on libvirt[1], and uses device node (e.g. 'vda').

BTW, I haven't found a way to inspect from the outside (e.g. monitor command) 
the BlockDriverState
topology, there is a way I'm missing?

Another issue I don't yet have a proper solution is related to this one.

The management app will have deal with VM with more than one block device disk, 
so we need
a way to map the notification with the corresponding device.

If we descend the bs-file chain, AFAIU the easiest mapping, being the device 
name,
is easily lost because only the outermost BlockDriverState has a device name 
attached, so when the
notification trigger
bdrv_get_device_name() will return NULL

I believe we don't necessarily need a device name in the notification, for 
example something
like an opaque cookie set together with the threshold and returned back with 
the notification
will suffice. Of course the device name is nicer :)

+++

[1] if that can help further to understand the usecase, these are the libvirt 
APIs being used
by oVirt:
http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockStatsFlags
http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetBlockInfo
both relies on the output[2] of 'query-blockstats' monitor command.

[2] AFAIU -but this is just my guesswork!- it also assumes a quite simple 
topology like I did

Thanks and bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-17 Thread Stefan Hajnoczi
On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:

Sorry for the long review delay.  Looks pretty good, just one real issue
to think about at the bottom.

 +static void usage_threshold_disable(BlockDriverState *bs)
 +{

It would be safest to make this idempotent:

if (!usage_threshold_is_set(bs)) {
return;
}

That way it can be called multiple times.

 +notifier_with_return_remove(bs-wr_usage_threshold_notifier);
 +bs-wr_offset_threshold = 0;
 +}
 +
 +static int usage_threshold_is_set(const BlockDriverState *bs)
 +{
 +return !!(bs-wr_offset_threshold  0);
 +}

Please use the bool type instead of an int return value.

 +
 +static int64_t usage_threshold_exceeded(const BlockDriverState *bs,
 +const BdrvTrackedRequest *req)
 +{
 +if (usage_threshold_is_set(bs)) {
 +int64_t amount = req-offset + req-bytes - bs-wr_offset_threshold;
 +if (amount  0) {
 +return amount;
 +}
 +}
 +return 0;
 +}
 +
 +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
 +void *opaque)
 +{
 +BdrvTrackedRequest *req = opaque;
 +BlockDriverState *bs = req-bs;
 +int64_t amount = 0;
 +
 +assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
 +assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);

Does the code still make these assumptions or are the asserts left over
from previous versions of the patch?

 +
 +amount = usage_threshold_exceeded(bs, req);
 +if (amount  0) {
 +qapi_event_send_block_usage_threshold(
 +bdrv_get_device_name(bs), /* FIXME: this does not work */
 +amount,
 +bs-wr_offset_threshold,
 +error_abort);
 +
 +/* autodisable to avoid to flood the monitor */
 +usage_threshold_disable(bs);
 +}
 +
 +return 0; /* should always let other notifiers run */
 +}
 +
 +static void usage_threshold_register_notifier(BlockDriverState *bs)
 +{
 +bs-wr_usage_threshold_notifier.notify = before_write_notify;
 +notifier_with_return_list_add(bs-before_write_notifiers,
 +  bs-wr_usage_threshold_notifier);
 +}
 +
 +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes)
 +{
 +BlockDriverState *target_bs = bs;
 +if (bs-file) {
 +target_bs = bs-file;
 +}

Hmm...I think now I understand why you are trying to use bs-file.  This
is an attempt to make image formats work with the threshold.

Unfortunately the BlockDriverState topology can be more complicated than
just 1 level.

If we hardcode a strategy to traverse bs-file then it will work in most
cases:

  while (bs-file) {
  bs = bs-file;
  }

But there are cases like VMDK extent files where a BlockDriverState
actually has multiple children.

One way to solve this is to require that the management tool tells QEMU
which exact BlockDriverState node the threshold applies to.  Then QEMU
doesn't need any hardcoded policy.  But I'm not sure how realistic that
it at the moment (whether management tools are uses node names for each
node yet), so it may be best to hardcode the bs-file traversal that
I've suggested.

Kevin: Do you agree?


pgpr9in8XhYSJ.pgp
Description: PGP signature


[Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-07 Thread Francesco Romani
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
altogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani from...@redhat.com
---
 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/usage-threshold.c | 124 
 include/block/block_int.h   |   4 ++
 include/block/usage-threshold.h |  39 +
 qapi/block-core.json|  46 ++-
 qmp-commands.hx |  26 +
 7 files changed, 242 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..43e381d 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += usage-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index 1301144..3bb0bc7 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include block/qapi.h
 #include block/block_int.h
+#include block/usage-threshold.h
 #include qmp-commands.h
 #include qapi-visit.h
 #include qapi/qmp-output-visitor.h
@@ -315,6 +316,8 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo 
**p_info,
 }
 }
 
+info-write_threshold = bdrv_get_usage_threshold(bs);
+
 *p_info = info;
 return;
 
diff --git a/block/usage-threshold.c b/block/usage-threshold.c
new file mode 100644
index 000..31a587d
--- /dev/null
+++ b/block/usage-threshold.c
@@ -0,0 +1,124 @@
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani from...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include block/block_int.h
+#include block/coroutine.h
+#include block/usage-threshold.h
+#include qemu/notify.h
+#include qapi-event.h
+#include qmp-commands.h
+
+
+int64_t bdrv_get_usage_threshold(const BlockDriverState *bs)
+{
+if (bs == NULL) {
+return 0;
+}
+if (bs-file) {
+return bs-file-wr_offset_threshold;
+}
+return bs-wr_offset_threshold;
+}
+
+static void usage_threshold_disable(BlockDriverState *bs)
+{
+notifier_with_return_remove(bs-wr_usage_threshold_notifier);
+bs-wr_offset_threshold = 0;
+}
+
+static int usage_threshold_is_set(const BlockDriverState *bs)
+{
+return !!(bs-wr_offset_threshold  0);
+}
+
+static int64_t usage_threshold_exceeded(const BlockDriverState *bs,
+const BdrvTrackedRequest *req)
+{
+if (usage_threshold_is_set(bs)) {
+int64_t amount = req-offset + req-bytes - bs-wr_offset_threshold;
+if (amount  0) {
+return amount;
+}
+}
+return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+int64_t amount = 0;
+
+assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
+assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);
+
+amount = usage_threshold_exceeded(bs, req);
+if (amount  0) {
+qapi_event_send_block_usage_threshold(
+bdrv_get_device_name(bs), /* FIXME: this does not work */
+amount,
+bs-wr_offset_threshold,
+error_abort);
+
+/* autodisable to avoid to flood the monitor */
+usage_threshold_disable(bs);
+}
+
+return 0; /* should always let other notifiers run */
+}
+
+static void usage_threshold_register_notifier(BlockDriverState *bs)
+{
+bs-wr_usage_threshold_notifier.notify = before_write_notify;
+notifier_with_return_list_add(bs-before_write_notifiers,
+  bs-wr_usage_threshold_notifier);
+}
+
+void