Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-30 Thread Paolo Bonzini


On 29/05/2015 13:11, Andrey Korolyov wrote:
 Sorry for a potential thread hijack, but I`m curious about the reasons
 to not making advertised queue depth for non-passthrough backends an
 independent tunable, is there any concerns behind that?

It certainly can be made tunable.  Usually it is bounded actually, for
example PATA has a queue depth of 1 (!) while AHCI has 32.  For virtio
(blk and scsi) it actually is already somewhat tunable, since the queue
depth is limited by the size of the virtqueue.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-29 Thread Andrey Korolyov
On Thu, May 28, 2015 at 3:05 PM, Fam Zheng f...@redhat.com wrote:
 On Thu, 05/28 13:19, Paolo Bonzini wrote:


 On 28/05/2015 13:11, Fam Zheng wrote:
   Whoever uses ioeventfd needs to implement pause/resume, yes---not just
   dataplane, also regular virtio-blk/virtio-scsi.
  
   However, everyone else should be okay, because the bottom half runs
   immediately and the big QEMU lock is not released in the meanwhile.  So
   the CPUs have no occasion to run.  This needs a comment!
 
  I'm not sure. It seems timer callbacks also do I/O, for example
  nvme_process_sq().

 Right, that's also true for USB devices. :(

 Perhaps we can skip block_job_defer_to_main_loop if not necessary
 (bs-aio_context == qemu_get_aio_context()).

 I think so. It will make dataplane even more specialized but that seems the
 only way to fix the problem at the moment.

 Fam


Sorry for a potential thread hijack, but I`m curious about the reasons
to not making advertised queue depth for non-passthrough backends an
independent tunable, is there any concerns behind that?



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Paolo Bonzini


On 28/05/2015 04:49, Fam Zheng wrote:
 
 On a second thought, although in this series, we only need to modify
 virtio-{scsi,blk}, Wen pointed out that mirror job also wants this during
 completion (because bdrv_swap is in a BH after the job coroutine returns).

Mirror needs to pause/resume the source.  It doesn't need to handle
pause/resume of the target, does it?

 So, in order for the solution to be general, and complete, this nofitier list
 approach relies on the listening devices to do the right thing, which requires
 modifying all of them, and is harder to maintain.

Isn't it only devices that use aio_set_event_notifier for their ioeventfd?

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Fam Zheng
On Thu, 05/28 11:40, Kevin Wolf wrote:
 Indeed. blk_pause/resume would handle everything in one central place
 in the block layer instead of spreading the logic across all the block
 layer users.

Sorry, I'm confused. Do you mean there is a way to implement blk_pause
completely in block layer, without the necessity of various notifier handlers
in device models?

Fam



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Paolo Bonzini


On 28/05/2015 13:11, Fam Zheng wrote:
  Whoever uses ioeventfd needs to implement pause/resume, yes---not just
  dataplane, also regular virtio-blk/virtio-scsi.
  
  However, everyone else should be okay, because the bottom half runs
  immediately and the big QEMU lock is not released in the meanwhile.  So
  the CPUs have no occasion to run.  This needs a comment!
 
 I'm not sure. It seems timer callbacks also do I/O, for example
 nvme_process_sq().

Right, that's also true for USB devices. :(

Perhaps we can skip block_job_defer_to_main_loop if not necessary
(bs-aio_context == qemu_get_aio_context()).

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Paolo Bonzini


On 28/05/2015 13:44, Fam Zheng wrote:
  The reason for doing it in the block layer is that it's in one place and
  we can be sure that it's applied. We can still in addition modify
  specific users to avoid even trying to send requests, but I think it's
  good to have the single place that always ensures correct functionality
  of the drain instead of making it dependent on the user.
 
 How to do that for the synchronous blk_write callers that don't run in
 a coroutine?

They would be completely oblivious to it.

Their call to blk_co_write would queue the request.  Then blk_write
calls aio_poll, which ultimately would result in blk_resume and unqueue
the request.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Paolo Bonzini


On 28/05/2015 12:46, Fam Zheng wrote:
  
  Mirror needs to pause/resume the source.  It doesn't need to handle
  pause/resume of the target, does it?
  
   So, in order for the solution to be general, and complete, this 
   nofitier list
   approach relies on the listening devices to do the right thing, which 
   requires
   modifying all of them, and is harder to maintain.
  
  Isn't it only devices that use aio_set_event_notifier for their ioeventfd?
 I think it's only the case for qmp_transaction, mirror job needs all block
 devices to implement pause: mirror_run guarantees source and target in sync
 when it returns; but bdrv_swap is deferred to a main loop bottom half -  what
 if there is a guest write to source in between?

Whoever uses ioeventfd needs to implement pause/resume, yes---not just
dataplane, also regular virtio-blk/virtio-scsi.

However, everyone else should be okay, because the bottom half runs
immediately and the big QEMU lock is not released in the meanwhile.  So
the CPUs have no occasion to run.  This needs a comment!

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Fam Zheng
On Thu, 05/28 13:24, Kevin Wolf wrote:
 Am 28.05.2015 um 13:00 hat Paolo Bonzini geschrieben:
  On 28/05/2015 12:55, Fam Zheng wrote:
Indeed. blk_pause/resume would handle everything in one central place
in the block layer instead of spreading the logic across all the block
layer users.
  
   Sorry, I'm confused. Do you mean there is a way to implement blk_pause
   completely in block layer, without the necessity of various notifier 
   handlers
   in device models?
  
  How would you do that?  Do you have to keep a queue of pending requests
  in the BlockBackend?  Since bdrv_drain_all may never return (e.g. stuck
  NFS connection with nfs=hard), the guest can force unbounded allocation
  in the host, which is bad.
 
 We already queue requests for things like I/O throttling or
 serialisation. Why would this be any worse?
 
  In addition, the BDS doesn't have a list of BlockBackends attached to
  it.  So you need the BlockBackends to register themselves for
  pause/resume in some way---for example with a notifier list.
  
  Then it's irrelevant whether it's the device model or the BB that
  attaches itself to the notifier list.  You can start with doing it in
  the device models (those that use ioeventfd), and later it can be moved
  to the BB.  The low-level implementation remains the same.
 
 The reason for doing it in the block layer is that it's in one place and
 we can be sure that it's applied. We can still in addition modify
 specific users to avoid even trying to send requests, but I think it's
 good to have the single place that always ensures correct functionality
 of the drain instead of making it dependent on the user.
 

How to do that for the synchronous blk_write callers that don't run in
a coroutine?

Fam



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Paolo Bonzini


On 28/05/2015 12:55, Fam Zheng wrote:
  Indeed. blk_pause/resume would handle everything in one central place
  in the block layer instead of spreading the logic across all the block
  layer users.

 Sorry, I'm confused. Do you mean there is a way to implement blk_pause
 completely in block layer, without the necessity of various notifier handlers
 in device models?

How would you do that?  Do you have to keep a queue of pending requests
in the BlockBackend?  Since bdrv_drain_all may never return (e.g. stuck
NFS connection with nfs=hard), the guest can force unbounded allocation
in the host, which is bad.

In addition, the BDS doesn't have a list of BlockBackends attached to
it.  So you need the BlockBackends to register themselves for
pause/resume in some way---for example with a notifier list.

Then it's irrelevant whether it's the device model or the BB that
attaches itself to the notifier list.  You can start with doing it in
the device models (those that use ioeventfd), and later it can be moved
to the BB.  The low-level implementation remains the same.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Paolo Bonzini


On 28/05/2015 13:24, Kevin Wolf wrote:
 Am 28.05.2015 um 13:00 hat Paolo Bonzini geschrieben:
 On 28/05/2015 12:55, Fam Zheng wrote:
 Indeed. blk_pause/resume would handle everything in one central place
 in the block layer instead of spreading the logic across all the block
 layer users.

 Sorry, I'm confused. Do you mean there is a way to implement blk_pause
 completely in block layer, without the necessity of various notifier 
 handlers
 in device models?

 How would you do that?  Do you have to keep a queue of pending requests
 in the BlockBackend?  Since bdrv_drain_all may never return (e.g. stuck
 NFS connection with nfs=hard), the guest can force unbounded allocation
 in the host, which is bad.
 
 We already queue requests for things like I/O throttling or
 serialisation. Why would this be any worse?

The fact that it's potentially unbounded makes me nervous.  But you're
right that we sort of expect the guest to not have too high a queue
depth.  And serialization is also potentially unbounded.

So it may indeed be the best way.  Just to double check, is it correct
that the API would still be BDS-based, i.e.
bdrv_pause_backends/bdrv_resume_backends, and the BlockBackends would
attach themselves to pause/resume notifier lists?

Paolo

 In addition, the BDS doesn't have a list of BlockBackends attached to
 it.  So you need the BlockBackends to register themselves for
 pause/resume in some way---for example with a notifier list.

 Then it's irrelevant whether it's the device model or the BB that
 attaches itself to the notifier list.  You can start with doing it in
 the device models (those that use ioeventfd), and later it can be moved
 to the BB.  The low-level implementation remains the same.
 
 The reason for doing it in the block layer is that it's in one place and
 we can be sure that it's applied. We can still in addition modify
 specific users to avoid even trying to send requests, but I think it's
 good to have the single place that always ensures correct functionality
 of the drain instead of making it dependent on the user.
 
 Kevin
 



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Kevin Wolf
Am 28.05.2015 um 04:49 hat Fam Zheng geschrieben:
 On Wed, 05/27 12:43, Paolo Bonzini wrote:
  
  
  On 27/05/2015 12:10, Kevin Wolf wrote:
   Am 27.05.2015 um 11:50 hat Paolo Bonzini geschrieben:
  
  
   On 27/05/2015 11:07, Kevin Wolf wrote:
   This is the first part of what's troubling me with this series, as it
   makes me doubtful if op blockers are the right tool to implement what
   the commit message says (block device I/O). This is are we doing the
   thing right?
  
   The second part should actually come first, though: Are we doing the
   right thing? I'm also doubtful whether blocking device I/O is even what
   we should do.
  
   Why is device I/O special compared to block job I/O or NBD server I/O?
  
   Because block job I/O doesn't modify the source disk.  For the target
   disk, block jobs should definitely treat themselves as device I/O and
   register notifiers that pause themselves on bdrv_drain.
   
   Block jobs don't generally have a source and a target; in fact, the
   design didn't even consider that such jobs exist (otherwise, we wouldn't
   have bs-job).
  
  Mirror and backup have targets.
  
   There are some jobs that use a BDS read-only (source for backup, mirror,
   commit) just like there are guest devices that use the BDS read-only
   (CD-ROMs). And others write to it (streaming, hard disks).
  
  Streaming doesn't write to the BDS.  It reads it while copy-on-read is
  active.  It's subtle, but it means that streaming can proceed without
  changing the contents of the disk.  As far as safety of atomic
  transactions is concerned, streaming need not be paused.
  
  However, you're right that there is no particular reason why a more
  generic pause/resume wouldn't pause jobs as well.  It wouldn't be a
  problem either way as far as the job is concerned, and a better-defined
  API is a better API.
  
   This is suspiciously similar to the first idea that I and Stefan had,
   which was a blk_pause/blk_resume API, implemented through a notifier 
   list.
   
   Any problems with it because of which Fam decided against it?
 
 I am not against it.
 
 Initially I started with writing blk_pause/resume, and soon I though it would
 be useful if blk_aio_read and friends can check the pause state before issuing
 IO:
 
 https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg02608.html
 
 So I reused op blockers. But the idea was later abandoned, so it's now very
 similar to the blk_pause/resume idea.
 
 BTW, I remember we once proposed the new op blocker model should have a
 category for I/O, no?

Yes, but I was always thinking of this on a much higher level: As soon
as a device is attached to a backend, it announces that it uses the
category, and it only drops it again when it's detached.

The use case for this is starting background jobs that don't work e.g.
while the image is written to. In such cases you want to block if the
attached device _can_ submit write request, not just if a write request
is already in flight.

Mutual exclusion on a per-request basis is something different.

 On a second thought, although in this series, we only need to modify
 virtio-{scsi,blk}, Wen pointed out that mirror job also wants this during
 completion (because bdrv_swap is in a BH after the job coroutine returns).  
 So,
 in order for the solution to be general, and complete, this nofitier list
 approach relies on the listening devices to do the right thing, which requires
 modifying all of them, and is harder to maintain.

Indeed. blk_pause/resume would handle everything in one central place
in the block layer instead of spreading the logic across all the block
layer users.

Kevin



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Fam Zheng
On Thu, 05/28 13:19, Paolo Bonzini wrote:
 
 
 On 28/05/2015 13:11, Fam Zheng wrote:
   Whoever uses ioeventfd needs to implement pause/resume, yes---not just
   dataplane, also regular virtio-blk/virtio-scsi.
   
   However, everyone else should be okay, because the bottom half runs
   immediately and the big QEMU lock is not released in the meanwhile.  So
   the CPUs have no occasion to run.  This needs a comment!
  
  I'm not sure. It seems timer callbacks also do I/O, for example
  nvme_process_sq().
 
 Right, that's also true for USB devices. :(
 
 Perhaps we can skip block_job_defer_to_main_loop if not necessary
 (bs-aio_context == qemu_get_aio_context()).

I think so. It will make dataplane even more specialized but that seems the
only way to fix the problem at the moment.

Fam



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-28 Thread Fam Zheng
On Thu, 05/28 13:47, Paolo Bonzini wrote:
 
 
 On 28/05/2015 13:44, Fam Zheng wrote:
   The reason for doing it in the block layer is that it's in one place and
   we can be sure that it's applied. We can still in addition modify
   specific users to avoid even trying to send requests, but I think it's
   good to have the single place that always ensures correct functionality
   of the drain instead of making it dependent on the user.
  
  How to do that for the synchronous blk_write callers that don't run in
  a coroutine?
 
 They would be completely oblivious to it.
 
 Their call to blk_co_write would queue the request.  Then blk_write
 calls aio_poll, which ultimately would result in blk_resume and unqueue
 the request.

OK.. I thought that, then we have to make a rule that a blk_pause must be
resolved by an aio_poll() loop.  This is somehow tricky, for example I assume
blk_write() would only call aio_poll() of its own AioContext? But the pairing
blk_resume() will be in the main context.  To make it worse, what if there is
no BDS in the main context, should it be special cased in the block layer?

Fam



Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO

2015-05-23 Thread Max Reitz

On 22.05.2015 06:54, Fam Zheng wrote:

On Thu, 05/21 15:32, Fam Zheng wrote:

On Thu, 05/21 15:06, Wen Congyang wrote:

On 05/21/2015 02:42 PM, Fam Zheng wrote:

It blocks device IO.

All bdrv_op_block_all/blk_op_block_all callers are taken care of:

- virtio_blk_data_plane_create
- virtio_scsi_hotplug

   Device creation, unblock it.

- bdrv_set_backing_hd

   Backing hd is not used by device, so blocking is OK.

- backup_start

   Blocking target when backup is running, unblock it.

Do you forget it?

Oh I think the commit log is wrong: the target image is only written to by
block job, there cannot be a device on it, so it it's similar to
bdrv_set_backing_hd.

Correction: if it's blockdev-backup, the target could have a device, in that
sense it should be unblocked like block_job_create(). I'll fix it.


Really? I think it makes sense not to allow I/O on a backup target. At 
least I can't imagine a use case where you'd want to do that... But that 
doesn't necessarily mean anything, of course.


Max