Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On 29/05/2015 13:11, Andrey Korolyov wrote: Sorry for a potential thread hijack, but I`m curious about the reasons to not making advertised queue depth for non-passthrough backends an independent tunable, is there any concerns behind that? It certainly can be made tunable. Usually it is bounded actually, for example PATA has a queue depth of 1 (!) while AHCI has 32. For virtio (blk and scsi) it actually is already somewhat tunable, since the queue depth is limited by the size of the virtqueue. Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On Thu, May 28, 2015 at 3:05 PM, Fam Zheng f...@redhat.com wrote: On Thu, 05/28 13:19, Paolo Bonzini wrote: On 28/05/2015 13:11, Fam Zheng wrote: Whoever uses ioeventfd needs to implement pause/resume, yes---not just dataplane, also regular virtio-blk/virtio-scsi. However, everyone else should be okay, because the bottom half runs immediately and the big QEMU lock is not released in the meanwhile. So the CPUs have no occasion to run. This needs a comment! I'm not sure. It seems timer callbacks also do I/O, for example nvme_process_sq(). Right, that's also true for USB devices. :( Perhaps we can skip block_job_defer_to_main_loop if not necessary (bs-aio_context == qemu_get_aio_context()). I think so. It will make dataplane even more specialized but that seems the only way to fix the problem at the moment. Fam Sorry for a potential thread hijack, but I`m curious about the reasons to not making advertised queue depth for non-passthrough backends an independent tunable, is there any concerns behind that?
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On 28/05/2015 04:49, Fam Zheng wrote: On a second thought, although in this series, we only need to modify virtio-{scsi,blk}, Wen pointed out that mirror job also wants this during completion (because bdrv_swap is in a BH after the job coroutine returns). Mirror needs to pause/resume the source. It doesn't need to handle pause/resume of the target, does it? So, in order for the solution to be general, and complete, this nofitier list approach relies on the listening devices to do the right thing, which requires modifying all of them, and is harder to maintain. Isn't it only devices that use aio_set_event_notifier for their ioeventfd? Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On Thu, 05/28 11:40, Kevin Wolf wrote: Indeed. blk_pause/resume would handle everything in one central place in the block layer instead of spreading the logic across all the block layer users. Sorry, I'm confused. Do you mean there is a way to implement blk_pause completely in block layer, without the necessity of various notifier handlers in device models? Fam
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On 28/05/2015 13:11, Fam Zheng wrote: Whoever uses ioeventfd needs to implement pause/resume, yes---not just dataplane, also regular virtio-blk/virtio-scsi. However, everyone else should be okay, because the bottom half runs immediately and the big QEMU lock is not released in the meanwhile. So the CPUs have no occasion to run. This needs a comment! I'm not sure. It seems timer callbacks also do I/O, for example nvme_process_sq(). Right, that's also true for USB devices. :( Perhaps we can skip block_job_defer_to_main_loop if not necessary (bs-aio_context == qemu_get_aio_context()). Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On 28/05/2015 13:44, Fam Zheng wrote: The reason for doing it in the block layer is that it's in one place and we can be sure that it's applied. We can still in addition modify specific users to avoid even trying to send requests, but I think it's good to have the single place that always ensures correct functionality of the drain instead of making it dependent on the user. How to do that for the synchronous blk_write callers that don't run in a coroutine? They would be completely oblivious to it. Their call to blk_co_write would queue the request. Then blk_write calls aio_poll, which ultimately would result in blk_resume and unqueue the request. Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On 28/05/2015 12:46, Fam Zheng wrote: Mirror needs to pause/resume the source. It doesn't need to handle pause/resume of the target, does it? So, in order for the solution to be general, and complete, this nofitier list approach relies on the listening devices to do the right thing, which requires modifying all of them, and is harder to maintain. Isn't it only devices that use aio_set_event_notifier for their ioeventfd? I think it's only the case for qmp_transaction, mirror job needs all block devices to implement pause: mirror_run guarantees source and target in sync when it returns; but bdrv_swap is deferred to a main loop bottom half - what if there is a guest write to source in between? Whoever uses ioeventfd needs to implement pause/resume, yes---not just dataplane, also regular virtio-blk/virtio-scsi. However, everyone else should be okay, because the bottom half runs immediately and the big QEMU lock is not released in the meanwhile. So the CPUs have no occasion to run. This needs a comment! Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On Thu, 05/28 13:24, Kevin Wolf wrote: Am 28.05.2015 um 13:00 hat Paolo Bonzini geschrieben: On 28/05/2015 12:55, Fam Zheng wrote: Indeed. blk_pause/resume would handle everything in one central place in the block layer instead of spreading the logic across all the block layer users. Sorry, I'm confused. Do you mean there is a way to implement blk_pause completely in block layer, without the necessity of various notifier handlers in device models? How would you do that? Do you have to keep a queue of pending requests in the BlockBackend? Since bdrv_drain_all may never return (e.g. stuck NFS connection with nfs=hard), the guest can force unbounded allocation in the host, which is bad. We already queue requests for things like I/O throttling or serialisation. Why would this be any worse? In addition, the BDS doesn't have a list of BlockBackends attached to it. So you need the BlockBackends to register themselves for pause/resume in some way---for example with a notifier list. Then it's irrelevant whether it's the device model or the BB that attaches itself to the notifier list. You can start with doing it in the device models (those that use ioeventfd), and later it can be moved to the BB. The low-level implementation remains the same. The reason for doing it in the block layer is that it's in one place and we can be sure that it's applied. We can still in addition modify specific users to avoid even trying to send requests, but I think it's good to have the single place that always ensures correct functionality of the drain instead of making it dependent on the user. How to do that for the synchronous blk_write callers that don't run in a coroutine? Fam
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On 28/05/2015 12:55, Fam Zheng wrote: Indeed. blk_pause/resume would handle everything in one central place in the block layer instead of spreading the logic across all the block layer users. Sorry, I'm confused. Do you mean there is a way to implement blk_pause completely in block layer, without the necessity of various notifier handlers in device models? How would you do that? Do you have to keep a queue of pending requests in the BlockBackend? Since bdrv_drain_all may never return (e.g. stuck NFS connection with nfs=hard), the guest can force unbounded allocation in the host, which is bad. In addition, the BDS doesn't have a list of BlockBackends attached to it. So you need the BlockBackends to register themselves for pause/resume in some way---for example with a notifier list. Then it's irrelevant whether it's the device model or the BB that attaches itself to the notifier list. You can start with doing it in the device models (those that use ioeventfd), and later it can be moved to the BB. The low-level implementation remains the same. Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On 28/05/2015 13:24, Kevin Wolf wrote: Am 28.05.2015 um 13:00 hat Paolo Bonzini geschrieben: On 28/05/2015 12:55, Fam Zheng wrote: Indeed. blk_pause/resume would handle everything in one central place in the block layer instead of spreading the logic across all the block layer users. Sorry, I'm confused. Do you mean there is a way to implement blk_pause completely in block layer, without the necessity of various notifier handlers in device models? How would you do that? Do you have to keep a queue of pending requests in the BlockBackend? Since bdrv_drain_all may never return (e.g. stuck NFS connection with nfs=hard), the guest can force unbounded allocation in the host, which is bad. We already queue requests for things like I/O throttling or serialisation. Why would this be any worse? The fact that it's potentially unbounded makes me nervous. But you're right that we sort of expect the guest to not have too high a queue depth. And serialization is also potentially unbounded. So it may indeed be the best way. Just to double check, is it correct that the API would still be BDS-based, i.e. bdrv_pause_backends/bdrv_resume_backends, and the BlockBackends would attach themselves to pause/resume notifier lists? Paolo In addition, the BDS doesn't have a list of BlockBackends attached to it. So you need the BlockBackends to register themselves for pause/resume in some way---for example with a notifier list. Then it's irrelevant whether it's the device model or the BB that attaches itself to the notifier list. You can start with doing it in the device models (those that use ioeventfd), and later it can be moved to the BB. The low-level implementation remains the same. The reason for doing it in the block layer is that it's in one place and we can be sure that it's applied. We can still in addition modify specific users to avoid even trying to send requests, but I think it's good to have the single place that always ensures correct functionality of the drain instead of making it dependent on the user. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
Am 28.05.2015 um 04:49 hat Fam Zheng geschrieben: On Wed, 05/27 12:43, Paolo Bonzini wrote: On 27/05/2015 12:10, Kevin Wolf wrote: Am 27.05.2015 um 11:50 hat Paolo Bonzini geschrieben: On 27/05/2015 11:07, Kevin Wolf wrote: This is the first part of what's troubling me with this series, as it makes me doubtful if op blockers are the right tool to implement what the commit message says (block device I/O). This is are we doing the thing right? The second part should actually come first, though: Are we doing the right thing? I'm also doubtful whether blocking device I/O is even what we should do. Why is device I/O special compared to block job I/O or NBD server I/O? Because block job I/O doesn't modify the source disk. For the target disk, block jobs should definitely treat themselves as device I/O and register notifiers that pause themselves on bdrv_drain. Block jobs don't generally have a source and a target; in fact, the design didn't even consider that such jobs exist (otherwise, we wouldn't have bs-job). Mirror and backup have targets. There are some jobs that use a BDS read-only (source for backup, mirror, commit) just like there are guest devices that use the BDS read-only (CD-ROMs). And others write to it (streaming, hard disks). Streaming doesn't write to the BDS. It reads it while copy-on-read is active. It's subtle, but it means that streaming can proceed without changing the contents of the disk. As far as safety of atomic transactions is concerned, streaming need not be paused. However, you're right that there is no particular reason why a more generic pause/resume wouldn't pause jobs as well. It wouldn't be a problem either way as far as the job is concerned, and a better-defined API is a better API. This is suspiciously similar to the first idea that I and Stefan had, which was a blk_pause/blk_resume API, implemented through a notifier list. Any problems with it because of which Fam decided against it? I am not against it. Initially I started with writing blk_pause/resume, and soon I though it would be useful if blk_aio_read and friends can check the pause state before issuing IO: https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg02608.html So I reused op blockers. But the idea was later abandoned, so it's now very similar to the blk_pause/resume idea. BTW, I remember we once proposed the new op blocker model should have a category for I/O, no? Yes, but I was always thinking of this on a much higher level: As soon as a device is attached to a backend, it announces that it uses the category, and it only drops it again when it's detached. The use case for this is starting background jobs that don't work e.g. while the image is written to. In such cases you want to block if the attached device _can_ submit write request, not just if a write request is already in flight. Mutual exclusion on a per-request basis is something different. On a second thought, although in this series, we only need to modify virtio-{scsi,blk}, Wen pointed out that mirror job also wants this during completion (because bdrv_swap is in a BH after the job coroutine returns). So, in order for the solution to be general, and complete, this nofitier list approach relies on the listening devices to do the right thing, which requires modifying all of them, and is harder to maintain. Indeed. blk_pause/resume would handle everything in one central place in the block layer instead of spreading the logic across all the block layer users. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On Thu, 05/28 13:19, Paolo Bonzini wrote: On 28/05/2015 13:11, Fam Zheng wrote: Whoever uses ioeventfd needs to implement pause/resume, yes---not just dataplane, also regular virtio-blk/virtio-scsi. However, everyone else should be okay, because the bottom half runs immediately and the big QEMU lock is not released in the meanwhile. So the CPUs have no occasion to run. This needs a comment! I'm not sure. It seems timer callbacks also do I/O, for example nvme_process_sq(). Right, that's also true for USB devices. :( Perhaps we can skip block_job_defer_to_main_loop if not necessary (bs-aio_context == qemu_get_aio_context()). I think so. It will make dataplane even more specialized but that seems the only way to fix the problem at the moment. Fam
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On Thu, 05/28 13:47, Paolo Bonzini wrote: On 28/05/2015 13:44, Fam Zheng wrote: The reason for doing it in the block layer is that it's in one place and we can be sure that it's applied. We can still in addition modify specific users to avoid even trying to send requests, but I think it's good to have the single place that always ensures correct functionality of the drain instead of making it dependent on the user. How to do that for the synchronous blk_write callers that don't run in a coroutine? They would be completely oblivious to it. Their call to blk_co_write would queue the request. Then blk_write calls aio_poll, which ultimately would result in blk_resume and unqueue the request. OK.. I thought that, then we have to make a rule that a blk_pause must be resolved by an aio_poll() loop. This is somehow tricky, for example I assume blk_write() would only call aio_poll() of its own AioContext? But the pairing blk_resume() will be in the main context. To make it worse, what if there is no BDS in the main context, should it be special cased in the block layer? Fam
Re: [Qemu-block] [Qemu-devel] [PATCH v6 01/13] block: Add op blocker type device IO
On 22.05.2015 06:54, Fam Zheng wrote: On Thu, 05/21 15:32, Fam Zheng wrote: On Thu, 05/21 15:06, Wen Congyang wrote: On 05/21/2015 02:42 PM, Fam Zheng wrote: It blocks device IO. All bdrv_op_block_all/blk_op_block_all callers are taken care of: - virtio_blk_data_plane_create - virtio_scsi_hotplug Device creation, unblock it. - bdrv_set_backing_hd Backing hd is not used by device, so blocking is OK. - backup_start Blocking target when backup is running, unblock it. Do you forget it? Oh I think the commit log is wrong: the target image is only written to by block job, there cannot be a device on it, so it it's similar to bdrv_set_backing_hd. Correction: if it's blockdev-backup, the target could have a device, in that sense it should be unblocked like block_job_create(). I'll fix it. Really? I think it makes sense not to allow I/O on a backup target. At least I can't imagine a use case where you'd want to do that... But that doesn't necessarily mean anything, of course. Max