Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM
On Wed, Jun 24, 2015 at 11:08:43AM +, Li, Liang Z wrote: Right now, we don't have an interface to detect that cases and got back to the iterative stage. How about go back to the iterative stage when detect that the pending_size is larger Than max_size, like this: +/* do flush here is aimed to shorten the VM downtime, + * bdrv_flush_all is a time consuming operation + * when the guest has done some file writing */ +bdrv_flush_all(); +pending_size = qemu_savevm_state_pending(s-file, max_size); +if (pending_size pending_size = max_size) { +qemu_mutex_unlock_iothread(); +continue; +} ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); if (ret = 0) { qemu_file_set_rate_limit(s-file, INT64_MAX); and this is quite simple. Yes, but it is too simple. If you hold all the locks during bdrv_flush_all(), your VM will effectively stop as soon as it performs the next I/O access, so you don't win much. And you still don't have a timeout for cases where the flush takes really long. This is probably better than what we had now (basically we are meassuring after bdrv_flush_all how much the amount of dirty memory has changed, and return to iterative stage if it took too much. A timeout would be better anyways. And an interface te start the synchronization sooner asynchronously would be also good. Notice that my understanding is that any proper fix for this is 2.4 material. Then, how to deal with this issue in 2.3, leave it here? or make an incomplete fix like I do above? I think it is better to leave it here for 2.3. With a patch like this one, we improve in one load and we got worse in a different load (depens a lot in the ratio of dirtying memory vs disk). I have no data which load is more common, so I prefer to be conservative so late in the cycle. What do you think? I agree, it's too late in the release cycle for such a change. Kevin Hi Juan Kevin, I have not found the related patches to fix the issue which lead to long VM downtime, how is it going? Kevin is on vacation and QEMU is currently in 2.4 soft freeze. Unless patches have been posted/merged that I'm not aware of, it is unlikely that anything will happen before QEMU 2.4 is released. Stefan pgpCfov3Rped3.pgp Description: PGP signature
Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM
Right now, we don't have an interface to detect that cases and got back to the iterative stage. How about go back to the iterative stage when detect that the pending_size is larger Than max_size, like this: +/* do flush here is aimed to shorten the VM downtime, + * bdrv_flush_all is a time consuming operation + * when the guest has done some file writing */ +bdrv_flush_all(); +pending_size = qemu_savevm_state_pending(s-file, max_size); +if (pending_size pending_size = max_size) { +qemu_mutex_unlock_iothread(); +continue; +} ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); if (ret = 0) { qemu_file_set_rate_limit(s-file, INT64_MAX); and this is quite simple. Yes, but it is too simple. If you hold all the locks during bdrv_flush_all(), your VM will effectively stop as soon as it performs the next I/O access, so you don't win much. And you still don't have a timeout for cases where the flush takes really long. This is probably better than what we had now (basically we are meassuring after bdrv_flush_all how much the amount of dirty memory has changed, and return to iterative stage if it took too much. A timeout would be better anyways. And an interface te start the synchronization sooner asynchronously would be also good. Notice that my understanding is that any proper fix for this is 2.4 material. Then, how to deal with this issue in 2.3, leave it here? or make an incomplete fix like I do above? I think it is better to leave it here for 2.3. With a patch like this one, we improve in one load and we got worse in a different load (depens a lot in the ratio of dirtying memory vs disk). I have no data which load is more common, so I prefer to be conservative so late in the cycle. What do you think? I agree, it's too late in the release cycle for such a change. Kevin Hi Juan Kevin, I have not found the related patches to fix the issue which lead to long VM downtime, how is it going? Liang
Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM
Li, Liang Z liang.z...@intel.com wrote: Right now, we don't have an interface to detect that cases and got back to the iterative stage. How about go back to the iterative stage when detect that the pending_size is larger Than max_size, like this: +/* do flush here is aimed to shorten the VM downtime, + * bdrv_flush_all is a time consuming operation + * when the guest has done some file writing */ +bdrv_flush_all(); +pending_size = qemu_savevm_state_pending(s-file, max_size); +if (pending_size pending_size = max_size) { +qemu_mutex_unlock_iothread(); +continue; +} ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); if (ret = 0) { qemu_file_set_rate_limit(s-file, INT64_MAX); and this is quite simple. Yes, but it is too simple. If you hold all the locks during bdrv_flush_all(), your VM will effectively stop as soon as it performs the next I/O access, so you don't win much. And you still don't have a timeout for cases where the flush takes really long. This is probably better than what we had now (basically we are meassuring after bdrv_flush_all how much the amount of dirty memory has changed, and return to iterative stage if it took too much. A timeout would be better anyways. And an interface te start the synchronization sooner asynchronously would be also good. Notice that my understanding is that any proper fix for this is 2.4 material. Then, how to deal with this issue in 2.3, leave it here? or make an incomplete fix like I do above? I think it is better to leave it here for 2.3. With a patch like this one, we improve in one load and we got worse in a different load (depens a lot in the ratio of dirtying memory vs disk). I have no data which load is more common, so I prefer to be conservative so late in the cycle. What do you think? Later, Juan. Liang Thanks, Juan.
Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM
Am 18.03.2015 um 13:36 hat Juan Quintela geschrieben: Kevin Wolf kw...@redhat.com wrote: The problem is that the block layer really doesn't have an option to control what is getting synced once the data is cached outside of qemu. Essentially we can do an fdatasync() or we can leave it, that's the only choice we have. See my explanation, if all that qemu is doing is an fdatasync(), just spawn a thread that do the fdatasync() and if the thread don't finish in timeout time, just return one error. You can implement that behaviour whatever you want. Yes, and if you use bdrv_co_flush(), this is one of the worker threads that raw-posix uses. So no new code to write for that. Now doing an asynchronous fdatasync() in the background is completely reasonable in order to reduce the amount of data to be flushed later. But the patch is doing it while holding both the BQL and the AIOContext lock of the device, which doesn't feel right. Maybe it should schedule a BH in the AIOContext instead and flush from there asynchronously. Position is wrong, definitelly. We want to start the asynchronous fdatasync() at the start of migration, or each X milliseconds. At this point, we *think* that we can finish the migration on max_downtime (basically we are ignoring the time that is going to take to migrate device state and do the block layer flush, but normally this takes in the other of 100-200ms, so it don't matter at all). The other thing is that flushing once doesn't mean that new data isn't accumulating in the cache, especially if you decide to do the background flush at the start of the migration. But pontentially, we would arrive to this point with less cached data everything on the system. The obvious way to avoid that would be to switch to a writethrough mode, so any write request goes directly to the disk. This will, however, impact performance so heavily that it's probably not a realistic option. An alternative approach could be to repeat the background flush periodically, either time based or after every x bytes that are written to a device. Time based should actually be quite easy to implement. We can do it periodically on the migration thread, if the call is thread_safe. We already have a loop there, and kind of time control, what we miss is an interface. Don't do it from the migration thread. You'd have to take locks and stop the guest from doing I/O while the flush is running. Instead, create a coroutine and use a BH to enter it from the main loop of the AIOContext of the block device (i.e. the qemu main loop thread in most cases, or the dataplane thread if it exists). Then your flush will run in the background without blocking anyone else. Time control in coroutines is as easy as calling co_aio_sleep_ns(). Once we have the flushing in the background, the migration code can apply any timeouts it wants. If the timeout is exceeded, the flush wouldn't be aborted but keep running in the background, but migration can go back to the iterative state anyway. Yeap, that is what we really want/need. Cool. I don't think it's very hard to get there. Notice that my understanding is that any proper fix for this is 2.4 material. Yes, I agree. Kevin