Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM

2015-06-25 Thread Stefan Hajnoczi
On Wed, Jun 24, 2015 at 11:08:43AM +, Li, Liang Z wrote:
  Right now, we don't have an interface to detect that cases and
  got back to the iterative stage.

 How about go back to the iterative stage when detect that the
 pending_size is larger Than max_size, like this:

 +/* do flush here is aimed to shorten the VM 
 downtime,
 + * bdrv_flush_all is a time consuming operation
 + * when the guest has done some file writing */
 +bdrv_flush_all();
 +pending_size = qemu_savevm_state_pending(s-file,
  max_size);
 +if (pending_size  pending_size = max_size) {
 +qemu_mutex_unlock_iothread();
 +continue;
 +}
   ret = 
 vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
   if (ret = 0) {
   qemu_file_set_rate_limit(s-file,
 INT64_MAX);

 and this is quite simple.

 Yes, but it is too simple. If you hold all the locks during
 bdrv_flush_all(), your VM will effectively stop as soon as it
 performs the next I/O access, so you don't win much. And you
 still don't have a timeout for cases where the flush takes really 
 long.
   
This is probably better than what we had now (basically we are
  meassuring
after bdrv_flush_all how much the amount of dirty memory has
changed, and return to iterative stage if it took too much.  A
timeout would be better anyways.  And an interface te start the
synchronization sooner asynchronously would be also good.
   
Notice that my understanding is that any proper fix for this is 2.4
  material.
   
Then, how to deal with this issue in 2.3, leave it here? or make an
incomplete fix like I do above?
  
   I think it is better to leave it here for 2.3. With a patch like this
   one, we improve in one load and we got worse in a different load
   (depens a lot in the ratio of dirtying memory vs disk).  I have no
   data which load is more common, so I prefer to be conservative so late
   in the cycle.  What do you think?
  
  I agree, it's too late in the release cycle for such a change.
  
  Kevin
 
 Hi Juan  Kevin,
 
 I have not found the related patches to fix the issue which lead to long VM 
 downtime,  how is it going?

Kevin is on vacation and QEMU is currently in 2.4 soft freeze.  Unless
patches have been posted/merged that I'm not aware of, it is unlikely
that anything will happen before QEMU 2.4 is released.

Stefan


pgpCfov3Rped3.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM

2015-06-24 Thread Li, Liang Z
 Right now, we don't have an interface to detect that cases and
 got back to the iterative stage.
   
How about go back to the iterative stage when detect that the
pending_size is larger Than max_size, like this:
   
+/* do flush here is aimed to shorten the VM downtime,
+ * bdrv_flush_all is a time consuming operation
+ * when the guest has done some file writing */
+bdrv_flush_all();
+pending_size = qemu_savevm_state_pending(s-file,
 max_size);
+if (pending_size  pending_size = max_size) {
+qemu_mutex_unlock_iothread();
+continue;
+}
  ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
  if (ret = 0) {
  qemu_file_set_rate_limit(s-file,
INT64_MAX);
   
and this is quite simple.
   
Yes, but it is too simple. If you hold all the locks during
bdrv_flush_all(), your VM will effectively stop as soon as it
performs the next I/O access, so you don't win much. And you
still don't have a timeout for cases where the flush takes really long.
  
   This is probably better than what we had now (basically we are
 meassuring
   after bdrv_flush_all how much the amount of dirty memory has
   changed, and return to iterative stage if it took too much.  A
   timeout would be better anyways.  And an interface te start the
   synchronization sooner asynchronously would be also good.
  
   Notice that my understanding is that any proper fix for this is 2.4
 material.
  
   Then, how to deal with this issue in 2.3, leave it here? or make an
   incomplete fix like I do above?
 
  I think it is better to leave it here for 2.3. With a patch like this
  one, we improve in one load and we got worse in a different load
  (depens a lot in the ratio of dirtying memory vs disk).  I have no
  data which load is more common, so I prefer to be conservative so late
  in the cycle.  What do you think?
 
 I agree, it's too late in the release cycle for such a change.
 
 Kevin

Hi Juan  Kevin,

I have not found the related patches to fix the issue which lead to long VM 
downtime,  how is it going?

Liang



Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM

2015-03-25 Thread Juan Quintela
Li, Liang Z liang.z...@intel.com wrote:
   Right now, we don't have an interface to detect that cases and got
   back to the iterative stage.
 
  How about go back to the iterative stage when detect that the
  pending_size is larger Than max_size, like this:
 
  +/* do flush here is aimed to shorten the VM downtime,
  + * bdrv_flush_all is a time consuming operation
  + * when the guest has done some file writing */
  +bdrv_flush_all();
  +pending_size = qemu_savevm_state_pending(s-file, 
  max_size);
  +if (pending_size  pending_size = max_size) {
  +qemu_mutex_unlock_iothread();
  +continue;
  +}
ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
if (ret = 0) {
qemu_file_set_rate_limit(s-file, INT64_MAX);
 
  and this is quite simple.
 
  Yes, but it is too simple. If you hold all the locks during
  bdrv_flush_all(), your VM will effectively stop as soon as it performs
  the next I/O access, so you don't win much. And you still don't have a
  timeout for cases where the flush takes really long.
 
 This is probably better than what we had now (basically we are meassuring
 after bdrv_flush_all how much the amount of dirty memory has changed,
 and return to iterative stage if it took too much.  A timeout would be better
 anyways.  And an interface te start the synchronization sooner
 asynchronously would be also good.
 
 Notice that my understanding is that any proper fix for this is 2.4 material.

 Then, how to deal with this issue in 2.3, leave it here? or make an
 incomplete fix like I do above?

I think it is better to leave it here for 2.3. With a patch like this
one, we improve in one load and we got worse in a different load (depens
a lot in the ratio of dirtying memory vs disk).  I have no data which
load is more common, so I prefer to be conservative so late in the
cycle.  What do you think?

Later, Juan.



 Liang

 Thanks, Juan.



Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM

2015-03-18 Thread Kevin Wolf
Am 18.03.2015 um 13:36 hat Juan Quintela geschrieben:
 Kevin Wolf kw...@redhat.com wrote:
  The problem is that the block layer really doesn't have an option to
  control what is getting synced once the data is cached outside of qemu.
  Essentially we can do an fdatasync() or we can leave it, that's the only
  choice we have.
 
 See my explanation, if all that qemu is doing is an fdatasync(), just
 spawn a thread that do the fdatasync() and if the thread don't finish in
 timeout time, just return one error.  You can implement that behaviour
 whatever you want.

Yes, and if you use bdrv_co_flush(), this is one of the worker threads
that raw-posix uses. So no new code to write for that.

  Now doing an asynchronous fdatasync() in the background is completely
  reasonable in order to reduce the amount of data to be flushed later.
  But the patch is doing it while holding both the BQL and the AIOContext
  lock of the device, which doesn't feel right. Maybe it should schedule a
  BH in the AIOContext instead and flush from there asynchronously.
 
 Position is wrong, definitelly.  We want to start the asynchronous
 fdatasync() at the start of migration, or each X milliseconds.  At this
 point, we *think* that we can finish the migration on max_downtime
 (basically we are ignoring the time that is going to take to migrate
 device state and do the block layer flush, but normally this takes in
 the other of 100-200ms, so it don't matter at all).
 
  The other thing is that flushing once doesn't mean that new data isn't
  accumulating in the cache, especially if you decide to do the background
  flush at the start of the migration.
 
 But pontentially, we would arrive to this point with less cached
 data everything on the system.
 
  The obvious way to avoid that would be to switch to a writethrough mode,
  so any write request goes directly to the disk. This will, however,
  impact performance so heavily that it's probably not a realistic option.
 
  An alternative approach could be to repeat the background flush
  periodically, either time based or after every x bytes that are written
  to a device. Time based should actually be quite easy to implement.
 
 We can do it periodically on the migration thread, if the call is
 thread_safe.  We already have a loop there, and kind of time control,
 what we miss is an interface.

Don't do it from the migration thread. You'd have to take locks and stop
the guest from doing I/O while the flush is running.

Instead, create a coroutine and use a BH to enter it from the main loop
of the AIOContext of the block device (i.e. the qemu main loop thread in
most cases, or the dataplane thread if it exists). Then your flush will
run in the background without blocking anyone else.

Time control in coroutines is as easy as calling co_aio_sleep_ns().

  Once we have the flushing in the background, the migration code can
  apply any timeouts it wants. If the timeout is exceeded, the flush
  wouldn't be aborted but keep running in the background, but migration
  can go back to the iterative state anyway.
 
 Yeap, that is what we really want/need.

Cool. I don't think it's very hard to get there.

 Notice that my understanding is that any proper fix for this is 2.4 material.

Yes, I agree.

Kevin