Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-19 Thread Peter Xu
On Mon, Sep 18, 2017 at 11:00:15AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > On Fri, Sep 15, 2017 at 10:22:43AM +0100, Dr. David Alan Gilbert wrote:
> > > * Fam Zheng (f...@redhat.com) wrote:
> > > > On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> > > > > * Fam Zheng (f...@redhat.com) wrote:
> > > > > > On Fri, 09/15 16:03, Peter Xu wrote:
> > > > > > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > > > > > bdrv_close_all() would abort() due to op blockers added by 
> > > > > > > > BMDS, clean
> > > > > > > > up migration states when main loop quits to avoid that.
> > > > > > > > 
> > > > > > > > Signed-off-by: Fam Zheng 
> > > > > > > > ---
> > > > > > > >  include/migration/misc.h | 1 +
> > > > > > > >  migration/migration.c| 7 ++-
> > > > > > > >  vl.c | 3 +++
> > > > > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > > > > > index c079b7771b..b9a26b0898 100644
> > > > > > > > --- a/include/migration/misc.h
> > > > > > > > +++ b/include/migration/misc.h
> > > > > > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > > > > > >  /* ...and after the device transmission */
> > > > > > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > > > > > >  void migration_global_dump(Monitor *mon);
> > > > > > > > +void migrate_cancel(void);
> > > > > > > >  
> > > > > > > >  #endif
> > > > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > > > index 959e8ec88e..2c844945c7 100644
> > > > > > > > --- a/migration/migration.c
> > > > > > > > +++ b/migration/migration.c
> > > > > > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool 
> > > > > > > > has_blk, bool blk,
> > > > > > > >  }
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > -void qmp_migrate_cancel(Error **errp)
> > > > > > > > +void migrate_cancel(void)
> > > > > > > >  {
> > > > > > > >  migrate_fd_cancel(migrate_get_current());
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +void qmp_migrate_cancel(Error **errp)
> > > > > > > > +{
> > > > > > > > +migrate_cancel();
> > > > > > > > +}
> > > > > > > > +
> > > > > > > 
> > > > > > > Nit: I would prefer just call migrate_fd_cancel() below, since I 
> > > > > > > don't
> > > > > > > see much point to introduce migrate_cancel() which only calls
> > > > > > > migrate_fd_cancel()...
> > > > > > 
> > > > > > migrate_get_current() is a migration internal IMHO. But that can be 
> > > > > > moved to
> > > > > > migrate_fd_cancel() so the parameter is dropped.
> > > > > > 
> > > > > > > 
> > > > > > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > > > > > >  {
> > > > > > > >  MigrationState *s = migrate_get_current();
> > > > > > > > diff --git a/vl.c b/vl.c
> > > > > > > > index fb1f05b937..abbe61f40b 100644
> > > > > > > > --- a/vl.c
> > > > > > > > +++ b/vl.c
> > > > > > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > > > > > >  #include "sysemu/blockdev.h"
> > > > > > > >  #include "hw/block/block.h"
> > > > > > > >  #include "migration/misc.h"
> > > > > > > > +#include "migration/savevm.h"
> > > > > > > >  #include "migration/snapshot.h"
> > > > > > > >  #include "migration/global_state.h"
> > > > > > > >  #include "sysemu/tpm.h"
> > > > > > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char 
> > > > > > > > **envp)
> > > > > > > >  iothread_stop_all();
> > > > > > > >  
> > > > > > > >  pause_all_vcpus();
> > > > > > > > +migrate_cancel();
> > > > > > > 
> > > > > > > IIUC this is an async cancel, so when reach here the migration 
> > > > > > > thread
> > > > > > > can still be alive.  Then...
> > > > > > > 
> > > > > > > > +qemu_savevm_state_cleanup();
> > > > > > > 
> > > > > > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > > > > > migration thread has not yet quitted.
> > > > > > > 
> > > > > > > I'm thinking whether we should make migrate_fd_cancel() wait 
> > > > > > > until the
> > > > > > > migration thread finishes (state change to CANCELLED).  Then the
> > > > > > > migration thread will do the cleanup, and here we can avoid 
> > > > > > > calling
> > > > > > > qemu_savevm_state_cleanup() as well.
> > > > > > 
> > > > > > But if the migration thread is stuck and CANCELLED is never 
> > > > > > reached, we'll hang
> > > > > > here?
> > > > > 
> > > > > I'm not sure I see an easy fix; I agree with Peter that calling
> > > > > migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> > > > > because the cancel just forces the state to CANCELLING before coming
> > > > > back to you, and the migration thread asynchronously starts to
> > > > > fail/cleanup.
> > > > > 
> > > > > migrate_cancel() can forcibly unblock some cases because it calls
> > > > > shutdown(2) on the network fd, but 

Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-18 Thread Dr. David Alan Gilbert
* Peter Xu (pet...@redhat.com) wrote:
> On Fri, Sep 15, 2017 at 10:22:43AM +0100, Dr. David Alan Gilbert wrote:
> > * Fam Zheng (f...@redhat.com) wrote:
> > > On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> > > > * Fam Zheng (f...@redhat.com) wrote:
> > > > > On Fri, 09/15 16:03, Peter Xu wrote:
> > > > > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > > > > bdrv_close_all() would abort() due to op blockers added by BMDS, 
> > > > > > > clean
> > > > > > > up migration states when main loop quits to avoid that.
> > > > > > > 
> > > > > > > Signed-off-by: Fam Zheng 
> > > > > > > ---
> > > > > > >  include/migration/misc.h | 1 +
> > > > > > >  migration/migration.c| 7 ++-
> > > > > > >  vl.c | 3 +++
> > > > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > > > > index c079b7771b..b9a26b0898 100644
> > > > > > > --- a/include/migration/misc.h
> > > > > > > +++ b/include/migration/misc.h
> > > > > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > > > > >  /* ...and after the device transmission */
> > > > > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > > > > >  void migration_global_dump(Monitor *mon);
> > > > > > > +void migrate_cancel(void);
> > > > > > >  
> > > > > > >  #endif
> > > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > > index 959e8ec88e..2c844945c7 100644
> > > > > > > --- a/migration/migration.c
> > > > > > > +++ b/migration/migration.c
> > > > > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool 
> > > > > > > has_blk, bool blk,
> > > > > > >  }
> > > > > > >  }
> > > > > > >  
> > > > > > > -void qmp_migrate_cancel(Error **errp)
> > > > > > > +void migrate_cancel(void)
> > > > > > >  {
> > > > > > >  migrate_fd_cancel(migrate_get_current());
> > > > > > >  }
> > > > > > >  
> > > > > > > +void qmp_migrate_cancel(Error **errp)
> > > > > > > +{
> > > > > > > +migrate_cancel();
> > > > > > > +}
> > > > > > > +
> > > > > > 
> > > > > > Nit: I would prefer just call migrate_fd_cancel() below, since I 
> > > > > > don't
> > > > > > see much point to introduce migrate_cancel() which only calls
> > > > > > migrate_fd_cancel()...
> > > > > 
> > > > > migrate_get_current() is a migration internal IMHO. But that can be 
> > > > > moved to
> > > > > migrate_fd_cancel() so the parameter is dropped.
> > > > > 
> > > > > > 
> > > > > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > > > > >  {
> > > > > > >  MigrationState *s = migrate_get_current();
> > > > > > > diff --git a/vl.c b/vl.c
> > > > > > > index fb1f05b937..abbe61f40b 100644
> > > > > > > --- a/vl.c
> > > > > > > +++ b/vl.c
> > > > > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > > > > >  #include "sysemu/blockdev.h"
> > > > > > >  #include "hw/block/block.h"
> > > > > > >  #include "migration/misc.h"
> > > > > > > +#include "migration/savevm.h"
> > > > > > >  #include "migration/snapshot.h"
> > > > > > >  #include "migration/global_state.h"
> > > > > > >  #include "sysemu/tpm.h"
> > > > > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > > > > >  iothread_stop_all();
> > > > > > >  
> > > > > > >  pause_all_vcpus();
> > > > > > > +migrate_cancel();
> > > > > > 
> > > > > > IIUC this is an async cancel, so when reach here the migration 
> > > > > > thread
> > > > > > can still be alive.  Then...
> > > > > > 
> > > > > > > +qemu_savevm_state_cleanup();
> > > > > > 
> > > > > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > > > > migration thread has not yet quitted.
> > > > > > 
> > > > > > I'm thinking whether we should make migrate_fd_cancel() wait until 
> > > > > > the
> > > > > > migration thread finishes (state change to CANCELLED).  Then the
> > > > > > migration thread will do the cleanup, and here we can avoid calling
> > > > > > qemu_savevm_state_cleanup() as well.
> > > > > 
> > > > > But if the migration thread is stuck and CANCELLED is never reached, 
> > > > > we'll hang
> > > > > here?
> > > > 
> > > > I'm not sure I see an easy fix; I agree with Peter that calling
> > > > migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> > > > because the cancel just forces the state to CANCELLING before coming
> > > > back to you, and the migration thread asynchronously starts to
> > > > fail/cleanup.
> > > > 
> > > > migrate_cancel() can forcibly unblock some cases because it calls
> > > > shutdown(2) on the network fd, but there are other ways for a migration
> > > > to hang.
> > > > 
> > > > Having said that,  the migration thread does it's calls
> > > > to qemu_savevm_state_cleanup under the lock_iothread;
> > > > Do we have that lock at this point?
> > > 
> > > Yes we do.  Main loop releases the lock only during poll(), other parts 

Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-18 Thread Peter Xu
On Fri, Sep 15, 2017 at 10:22:43AM +0100, Dr. David Alan Gilbert wrote:
> * Fam Zheng (f...@redhat.com) wrote:
> > On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> > > * Fam Zheng (f...@redhat.com) wrote:
> > > > On Fri, 09/15 16:03, Peter Xu wrote:
> > > > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > > > bdrv_close_all() would abort() due to op blockers added by BMDS, 
> > > > > > clean
> > > > > > up migration states when main loop quits to avoid that.
> > > > > > 
> > > > > > Signed-off-by: Fam Zheng 
> > > > > > ---
> > > > > >  include/migration/misc.h | 1 +
> > > > > >  migration/migration.c| 7 ++-
> > > > > >  vl.c | 3 +++
> > > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > > > index c079b7771b..b9a26b0898 100644
> > > > > > --- a/include/migration/misc.h
> > > > > > +++ b/include/migration/misc.h
> > > > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > > > >  /* ...and after the device transmission */
> > > > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > > > >  void migration_global_dump(Monitor *mon);
> > > > > > +void migrate_cancel(void);
> > > > > >  
> > > > > >  #endif
> > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > index 959e8ec88e..2c844945c7 100644
> > > > > > --- a/migration/migration.c
> > > > > > +++ b/migration/migration.c
> > > > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool 
> > > > > > has_blk, bool blk,
> > > > > >  }
> > > > > >  }
> > > > > >  
> > > > > > -void qmp_migrate_cancel(Error **errp)
> > > > > > +void migrate_cancel(void)
> > > > > >  {
> > > > > >  migrate_fd_cancel(migrate_get_current());
> > > > > >  }
> > > > > >  
> > > > > > +void qmp_migrate_cancel(Error **errp)
> > > > > > +{
> > > > > > +migrate_cancel();
> > > > > > +}
> > > > > > +
> > > > > 
> > > > > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > > > > see much point to introduce migrate_cancel() which only calls
> > > > > migrate_fd_cancel()...
> > > > 
> > > > migrate_get_current() is a migration internal IMHO. But that can be 
> > > > moved to
> > > > migrate_fd_cancel() so the parameter is dropped.
> > > > 
> > > > > 
> > > > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > > > >  {
> > > > > >  MigrationState *s = migrate_get_current();
> > > > > > diff --git a/vl.c b/vl.c
> > > > > > index fb1f05b937..abbe61f40b 100644
> > > > > > --- a/vl.c
> > > > > > +++ b/vl.c
> > > > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > > > >  #include "sysemu/blockdev.h"
> > > > > >  #include "hw/block/block.h"
> > > > > >  #include "migration/misc.h"
> > > > > > +#include "migration/savevm.h"
> > > > > >  #include "migration/snapshot.h"
> > > > > >  #include "migration/global_state.h"
> > > > > >  #include "sysemu/tpm.h"
> > > > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > > > >  iothread_stop_all();
> > > > > >  
> > > > > >  pause_all_vcpus();
> > > > > > +migrate_cancel();
> > > > > 
> > > > > IIUC this is an async cancel, so when reach here the migration thread
> > > > > can still be alive.  Then...
> > > > > 
> > > > > > +qemu_savevm_state_cleanup();
> > > > > 
> > > > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > > > migration thread has not yet quitted.
> > > > > 
> > > > > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > > > > migration thread finishes (state change to CANCELLED).  Then the
> > > > > migration thread will do the cleanup, and here we can avoid calling
> > > > > qemu_savevm_state_cleanup() as well.
> > > > 
> > > > But if the migration thread is stuck and CANCELLED is never reached, 
> > > > we'll hang
> > > > here?
> > > 
> > > I'm not sure I see an easy fix; I agree with Peter that calling
> > > migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> > > because the cancel just forces the state to CANCELLING before coming
> > > back to you, and the migration thread asynchronously starts to
> > > fail/cleanup.
> > > 
> > > migrate_cancel() can forcibly unblock some cases because it calls
> > > shutdown(2) on the network fd, but there are other ways for a migration
> > > to hang.
> > > 
> > > Having said that,  the migration thread does it's calls
> > > to qemu_savevm_state_cleanup under the lock_iothread;
> > > Do we have that lock at this point?
> > 
> > Yes we do.  Main loop releases the lock only during poll(), other parts of
> > main() all have the lock.
> 
> Having said that though this is pretty confusing; because at this point
> we're after main_loop has exited.
> I don't think the migration thread will exit without having taken the
> iothread lock; so if we've got it at this point then the migration
> thread will 

Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-15 Thread Dr. David Alan Gilbert
* Fam Zheng (f...@redhat.com) wrote:
> On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> > * Fam Zheng (f...@redhat.com) wrote:
> > > On Fri, 09/15 16:03, Peter Xu wrote:
> > > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > > > up migration states when main loop quits to avoid that.
> > > > > 
> > > > > Signed-off-by: Fam Zheng 
> > > > > ---
> > > > >  include/migration/misc.h | 1 +
> > > > >  migration/migration.c| 7 ++-
> > > > >  vl.c | 3 +++
> > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > > index c079b7771b..b9a26b0898 100644
> > > > > --- a/include/migration/misc.h
> > > > > +++ b/include/migration/misc.h
> > > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > > >  /* ...and after the device transmission */
> > > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > > >  void migration_global_dump(Monitor *mon);
> > > > > +void migrate_cancel(void);
> > > > >  
> > > > >  #endif
> > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > index 959e8ec88e..2c844945c7 100644
> > > > > --- a/migration/migration.c
> > > > > +++ b/migration/migration.c
> > > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool 
> > > > > has_blk, bool blk,
> > > > >  }
> > > > >  }
> > > > >  
> > > > > -void qmp_migrate_cancel(Error **errp)
> > > > > +void migrate_cancel(void)
> > > > >  {
> > > > >  migrate_fd_cancel(migrate_get_current());
> > > > >  }
> > > > >  
> > > > > +void qmp_migrate_cancel(Error **errp)
> > > > > +{
> > > > > +migrate_cancel();
> > > > > +}
> > > > > +
> > > > 
> > > > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > > > see much point to introduce migrate_cancel() which only calls
> > > > migrate_fd_cancel()...
> > > 
> > > migrate_get_current() is a migration internal IMHO. But that can be moved 
> > > to
> > > migrate_fd_cancel() so the parameter is dropped.
> > > 
> > > > 
> > > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > > >  {
> > > > >  MigrationState *s = migrate_get_current();
> > > > > diff --git a/vl.c b/vl.c
> > > > > index fb1f05b937..abbe61f40b 100644
> > > > > --- a/vl.c
> > > > > +++ b/vl.c
> > > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > > >  #include "sysemu/blockdev.h"
> > > > >  #include "hw/block/block.h"
> > > > >  #include "migration/misc.h"
> > > > > +#include "migration/savevm.h"
> > > > >  #include "migration/snapshot.h"
> > > > >  #include "migration/global_state.h"
> > > > >  #include "sysemu/tpm.h"
> > > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > > >  iothread_stop_all();
> > > > >  
> > > > >  pause_all_vcpus();
> > > > > +migrate_cancel();
> > > > 
> > > > IIUC this is an async cancel, so when reach here the migration thread
> > > > can still be alive.  Then...
> > > > 
> > > > > +qemu_savevm_state_cleanup();
> > > > 
> > > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > > migration thread has not yet quitted.
> > > > 
> > > > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > > > migration thread finishes (state change to CANCELLED).  Then the
> > > > migration thread will do the cleanup, and here we can avoid calling
> > > > qemu_savevm_state_cleanup() as well.
> > > 
> > > But if the migration thread is stuck and CANCELLED is never reached, 
> > > we'll hang
> > > here?
> > 
> > I'm not sure I see an easy fix; I agree with Peter that calling
> > migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> > because the cancel just forces the state to CANCELLING before coming
> > back to you, and the migration thread asynchronously starts to
> > fail/cleanup.
> > 
> > migrate_cancel() can forcibly unblock some cases because it calls
> > shutdown(2) on the network fd, but there are other ways for a migration
> > to hang.
> > 
> > Having said that,  the migration thread does it's calls
> > to qemu_savevm_state_cleanup under the lock_iothread;
> > Do we have that lock at this point?
> 
> Yes we do.  Main loop releases the lock only during poll(), other parts of
> main() all have the lock.

Having said that though this is pretty confusing; because at this point
we're after main_loop has exited.
I don't think the migration thread will exit without having taken the
iothread lock; so if we've got it at this point then the migration
thread will never exit, and it will never call qemu_savevm_state_cleanup
itself - so that race might not exist?

However, assuming the migration is at an earlier point, it might be
calling one of the state handlers that you're cleaning up, and that's
racy in individual devices.

If we have the lock I don't think we can wait 

Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-15 Thread Dr. David Alan Gilbert
* Fam Zheng (f...@redhat.com) wrote:
> On Fri, 09/15 16:03, Peter Xu wrote:
> > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > up migration states when main loop quits to avoid that.
> > > 
> > > Signed-off-by: Fam Zheng 
> > > ---
> > >  include/migration/misc.h | 1 +
> > >  migration/migration.c| 7 ++-
> > >  vl.c | 3 +++
> > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > index c079b7771b..b9a26b0898 100644
> > > --- a/include/migration/misc.h
> > > +++ b/include/migration/misc.h
> > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > >  /* ...and after the device transmission */
> > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > >  void migration_global_dump(Monitor *mon);
> > > +void migrate_cancel(void);
> > >  
> > >  #endif
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 959e8ec88e..2c844945c7 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, 
> > > bool blk,
> > >  }
> > >  }
> > >  
> > > -void qmp_migrate_cancel(Error **errp)
> > > +void migrate_cancel(void)
> > >  {
> > >  migrate_fd_cancel(migrate_get_current());
> > >  }
> > >  
> > > +void qmp_migrate_cancel(Error **errp)
> > > +{
> > > +migrate_cancel();
> > > +}
> > > +
> > 
> > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > see much point to introduce migrate_cancel() which only calls
> > migrate_fd_cancel()...
> 
> migrate_get_current() is a migration internal IMHO. But that can be moved to
> migrate_fd_cancel() so the parameter is dropped.
> 
> > 
> > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > >  {
> > >  MigrationState *s = migrate_get_current();
> > > diff --git a/vl.c b/vl.c
> > > index fb1f05b937..abbe61f40b 100644
> > > --- a/vl.c
> > > +++ b/vl.c
> > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > >  #include "sysemu/blockdev.h"
> > >  #include "hw/block/block.h"
> > >  #include "migration/misc.h"
> > > +#include "migration/savevm.h"
> > >  #include "migration/snapshot.h"
> > >  #include "migration/global_state.h"
> > >  #include "sysemu/tpm.h"
> > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > >  iothread_stop_all();
> > >  
> > >  pause_all_vcpus();
> > > +migrate_cancel();
> > 
> > IIUC this is an async cancel, so when reach here the migration thread
> > can still be alive.  Then...
> > 
> > > +qemu_savevm_state_cleanup();
> > 
> > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > migration thread has not yet quitted.
> > 
> > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > migration thread finishes (state change to CANCELLED).  Then the
> > migration thread will do the cleanup, and here we can avoid calling
> > qemu_savevm_state_cleanup() as well.
> 
> But if the migration thread is stuck and CANCELLED is never reached, we'll 
> hang
> here?

I'm not sure I see an easy fix; I agree with Peter that calling
migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
because the cancel just forces the state to CANCELLING before coming
back to you, and the migration thread asynchronously starts to
fail/cleanup.

migrate_cancel() can forcibly unblock some cases because it calls
shutdown(2) on the network fd, but there are other ways for a migration
to hang.

Having said that,  the migration thread does it's calls
to qemu_savevm_state_cleanup under the lock_iothread;
Do we have that lock at this point?

Dave

> Fam
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-15 Thread Fam Zheng
On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> * Fam Zheng (f...@redhat.com) wrote:
> > On Fri, 09/15 16:03, Peter Xu wrote:
> > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > > up migration states when main loop quits to avoid that.
> > > > 
> > > > Signed-off-by: Fam Zheng 
> > > > ---
> > > >  include/migration/misc.h | 1 +
> > > >  migration/migration.c| 7 ++-
> > > >  vl.c | 3 +++
> > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > index c079b7771b..b9a26b0898 100644
> > > > --- a/include/migration/misc.h
> > > > +++ b/include/migration/misc.h
> > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > >  /* ...and after the device transmission */
> > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > >  void migration_global_dump(Monitor *mon);
> > > > +void migrate_cancel(void);
> > > >  
> > > >  #endif
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index 959e8ec88e..2c844945c7 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, 
> > > > bool blk,
> > > >  }
> > > >  }
> > > >  
> > > > -void qmp_migrate_cancel(Error **errp)
> > > > +void migrate_cancel(void)
> > > >  {
> > > >  migrate_fd_cancel(migrate_get_current());
> > > >  }
> > > >  
> > > > +void qmp_migrate_cancel(Error **errp)
> > > > +{
> > > > +migrate_cancel();
> > > > +}
> > > > +
> > > 
> > > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > > see much point to introduce migrate_cancel() which only calls
> > > migrate_fd_cancel()...
> > 
> > migrate_get_current() is a migration internal IMHO. But that can be moved to
> > migrate_fd_cancel() so the parameter is dropped.
> > 
> > > 
> > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > >  {
> > > >  MigrationState *s = migrate_get_current();
> > > > diff --git a/vl.c b/vl.c
> > > > index fb1f05b937..abbe61f40b 100644
> > > > --- a/vl.c
> > > > +++ b/vl.c
> > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > >  #include "sysemu/blockdev.h"
> > > >  #include "hw/block/block.h"
> > > >  #include "migration/misc.h"
> > > > +#include "migration/savevm.h"
> > > >  #include "migration/snapshot.h"
> > > >  #include "migration/global_state.h"
> > > >  #include "sysemu/tpm.h"
> > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > >  iothread_stop_all();
> > > >  
> > > >  pause_all_vcpus();
> > > > +migrate_cancel();
> > > 
> > > IIUC this is an async cancel, so when reach here the migration thread
> > > can still be alive.  Then...
> > > 
> > > > +qemu_savevm_state_cleanup();
> > > 
> > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > migration thread has not yet quitted.
> > > 
> > > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > > migration thread finishes (state change to CANCELLED).  Then the
> > > migration thread will do the cleanup, and here we can avoid calling
> > > qemu_savevm_state_cleanup() as well.
> > 
> > But if the migration thread is stuck and CANCELLED is never reached, we'll 
> > hang
> > here?
> 
> I'm not sure I see an easy fix; I agree with Peter that calling
> migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> because the cancel just forces the state to CANCELLING before coming
> back to you, and the migration thread asynchronously starts to
> fail/cleanup.
> 
> migrate_cancel() can forcibly unblock some cases because it calls
> shutdown(2) on the network fd, but there are other ways for a migration
> to hang.
> 
> Having said that,  the migration thread does it's calls
> to qemu_savevm_state_cleanup under the lock_iothread;
> Do we have that lock at this point?

Yes we do.  Main loop releases the lock only during poll(), other parts of
main() all have the lock.

Fam



Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-15 Thread Peter Xu
On Fri, Sep 15, 2017 at 04:20:26PM +0800, Fam Zheng wrote:

[...]

> > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > >  {
> > >  MigrationState *s = migrate_get_current();
> > > diff --git a/vl.c b/vl.c
> > > index fb1f05b937..abbe61f40b 100644
> > > --- a/vl.c
> > > +++ b/vl.c
> > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > >  #include "sysemu/blockdev.h"
> > >  #include "hw/block/block.h"
> > >  #include "migration/misc.h"
> > > +#include "migration/savevm.h"
> > >  #include "migration/snapshot.h"
> > >  #include "migration/global_state.h"
> > >  #include "sysemu/tpm.h"
> > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > >  iothread_stop_all();
> > >  
> > >  pause_all_vcpus();
> > > +migrate_cancel();
> > 
> > IIUC this is an async cancel, so when reach here the migration thread
> > can still be alive.  Then...
> > 
> > > +qemu_savevm_state_cleanup();
> > 
> > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > migration thread has not yet quitted.
> > 
> > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > migration thread finishes (state change to CANCELLED).  Then the
> > migration thread will do the cleanup, and here we can avoid calling
> > qemu_savevm_state_cleanup() as well.
> 
> But if the migration thread is stuck and CANCELLED is never reached, we'll 
> hang
> here?

Maybe we can add timeout. Anyway, if it gets stuck, I do see it a
migration bug, and in that case I'm not sure force return after
timeout would be anything wiser.

Dave/Juan may have better idea though.

-- 
Peter Xu



Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-15 Thread Fam Zheng
On Fri, 09/15 16:03, Peter Xu wrote:
> On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > up migration states when main loop quits to avoid that.
> > 
> > Signed-off-by: Fam Zheng 
> > ---
> >  include/migration/misc.h | 1 +
> >  migration/migration.c| 7 ++-
> >  vl.c | 3 +++
> >  3 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > index c079b7771b..b9a26b0898 100644
> > --- a/include/migration/misc.h
> > +++ b/include/migration/misc.h
> > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> >  /* ...and after the device transmission */
> >  bool migration_in_postcopy_after_devices(MigrationState *);
> >  void migration_global_dump(Monitor *mon);
> > +void migrate_cancel(void);
> >  
> >  #endif
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 959e8ec88e..2c844945c7 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, 
> > bool blk,
> >  }
> >  }
> >  
> > -void qmp_migrate_cancel(Error **errp)
> > +void migrate_cancel(void)
> >  {
> >  migrate_fd_cancel(migrate_get_current());
> >  }
> >  
> > +void qmp_migrate_cancel(Error **errp)
> > +{
> > +migrate_cancel();
> > +}
> > +
> 
> Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> see much point to introduce migrate_cancel() which only calls
> migrate_fd_cancel()...

migrate_get_current() is a migration internal IMHO. But that can be moved to
migrate_fd_cancel() so the parameter is dropped.

> 
> >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> >  {
> >  MigrationState *s = migrate_get_current();
> > diff --git a/vl.c b/vl.c
> > index fb1f05b937..abbe61f40b 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> >  #include "sysemu/blockdev.h"
> >  #include "hw/block/block.h"
> >  #include "migration/misc.h"
> > +#include "migration/savevm.h"
> >  #include "migration/snapshot.h"
> >  #include "migration/global_state.h"
> >  #include "sysemu/tpm.h"
> > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> >  iothread_stop_all();
> >  
> >  pause_all_vcpus();
> > +migrate_cancel();
> 
> IIUC this is an async cancel, so when reach here the migration thread
> can still be alive.  Then...
> 
> > +qemu_savevm_state_cleanup();
> 
> ... Here calling qemu_savevm_state_cleanup() may be problematic if
> migration thread has not yet quitted.
> 
> I'm thinking whether we should make migrate_fd_cancel() wait until the
> migration thread finishes (state change to CANCELLED).  Then the
> migration thread will do the cleanup, and here we can avoid calling
> qemu_savevm_state_cleanup() as well.

But if the migration thread is stuck and CANCELLED is never reached, we'll hang
here?

Fam



Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit

2017-09-15 Thread Peter Xu
On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> up migration states when main loop quits to avoid that.
> 
> Signed-off-by: Fam Zheng 
> ---
>  include/migration/misc.h | 1 +
>  migration/migration.c| 7 ++-
>  vl.c | 3 +++
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index c079b7771b..b9a26b0898 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
>  /* ...and after the device transmission */
>  bool migration_in_postcopy_after_devices(MigrationState *);
>  void migration_global_dump(Monitor *mon);
> +void migrate_cancel(void);
>  
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 959e8ec88e..2c844945c7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool 
> blk,
>  }
>  }
>  
> -void qmp_migrate_cancel(Error **errp)
> +void migrate_cancel(void)
>  {
>  migrate_fd_cancel(migrate_get_current());
>  }
>  
> +void qmp_migrate_cancel(Error **errp)
> +{
> +migrate_cancel();
> +}
> +

Nit: I would prefer just call migrate_fd_cancel() below, since I don't
see much point to introduce migrate_cancel() which only calls
migrate_fd_cancel()...

>  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
>  {
>  MigrationState *s = migrate_get_current();
> diff --git a/vl.c b/vl.c
> index fb1f05b937..abbe61f40b 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -87,6 +87,7 @@ int main(int argc, char **argv)
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
>  #include "migration/misc.h"
> +#include "migration/savevm.h"
>  #include "migration/snapshot.h"
>  #include "migration/global_state.h"
>  #include "sysemu/tpm.h"
> @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
>  iothread_stop_all();
>  
>  pause_all_vcpus();
> +migrate_cancel();

IIUC this is an async cancel, so when reach here the migration thread
can still be alive.  Then...

> +qemu_savevm_state_cleanup();

... Here calling qemu_savevm_state_cleanup() may be problematic if
migration thread has not yet quitted.

I'm thinking whether we should make migrate_fd_cancel() wait until the
migration thread finishes (state change to CANCELLED).  Then the
migration thread will do the cleanup, and here we can avoid calling
qemu_savevm_state_cleanup() as well.

>  bdrv_close_all();
>  res_free();
>  
> -- 
> 2.13.5
> 

-- 
Peter Xu