Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2017-01-10 Thread Hailiang Zhang

ping .. ?

Any comments ? Or should I send a for formal patch ?

On 2016/12/22 10:56, Hailiang Zhang wrote:

On 2016/12/9 4:02, Dr. David Alan Gilbert wrote:

* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote:

Hi,

On 2016/12/6 23:24, Dr. David Alan Gilbert wrote:

* Kevin Wolf (kw...@redhat.com) wrote:

Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:

commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
which migration aborted QEMU because it didn't regain the control
of images while some errors happened.

Actually, we have another case in that error path to abort QEMU
because of the same reason:
   migration_thread()
   migration_completion()
  bdrv_inactivate_all() > inactivate images
  qemu_savevm_state_complete_precopy()
  socket_writev_buffer() > error because destination 
fails
qemu_fflush() ---> set error on migration stream
  qemu_mutex_unlock_iothread() --> unlock
   qmp_migrate_cancel() -> user cancelled migration
   migrate_set_state() --> set migrate CANCELLING


Important to note here: qmp_migrate_cancel() is executed by a concurrent
thread, it doesn't depend on any code paths in migration_completion().


   migration_completion() -> go on to fail_invalidate
   if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
   migration_thread() ---> break migration loop
 vm_start() -> restart guest with inactive
   images
We failed to regain the control of images because we only regain it
while the migration state is "active", but here users cancelled the migration
when they found some errors happened (for example, libvirtd daemon is shutdown
in destination unexpectedly).

Signed-off-by: zhanghailiang 
---
migration/migration.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index f498ab8..0c1ee6d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1752,7 +1752,8 @@ fail_invalidate:
/* If not doing postcopy, vm_start() will be called: let's regain
 * control on images.
 */
-if (s->state == MIGRATION_STATUS_ACTIVE) {


This if condition tries to check whether we ran the code path that
called bdrv_inactivate_all(), so that we only try to reactivate images
it if we really inactivated them first.

The problem with it is that it ignores a possible concurrent
modification of s->state.


+if (s->state == MIGRATION_STATUS_ACTIVE ||
+s->state == MIGRATION_STATUS_CANCELLING) {


This adds another state that we could end up with with a concurrent
modification, so that even in this case we undo the inactivation.

However, it is no longer limited to the cases where we inactivated the
image. It also applies to other code paths (like the postcopy one) where
we didn't inactivate images.

What saves the patch is that bdrv_invalidate_cache() is a no-op for
block devices that aren't inactivated, so calling it more often than
necessary is okay.

But then, if we're going to rely on this, it would be much better to
just remove the if altogether. I can't say whether there are any other
possible values of s->state that we should consider, and by removing the
if we would be guaranteed to catch all of them.

If we don't want to rely on it, just keep a local bool that remembers
whether we inactivated images and check that here.


Error *local_err = NULL;

bdrv_invalidate_cache_all(_err);


So in summary, this is a horrible patch because it checks the wrong
thing, and for I can't really say if it covers everything it needs to
cover, but arguably it happens to correctly fix the outcome of a
previously failing case.

Normally I would reject such a patch and require a clean solution, but
then we're on the day of -rc3, so if you can't send v2 right away, we
might not have the time for it.

Tough call...


Hmm, this case is messy; I created this function having split it out
of the main loop a couple of years back but it did get more messy
with more s->state checks; as far as I can tell it's always
done the transition to COMPLETED at the end well after the locked
section, so there's always been that chance that cancellation sneaks
in just before or just after the locked section.

Some of the bad cases that can happen:
  a) A cancel sneaks in after the ACTIVE check but before or after
the locked section;  should we reactivate the disks? Well that
depends on whether the destination actually got the full migration
stream - we don't know!
   If the destination actually starts running we must not reactivate
   the disk on the source even if the CPU is stopped.



Yes, we didn't have 

Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-27 Thread Hailiang Zhang

On 2016/12/22 10:56, Hailiang Zhang wrote:

On 2016/12/9 4:02, Dr. David Alan Gilbert wrote:

* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote:

Hi,

On 2016/12/6 23:24, Dr. David Alan Gilbert wrote:

* Kevin Wolf (kw...@redhat.com) wrote:

Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:

commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
which migration aborted QEMU because it didn't regain the control
of images while some errors happened.

Actually, we have another case in that error path to abort QEMU
because of the same reason:
   migration_thread()
   migration_completion()
  bdrv_inactivate_all() > inactivate images
  qemu_savevm_state_complete_precopy()
  socket_writev_buffer() > error because destination 
fails
qemu_fflush() ---> set error on migration stream
  qemu_mutex_unlock_iothread() --> unlock
   qmp_migrate_cancel() -> user cancelled migration
   migrate_set_state() --> set migrate CANCELLING


Important to note here: qmp_migrate_cancel() is executed by a concurrent
thread, it doesn't depend on any code paths in migration_completion().


   migration_completion() -> go on to fail_invalidate
   if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
   migration_thread() ---> break migration loop
 vm_start() -> restart guest with inactive
   images
We failed to regain the control of images because we only regain it
while the migration state is "active", but here users cancelled the migration
when they found some errors happened (for example, libvirtd daemon is shutdown
in destination unexpectedly).

Signed-off-by: zhanghailiang 
---
migration/migration.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index f498ab8..0c1ee6d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1752,7 +1752,8 @@ fail_invalidate:
/* If not doing postcopy, vm_start() will be called: let's regain
 * control on images.
 */
-if (s->state == MIGRATION_STATUS_ACTIVE) {


This if condition tries to check whether we ran the code path that
called bdrv_inactivate_all(), so that we only try to reactivate images
it if we really inactivated them first.

The problem with it is that it ignores a possible concurrent
modification of s->state.


+if (s->state == MIGRATION_STATUS_ACTIVE ||
+s->state == MIGRATION_STATUS_CANCELLING) {


This adds another state that we could end up with with a concurrent
modification, so that even in this case we undo the inactivation.

However, it is no longer limited to the cases where we inactivated the
image. It also applies to other code paths (like the postcopy one) where
we didn't inactivate images.

What saves the patch is that bdrv_invalidate_cache() is a no-op for
block devices that aren't inactivated, so calling it more often than
necessary is okay.

But then, if we're going to rely on this, it would be much better to
just remove the if altogether. I can't say whether there are any other
possible values of s->state that we should consider, and by removing the
if we would be guaranteed to catch all of them.

If we don't want to rely on it, just keep a local bool that remembers
whether we inactivated images and check that here.


Error *local_err = NULL;

bdrv_invalidate_cache_all(_err);


So in summary, this is a horrible patch because it checks the wrong
thing, and for I can't really say if it covers everything it needs to
cover, but arguably it happens to correctly fix the outcome of a
previously failing case.

Normally I would reject such a patch and require a clean solution, but
then we're on the day of -rc3, so if you can't send v2 right away, we
might not have the time for it.

Tough call...


Hmm, this case is messy; I created this function having split it out
of the main loop a couple of years back but it did get more messy
with more s->state checks; as far as I can tell it's always
done the transition to COMPLETED at the end well after the locked
section, so there's always been that chance that cancellation sneaks
in just before or just after the locked section.

Some of the bad cases that can happen:
  a) A cancel sneaks in after the ACTIVE check but before or after
the locked section;  should we reactivate the disks? Well that
depends on whether the destination actually got the full migration
stream - we don't know!
   If the destination actually starts running we must not reactivate
   the disk on the source even if the CPU is stopped.



Yes, we didn't have a mechanism to know exactly whether or not the VM in
destination 

Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-21 Thread Hailiang Zhang

On 2016/12/9 4:02, Dr. David Alan Gilbert wrote:

* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote:

Hi,

On 2016/12/6 23:24, Dr. David Alan Gilbert wrote:

* Kevin Wolf (kw...@redhat.com) wrote:

Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:

commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
which migration aborted QEMU because it didn't regain the control
of images while some errors happened.

Actually, we have another case in that error path to abort QEMU
because of the same reason:
  migration_thread()
  migration_completion()
 bdrv_inactivate_all() > inactivate images
 qemu_savevm_state_complete_precopy()
 socket_writev_buffer() > error because destination 
fails
   qemu_fflush() ---> set error on migration stream
 qemu_mutex_unlock_iothread() --> unlock
  qmp_migrate_cancel() -> user cancelled migration
  migrate_set_state() --> set migrate CANCELLING


Important to note here: qmp_migrate_cancel() is executed by a concurrent
thread, it doesn't depend on any code paths in migration_completion().


  migration_completion() -> go on to fail_invalidate
  if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
  migration_thread() ---> break migration loop
vm_start() -> restart guest with inactive
  images
We failed to regain the control of images because we only regain it
while the migration state is "active", but here users cancelled the migration
when they found some errors happened (for example, libvirtd daemon is shutdown
in destination unexpectedly).

Signed-off-by: zhanghailiang 
---
   migration/migration.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index f498ab8..0c1ee6d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1752,7 +1752,8 @@ fail_invalidate:
   /* If not doing postcopy, vm_start() will be called: let's regain
* control on images.
*/
-if (s->state == MIGRATION_STATUS_ACTIVE) {


This if condition tries to check whether we ran the code path that
called bdrv_inactivate_all(), so that we only try to reactivate images
it if we really inactivated them first.

The problem with it is that it ignores a possible concurrent
modification of s->state.


+if (s->state == MIGRATION_STATUS_ACTIVE ||
+s->state == MIGRATION_STATUS_CANCELLING) {


This adds another state that we could end up with with a concurrent
modification, so that even in this case we undo the inactivation.

However, it is no longer limited to the cases where we inactivated the
image. It also applies to other code paths (like the postcopy one) where
we didn't inactivate images.

What saves the patch is that bdrv_invalidate_cache() is a no-op for
block devices that aren't inactivated, so calling it more often than
necessary is okay.

But then, if we're going to rely on this, it would be much better to
just remove the if altogether. I can't say whether there are any other
possible values of s->state that we should consider, and by removing the
if we would be guaranteed to catch all of them.

If we don't want to rely on it, just keep a local bool that remembers
whether we inactivated images and check that here.


   Error *local_err = NULL;

   bdrv_invalidate_cache_all(_err);


So in summary, this is a horrible patch because it checks the wrong
thing, and for I can't really say if it covers everything it needs to
cover, but arguably it happens to correctly fix the outcome of a
previously failing case.

Normally I would reject such a patch and require a clean solution, but
then we're on the day of -rc3, so if you can't send v2 right away, we
might not have the time for it.

Tough call...


Hmm, this case is messy; I created this function having split it out
of the main loop a couple of years back but it did get more messy
with more s->state checks; as far as I can tell it's always
done the transition to COMPLETED at the end well after the locked
section, so there's always been that chance that cancellation sneaks
in just before or just after the locked section.

Some of the bad cases that can happen:
 a) A cancel sneaks in after the ACTIVE check but before or after
   the locked section;  should we reactivate the disks? Well that
   depends on whether the destination actually got the full migration
   stream - we don't know!
  If the destination actually starts running we must not reactivate
  the disk on the source even if the CPU is stopped.



Yes, we didn't have a mechanism to know exactly whether or not the VM in
destination is well received.


 b) If the bdrv_inactive_all fails for one 

Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-08 Thread Dr. David Alan Gilbert
* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote:
> Hi,
> 
> On 2016/12/6 23:24, Dr. David Alan Gilbert wrote:
> > * Kevin Wolf (kw...@redhat.com) wrote:
> > > Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:
> > > > commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
> > > > which migration aborted QEMU because it didn't regain the control
> > > > of images while some errors happened.
> > > > 
> > > > Actually, we have another case in that error path to abort QEMU
> > > > because of the same reason:
> > > >  migration_thread()
> > > >  migration_completion()
> > > > bdrv_inactivate_all() > inactivate images
> > > > qemu_savevm_state_complete_precopy()
> > > > socket_writev_buffer() > error because 
> > > > destination fails
> > > >   qemu_fflush() ---> set error on migration 
> > > > stream
> > > > qemu_mutex_unlock_iothread() --> unlock
> > > >  qmp_migrate_cancel() -> user cancelled 
> > > > migration
> > > >  migrate_set_state() --> set migrate CANCELLING
> > > 
> > > Important to note here: qmp_migrate_cancel() is executed by a concurrent
> > > thread, it doesn't depend on any code paths in migration_completion().
> > > 
> > > >  migration_completion() -> go on to fail_invalidate
> > > >  if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
> > > >  migration_thread() ---> break migration loop
> > > >vm_start() -> restart guest with 
> > > > inactive
> > > >  images
> > > > We failed to regain the control of images because we only regain it
> > > > while the migration state is "active", but here users cancelled the 
> > > > migration
> > > > when they found some errors happened (for example, libvirtd daemon is 
> > > > shutdown
> > > > in destination unexpectedly).
> > > > 
> > > > Signed-off-by: zhanghailiang 
> > > > ---
> > > >   migration/migration.c | 3 ++-
> > > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index f498ab8..0c1ee6d 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -1752,7 +1752,8 @@ fail_invalidate:
> > > >   /* If not doing postcopy, vm_start() will be called: let's regain
> > > >* control on images.
> > > >*/
> > > > -if (s->state == MIGRATION_STATUS_ACTIVE) {
> > > 
> > > This if condition tries to check whether we ran the code path that
> > > called bdrv_inactivate_all(), so that we only try to reactivate images
> > > it if we really inactivated them first.
> > > 
> > > The problem with it is that it ignores a possible concurrent
> > > modification of s->state.
> > > 
> > > > +if (s->state == MIGRATION_STATUS_ACTIVE ||
> > > > +s->state == MIGRATION_STATUS_CANCELLING) {
> > > 
> > > This adds another state that we could end up with with a concurrent
> > > modification, so that even in this case we undo the inactivation.
> > > 
> > > However, it is no longer limited to the cases where we inactivated the
> > > image. It also applies to other code paths (like the postcopy one) where
> > > we didn't inactivate images.
> > > 
> > > What saves the patch is that bdrv_invalidate_cache() is a no-op for
> > > block devices that aren't inactivated, so calling it more often than
> > > necessary is okay.
> > > 
> > > But then, if we're going to rely on this, it would be much better to
> > > just remove the if altogether. I can't say whether there are any other
> > > possible values of s->state that we should consider, and by removing the
> > > if we would be guaranteed to catch all of them.
> > > 
> > > If we don't want to rely on it, just keep a local bool that remembers
> > > whether we inactivated images and check that here.
> > > 
> > > >   Error *local_err = NULL;
> > > > 
> > > >   bdrv_invalidate_cache_all(_err);
> > > 
> > > So in summary, this is a horrible patch because it checks the wrong
> > > thing, and for I can't really say if it covers everything it needs to
> > > cover, but arguably it happens to correctly fix the outcome of a
> > > previously failing case.
> > > 
> > > Normally I would reject such a patch and require a clean solution, but
> > > then we're on the day of -rc3, so if you can't send v2 right away, we
> > > might not have the time for it.
> > > 
> > > Tough call...
> > 
> > Hmm, this case is messy; I created this function having split it out
> > of the main loop a couple of years back but it did get more messy
> > with more s->state checks; as far as I can tell it's always
> > done the transition to COMPLETED at the end well after the locked
> > section, so there's always been that chance that cancellation sneaks
> > in 

Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-07 Thread Hailiang Zhang

Hi,

On 2016/12/6 23:24, Dr. David Alan Gilbert wrote:

* Kevin Wolf (kw...@redhat.com) wrote:

Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:

commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
which migration aborted QEMU because it didn't regain the control
of images while some errors happened.

Actually, we have another case in that error path to abort QEMU
because of the same reason:
 migration_thread()
 migration_completion()
bdrv_inactivate_all() > inactivate images
qemu_savevm_state_complete_precopy()
socket_writev_buffer() > error because destination fails
  qemu_fflush() ---> set error on migration stream
qemu_mutex_unlock_iothread() --> unlock
 qmp_migrate_cancel() -> user cancelled migration
 migrate_set_state() --> set migrate CANCELLING


Important to note here: qmp_migrate_cancel() is executed by a concurrent
thread, it doesn't depend on any code paths in migration_completion().


 migration_completion() -> go on to fail_invalidate
 if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
 migration_thread() ---> break migration loop
   vm_start() -> restart guest with inactive
 images
We failed to regain the control of images because we only regain it
while the migration state is "active", but here users cancelled the migration
when they found some errors happened (for example, libvirtd daemon is shutdown
in destination unexpectedly).

Signed-off-by: zhanghailiang 
---
  migration/migration.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index f498ab8..0c1ee6d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1752,7 +1752,8 @@ fail_invalidate:
  /* If not doing postcopy, vm_start() will be called: let's regain
   * control on images.
   */
-if (s->state == MIGRATION_STATUS_ACTIVE) {


This if condition tries to check whether we ran the code path that
called bdrv_inactivate_all(), so that we only try to reactivate images
it if we really inactivated them first.

The problem with it is that it ignores a possible concurrent
modification of s->state.


+if (s->state == MIGRATION_STATUS_ACTIVE ||
+s->state == MIGRATION_STATUS_CANCELLING) {


This adds another state that we could end up with with a concurrent
modification, so that even in this case we undo the inactivation.

However, it is no longer limited to the cases where we inactivated the
image. It also applies to other code paths (like the postcopy one) where
we didn't inactivate images.

What saves the patch is that bdrv_invalidate_cache() is a no-op for
block devices that aren't inactivated, so calling it more often than
necessary is okay.

But then, if we're going to rely on this, it would be much better to
just remove the if altogether. I can't say whether there are any other
possible values of s->state that we should consider, and by removing the
if we would be guaranteed to catch all of them.

If we don't want to rely on it, just keep a local bool that remembers
whether we inactivated images and check that here.


  Error *local_err = NULL;

  bdrv_invalidate_cache_all(_err);


So in summary, this is a horrible patch because it checks the wrong
thing, and for I can't really say if it covers everything it needs to
cover, but arguably it happens to correctly fix the outcome of a
previously failing case.

Normally I would reject such a patch and require a clean solution, but
then we're on the day of -rc3, so if you can't send v2 right away, we
might not have the time for it.

Tough call...


Hmm, this case is messy; I created this function having split it out
of the main loop a couple of years back but it did get more messy
with more s->state checks; as far as I can tell it's always
done the transition to COMPLETED at the end well after the locked
section, so there's always been that chance that cancellation sneaks
in just before or just after the locked section.

Some of the bad cases that can happen:
a) A cancel sneaks in after the ACTIVE check but before or after
  the locked section;  should we reactivate the disks? Well that
  depends on whether the destination actually got the full migration
  stream - we don't know!
 If the destination actually starts running we must not reactivate
 the disk on the source even if the CPU is stopped.



Yes, we didn't have a mechanism to know exactly whether or not the VM in
destination is well received.


b) If the bdrv_inactive_all fails for one device, but the others
   are fine, we go down the fail: label and don't reactivate, so
   the source dies even though it might have 

Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-06 Thread Dr. David Alan Gilbert
* Kevin Wolf (kw...@redhat.com) wrote:
> Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:
> > commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
> > which migration aborted QEMU because it didn't regain the control
> > of images while some errors happened.
> > 
> > Actually, we have another case in that error path to abort QEMU
> > because of the same reason:
> > migration_thread()
> > migration_completion()
> >bdrv_inactivate_all() > inactivate images
> >qemu_savevm_state_complete_precopy()
> >socket_writev_buffer() > error because destination 
> > fails
> >  qemu_fflush() ---> set error on migration 
> > stream
> >qemu_mutex_unlock_iothread() --> unlock
> > qmp_migrate_cancel() -> user cancelled migration
> > migrate_set_state() --> set migrate CANCELLING
> 
> Important to note here: qmp_migrate_cancel() is executed by a concurrent
> thread, it doesn't depend on any code paths in migration_completion().
> 
> > migration_completion() -> go on to fail_invalidate
> > if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
> > migration_thread() ---> break migration loop
> >   vm_start() -> restart guest with inactive
> > images
> > We failed to regain the control of images because we only regain it
> > while the migration state is "active", but here users cancelled the 
> > migration
> > when they found some errors happened (for example, libvirtd daemon is 
> > shutdown
> > in destination unexpectedly).
> > 
> > Signed-off-by: zhanghailiang 
> > ---
> >  migration/migration.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index f498ab8..0c1ee6d 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1752,7 +1752,8 @@ fail_invalidate:
> >  /* If not doing postcopy, vm_start() will be called: let's regain
> >   * control on images.
> >   */
> > -if (s->state == MIGRATION_STATUS_ACTIVE) {
> 
> This if condition tries to check whether we ran the code path that
> called bdrv_inactivate_all(), so that we only try to reactivate images
> it if we really inactivated them first.
> 
> The problem with it is that it ignores a possible concurrent
> modification of s->state.
> 
> > +if (s->state == MIGRATION_STATUS_ACTIVE ||
> > +s->state == MIGRATION_STATUS_CANCELLING) {
> 
> This adds another state that we could end up with with a concurrent
> modification, so that even in this case we undo the inactivation.
> 
> However, it is no longer limited to the cases where we inactivated the
> image. It also applies to other code paths (like the postcopy one) where
> we didn't inactivate images.
> 
> What saves the patch is that bdrv_invalidate_cache() is a no-op for
> block devices that aren't inactivated, so calling it more often than
> necessary is okay.
> 
> But then, if we're going to rely on this, it would be much better to
> just remove the if altogether. I can't say whether there are any other
> possible values of s->state that we should consider, and by removing the
> if we would be guaranteed to catch all of them.
> 
> If we don't want to rely on it, just keep a local bool that remembers
> whether we inactivated images and check that here.
> 
> >  Error *local_err = NULL;
> >  
> >  bdrv_invalidate_cache_all(_err);
> 
> So in summary, this is a horrible patch because it checks the wrong
> thing, and for I can't really say if it covers everything it needs to
> cover, but arguably it happens to correctly fix the outcome of a
> previously failing case.
> 
> Normally I would reject such a patch and require a clean solution, but
> then we're on the day of -rc3, so if you can't send v2 right away, we
> might not have the time for it.
> 
> Tough call...

Hmm, this case is messy; I created this function having split it out
of the main loop a couple of years back but it did get more messy
with more s->state checks; as far as I can tell it's always
done the transition to COMPLETED at the end well after the locked
section, so there's always been that chance that cancellation sneaks
in just before or just after the locked section.

Some of the bad cases that can happen:
   a) A cancel sneaks in after the ACTIVE check but before or after
 the locked section;  should we reactivate the disks? Well that
 depends on whether the destination actually got the full migration
 stream - we don't know!
If the destination actually starts running we must not reactivate
the disk on the source even if the CPU is stopped.

   b) If the bdrv_inactive_all fails for one device, but the others 
  are fine, we go down the 

Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-06 Thread Kevin Wolf
Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:
> commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
> which migration aborted QEMU because it didn't regain the control
> of images while some errors happened.
> 
> Actually, we have another case in that error path to abort QEMU
> because of the same reason:
> migration_thread()
> migration_completion()
>bdrv_inactivate_all() > inactivate images
>qemu_savevm_state_complete_precopy()
>socket_writev_buffer() > error because destination 
> fails
>  qemu_fflush() ---> set error on migration stream
>qemu_mutex_unlock_iothread() --> unlock
> qmp_migrate_cancel() -> user cancelled migration
> migrate_set_state() --> set migrate CANCELLING

Important to note here: qmp_migrate_cancel() is executed by a concurrent
thread, it doesn't depend on any code paths in migration_completion().

> migration_completion() -> go on to fail_invalidate
> if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
> migration_thread() ---> break migration loop
>   vm_start() -> restart guest with inactive
> images
> We failed to regain the control of images because we only regain it
> while the migration state is "active", but here users cancelled the migration
> when they found some errors happened (for example, libvirtd daemon is shutdown
> in destination unexpectedly).
> 
> Signed-off-by: zhanghailiang 
> ---
>  migration/migration.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index f498ab8..0c1ee6d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1752,7 +1752,8 @@ fail_invalidate:
>  /* If not doing postcopy, vm_start() will be called: let's regain
>   * control on images.
>   */
> -if (s->state == MIGRATION_STATUS_ACTIVE) {

This if condition tries to check whether we ran the code path that
called bdrv_inactivate_all(), so that we only try to reactivate images
it if we really inactivated them first.

The problem with it is that it ignores a possible concurrent
modification of s->state.

> +if (s->state == MIGRATION_STATUS_ACTIVE ||
> +s->state == MIGRATION_STATUS_CANCELLING) {

This adds another state that we could end up with with a concurrent
modification, so that even in this case we undo the inactivation.

However, it is no longer limited to the cases where we inactivated the
image. It also applies to other code paths (like the postcopy one) where
we didn't inactivate images.

What saves the patch is that bdrv_invalidate_cache() is a no-op for
block devices that aren't inactivated, so calling it more often than
necessary is okay.

But then, if we're going to rely on this, it would be much better to
just remove the if altogether. I can't say whether there are any other
possible values of s->state that we should consider, and by removing the
if we would be guaranteed to catch all of them.

If we don't want to rely on it, just keep a local bool that remembers
whether we inactivated images and check that here.

>  Error *local_err = NULL;
>  
>  bdrv_invalidate_cache_all(_err);

So in summary, this is a horrible patch because it checks the wrong
thing, and for I can't really say if it covers everything it needs to
cover, but arguably it happens to correctly fix the outcome of a
previously failing case.

Normally I would reject such a patch and require a clean solution, but
then we're on the day of -rc3, so if you can't send v2 right away, we
might not have the time for it.

Tough call...

Kevin



Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-01 Thread Kevin Wolf
Forwarding to qemu-block so I won't forget to have a look.

Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:
> commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
> which migration aborted QEMU because it didn't regain the control
> of images while some errors happened.
> 
> Actually, we have another case in that error path to abort QEMU
> because of the same reason:
> migration_thread()
> migration_completion()
>bdrv_inactivate_all() > inactivate images
>qemu_savevm_state_complete_precopy()
>socket_writev_buffer() > error because destination 
> fails
>  qemu_fflush() ---> set error on migration stream
>qemu_mutex_unlock_iothread() --> unlock
> qmp_migrate_cancel() -> user cancelled migration
> migrate_set_state() --> set migrate CANCELLING
> migration_completion() -> go on to fail_invalidate
> if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
> migration_thread() ---> break migration loop
>   vm_start() -> restart guest with inactive
> images
> We failed to regain the control of images because we only regain it
> while the migration state is "active", but here users cancelled the migration
> when they found some errors happened (for example, libvirtd daemon is shutdown
> in destination unexpectedly).
> 
> Signed-off-by: zhanghailiang 
> ---
>  migration/migration.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index f498ab8..0c1ee6d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1752,7 +1752,8 @@ fail_invalidate:
>  /* If not doing postcopy, vm_start() will be called: let's regain
>   * control on images.
>   */
> -if (s->state == MIGRATION_STATUS_ACTIVE) {
> +if (s->state == MIGRATION_STATUS_ACTIVE ||
> +s->state == MIGRATION_STATUS_CANCELLING) {
>  Error *local_err = NULL;
>  
>  bdrv_invalidate_cache_all(_err);
> -- 
> 1.8.3.1
> 
> 
> 



Re: [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-11-21 Thread Hailiang Zhang

Cc: qemu-sta...@nongnu.org

On 2016/11/19 19:43, zhanghailiang wrote:

commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
which migration aborted QEMU because it didn't regain the control
of images while some errors happened.

Actually, we have another case in that error path to abort QEMU
because of the same reason:
 migration_thread()
 migration_completion()
bdrv_inactivate_all() > inactivate images
qemu_savevm_state_complete_precopy()
socket_writev_buffer() > error because destination fails
  qemu_fflush() ---> set error on migration stream
qemu_mutex_unlock_iothread() --> unlock
 qmp_migrate_cancel() -> user cancelled migration
 migrate_set_state() --> set migrate CANCELLING
 migration_completion() -> go on to fail_invalidate
 if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
 migration_thread() ---> break migration loop
   vm_start() -> restart guest with inactive
 images
We failed to regain the control of images because we only regain it
while the migration state is "active", but here users cancelled the migration
when they found some errors happened (for example, libvirtd daemon is shutdown
in destination unexpectedly).

Signed-off-by: zhanghailiang 
---
  migration/migration.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index f498ab8..0c1ee6d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1752,7 +1752,8 @@ fail_invalidate:
  /* If not doing postcopy, vm_start() will be called: let's regain
   * control on images.
   */
-if (s->state == MIGRATION_STATUS_ACTIVE) {
+if (s->state == MIGRATION_STATUS_ACTIVE ||
+s->state == MIGRATION_STATUS_CANCELLING) {
  Error *local_err = NULL;

  bdrv_invalidate_cache_all(_err);






[Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-11-19 Thread zhanghailiang
commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
which migration aborted QEMU because it didn't regain the control
of images while some errors happened.

Actually, we have another case in that error path to abort QEMU
because of the same reason:
migration_thread()
migration_completion()
   bdrv_inactivate_all() > inactivate images
   qemu_savevm_state_complete_precopy()
   socket_writev_buffer() > error because destination fails
 qemu_fflush() ---> set error on migration stream
   qemu_mutex_unlock_iothread() --> unlock
qmp_migrate_cancel() -> user cancelled migration
migrate_set_state() --> set migrate CANCELLING
migration_completion() -> go on to fail_invalidate
if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
migration_thread() ---> break migration loop
  vm_start() -> restart guest with inactive
images
We failed to regain the control of images because we only regain it
while the migration state is "active", but here users cancelled the migration
when they found some errors happened (for example, libvirtd daemon is shutdown
in destination unexpectedly).

Signed-off-by: zhanghailiang 
---
 migration/migration.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index f498ab8..0c1ee6d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1752,7 +1752,8 @@ fail_invalidate:
 /* If not doing postcopy, vm_start() will be called: let's regain
  * control on images.
  */
-if (s->state == MIGRATION_STATUS_ACTIVE) {
+if (s->state == MIGRATION_STATUS_ACTIVE ||
+s->state == MIGRATION_STATUS_CANCELLING) {
 Error *local_err = NULL;
 
 bdrv_invalidate_cache_all(_err);
-- 
1.8.3.1