Public bug reported:

There is a fairly serious bug in VM state handling during live
migration, with a result that if libvirt raises an error *after* the VM
has successfully live migrated to the target host, Nova can end up
thinking the VM is shutoff everywhere, despite it still being active.
The consequences of this are quite dire as the user can then manually
start the VM again and corrupt any data in shared volumes and the like.

The fun starts in the _live_migration method in
nova.virt.libvirt.driver, if the 'migrateToURI2' method fails *after*
the guest has completed migration.

At start of migration, we see an event received by Nova for the new QEMU
process starting on target host

2015-01-23 15:39:57.743 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state
after lifecycle event "Started"; current vm_state: active, current
task_state: migrating, current DB power_state: 1, VM power_state: 1 from
(pid=19494) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134


Upon migration completion we see CPUs start running on the target host

2015-01-23 15:40:02.794 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state
after lifecycle event "Resumed"; current vm_state: active, current
task_state: migrating, current DB power_state: 1, VM power_state: 1 from
(pid=19494) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134

And finally an event saying that the QEMU on the source host has stopped

2015-01-23 15:40:03.629 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state
after lifecycle event "Stopped"; current vm_state: active, current
task_state: migrating, current DB power_state: 1, VM power_state: 4 from
(pid=23081) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134


It is the last event that causes the trouble.  It causes Nova to mark the VM as 
shutoff at this point.

Normally the '_live_migrate' method would succeed and so Nova would then
immediately & explicitly mark the guest as running on the target host.
If an exception occurrs though, this explicit update of VM state doesn't
happen so Nova considers the guest shutoff, even though it is still
running :-(


The lifecycle events from libvirt have an associated "reason", so we could see 
that the shutoff event from libvirt corresponds to a migration being completed, 
and so not mark the VM as shutoff in Nova.  We would also have to make sure the 
target host processes the 'resume' event upon migrate completion.

An safer approach though, might be to just mark the VM as in an ERROR
state if any exception occurs during migration.

** Affects: nova
     Importance: High
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1414065

Title:
  Nova can loose track of running VM if live migration raises an
  exception

Status in OpenStack Compute (Nova):
  New

Bug description:
  There is a fairly serious bug in VM state handling during live
  migration, with a result that if libvirt raises an error *after* the
  VM has successfully live migrated to the target host, Nova can end up
  thinking the VM is shutoff everywhere, despite it still being active.
  The consequences of this are quite dire as the user can then manually
  start the VM again and corrupt any data in shared volumes and the
  like.

  The fun starts in the _live_migration method in
  nova.virt.libvirt.driver, if the 'migrateToURI2' method fails *after*
  the guest has completed migration.

  At start of migration, we see an event received by Nova for the new
  QEMU process starting on target host

  2015-01-23 15:39:57.743 DEBUG nova.compute.manager [-] [instance:
  12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
  state after lifecycle event "Started"; current vm_state: active,
  current task_state: migrating, current DB power_state: 1, VM
  power_state: 1 from (pid=19494) handle_lifecycle_event
  /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

  
  Upon migration completion we see CPUs start running on the target host

  2015-01-23 15:40:02.794 DEBUG nova.compute.manager [-] [instance:
  12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
  state after lifecycle event "Resumed"; current vm_state: active,
  current task_state: migrating, current DB power_state: 1, VM
  power_state: 1 from (pid=19494) handle_lifecycle_event
  /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

  And finally an event saying that the QEMU on the source host has
  stopped

  2015-01-23 15:40:03.629 DEBUG nova.compute.manager [-] [instance:
  12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
  state after lifecycle event "Stopped"; current vm_state: active,
  current task_state: migrating, current DB power_state: 1, VM
  power_state: 4 from (pid=23081) handle_lifecycle_event
  /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

  
  It is the last event that causes the trouble.  It causes Nova to mark the VM 
as shutoff at this point.

  Normally the '_live_migrate' method would succeed and so Nova would
  then immediately & explicitly mark the guest as running on the target
  host.   If an exception occurrs though, this explicit update of VM
  state doesn't happen so Nova considers the guest shutoff, even though
  it is still running :-(

  
  The lifecycle events from libvirt have an associated "reason", so we could 
see that the shutoff event from libvirt corresponds to a migration being 
completed, and so not mark the VM as shutoff in Nova.  We would also have to 
make sure the target host processes the 'resume' event upon migrate completion.

  An safer approach though, might be to just mark the VM as in an ERROR
  state if any exception occurs during migration.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1414065/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to