Reviewed: https://review.opendev.org/c/openstack/nova/+/764435 Committed: https://opendev.org/openstack/nova/commit/39f0af5d18d6bea34fa15b8f7778115b25432749 Submitter: "Zuul (22348)" Branch: master
commit 39f0af5d18d6bea34fa15b8f7778115b25432749 Author: Alexandre Arents <[email protected]> Date: Thu Nov 26 15:24:19 2020 +0000 libvirt: Abort live-migration job when monitoring fails During live migration process, a _live_migration_monitor thread checks progress of migration on source host, if for any reason we hit infrastructure issue involving a DB/RPC/libvirt-timeout failure, an Exception is raised to the nova-compute service and instance/migration is set to ERROR state. The issue is that we may let live-migration job running out of nova control. At the end of job, guest is resumed on target host while nova still reports it on source host, this may lead to a split-brain situation if instance is restarted. This change proposes to abort live-migration job if issue occurs during _live_migration_monitor. Change-Id: Ia593b500425c81e54eb401e38264db5cc5fc1f93 Closes-Bug: #1905944 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1905944 Title: live-migration job not aborted when live_monitor thread fails Status in OpenStack Compute (nova): Fix Released Bug description: Description =========== During live migration, a monitoring thread poll each 0.5s libvirt job progress and update db with with jobs stats. If there control pane issue like DB/RPC or libvirt unexpected Exception (timeout) exception handling do not properly interrupt libvirt job. Steps to reproduce ================== On a multinode devstack master. #spawn instance on source_host 1) openstack server create --flavor m1.small --image cirros-0.5.1-x86_64-disk \ --nic net-id=private inst #ignite live block migration on dest_host, wait a bit( to be in monitoring thread), # and trigger an issue on DB for ex. 2) nova live-migration inst ; sleep 6 ; sudo service mysql restart 3) On source host you can survey libvirt job progess until it complete and disappear because libvirt resume guest on target host(starting writting data on target disk) source_host$ watch -n 1 virsh domjobinfo instance-0000000d 4) on dest host you will find instance active dest_host$ virsh list Id Name State ----------------------------------- 20 instance-0000000d running 5) nova show inst show instance still on source host. $nova show inst | grep host | OS-EXT-SRV-ATTR:host | source_host if admin try to recover the instance on source on as it in on nova DB, we can fall in split-brain where 2 qemu running on two different disks on two host (true story..) Expected result =============== If issue happen we must at least ensure that libvirt job is interrupted, avoiding the guest resume on target host. Actual result ============= If issue happen libvirt job continue and bring up guest on target host, nova still consider it on source. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1905944/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

