Reviewed: https://review.opendev.org/594527 Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa87b9c288d316b85079e681e0df24354ec1912c Submitter: Zuul Branch: master
commit aa87b9c288d316b85079e681e0df24354ec1912c Author: Matt Riedemann <[email protected]> Date: Tue Aug 21 15:32:13 2018 -0400 libvirt: check job status for VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event Change Ic5cab99944df9e501ba2032eb96911c36304494d added handling for the VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event during live migration but failed to distinguish between the live migration actually succeeding or failing before queueing the EVENT_LIFECYCLE_MIGRATION_COMPLETED up into the ComputeManager.handle_lifecycle_event method. As a result, failed live migrations will inadvertantly trigger activation of the port bindings on the destination host, which deactivates the source host port bindings, and then _rollback_live_migration will delete those activated dest host port bindings and leave the source host port bindings deactivated. In this change, if we get the VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event, we attempt to get the job status to determine the course to take and only queue the EVENT_LIFECYCLE_MIGRATION_COMPLETED event, which triggers the dest host port activation, if we can determine the live migration job completed successfully. Otherwise we simply report the guest as paused, the same as before Ic5cab9994. Change-Id: I6a4252b0c12c41c233299f30ce8294fef21c7b40 Closes-Bug: #1788014 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1788014 Title: when live migration fails due to a internal error rollback is not handeled correctly. Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) rocky series: Triaged Status in OpenStack Compute (nova) stein series: Triaged Status in OpenStack Compute (nova) train series: Triaged Bug description: Description =========== While testing livemigration between ovs and ovs-dpdk on centos 7.5 i encountered a qemu internal error the resulted in the qemu monitor closeing. when the migration failed during the virDomainMigrateToURI3 it rolled back the migration leaving the vm running on the souce node the vm however no longer had network connectivity because the vm port on the source node had no vif type as the host-id in the port was no long set. this is likely due to not reactivating the souce node binding when the migration fails. if the vm is hard-rebooted in this state it enters teh error state as the vif bingins are broken. note the qemu error is out of scope of this bug. this bug just focuses on the fact that nova does not correctly roll back the migration and result in the vm still running correctly with networking on the source node when a qemu error is thrown. Steps to reproduce ================== A chronological list of steps which will bring off the issue you noticed: deploy a two node devstack with ovs on the controler and ovs-dpdk on the second node. configure nova compute with libvirt/kvm virt type on both nodes. alloate hugepages on both nodes and boot a vm on either node. attempt to live migrate the vm to the other node. Expected result =============== if qemu has a bug that results in migration failure the vif bindings should be resotred to there premigration state and if networkig cannot be resore the vm should go directly to error state. Actual result ============= the vm remained running on the source node with state active. the vm interface no longer had a host_id set in the neutron vif:binding-details and the vif type was none. on hard reboot the vm entered error state with the following message Error: Failed to perform requested operation on instance "dpdk", the instance has an error status: Please try again later [Error: vif_type parameter must be present for this vif_driver implementation]. Environment =========== 1. openstack rocky RC1 nova sha: afe4512bf66c89a061b1a7ccd3e7ac8e3b1b284d neutron sha: 1dda2bca862b1268c0f5ae39b7508f1b1cab6f15 2. Which hypervisor did you use? libvirt + kvm libvirtd (libvirt) 3.9.0 QEMU emulator version 2.10.0(qemu-kvm-ev-2.10.0-21.el7_5.4.1) centos 7.5 2. Which storage type did you use? boot from volume using default lvm backed deploy by devstack. 3. Which networking type did you use? neutron ml2/ovs with ovs and ovs-dpdk this is not really relevent to the bug. Logs & Configs ============== n-cpu log Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.057596Z qemu-kvm: load of migration failed: Invalid argument: libvirtError: internal error: qemu unexpectedly closed the monitor: 2018-08-16T11:19:00.055479Z qemu-kvm: get_pci_config_device: Bad config data: i=0x10 read: 61 device: 1 cmask: ff wmask: c0 w1cmask:0 Aug 16 12:19:00 devstack5 nova-compute[14766]: DEBUG nova.virt.libvirt.driver [-] [instance: 05a564c2-2e25-4b07-911f-2ded60fb6756] Migration operation thread notification {{(pid=14766) thread_finished /opt/stack/nova/nova/virt/libvirt/driver.py:7478}} Aug 16 12:19:00 devstack5 nova-compute[14766]: Traceback (most recent call last): Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers Aug 16 12:19:00 devstack5 nova-compute[14766]: timer() Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__ Aug 16 12:19:00 devstack5 nova-compute[14766]: cb(*args, **kw) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 168, in _do_send Aug 16 12:19:00 devstack5 nova-compute[14766]: waiter.switch(result) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main Aug 16 12:19:00 devstack5 nova-compute[14766]: result = function(*args, **kwargs) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/opt/stack/nova/nova/utils.py", line 810, in context_wrapper Aug 16 12:19:00 devstack5 nova-compute[14766]: return func(*args, **kwargs) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 7134, in _live_migration_operation Aug 16 12:19:00 devstack5 nova-compute[14766]: LOG.error("Live Migration failure: %s", e, instance=instance) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ Aug 16 12:19:00 devstack5 nova-compute[14766]: self.force_reraise() Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise Aug 16 12:19:00 devstack5 nova-compute[14766]: six.reraise(self.type_, self.value, self.tb) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 7127, in _live_migration_operation Aug 16 12:19:00 devstack5 nova-compute[14766]: bandwidth=CONF.libvirt.live_migration_bandwidth) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 673, in migrate Aug 16 12:19:00 devstack5 nova-compute[14766]: destination, params=params, flags=flags) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit Aug 16 12:19:00 devstack5 nova-compute[14766]: result = proxy_call(self._autowrap, f, *args, **kwargs) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call Aug 16 12:19:00 devstack5 nova-compute[14766]: rv = execute(f, *args, **kwargs) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute Aug 16 12:19:00 devstack5 nova-compute[14766]: six.reraise(c, e, tb) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker Aug 16 12:19:00 devstack5 nova-compute[14766]: rv = meth(*args, **kwargs) Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1745, in migrateToURI3 Aug 16 12:19:00 devstack5 nova-compute[14766]: if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) Aug 16 12:19:00 devstack5 nova-compute[14766]: libvirtError: internal error: qemu unexpectedly closed the monitor: 2018-08-16T11:19:00.055479Z qemu-kvm: get_pci_config_device: Bad config data: i=0x10 read: 61 device: 1 cmask: ff wmask: c0 w1cmask:0 Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.055541Z qemu-kvm: Failed to load PCIDevice:config Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.055555Z qemu-kvm: Failed to load virtio-net:virtio Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.055568Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net' Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.055761Z qemu-kvm: warning: TSC frequency mismatch between VM (2712005 kHz) and host (2712000 kHz), and TSC scaling unavailable Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.057082Z qemu-kvm: warning: TSC frequency mismatch between VM (2712005 kHz) and host (2712000 kHz), and TSC scaling unavailable To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1788014/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

