Public bug reported: Recently we live migrated an entire cell to new hardware and we hit the following problem several times...
During a live migration Nova monitors the state of the migration quering libvirt every 0.5s https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452 If libvirt timeout, the instance is left in a very bad state... The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node. I'm using Stein release, but looking into the current release the code path seems the same. Here's the Stein trace: ``` Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration block_migration, migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration finish_event, disk_paths) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor info = guest.get_job_info() File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info stats = self._domain.jobStats() File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit result = proxy_call(self._autowrap, f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call rv = execute(f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute six.reraise(c, e, tb) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker rv = meth(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self) libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats) ``` ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1924585 Title: Live Migration - if libvirt timeout the instance goes to error state but the live migration continues Status in OpenStack Compute (nova): New Bug description: Recently we live migrated an entire cell to new hardware and we hit the following problem several times... During a live migration Nova monitors the state of the migration quering libvirt every 0.5s https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452 If libvirt timeout, the instance is left in a very bad state... The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node. I'm using Stein release, but looking into the current release the code path seems the same. Here's the Stein trace: ``` Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration block_migration, migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration finish_event, disk_paths) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor info = guest.get_job_info() File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info stats = self._domain.jobStats() File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit result = proxy_call(self._autowrap, f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call rv = execute(f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute six.reraise(c, e, tb) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker rv = meth(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self) libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats) ``` To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1924585/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

