** Description changed:

  [Impact]
  
  Nova uses evently.tpool.Proxy to defer actions/commands that would
  otherwise risk starving eventlets. This patch fixes the issue where
  virNodeDevice returned from libvirt were not wrapped by the proxy and
  therefore executed outside the thread which leads to starvation. There
  are two patches required to fix this issue, the first is the one in this
  bug and the second is to fix a regression subsequently identified by the
  first patch (bug 2098892).
  
  [Test Plan]
  
-  * Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms 
over a period of several hours if not days
-  * ensure that the amount of time nova.compute.resource_tracker takes to run 
does not continuously increase (can use 
https://github.com/dosaboy/openstack-analysis to determine this)
+  * Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms 
over a period of several hours if not days
+  * ensure that the amount of time nova.compute.resource_tracker takes to run 
does not continuously increase (can use 
https://github.com/dosaboy/openstack-analysis to determine this)
  
  [Regression Potential]
  
-  * no regression potential is expected as a result of this set of
- patches.
- 
+ This patch is in fact addressing a large performance regression. Some time 
back libvirt calls were autowrapped so that they were proxied through a native 
thread but when a call is made, the return objects did not receive the same 
treatment i.e. calls made using those objects were not also being proxied. The 
effect is that when a large number of calls are made it starves the current 
thread of time to run eventlet threads. This patch is pushing libvirt calls 
that should have been run with the other libvirt calls i.e. in the proxied 
thread this freeing up the main thread. Since this is fixing a gap in existing 
functionality I do not anticipate it causing any regressions. The calls should 
be using the same thread as the other libvirt calls so we should not be at risk 
of running out of threads.
  --------------------------------------------------------------------------
  
  tl;dr This bug has the same root cause as
  https://bugs.launchpad.net/nova/+bug/1840912 where items in lists
  returned from libvirt are not automatically wrapped in a tpool.Proxy.
  
  Discovered during investigation of a downstream bug [1] where a live
  migration was dirtying memory faster than the transfer and nova-compute
  became frozen unable to perform any other operations, not even logging,
  for hours.
  
  The freezing was tracked down to un-proxied libvirt call
  listAllDevices() which could block all other greenthreads. The
  listAllDevices() call occurs during the update_available_resource()
  periodic task in the libvirt driver in _get_pci_passthrough_devices().
  In a GMR collected during a repro of the issue, a traceback showing this
  was present in the report [2]:
  
  tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in 
run_periodic_tasks
  stderr F     `task(self, context)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in 
update_available_resource
  stderr F     `startup=startup)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in 
_update_available_resource_for_node
  stderr F     `startup=startup)`
  stderr F
  stderr F 
/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in 
update_available_resource
  stderr F     `resources = self.driver.get_available_resource(nodename)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in 
get_available_resource
  stderr F     `data['pci_passthrough_devices'] = 
self._get_pci_passthrough_devices()`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in 
_get_pci_passthrough_devices
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in 
<listcomp>
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
  stderr F     `ret = libvirtmod.virNodeDeviceListCaps(self._o)`
  
  The listAllDevices() function returned a list of unwrapped virNodeDevice
  objects and so calling listCaps() on such an unwrapped device could
  cause a freeze.
  
  Based on the above, the bug reporter was able to test a patch [3] to
  wrap listAllDevices() list items in tpool.Proxy and the result showed
  nova-compute no longer freezing [4] in the aforementioned scenario.
  
  During investigation it was also noticed that the listDevices() call
  list items were not tpool.Proxy wrapped, so this is fixed as well in the
  patch.
  
  [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13
  [3] https://review.opendev.org/c/openstack/nova/+/932669
  [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2091033

Title:
  Un-proxied libvirt calls list(All)Devices() can cause nova-compute to
  freeze for hours

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2091033/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to