** Also affects: cloud-archive
   Importance: Undecided
       Status: New

** Also affects: nova (Ubuntu)
   Importance: Undecided
       Status: New

** Description changed:

+ [Impact]
+ 
+ Nova uses evently.tpool.Proxy to defer actions/commands that would
+ otherwise risk starving eventlets. This patch fixes the issue where
+ virNodeDevice returned from libvirt were not wrapped by the proxy and
+ therefore executed outside the thread which leads to starvation. There
+ are two patches required to fix this issue, the first is the one in this
+ bug and the second is to fix a regression subsequently identified by the
+ first patch (bug 2098892).
+ 
+ [Test Plan]
+ 
+  * Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms 
over a period of several hours if not days
+  * ensure that the amount of time nova.compute.resource_tracker takes to run 
does not continuously increase (can use 
https://github.com/dosaboy/openstack-analysis to determine this)
+ 
+ [Regression Potential]
+ 
+  * no regression potential is expected as a result of this set of
+ patches.
+ 
+ --------------------------------------------------------------------------
+ 
  tl;dr This bug has the same root cause as
  https://bugs.launchpad.net/nova/+bug/1840912 where items in lists
  returned from libvirt are not automatically wrapped in a tpool.Proxy.
  
  Discovered during investigation of a downstream bug [1] where a live
  migration was dirtying memory faster than the transfer and nova-compute
  became frozen unable to perform any other operations, not even logging,
  for hours.
  
  The freezing was tracked down to un-proxied libvirt call
  listAllDevices() which could block all other greenthreads. The
  listAllDevices() call occurs during the update_available_resource()
  periodic task in the libvirt driver in _get_pci_passthrough_devices().
  In a GMR collected during a repro of the issue, a traceback showing this
  was present in the report [2]:
  
  tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in 
run_periodic_tasks
  stderr F     `task(self, context)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in 
update_available_resource
  stderr F     `startup=startup)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in 
_update_available_resource_for_node
  stderr F     `startup=startup)`
  stderr F
  stderr F 
/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in 
update_available_resource
  stderr F     `resources = self.driver.get_available_resource(nodename)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in 
get_available_resource
  stderr F     `data['pci_passthrough_devices'] = 
self._get_pci_passthrough_devices()`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in 
_get_pci_passthrough_devices
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in 
<listcomp>
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
  stderr F     `ret = libvirtmod.virNodeDeviceListCaps(self._o)`
  
  The listAllDevices() function returned a list of unwrapped virNodeDevice
  objects and so calling listCaps() on such an unwrapped device could
  cause a freeze.
  
  Based on the above, the bug reporter was able to test a patch [3] to
  wrap listAllDevices() list items in tpool.Proxy and the result showed
  nova-compute no longer freezing [4] in the aforementioned scenario.
  
  During investigation it was also noticed that the listDevices() call
  list items were not tpool.Proxy wrapped, so this is fixed as well in the
  patch.
  
  [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13
  [3] https://review.opendev.org/c/openstack/nova/+/932669
  [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21

** Also affects: nova (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Also affects: nova (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/caracal
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/antelope
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/bobcat
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2091033

Title:
  Un-proxied libvirt calls list(All)Devices() can cause nova-compute to
  freeze for hours

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2091033/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to