libvirt: Fix regression of listDevices() return type : https://review.opendev.org/q/Ib5befdd3c13367daa208ff969f66cba693ae2c76 libvirt: Wrap un-proxied listDevices() and listAllDevices() : https://review.opendev.org/q/I60d6f04d374e9ede5895a43b7a75e955b0fea3c5
stable/2024.2: c4f4ae784f2078b652d07053d5a69a81dba1a8f5 tag=none 22981123dc199ff9889dfe357195c5d6d1c203f8 tag=none stable/2024.1: b8bfa1efbb71be1909897b39ecc12e687f177e89 tag=29.2.1 c20ed18dd23b759bc37be6344d121fd58a1cc728 tag=29.2.1 stable/2023.2: c4467364647387a3b2bae06d61c1e8c7b363ea5f tag=28.3.1 021ea3f9d640bc2438ce28de70f1b556b49fb8c2 tag=28.3.1 unmaintained/2023.1: not backported b834c628ff5ce1eaec3cce719a5ccbf9066bc960 tag=none None of these point releases or commits are in ubuntu packages yet so we need to SRU all of them. ** Also affects: cloud-archive/dalmatian Importance: Undecided Status: New ** Also affects: nova (Ubuntu Oracular) Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2091033 Title: Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours Status in Ubuntu Cloud Archive: New Status in Ubuntu Cloud Archive antelope series: New Status in Ubuntu Cloud Archive bobcat series: New Status in Ubuntu Cloud Archive caracal series: New Status in Ubuntu Cloud Archive dalmatian series: New Status in Ubuntu Cloud Archive yoga series: New Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) 2024.1 series: Fix Committed Status in OpenStack Compute (nova) 2024.2 series: Fix Committed Status in OpenStack Compute (nova) antelope series: Fix Committed Status in OpenStack Compute (nova) bobcat series: Fix Released Status in nova package in Ubuntu: New Status in nova source package in Jammy: New Status in nova source package in Noble: New Status in nova source package in Oracular: New Bug description: [Impact] Nova uses evently.tpool.Proxy to defer actions/commands that would otherwise risk starving eventlets. This patch fixes the issue where virNodeDevice returned from libvirt were not wrapped by the proxy and therefore executed outside the thread which leads to starvation. There are two patches required to fix this issue, the first is the one in this bug and the second is to fix a regression subsequently identified by the first patch (bug 2098892). [Test Plan] * Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms over a period of several hours if not days * ensure that the amount of time nova.compute.resource_tracker takes to run does not continuously increase (can use https://github.com/dosaboy/openstack-analysis to determine this) [Regression Potential] * no regression potential is expected as a result of this set of patches. -------------------------------------------------------------------------- tl;dr This bug has the same root cause as https://bugs.launchpad.net/nova/+bug/1840912 where items in lists returned from libvirt are not automatically wrapped in a tpool.Proxy. Discovered during investigation of a downstream bug [1] where a live migration was dirtying memory faster than the transfer and nova- compute became frozen unable to perform any other operations, not even logging, for hours. The freezing was tracked down to un-proxied libvirt call listAllDevices() which could block all other greenthreads. The listAllDevices() call occurs during the update_available_resource() periodic task in the libvirt driver in _get_pci_passthrough_devices(). In a GMR collected during a repro of the issue, a traceback showing this was present in the report [2]: tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks stderr F `task(self, context)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource stderr F `startup=startup)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node stderr F `startup=startup)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource stderr F `resources = self.driver.get_available_resource(nodename)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource stderr F `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices stderr F `in devices.items() if "pci" in dev.listCaps()]` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp> stderr F `in devices.items() if "pci" in dev.listCaps()]` stderr F stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps stderr F `ret = libvirtmod.virNodeDeviceListCaps(self._o)` The listAllDevices() function returned a list of unwrapped virNodeDevice objects and so calling listCaps() on such an unwrapped device could cause a freeze. Based on the above, the bug reporter was able to test a patch [3] to wrap listAllDevices() list items in tpool.Proxy and the result showed nova-compute no longer freezing [4] in the aforementioned scenario. During investigation it was also noticed that the listDevices() call list items were not tpool.Proxy wrapped, so this is fixed as well in the patch. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196 [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13 [3] https://review.opendev.org/c/openstack/nova/+/932669 [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21 To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/2091033/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp