** Description changed: [Impact] Nova uses evently.tpool.Proxy to defer actions/commands that would otherwise risk starving eventlets. This patch fixes the issue where virNodeDevice returned from libvirt were not wrapped by the proxy and therefore executed outside the thread which leads to starvation. There are two patches required to fix this issue, the first is the one in this bug and the second is to fix a regression subsequently identified by the first patch (bug 2098892). [Test Plan] - * Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms over a period of several hours if not days - * ensure that the amount of time nova.compute.resource_tracker takes to run does not continuously increase (can use https://github.com/dosaboy/openstack-analysis to determine this) + * Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms over a period of several hours if not days + * ensure that the amount of time nova.compute.resource_tracker takes to run does not continuously increase (can use https://github.com/dosaboy/openstack-analysis to determine this) [Regression Potential] - * no regression potential is expected as a result of this set of - patches. - + This patch is in fact addressing a large performance regression. Some time back libvirt calls were autowrapped so that they were proxied through a native thread but when a call is made, the return objects did not receive the same treatment i.e. calls made using those objects were not also being proxied. The effect is that when a large number of calls are made it starves the current thread of time to run eventlet threads. This patch is pushing libvirt calls that should have been run with the other libvirt calls i.e. in the proxied thread this freeing up the main thread. Since this is fixing a gap in existing functionality I do not anticipate it causing any regressions. The calls should be using the same thread as the other libvirt calls so we should not be at risk of running out of threads. -------------------------------------------------------------------------- tl;dr This bug has the same root cause as https://bugs.launchpad.net/nova/+bug/1840912 where items in lists returned from libvirt are not automatically wrapped in a tpool.Proxy. Discovered during investigation of a downstream bug [1] where a live migration was dirtying memory faster than the transfer and nova-compute became frozen unable to perform any other operations, not even logging, for hours. The freezing was tracked down to un-proxied libvirt call listAllDevices() which could block all other greenthreads. The listAllDevices() call occurs during the update_available_resource() periodic task in the libvirt driver in _get_pci_passthrough_devices(). In a GMR collected during a repro of the issue, a traceback showing this was present in the report [2]: tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks stderr F `task(self, context)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource stderr F `startup=startup)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node stderr F `startup=startup)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource stderr F `resources = self.driver.get_available_resource(nodename)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource stderr F `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices stderr F `in devices.items() if "pci" in dev.listCaps()]` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp> stderr F `in devices.items() if "pci" in dev.listCaps()]` stderr F stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps stderr F `ret = libvirtmod.virNodeDeviceListCaps(self._o)` The listAllDevices() function returned a list of unwrapped virNodeDevice objects and so calling listCaps() on such an unwrapped device could cause a freeze. Based on the above, the bug reporter was able to test a patch [3] to wrap listAllDevices() list items in tpool.Proxy and the result showed nova-compute no longer freezing [4] in the aforementioned scenario. During investigation it was also noticed that the listDevices() call list items were not tpool.Proxy wrapped, so this is fixed as well in the patch. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196 [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13 [3] https://review.opendev.org/c/openstack/nova/+/932669 [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2091033 Title: Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/2091033/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
