** Description changed:

+ [Impact]
+ 
+  * Systems using plenty of PCI Host Devices can have lost the Qemu PID 
+    from /proc/<pid> but the kernel still holds it quite a while while 
+    cleaning up resources.
+    That leads to libvirtd giving up on the process and keeping the guest 
+    forever in the state "in shutdown" as it can't know what is going on.
+ 
+  * It turns out the kernel can need quite some time, especially if it hits 
+    a settling per pci subbus reset per device.
+ 
+  * We have tested and upstreamed changes to adapt the time libvirt waits 
+    in those cases. Backporting those to Cosmic/Bionic would be great for 
+    Ubuntu to be ready for that many PCI device passthroughs - further back 
+    I think the backport noise and regression-risk tradeoff is no more 
+    worth - also in terms of requests I only got it for 18.04.
+ 
+ [Test Case]
+ 
+  * If a certain PCI device needs the bus reset or not depends on the 
+    kernel driver. One that I could prove having this issue are Nvidia 3d 
+    Controllers of the type "3D controller: NVIDIA Corporation Device 
+    1db8". It is a general concept, so other PCI devices might work to 
+    reproduce the case, but I can't guarantee as drivers can implement 
+    smarter reset code. Keep that in mind when following the repro below
+ 
+  * The former default delay was ~15 second, with the guest needing ~1-2 
+    seconds. So we need at least 14, better 16+ affected PCI devices.
+    Pass those devices to the guest with libvirt snippet like:
+     <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address 
+     domain='0x0000' bus='0x34' slot='0x0' function='0x0'/> </source> 
+     </hostdev>
+ 
+  * Let the guest start up as you'd normally
+ 
+  * Now run virsh shutdown <guestname>"
+ 
+  * You'll see shortly after that the qemu process is gone from 
+    /proc/<pid>, but libvirt will still spin on its shutdown. That is due 
+    to signal 0 still being able to reach it as it only guarantees the 
+    existence of the PID (not the process behind it)
+ 
+  * After the timeout expires libvirt would give up and leavev the guest in 
+    the half-dead state like:
+       libvirtd: virProcessKillPainfully:401 : Failed to terminate process 
+       5114 with SIGKILL: Device or resource busy
+ 
+  * With the fix the maximum delay is scaled if needed and the shutdown 
+    will work e.g. with 16 devices about ~18 seconds.
+    If one wants to check, log entries at debuglevel will contain the extra 
+    delay scaling:
+       #40 devices attached, getting the proper plus 80 seconds:
+       debug : virProcessKillPainfullyDelay:361 : vpid=34896 force=1 
+       extradelay=80
+ 
+ [Regression Potential]
+ 
+  * I can think of two potential regressions, but none is on the level to 
+    stop this fix to improve the affected cases.
+ 
+  * R1: For forced shutdowns (at the end of life of a guest) the worst case 
+    timeout after delivering the sigkill when the process was not going 
+    down due to sigterm got bumped - that said really hanging cases will 
+    need a few seconds more until libvirt gives up on them.
+    Those cases would never recover anyway, but the time until giving up 
+    can be visible to e.g. higher level tools/admins in that longer 
+    duration.
+ 
+  * R2: The scaling per Hostdevice was derived out of explanations by 
+    Kernel developers that there is some base work plus a 1 second settling 
+    period. Out of that the increased max timeout can go really high if you 
+    will one day have hundreds of hostdevices passed through. But libvirt 
+    in this case is on the lower layer, if/once the kernel has code that 
+    does no more scale linearly we can also make this extra delay less 
+    linear - but so far nothing like that is on the horizon.
+ 
+   * Please keep in mind that both Risks that I spelled out are only a 
+     problem if the guest Process is not going away to sigterm/sigkill.
+     That means it already is in a really bad spot and code path, we just 
+     wait longer to give it a chance to clean up. 99+% of the cases will 
+     have guests that go away as they should, and those cases are not 
+     waiting the full - now extended - delay.
+ 
+ [Other Info]
+  
+  * n/a
+ 
+ ---
+ 
  I had an upstream discussion at [1]
  
  The TL;DR is that if using more than a few hostdevices as pass through
  then on shutting down of the guest there might be an odd time.
  
  In that time the qemu process is already gone from /proc/<pid> but still
  reachable via singal-0.
  
  That makes libvirt believe the process would not die (think of zombie
  processes due to e.g. failed NFS files).
  
  But what really happens is that depending on the hostdev the kernel
  might need up to 1-2 seconds extra time to unallocate all the pci
  ressources.
  
  We came up with patches that scale the allowed time depending on the
  number of hostdevs as well a s a general bump for the bad case (sigkill)
  following the recommendation of kernel engineers who said that we are in
  a bad path anyway, so we could as well provide a bit more time to let it
  clean up.
  
  We should add these changes at least back into Bionic IMHO.
  
  [1]: https://www.redhat.com/archives/libvir-
  list/2018-August/msg01295.html

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1788226

Title:
  libvirt might misdetect guests as not properly shut down when using
  plenty of hostdev ressources

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1788226/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to