Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
I think we can close this as #notabug. I did a fresh install and its working on 5.7 It worked 100% fine on qemu, but when I imported it into virt-manager it crashed, until i removed all of the clock bits, so i think maybe there's a bug in the clock/pit parts of libvirt that qemu doesn't seem to use. Regards. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
Ah no, turns out I had disabled the gpu passthrough, when i re-enabled it i just got a kernel panic from 5.8.0-rc5 On 19/07/2020 00:34, Simon John wrote: I tried mainline 5.8.0-rc5 and I couldn't even get into gnome shell after the gdm3 password prompt, just a black screen! I tried running virsh start which worked but my monitor never got a signal. It also didn't crash the host or qemu though, although virsh destroy finished instantly, so i assume the guest didn't even start. regards. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
I tried mainline 5.8.0-rc5 and I couldn't even get into gnome shell after the gdm3 password prompt, just a black screen! I tried running virsh start which worked but my monitor never got a signal. It also didn't crash the host or qemu though, although virsh destroy finished instantly, so i assume the guest didn't even start. regards. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
Where would I look for some follow-up on this - upstream kernel.org? -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
Done some more digging. I can boot my macos vm using QXL instead of a passthrough GPU, even if i passthrough the USB device (not pci controller). So its definitely only passing through PCI devices like graphics cards that breaks things. So this works in qemu terms: -usb -device usb-host,hostbus=1,hostaddr=3 \ This breaks things: -device vfio-pci,host=02:00.0,multifunction=on \ -device vfio-pci,host=02:00.1 Oddly enough an Ubuntu 20.04 vm works fine with the gpu and usb passed through. Added the qemu shell scripts to the gist: https://gist.github.com/sej7278/766043a69c76308f84cfa14b3f3a924f Regards. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
On Sun, 28 Jun 2020 10:41:41 +0200 Salvatore Bonaccorso wrote: Hi Simon, Hi Salvatore, thanks for looking into this. On Sun, Jun 28, 2020 at 01:01:01AM +0100, Simon John wrote: > This looks a likely culprit: > > https://bugzilla.kernel.org/show_bug.cgi?id=207489 The issue you are seeing now seems different, though afaics. The fixing commit from the above reference, 8be8f932e3db ("kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts") was applied in v5.7-rc3, which was as well backported to v5.6.13. Yes, I was going to patch the kernel but noticed its already got that patch. I also tried passing through only one pci device and not the usb device but it didn't help, so probably isn't that issue. Now you kernel is tained, can you check if you see the issue as well when not loading the modules which taint the kernel? I removed virtualbox which was tainting the kernel, confirmed cat /proc/sys/kernel/tainted returned 0 after a reboot. I still managed to trigger the issue though, which then tainted the kernel itself, output from kernel-chktaint: Kernel is "tainted" for the following reasons: * kernel died recently, i.e. there was an OOPS or BUG (#7) * kernel issued warning (#9) * soft lockup occurred (#14) For a more detailed explanation of the various taint flags see Documentation/admin-guide/tainted-kernels.rst in the the Linux kernel sources or https://kernel.org/doc/html/latest/admin-guide/tainted-kernels.html Raw taint value as int/string: 17024/'G D WL ' The kvm-pit process was using 100% of a cpu core, i've never even noticed that process before, looks like the culprit now?: watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kvm-pit/2888:2913] CPU: 7 PID: 2913 Comm: kvm-pit/2888 Tainted: G D W 5.7.0-1-amd64 #1 Debian 5.7.6-1 I've added a strace and dmesg to the gist here: https://gist.github.com/sej7278/766043a69c76308f84cfa14b3f3a924f Any other diagnostics I can run? Next, if it still does show up, does it show up as well in current mainline? Not sure what you mean there, do you mean upstream kernel.org kernel - i'm not sure how i'd run that. Regards. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
Hi Simon, On Sun, Jun 28, 2020 at 01:01:01AM +0100, Simon John wrote: > This looks a likely culprit: > > https://bugzilla.kernel.org/show_bug.cgi?id=207489 The issue you are seeing now seems different, though afaics. The fixing commit from the above reference, 8be8f932e3db ("kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts") was applied in v5.7-rc3, which was as well backported to v5.6.13. Now you kernel is tained, can you check if you see the issue as well when not loading the modules which taint the kernel? Next, if it still does show up, does it show up as well in current mainline? Regards, Salvatore
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
This looks a likely culprit: https://bugzilla.kernel.org/show_bug.cgi?id=207489 -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
Bug still exists in today's 5.7 kernel, however it took long enough to trigger that it managed to write some logs: https://gist.github.com/sej7278/766043a69c76308f84cfa14b3f3a924f "BUG: soft lockup - CPU#0 stuck for 22s!" seems to go back to 2013 or earlier e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1038929 kvm-pit seemed to be eating a lot of cpu, qemu went defunct, macos guest got a little further into its boot sequence. Can we upgrade this bug so it gets some visibility, that's two major kernel versions its not been fixed in now, let alone commented on. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
I tried to get some sort of useful logging today, best I could do was running strace. So I noticed that virsh exited 0 but didn't instantly crash, it took 2-3 seconds more. Photo of screen with strace output here: https://i.imgur.com/kOSEXXA.jpg Managed to install 5.5.0-2 from snapshot.debian.org which works fine. Regards. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
The new linux-image-5.6.0-2-amd64 (5.6.14-2) from today is no better, only change is its signed I guess? Also I noticed the 5.3/5.4/5.5 kernels have all been removed from my system and the servers so I have no way to boot back into a kernel that doesn't break VFIO. Has this issue been looked at or is 5.7 due soon? Regards. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
On Wed, 27 May 2020 21:31:46 +0200 Salvatore Bonaccorso wrote: Source: linux Version: 5.6.7-1 Control: submitter -1 deb...@the-jedi.co.uk Forwarding this report as bug in the BTS. On Mon, May 25, 2020 at 09:03:43PM +0100, Simon John wrote: > Sorry to email directly but I've tried reportbug but it doesn't seem to work > with the kernel packages. What exactly did not work? I tried to report a bug in 5.5 kernel regarding HDMI audio not working after a while until you kill pulseaudio but reportbug did not create a valid report i guess as it never got created - i tried using the email template too. same problem when trying to report this bug. Anyway, at least this bug has been created now, hopefully will get looked into. > The 5.6.0-1 kernel in Sid when used in conjunction with VFIO (PCI > passthrough) guests in Qemu-KVM causes KVM to crash with no useful logs - > the guest partially starts. Eventually the host slows down and needs to be > powered off. Doesn't affect non-VFIO guests using spice/headless, only when > passing through a PCIe graphics card or USB keyboard/mouse. > > 5.6.0-2 that just landed makes this infinitely worse - as soon as you start > a VFIO guest the host hard crashes. > > 5.3/5.4/5.5 kernels do not have this problem with the same version of Qemu > and reverting from qemu 5.0 to 4.2 doesn't help, so it must be the 5.6 > kernel causing the issue. > > Not sure what logs or any useful information I can supply you, there's some > interesting comments on this reddit thread suggesting a couple of bugs that > have fixes upstream: > > https://www.reddit.com/r/VFIO/comments/glfgqs/qemu_5_vfio_no_longer_works_on_debian/ > > I'm using macos catalina and win10 guests to reproduce the issue, and using > Intel not AMD chips. > > Cheers. > > -- > Simon John Regards, Salvatore Best regards. -- Simon John
Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests
Source: linux Version: 5.6.7-1 Control: submitter -1 deb...@the-jedi.co.uk Forwarding this report as bug in the BTS. On Mon, May 25, 2020 at 09:03:43PM +0100, Simon John wrote: > Sorry to email directly but I've tried reportbug but it doesn't seem to work > with the kernel packages. What exactly did not work? > The 5.6.0-1 kernel in Sid when used in conjunction with VFIO (PCI > passthrough) guests in Qemu-KVM causes KVM to crash with no useful logs - > the guest partially starts. Eventually the host slows down and needs to be > powered off. Doesn't affect non-VFIO guests using spice/headless, only when > passing through a PCIe graphics card or USB keyboard/mouse. > > 5.6.0-2 that just landed makes this infinitely worse - as soon as you start > a VFIO guest the host hard crashes. > > 5.3/5.4/5.5 kernels do not have this problem with the same version of Qemu > and reverting from qemu 5.0 to 4.2 doesn't help, so it must be the 5.6 > kernel causing the issue. > > Not sure what logs or any useful information I can supply you, there's some > interesting comments on this reddit thread suggesting a couple of bugs that > have fixes upstream: > > https://www.reddit.com/r/VFIO/comments/glfgqs/qemu_5_vfio_no_longer_works_on_debian/ > > I'm using macos catalina and win10 guests to reproduce the issue, and using > Intel not AMD chips. > > Cheers. > > -- > Simon John Regards, Salvatore