On Thu, 27 Feb 2020, at 11:35 AM, Bronek Kozicki wrote: > On Mon, 24 Feb 2020, at 6:58 PM, Bronek Kozicki wrote: > > On Mon, 24 Feb 2020, at 5:23 PM, Alex Williamson wrote: > > > On Mon, 24 Feb 2020 10:40:39 +0000 > > > "Bronek Kozicki" <b...@spamcop.net> wrote: > > > > > > > Heads up to anyone running the latest vanilla kernels - after upgrade > > > > from 5.4.21 to 5.4.22 one of my VMs lost access to a vfio1 > > > > passed-through GPU. This was restored when I downgraded to 5.4.21 so > > > > the problem seems related to some patch in version 5.4.22 > > > > > > > > Also, when starting the VM, I noticed the hypervisor log flooded with > > > > messages "BAR 3: can't reserve" like: > > > > > > > > Feb 24 09:49:38 gdansk.lan.incorrekt.net kernel: vfio-pci > > > > 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 Feb 24 09:49:38 > > > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: > > > > vfio_ecap_init: hiding ecap 0x19@0x900 Feb 24 09:49:38 > > > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't > > > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:49:38 > > > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: No more image > > > > in the PCI ROM Feb 24 09:51:43 gdansk.lan.incorrekt.net kernel: > > > > vfio-pci 0000:03:00.0: BAR 3: can't reserve [mem > > > > 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:51:43 > > > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't > > > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:51:43 > > > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't > > > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:51:43 > > > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't > > > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:51:43 > > > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't > > > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] > > > > > > > > journalctl -b-2 | grep "vfio-pci 0000:03:00.0: BAR 3: can't reserve" > > > > | wc -l 2609 > > > > > > > > Finally, when shutting down the VM I observed kernel panic on the > > > > hypervisor: > > > > > > > > [ 873.831301] Kernel panic - not syncing: Timeout: Not all CPUs > > > > entered broadcast exception handler [ 874.874008] Shutting down cpus > > > > with NMI [ 874.888189] Kernel Offset: 0x0 from 0xffffffff81000000 > > > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ > > > > 875.074319] Rebooting in 30 seconds.. > > > > > > Tried v5.4.22, not getting anything similar. Potentially there's a > > > driver activated in this kernel that wasn't previously on your system > > > and it's attached itself to part of your device. Look in /proc/iomem > > > to see what it might be and disable it. Thanks, > > > > > > Alex > > > > > > Thank you Alex. One more thing which might be relevant: my system has > > two identical GPUs (Quadro M5000), each in its own IOMMU group, and > > two VMs each using one of these GPUs. One of the VMs is Windows 10 and > > I think it is configured for MSI-X, the other is Ubuntu Biopic with > > stable nvidia drivers. > > > > I will try to find more debugging information when I get home, but > > perhaps above will allow you to reproduce. > > Some more information > > My system has 2 Xeon CPUs E5-2667 v2, each with 8 cores and 16 threads > (total 32 threads over 2 sockets). The motherboard is Supermicro X9DA7. > Despite 2 GPUs attached, the machine is headless with ttyS0 for control > - both GPUs are dedicated for virtual machines. > > There is 128GB of ECC RAM, shared between small number of VMs and ZFS > filesystems. 80GB is reserved in hugepages for the VMs, 20GB is > reserved for ZFS cache. The kernel options are my own and unlikely to > be very good (happy to take feedback); I use the same kernel package > for both hypervisor and for one of the virtual machines, so some kernel > options enabled only make sense in a VM. I need > CONFIG_PREEMPT_VOLUNTARY and CONFIG_TREE_RCU for ZFS, I do not care > about Xen, legacy hardware or some kernel debugging options (although > perhaps I should). > > I did get more of that on my main computer (including dmesg logs, pcie > topology etc), but because of a kernel panic (same as seen earlier, > while trying to reproduce the bug) its root filesystem is currently not > in a good state and I am unfortunately too busy at the moment to fix it > and access this data. Will send more over the weekend, assuming that > fixing my computer wont take very long.
A followup after long pause (long story short, my computer would not boot anymore, not even to BIOS, so I had to rebuild it with a new motherboard). I found the following reported by nvidia-smi on one of the cards: WARNING: infoROM is corrupted at gpu 0000:06:00.0 According to NVIDIA, that's either bad drivers (I am using LTS version 440) or, more likely, bad card. I guess it it the latter in my case. B. -- Bronek Kozicki b...@incorrekt.com _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users