On Mon, 24 Feb 2020, at 6:58 PM, Bronek Kozicki wrote:
> On Mon, 24 Feb 2020, at 5:23 PM, Alex Williamson wrote:
> > On Mon, 24 Feb 2020 10:40:39 +0000
> > "Bronek Kozicki" <b...@spamcop.net> wrote:
> > 
> > > Heads up to anyone running the latest vanilla kernels - after upgrade
> > > from 5.4.21 to 5.4.22 one of my VMs lost access to a vfio1
> > > passed-through GPU. This was restored when I downgraded to 5.4.21 so
> > > the problem seems related to some patch in version 5.4.22
> > > 
> > > Also, when starting the VM, I noticed the hypervisor log flooded with
> > > messages "BAR 3: can't reserve" like:
> > > 
> > > Feb 24 09:49:38 gdansk.lan.incorrekt.net kernel: vfio-pci
> > > 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 Feb 24 09:49:38
> > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0:
> > > vfio_ecap_init: hiding ecap 0x19@0x900 Feb 24 09:49:38
> > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't
> > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:49:38
> > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: No more image
> > > in the PCI ROM Feb 24 09:51:43 gdansk.lan.incorrekt.net kernel:
> > > vfio-pci 0000:03:00.0: BAR 3: can't reserve [mem
> > > 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:51:43
> > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't
> > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:51:43
> > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't
> > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:51:43
> > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't
> > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref] Feb 24 09:51:43
> > > gdansk.lan.incorrekt.net kernel: vfio-pci 0000:03:00.0: BAR 3: can't
> > > reserve [mem 0xc0000000-0xc1ffffff 64bit pref]
> > > 
> > > journalctl -b-2 | grep "vfio-pci 0000:03:00.0: BAR 3: can't reserve"
> > > | wc -l 2609
> > > 
> > > Finally, when shutting down the VM I observed kernel panic on the
> > > hypervisor:
> > > 
> > > [  873.831301] Kernel panic - not syncing: Timeout: Not all CPUs
> > > entered broadcast exception handler [  874.874008] Shutting down cpus
> > > with NMI [  874.888189] Kernel Offset: 0x0 from 0xffffffff81000000
> > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [
> > > 875.074319] Rebooting in 30 seconds..
> > 
> > Tried v5.4.22, not getting anything similar.  Potentially there's a
> > driver activated in this kernel that wasn't previously on your system
> > and it's attached itself to part of your device.  Look in /proc/iomem
> > to see what it might be and disable it.  Thanks,
> > 
> > Alex 
> 
> 
> Thank you Alex. One more thing which might be relevant: my system has 
> two identical GPUs (Quadro  M5000), each in its own IOMMU group, and 
> two VMs each using one of these GPUs. One of the VMs is Windows 10 and 
> I think it is configured for MSI-X, the other is Ubuntu Biopic with 
> stable nvidia drivers.
> 
> I will try to find more debugging information when I get home, but 
> perhaps above will allow you to reproduce. 

Some more information

My system has 2 Xeon CPUs  E5-2667 v2, each with 8 cores and 16 threads (total 
32 threads over 2 sockets). The motherboard is Supermicro X9DA7. Despite 2 GPUs 
attached, the machine is headless with ttyS0 for control - both GPUs are 
dedicated for virtual machines.

There is 128GB of ECC RAM, shared between small number of VMs and ZFS 
filesystems. 80GB is reserved in hugepages for the VMs, 20GB is reserved for 
ZFS cache. The kernel options are my own and unlikely to be very good (happy to 
take feedback); I use the same kernel package for both hypervisor and for one 
of the virtual machines, so some kernel options enabled only make sense in a 
VM. I need CONFIG_PREEMPT_VOLUNTARY and CONFIG_TREE_RCU for ZFS, I do not care 
about Xen, legacy hardware or some kernel debugging options (although perhaps I 
should).

I did get more of that on my main computer (including dmesg logs, pcie topology 
etc), but because of a kernel panic (same as seen earlier, while trying to 
reproduce the bug) its root filesystem is currently not in a good state and I 
am unfortunately too busy at the moment to fix it and access this data. Will 
send more over the weekend, assuming that fixing my computer wont take very 
long.


B.

-- 
  Bronek Kozicki
  b...@spamcop.net


_______________________________________________
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users

Reply via email to