Running VMs with an eGPU and VFIO: from flaky (<= 5.12.x) to broken (5.13.x)

Andrej Podzimek via Virtualization Sun, 11 Jul 2021 05:21:20 -0700

Dear virtualization mailing list,

My question may well be misplaced, because it's Thunderbolt-, eGPU- as well as 
NVidia-related, but I'm out of ideas where else to ask. (Should I ask in a 
qemu- or libvirt-specific list instead? If so, please give me a hint.)


First, here's the configuration of the physical (host) machine:

        Command line: pcie_ports=native 
pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=256M,hpmmioprefsize=16G 
mem_encrypt=on
           lspci -tv: https://pastebin.com/raw/usBudC1y
         Motherboard: ASRock x570 Creator with BIOS 3.50
                 CPU: AMD Ryzen 3950X
              System: ArchLinux with kernel 5.12.15 / 5.13.1
      eGPU enclosure: Razer Core X Chroma
            eGPU GPU: NVidia Quadro P5000
       UEFI settings: Above 64b decoding, IOMMU and SR-IOV all *enabled*
    Other PCIe slots:
                     GPU: AMD Radeon Pro W5700
                      M2: Two Seagate FireCuda 520 (ZP2000GM30002)
                    WiFi: Intel AX200 (factory-default)

The eGPU is configured like this in libvirt:

    <hostdev mode="subsystem" type="pci" managed="yes">
      <source><address domain="0x0000" bus="0x3d" slot="0x00" 
function="0x0"/></source>
      <address type="pci" domain="0x0000" bus="0x07" slot="0x00" 
function="0x0"/>
    </hostdev>

Now the problem: Forwarding of the NVidia card inside the eGPU into virtual 
machines was flaky up to 5.12.x (i.e., sometimes worked, sometimes didn't) and 
stopped working entirely in 5.13:

    virsh # start FreeBSD
    error: Failed to start domain 'FreeBSD'
    error: internal error: qemu unexpectedly closed the monitor: 
2021-07-11T10:34:09.102381Z qemu-system-x86_64: -device 
vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:3d:00.0: 
error getting device from group 49: Invalid argument
    Verify all devices in group 49 are bound to vfio-<bus> or pci-stub and not 
already in use

    virsh # start Windows
    error: Failed to start domain 'Windows'
    error: internal error: qemu unexpectedly closed the monitor: 
qxl_send_events: spice-server bug: guest stopped, ignoring
    2021-07-11T10:34:36.163549Z qemu-system-x86_64: -device 
vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: 
vfio_listener_region_add received unaligned region
    2021-07-11T10:34:39.432499Z qemu-system-x86_64: -device 
vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: 
vfio_listener_region_del received unaligned region
    2021-07-11T10:34:39.567039Z qemu-system-x86_64: -device 
vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: vfio 0000:3d:00.0: 
error getting device from group 49: Invalid argument
    Verify all devices in group 49 are bound to vfio-<bus> or pci-stub and not 
already in use

============
With 5.12.x:

There were "lucky" and "unlucky" boots/uptimes. VMs could be started and restarted arbitrarily during the 
"lucky" uptimes and the NVidia eGPU worked flawlessly. During an "unlucky" uptime, the errors above popped up 
every single time and no VMs using the eGPU could be started. Restarts of the eGPU did not help. The likelihood of a 
"lucky" uptime was roughly 1:3, so it took quite a few reboots to get there. :-( /o\
============

============
With 5.13.x:

After boot, the eGPU on Thunderbolt initially doesn't work at all. It won't 
show up in lspci, the nvidia module is not loaded etc. Switching the eGPU 
off/on won't help. Surprisingly, the only way to make it initialize (that I've 
discovered thus far) is:
    modprobe -r thunderbolt
    modprobe thunderbolt

After that^^^ the eGPU and NVidia GPU are detected, modules are loaded, nvidia-smi works and 
shows information etc., but attempts at VM startup _always_ produces the errors above. I 
have not seen a "lucky" uptime in >50 boots. :-( Also, before 
unloading+reloading of thunderbolt, there is simply no device 3d:00.0 anywhere on PCI (and 
no trace of NVidia elsewhere), so that machine state is a (VM) non-starter.

What else I tried:
    * options thunderbolt start_icm=1  -- no change (plus admittedly I have no 
clue what the internal connection manager means/does)
    * options vfio_iommu_type1 disable_hugepages=1  -- "What if the 'unaligned 
region' is related to huge pages?" No change here either. /o\
    * a huge lot of reboots, Thunderbolt disconnects/reconnects etc. Nope. It 
won't work.
============

Final note: Without the extra command line tokens, namely pcie_ports=native 
pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=256M,hpmmioprefsize=16G, 
the NVidia eGPU just won't work, neither on 5.12.x nor on 5.13.x. Way more 
details about that are here:
    https://egpu.io/forums/postid/90608/
    https://bbs.archlinux.org/viewtopic.php?id=261303

What should I try next to debug the issue and, importantly, to keep my VMs 
working on 5.13.x and beyond? Any ideas, tips, magic kernel command line tokens 
etc. would be very helpful.

Cheers!
Andrej

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Virtualization mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Running VMs with an eGPU and VFIO: from flaky (<= 5.12.x) to broken (5.13.x)

Reply via email to