Hello,

I am building a proof-of-concept server for performing some GPGPU related
calculations. The idea is to pass through all GPUs to VMs for easier
testing of code and better isolation between program and host (if anything
goes wrong in best case only the VM has to be stopped or at least the host
does not hang and can be easily rebooted without the need for triggering a
physical reset button).

The base specs of machine:
- Motherboard: Supermicro X9DRi-LN4F+
- CPU: 2x Intel Xeon E5-2670
- RAM: 96GB DDR3
- GPU: 6x AMD RX570 4GB (Sapphire Nitro+)
- SSDs + ZFS + Kernel 5.4.101 (LTS) + VFIO modules + ACS patch

As the motherboard has only 6 PCIe slots they are populated like this:
- CPU0, slot 1: GPU
- CPU0, slot 2: GPU
- CPU0, slot 3: GPU
- CPU1, slot 4: NVMe SSD
- CPU1, slot 5: NVMe SSD
- CPU1, slot 6: ASM1184e PCIe Switch Port (
https://www.amazon.com/XT-XINTE-PCI-express-External-Adapter-Multiplier/dp/B07CWPWDF8
)
     -> port 1: GPU
     -> port 2: GPU
     -> port 3: GPU

IOMMU groups: https://pastebin.com/SvuWtGcz
GPU1 PCI details (same for GPU2-3 except the addresses):
https://pastebin.com/6wh4Hz8v
GPU6 PCI details (slightly differ from GPU1-3, same for GPU4-5 except the
addresses): https://pastebin.com/aDRfiLXA

I have successfully established VFIO for the GPUs in slots 1-3 (CPU0) with
...

/etc/modprobe.d/vfio-pci.conf:
options vfio-pci ids=1002:67df,1002:aaf0 disable_vga=1

/etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt
pcie_acs_override=id:1b21:1184"

... and then for VM (Windows, Linux) with assigned GPU1-3 everything works
perfectly.

The problem appears when I try to also pass through GPU4-6 (CPU1), which
are on PCIe Switch Port. It doesn't matter if I try to pass through only
one of those GPU, result is the same. When I start the VM I can see this
repeated multiple times in dmesg:
DMAR: DRHD: handling fault status reg 40

This line is also present on machine boot, before the: DMAR-IR: Enabled IRQ
remapping in x2apic mode

However, after some time when VM is starting, I also start to get kernel
errors, which I pasted here: https://pastebin.com/zRVJ63yZ (the host hangs
at that point and just throws those errors via ssh/dmesg -wH, even though
they are slightly different, but I have caught only what's in the paste)

I tried a lot of different configs, from changing options of intel_iommu
(igfx_off, sp_off, ...), allowing unsafe interrupts, changing VM
args/settings and I can't figure out what is going wrong - I don't know
enough about kernel and it's methods to understand what the errors which
are thrown mean.

I found a partial solution to the problem which is unfortunately working
only for Linux - but I'd like to have Windows also working. The secret
there is to remove pcie_acs_override and add pci=nommconf. Then the PCIe
Switch Port devices get into single IOMMU group and when assiging them to
Linux VM there is no any error - everything is being recognized. On the
other hand in Windows I always get exclamation in Device Manager, showing
that there is not enough resources for the device to work properly and I
need to first disable pcie option in VM settings, boot, shutdown and add
pcie option again, because otherwise VM doesn't boot at all.

I have now spent almost a week trying to figure out what's causing issues
and it seems I don't have enough skill to find that out by myself. I would
be very happy for any help/tip/... to get this thing resolved if even
possible, or at least the confirmation that there is no option this will
work correctly. If there is possibility to pass through complete PCIe
Switch with all the GPUs on it, this is also ok (I already tried to do that
somehow, but GPUs are resolved at the boot time and can't unbind/remove
them later on, as pcieport module is being used on boot).


Thank you very much!

Best Regards,
Marjan Novi
_______________________________________________
vfio-users mailing list
vfio-users@redhat.com
https://listman.redhat.com/mailman/listinfo/vfio-users

Reply via email to