Hi Andrew, On Tue, 1 May 2018 19:30:58 +0000 Andrew Zimmerman <a...@rincon.com> wrote:
> Alex, > > Thank you for your reply and all of your ideas. You are right that > the SXM uses NVLink - I had not thought of that as a potential > culprit. I do not have any PCIe GPUs in this cluster, but I may be > able to setup a standalone test on an older box. I tried it on a RHEL7.5 host, RHEL7.5 guest, assigned (PCIe) Telsa P4, driver 390.46, cuda-samples-9-1: # ./simpleAtomicIntrinsics simpleAtomicIntrinsics starting... GPU Device 0: "Tesla P4" with compute capability 6.1 > GPU device has 20 Multi-Processors, SM 6.1 compute capabilities Processing time: 114.939003 (ms) simpleAtomicIntrinsics completed, returned OK > I have not seen a specific mention from NVIDIA regarding VFIO support > for this form factor of the Tesla V100, but there were talks at GTC > regarding using Tesla cards with VFIO. Yes, we (RH & NVIDIA) support assignment of Tesla, GRID, and sufficiently expensive Quadro cards with vfio, and the vGPU framework for KVM is built on vfio, but all of this is only for PCIe based devices AFAIK. > Do you know of a better guide you could point me to for getting up > and running with VFIO? I was thinking that it felt like a > permissions issue (as I can query the device, but not write to it), > so it could be an issue with how it had me set up the ACLs... Those ACLs were only for the host, you can't do device assignment without the guest having full access to the device, so if you can assign the device, those ACLs are not the problem. If you started with just a RHEL/Centos 7.4 installed as a hypervisor and you have somewhere you can run virt-manager (ie. a Linux desktop), the key aspects for a compute GPU are to make sure the IOMMU is enabled on the host (intel_iommu=on on the host kernel command line... assuming an x86_64 system), blacklist nouveau on the host, just as if you were going to install the nvidia driver on the host, create and install a VM with virt-manager, also blacklist nouveau in the VM because you are going to install the nvidia driver there, then use virt-manager to add the Tesla to the VM and install the driver and CUDA dev kit. There are commandline tools to do all this too, virt-install, virt-viewer, virsh, but virt-manager just makes it easier. I'm interested in your experience, but I'll be rather surprised if an NVLink setup "just works", and perhaps a bit dubious about whether it should just work given the likely lack of isolation in such a mesh environment. Thanks, Alex _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users