Re: [Qemu-devel] [PATCH qemu v5] spapr: Support NVIDIA V100 GPU with NVLink2
On Fri, 8 Mar 2019 12:44:20 +1100 Alexey Kardashevskiy wrote: > NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory > space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver > implements special regions for such GPUs and emulates an NVLink bridge. > NVLink2-enabled POWER9 CPUs also provide address translation services > which includes an ATS shootdown (ATSD) register exported via the NVLink > bridge device. > > This adds a quirk to VFIO to map the GPU memory and create an MR; > the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses > this to get the MR and map it to the system address space. > Another quirk does the same for ATSD. > > This adds additional steps to sPAPR PHB setup: > > 1. Search for specific GPUs and NPUs, collect findings in > sPAPRPHBState::nvgpus, manage system address space mappings; > > 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", > "memory-block", "link-speed" to advertise the NVLink2 function to > the guest; > > 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; > > 4. Add new memory blocks (with extra "linux,memory-usable" to prevent > the guest OS from accessing the new memory until it is onlined) and > npuphb# nodes representing an NPU unit for every vPHB as the GPU driver > uses it for link discovery. > > This allocates space for GPU RAM and ATSD like we do for MMIOs by > adding 2 new parameters to the phb_placement() hook. Older machine types > set these to zero. > > This puts new memory nodes in a separate NUMA node to replicate the host > system setup as the GPU driver relies on this. > > This adds requirement similar to EEH - one IOMMU group per vPHB. > The reason for this is that ATSD registers belong to a physical NPU > so they cannot invalidate translations on GPUs attached to another NPU. > It is guaranteed by the host platform as it does not mix NVLink bridges > or GPUs from different NPU in the same IOMMU group. If more than one > IOMMU group is detected on a vPHB, this disables ATSD support for that > vPHB and prints a warning. > > Signed-off-by: Alexey Kardashevskiy > --- > > This is based on David's ppc-for-4.0 + > applied but not pushed "iommu replay": > https://patchwork.ozlabs.org/patch/1052644/ > acked "vfio_info_cap public": https://patchwork.ozlabs.org/patch/1052645/ > > > Changes: > v5: > * converted MRs to VFIOQuirk - this fixed leaks > > v4: > * fixed ATSD placement > * fixed spapr_phb_unrealize() to do nvgpu cleanup > * replaced warn_report() with Error* > > v3: > * moved GPU RAM above PCI MMIO limit > * renamed QOM property to nvlink2-tgt > * moved nvlink2 code to its own file > > --- > > The example command line for redbud system: > > pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \ > -nodefaults \ > -chardev stdio,id=STDIO0,signal=off,mux=on \ > -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \ > -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \ > -enable-kvm -m 384G \ > -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=4 \ > -mon chardev=SOCKET0,mode=control \ > -smp 80,sockets=1,threads=4 \ > -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \ > -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \ > img/vdisk0.img \ > -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \ > -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \ > -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \ > -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \ > -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \ > -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \ > -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \ > -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \ > -device spapr-pci-host-bridge,id=phb1,index=1 \ > -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \ > -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \ > -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \ > -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \ > -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \ > -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \ > -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \ > -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \ > -machine pseries \ > -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors > > Note that QEMU attaches PCI devices to the last added vPHB so first > 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and > 35:03:00.0..7:00:01.2 to the vPHB with id=phb1. > --- > hw/ppc/Makefile.objs| 2 +- > hw/vfio/pci.h | 2 + > include/hw/pci-host/spapr.h | 45 > include/hw/ppc/spapr.h | 3 +- > hw/ppc/spapr.c | 29 ++- > hw/ppc/spapr_pci.c | 19 ++ > hw/ppc/spapr_pci_nvlink2.c | 441 > hw/vfio/pci-quirks.c| 132 +++ >
Re: [Qemu-devel] [PATCH qemu v5] spapr: Support NVIDIA V100 GPU with NVLink2
On 08/03/2019 15:30, David Gibson wrote: > On Fri, Mar 08, 2019 at 12:44:20PM +1100, Alexey Kardashevskiy wrote: >> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory >> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver >> implements special regions for such GPUs and emulates an NVLink bridge. >> NVLink2-enabled POWER9 CPUs also provide address translation services >> which includes an ATS shootdown (ATSD) register exported via the NVLink >> bridge device. >> >> This adds a quirk to VFIO to map the GPU memory and create an MR; >> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses >> this to get the MR and map it to the system address space. >> Another quirk does the same for ATSD. >> >> This adds additional steps to sPAPR PHB setup: >> >> 1. Search for specific GPUs and NPUs, collect findings in >> sPAPRPHBState::nvgpus, manage system address space mappings; >> >> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", >> "memory-block", "link-speed" to advertise the NVLink2 function to >> the guest; >> >> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; >> >> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent >> the guest OS from accessing the new memory until it is onlined) and >> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver >> uses it for link discovery. >> >> This allocates space for GPU RAM and ATSD like we do for MMIOs by >> adding 2 new parameters to the phb_placement() hook. Older machine types >> set these to zero. >> >> This puts new memory nodes in a separate NUMA node to replicate the host >> system setup as the GPU driver relies on this. >> >> This adds requirement similar to EEH - one IOMMU group per vPHB. >> The reason for this is that ATSD registers belong to a physical NPU >> so they cannot invalidate translations on GPUs attached to another NPU. >> It is guaranteed by the host platform as it does not mix NVLink bridges >> or GPUs from different NPU in the same IOMMU group. If more than one >> IOMMU group is detected on a vPHB, this disables ATSD support for that >> vPHB and prints a warning. >> >> Signed-off-by: Alexey Kardashevskiy >> --- >> >> This is based on David's ppc-for-4.0 + >> applied but not pushed "iommu replay": >> https://patchwork.ozlabs.org/patch/1052644/ >> acked "vfio_info_cap public": https://patchwork.ozlabs.org/patch/1052645/ >> >> >> Changes: >> v5: >> * converted MRs to VFIOQuirk - this fixed leaks >> >> v4: >> * fixed ATSD placement >> * fixed spapr_phb_unrealize() to do nvgpu cleanup >> * replaced warn_report() with Error* >> >> v3: >> * moved GPU RAM above PCI MMIO limit >> * renamed QOM property to nvlink2-tgt >> * moved nvlink2 code to its own file >> >> --- >> >> The example command line for redbud system: >> >> pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \ >> -nodefaults \ >> -chardev stdio,id=STDIO0,signal=off,mux=on \ >> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \ >> -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \ >> -enable-kvm -m 384G \ >> -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=4 \ >> -mon chardev=SOCKET0,mode=control \ >> -smp 80,sockets=1,threads=4 \ >> -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \ >> -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \ >> img/vdisk0.img \ >> -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \ >> -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \ >> -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \ >> -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \ >> -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \ >> -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \ >> -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \ >> -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \ >> -device spapr-pci-host-bridge,id=phb1,index=1 \ >> -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \ >> -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \ >> -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \ >> -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \ >> -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \ >> -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \ >> -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \ >> -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \ >> -machine pseries \ >> -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors >> >> Note that QEMU attaches PCI devices to the last added vPHB so first >> 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and >> 35:03:00.0..7:00:01.2 to the vPHB with id=phb1. >> --- >> hw/ppc/Makefile.objs| 2 +- >> hw/vfio/pci.h | 2 + >> include/hw/pci-host/spapr.h | 45 >> include/hw/ppc/spapr.h | 3 +- >> hw/ppc/spapr.c | 29 ++- >> hw/ppc/spapr_pci.c
Re: [Qemu-devel] [PATCH qemu v5] spapr: Support NVIDIA V100 GPU with NVLink2
On Fri, Mar 08, 2019 at 12:44:20PM +1100, Alexey Kardashevskiy wrote: > NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory > space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver > implements special regions for such GPUs and emulates an NVLink bridge. > NVLink2-enabled POWER9 CPUs also provide address translation services > which includes an ATS shootdown (ATSD) register exported via the NVLink > bridge device. > > This adds a quirk to VFIO to map the GPU memory and create an MR; > the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses > this to get the MR and map it to the system address space. > Another quirk does the same for ATSD. > > This adds additional steps to sPAPR PHB setup: > > 1. Search for specific GPUs and NPUs, collect findings in > sPAPRPHBState::nvgpus, manage system address space mappings; > > 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", > "memory-block", "link-speed" to advertise the NVLink2 function to > the guest; > > 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; > > 4. Add new memory blocks (with extra "linux,memory-usable" to prevent > the guest OS from accessing the new memory until it is onlined) and > npuphb# nodes representing an NPU unit for every vPHB as the GPU driver > uses it for link discovery. > > This allocates space for GPU RAM and ATSD like we do for MMIOs by > adding 2 new parameters to the phb_placement() hook. Older machine types > set these to zero. > > This puts new memory nodes in a separate NUMA node to replicate the host > system setup as the GPU driver relies on this. > > This adds requirement similar to EEH - one IOMMU group per vPHB. > The reason for this is that ATSD registers belong to a physical NPU > so they cannot invalidate translations on GPUs attached to another NPU. > It is guaranteed by the host platform as it does not mix NVLink bridges > or GPUs from different NPU in the same IOMMU group. If more than one > IOMMU group is detected on a vPHB, this disables ATSD support for that > vPHB and prints a warning. > > Signed-off-by: Alexey Kardashevskiy > --- > > This is based on David's ppc-for-4.0 + > applied but not pushed "iommu replay": > https://patchwork.ozlabs.org/patch/1052644/ > acked "vfio_info_cap public": https://patchwork.ozlabs.org/patch/1052645/ > > > Changes: > v5: > * converted MRs to VFIOQuirk - this fixed leaks > > v4: > * fixed ATSD placement > * fixed spapr_phb_unrealize() to do nvgpu cleanup > * replaced warn_report() with Error* > > v3: > * moved GPU RAM above PCI MMIO limit > * renamed QOM property to nvlink2-tgt > * moved nvlink2 code to its own file > > --- > > The example command line for redbud system: > > pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \ > -nodefaults \ > -chardev stdio,id=STDIO0,signal=off,mux=on \ > -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \ > -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \ > -enable-kvm -m 384G \ > -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=4 \ > -mon chardev=SOCKET0,mode=control \ > -smp 80,sockets=1,threads=4 \ > -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \ > -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \ > img/vdisk0.img \ > -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \ > -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \ > -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \ > -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \ > -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \ > -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \ > -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \ > -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \ > -device spapr-pci-host-bridge,id=phb1,index=1 \ > -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \ > -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \ > -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \ > -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \ > -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \ > -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \ > -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \ > -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \ > -machine pseries \ > -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors > > Note that QEMU attaches PCI devices to the last added vPHB so first > 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and > 35:03:00.0..7:00:01.2 to the vPHB with id=phb1. > --- > hw/ppc/Makefile.objs| 2 +- > hw/vfio/pci.h | 2 + > include/hw/pci-host/spapr.h | 45 > include/hw/ppc/spapr.h | 3 +- > hw/ppc/spapr.c | 29 ++- > hw/ppc/spapr_pci.c | 19 ++ > hw/ppc/spapr_pci_nvlink2.c | 441 > hw/vfio/pci-quirks.c| 132 +++ >
[Qemu-devel] [PATCH qemu v5] spapr: Support NVIDIA V100 GPU with NVLink2
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver implements special regions for such GPUs and emulates an NVLink bridge. NVLink2-enabled POWER9 CPUs also provide address translation services which includes an ATS shootdown (ATSD) register exported via the NVLink bridge device. This adds a quirk to VFIO to map the GPU memory and create an MR; the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses this to get the MR and map it to the system address space. Another quirk does the same for ATSD. This adds additional steps to sPAPR PHB setup: 1. Search for specific GPUs and NPUs, collect findings in sPAPRPHBState::nvgpus, manage system address space mappings; 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", "memory-block", "link-speed" to advertise the NVLink2 function to the guest; 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; 4. Add new memory blocks (with extra "linux,memory-usable" to prevent the guest OS from accessing the new memory until it is onlined) and npuphb# nodes representing an NPU unit for every vPHB as the GPU driver uses it for link discovery. This allocates space for GPU RAM and ATSD like we do for MMIOs by adding 2 new parameters to the phb_placement() hook. Older machine types set these to zero. This puts new memory nodes in a separate NUMA node to replicate the host system setup as the GPU driver relies on this. This adds requirement similar to EEH - one IOMMU group per vPHB. The reason for this is that ATSD registers belong to a physical NPU so they cannot invalidate translations on GPUs attached to another NPU. It is guaranteed by the host platform as it does not mix NVLink bridges or GPUs from different NPU in the same IOMMU group. If more than one IOMMU group is detected on a vPHB, this disables ATSD support for that vPHB and prints a warning. Signed-off-by: Alexey Kardashevskiy --- This is based on David's ppc-for-4.0 + applied but not pushed "iommu replay": https://patchwork.ozlabs.org/patch/1052644/ acked "vfio_info_cap public": https://patchwork.ozlabs.org/patch/1052645/ Changes: v5: * converted MRs to VFIOQuirk - this fixed leaks v4: * fixed ATSD placement * fixed spapr_phb_unrealize() to do nvgpu cleanup * replaced warn_report() with Error* v3: * moved GPU RAM above PCI MMIO limit * renamed QOM property to nvlink2-tgt * moved nvlink2 code to its own file --- The example command line for redbud system: pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \ -nodefaults \ -chardev stdio,id=STDIO0,signal=off,mux=on \ -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \ -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \ -enable-kvm -m 384G \ -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=4 \ -mon chardev=SOCKET0,mode=control \ -smp 80,sockets=1,threads=4 \ -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \ -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \ img/vdisk0.img \ -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \ -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \ -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \ -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \ -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \ -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \ -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \ -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \ -device spapr-pci-host-bridge,id=phb1,index=1 \ -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \ -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \ -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \ -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \ -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \ -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \ -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \ -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \ -machine pseries \ -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors Note that QEMU attaches PCI devices to the last added vPHB so first 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and 35:03:00.0..7:00:01.2 to the vPHB with id=phb1. --- hw/ppc/Makefile.objs| 2 +- hw/vfio/pci.h | 2 + include/hw/pci-host/spapr.h | 45 include/hw/ppc/spapr.h | 3 +- hw/ppc/spapr.c | 29 ++- hw/ppc/spapr_pci.c | 19 ++ hw/ppc/spapr_pci_nvlink2.c | 441 hw/vfio/pci-quirks.c| 132 +++ hw/vfio/pci.c | 14 ++ hw/vfio/trace-events| 4 + 10 files changed, 686 insertions(+), 5 deletions(-) create mode 100644 hw/ppc/spapr_pci_nvlink2.c diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs index b218a048..636e717f207c 100644 --- a/hw/ppc/Makefile.objs