Re: [PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
>> Ankit Agrawal (3): >> vfio/pci: rename and export do_io_rw() >> vfio/pci: rename and export range_intersect_range >> vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper >> >> MAINTAINERS | 16 +- >> drivers/vfio/pci/Kconfig | 2 + >> drivers/vfio/pci/Makefile | 2 + >> drivers/vfio/pci/nvgrace-gpu/Kconfig | 10 + >> drivers/vfio/pci/nvgrace-gpu/Makefile | 3 + >> drivers/vfio/pci/nvgrace-gpu/main.c | 879 ++ >> drivers/vfio/pci/vfio_pci_config.c | 42 ++ >> drivers/vfio/pci/vfio_pci_rdwr.c | 16 +- >> drivers/vfio/pci/virtio/main.c | 72 +-- >> include/linux/vfio_pci_core.h | 10 +- >> 10 files changed, 993 insertions(+), 59 deletions(-) >> create mode 100644 drivers/vfio/pci/nvgrace-gpu/Kconfig >> create mode 100644 drivers/vfio/pci/nvgrace-gpu/Makefile >> create mode 100644 drivers/vfio/pci/nvgrace-gpu/main.c >> > > Applied to vfio next branch for v6.9. Thanks, > > Alex Thanks Alex! Appreciate this along with your guidance and help in the reviews.
Re: [PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Tue, 20 Feb 2024 17:20:52 +0530 wrote: > From: Ankit Agrawal > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > for the on-chip GPU that is the logical OS representation of the > internal proprietary chip-to-chip cache coherent interconnect. > > The device is peculiar compared to a real PCI device in that whilst > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the > device, it is not used to access device memory once the faster > chip-to-chip interconnect is initialized (occurs at the time of host > system boot). The device memory is accessed instead using the > chip-to-chip interconnect that is exposed as a contiguous physically > addressable region on the host. Since the device memory is cache > coherent with the CPU, it can be mmaped into the user VMA with a > cacheable mapping and used like a regular RAM. The device memory is > not added to the host kernel, but mapped directly as this reduces > memory wastage due to struct pages. > > There is also a requirement of a minimum reserved 1G uncached region > (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. > This is to work around a HW defect. Based on [2], the requisite properties > (uncached, unaligned access) can be achieved through a VM mapping (S1) > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide > a different non-cached property to the reserved 1G region, it needs to > be carved out from the device memory and mapped as a separate region > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets > the Qemu VMA page properties (pgprot) as NORMAL_NC. > > Provide a VFIO PCI variant driver that adapts the unique device memory > representation into a more standard PCI representation facing userspace. > > The variant driver exposes these two regions - the non-cached reserved > (resmem) and the cached rest of the device memory (termed as usemem) as > separate VFIO 64b BAR regions. This is divergent from the baremetal > approach, where the device memory is exposed as a device memory region. > The decision for a different approach was taken in view of the fact that > it would necessiate additional code in Qemu to discover and insert those > regions in the VM IPA, along with the additional VM ACPI DSDT changes to > communiate the device memory region IPA to the VM workloads. Moreover, > this behavior would have to be added to a variety of emulators (beyond > top of tree Qemu) out there desiring grace hopper support. > > Since the device implements 64-bit BAR0, the VFIO PCI variant driver > maps the uncached carved out region to the next available PCI BAR (i.e. > comprising of region 2 and 3). The cached device memory aperture is > assigned BAR region 4 and 5. Qemu will then naturally generate a PCI > device in the VM with the uncached aperture reported as BAR2 region, > the cacheable as BAR4. The variant driver provides emulation for these > fake BARs' PCI config space offset registers. > > The hardware ensures that the system does not crash when the memory > is accessed with the memory enable turned off. It synthesis ~0 reads > and dropped writes on such access. So there is no need to support the > disablement/enablement of BAR through PCI_COMMAND config space register. > > The memory layout on the host looks like the following: >devmem (memlength) > |--| > |-cached|--NC--| > | | > usemem.memphys resmem.memphys > > PCI BARs need to be aligned to the power-of-2, but the actual memory on the > device may not. A read or write access to the physical address from the > last device PFN up to the next power-of-2 aligned physical address > results in reading ~0 and dropped writes. Note that the GPU device > driver [6] is capable of knowing the exact device memory size through > separate means. The device memory size is primarily kept in the system > ACPI tables for use by the VFIO PCI variant module. > > Note that the usemem memory is added by the VM Nvidia device driver [5] > to the VM kernel as memblocks. Hence make the usable memory size memblock > (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and > VFIO driver. The VM device driver make use of the same value for its > calculation to determine USEMEM size. > > Currently there is no provision in KVM for a S2 mapping with > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. > As previously mentioned, resmem is mapped pgprot_writecombine(), that > sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the > proposed changes in [3] and [4], KVM marks the region with > MemAttr[2:0]=0b101 in S2. > > If the device memory properties are not present, the driver registers the > vfio-pci-core function pointers. Since there are no ACPI memory properties > generated for the VM, the
[PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
From: Ankit Agrawal NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. Since the device memory is cache coherent with the CPU, it can be mmaped into the user VMA with a cacheable mapping and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a minimum reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communiate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--| |-cached|--NC--| | | usemem.memphys resmem.memphys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and VFIO driver. The VM device driver make use of the same value for its calculation to determine USEMEM size. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [3] and [4], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the device memory properties are not present, the driver registers the vfio-pci-core function pointers. Since there are no ACPI memory properties generated for the VM, the variant driver inside the VM will only use the vfio-pci-core ops and hence try to map the BARs as non cached. This is not a problem as the CPUs have FWB enabled which blocks the VM mapping's ability to override the