Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Fri, 9 Feb 2024 13:19:03 -0400 Jason Gunthorpe wrote: > On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote: > > I think Kevin's point is also relative to this latter scenario, in the > > L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is > > cachable, but in the L2 instance of the driver where we only use the > > vfio-pci-core ops nothing maintains that cachable mapping. Is that a > > problem? An uncached mapping on top of a cachable mapping is often > > prone to problems. > > On these CPUs the ARM architecture won't permit it, the L0 level > blocks uncachable using FWB and page table attributes. The VM, no > matter what it does, cannot make the cachable memory uncachable. Great, thanks, Alex
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote: > I think Kevin's point is also relative to this latter scenario, in the > L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is > cachable, but in the L2 instance of the driver where we only use the > vfio-pci-core ops nothing maintains that cachable mapping. Is that a > problem? An uncached mapping on top of a cachable mapping is often > prone to problems. On these CPUs the ARM architecture won't permit it, the L0 level blocks uncachable using FWB and page table attributes. The VM, no matter what it does, cannot make the cachable memory uncachable. Jason
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Fri, 9 Feb 2024 09:20:22 + Ankit Agrawal wrote: > Thanks Kevin for the review. Comments inline. > > >> > >> Note that the usemem memory is added by the VM Nvidia device driver [5] > >> to the VM kernel as memblocks. Hence make the usable memory size > >> memblock > >> aligned. > > > > Is memblock size defined in spec or purely a guest implementation choice? > > The MEMBLOCK value is a hardwired and a constant ABI value between the GPU > FW and VFIO driver. > > >> > >> If the bare metal properties are not present, the driver registers the > >> vfio-pci-core function pointers. > > > > so if qemu doesn't generate such property the variant driver running > > inside guest will always go to use core functions and guest vfio userspace > > will observe both resmem and usemem bars. But then there is nothing > > in field to prohibit mapping resmem bar as cacheable. > > > > should this driver check the presence of either ACPI property or > > resmem/usemem bars to enable variant function pointers? > > Maybe I am missing something here; but if the ACPI property is absent, > the real physical BARs present on the device will be exposed by the > vfio-pci-core functions to the VM. So I think if the variant driver is ran > within the VM, it should not see the fake usemem and resmem BARs. There are two possibilities here, either we're assigning the pure physical device from a host that does not have the ACPI properties or we're performing a nested assignment. In the former case we're simply passing along the unmodified physical BARs. In the latter case we're actually passing through the fake BARs, the virtualization of the device has already happened in the level 1 assignment. I think Kevin's point is also relative to this latter scenario, in the L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is cachable, but in the L2 instance of the driver where we only use the vfio-pci-core ops nothing maintains that cachable mapping. Is that a problem? An uncached mapping on top of a cachable mapping is often prone to problems. Thanks, Alex
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
>> > >> > IMO, this seems like adding too much code to reduce the call length for a >> > very specific case. If there aren't any strong opinion on this, I'm >> > planning to >> > leave this code as it is. >> >> a slight difference. if mem_count==0 the result should always succeed >> no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course >> if it fails it's already a big problem probably nobody cares about the subtle >> difference when reading non-exist range. >> >> but regarding to readability it's still clearer: >> >> if (mem_count) >> nvgrace_gpu_map_and_read(); >> > > The below has better flow imo vs conditionalizing the call to > map_and_read/write and subsequent error handling, but I don't think > either adds too much code. Thanks, > > Alex > > --- a/drivers/vfio/pci/nvgrace-gpu/main.c > +++ b/drivers/vfio/pci/nvgrace-gpu/main.c > @@ -429,6 +429,9 @@ nvgrace_gpu_map_and_read(struct > nvgrace_gpu_vfio_pci_core_device *nvdev, > u64 offset = *ppos & VFIO_PCI_OFFSET_MASK; > int ret; > > + if (!mem_count) > + return 0; > + > /* > * Handle read on the BAR regions. Map to the target device memory > * physical address and copy to the request read buffer. > @@ -547,6 +550,9 @@ nvgrace_gpu_map_and_write(struct > nvgrace_gpu_vfio_pci_core_device *nvdev, > loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > int ret; > > + if (!mem_count) > + return 0; > + > ret = nvgrace_gpu_map_device_mem(index, nvdev); > if (ret) > return ret; Sure, will update it as mentioned above.
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
Thanks Kevin for the review. Comments inline. >> >> Note that the usemem memory is added by the VM Nvidia device driver [5] >> to the VM kernel as memblocks. Hence make the usable memory size >> memblock >> aligned. > > Is memblock size defined in spec or purely a guest implementation choice? The MEMBLOCK value is a hardwired and a constant ABI value between the GPU FW and VFIO driver. >> >> If the bare metal properties are not present, the driver registers the >> vfio-pci-core function pointers. > > so if qemu doesn't generate such property the variant driver running > inside guest will always go to use core functions and guest vfio userspace > will observe both resmem and usemem bars. But then there is nothing > in field to prohibit mapping resmem bar as cacheable. > > should this driver check the presence of either ACPI property or > resmem/usemem bars to enable variant function pointers? Maybe I am missing something here; but if the ACPI property is absent, the real physical BARs present on the device will be exposed by the vfio-pci-core functions to the VM. So I think if the variant driver is ran within the VM, it should not see the fake usemem and resmem BARs. >> +config NVGRACE_GPU_VFIO_PCI >> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper >> Superchip" >> + depends on ARM64 || (COMPILE_TEST && 64BIT) >> + select VFIO_PCI_CORE >> + help >> + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is >> + required to assign the GPU device using KVM/qemu/etc. > > "assign the GPU device to userspace" Ack, will make the change. >> + >> +/* Memory size expected as non cached and reserved by the VM driver */ >> +#define RESMEM_SIZE 0x4000 >> +#define MEMBLK_SIZE 0x2000 > > also add a comment for MEMBLK_SIZE Yes. >> + >> +struct nvgrace_gpu_vfio_pci_core_device { > > will nvgrace refer to a non-gpu device? if not probably all prefixes with > 'nvgrace_gpu' can be simplified to 'nvgrace'. nvgrace_gpu indicates the GPU associated with grace CPU here. That is a one of the reason behind keeping the nomenclature as nvgrace_gpu. Calling it nvgrace may not be apt as it is a CPU. > btw following other variant drivers 'vfio' can be removed too. Ack. >> + >> +/* >> + * Both the usable (usemem) and the reserved (resmem) device memory >> region >> + * are 64b fake BARs in the VM. These fake BARs must respond > > s/VM/device/ I can make it clearer to something like "64b fake device BARs in the VM". >> + int ret; >> + >> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos); >> + if (ret < 0) >> + return ret; > > here if core_read succeeds *ppos has been updated... Thanks for pointing that out, will fix the code handling *ppos. >> + * Read the data from the device memory (mapped either through ioremap >> + * or memremap) into the user buffer. >> + */ >> +static int >> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device >> *nvdev, >> + char __user *buf, size_t mem_count, loff_t *ppos) >> +{ >> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); >> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK; >> + int ret; >> + >> + /* >> + * Handle read on the BAR regions. Map to the target device memory >> + * physical address and copy to the request read buffer. >> + */ > > duplicate with the earlier comment for the function. Ack. >> +/* >> + * Read count bytes from the device memory at an offset. The actual device >> + * memory size (available) may not be a power-of-2. So the driver fakes >> + * the size to a power-of-2 (reported) when exposing to a user space driver. >> + * >> + * Reads extending beyond the reported size are truncated; reads starting >> + * beyond the reported size generate -EINVAL; reads extending beyond the >> + * actual device size is filled with ~0. > > slightly clearer to order the description: read starting beyond reported > size, then read extending beyond device size, and read extending beyond > reported size. Yes, will make the change. >> + * exposing the rest (termed as usable memory and represented >> using usemem) >> + * as cacheable 64b BAR (region 4 and 5). >> + * >> + * devmem (memlength) >> + * |-| >> + * | | >> + * usemem.phys/memphys resmem.phys > > there is no usemem.phys and resmem.phys Silly miss. Will fix that. >> + */ >> + nvdev->usemem.memphys = memphys; >> + >> + /* >> + * The device memory exposed to the VM is added to the kernel by >> the >> + * VM driver module in chunks of memory block size. Only the usable >> + * memory (usemem) is added to the kernel for usage by the VM >> + * workloads. Make the usable memory size memblock aligned. >> + */ > > If memblock size is defined by hw spec then say so. > otherwise this
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Thu, 8 Feb 2024 07:21:40 + "Tian, Kevin" wrote: > > From: Ankit Agrawal > > Sent: Thursday, February 8, 2024 3:13 PM > > >> > + * Determine how many bytes to be actually read from the > > >> > device memory. > > >> > + * Read request beyond the actual device memory size is > > >> > filled with ~0, > > >> > + * while those beyond the actual reported size is skipped. > > >> > + */ > > >> > + if (offset >= memregion->memlength) > > >> > + mem_count = 0; > > >> > > >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not > > >> necessary. > > > > > > Harmless, other than the possibly unnecessary call through to > > > nvgrace_gpu_map_device_mem(). Maybe both > > nvgrace_gpu_map_and_read() > > > and nvgrace_gpu_map_and_write() could conditionally return 0 as their > > > first operation when !mem_count. Thanks, > > > > > >Alex > > > > IMO, this seems like adding too much code to reduce the call length for a > > very specific case. If there aren't any strong opinion on this, I'm > > planning to > > leave this code as it is. > > a slight difference. if mem_count==0 the result should always succeed > no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course > if it fails it's already a big problem probably nobody cares about the subtle > difference when reading non-exist range. > > but regarding to readability it's still clearer: > > if (mem_count) > nvgrace_gpu_map_and_read(); > The below has better flow imo vs conditionalizing the call to map_and_read/write and subsequent error handling, but I don't think either adds too much code. Thanks, Alex --- a/drivers/vfio/pci/nvgrace-gpu/main.c +++ b/drivers/vfio/pci/nvgrace-gpu/main.c @@ -429,6 +429,9 @@ nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device *nvdev, u64 offset = *ppos & VFIO_PCI_OFFSET_MASK; int ret; + if (!mem_count) + return 0; + /* * Handle read on the BAR regions. Map to the target device memory * physical address and copy to the request read buffer. @@ -547,6 +550,9 @@ nvgrace_gpu_map_and_write(struct nvgrace_gpu_vfio_pci_core_device *nvdev, loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; int ret; + if (!mem_count) + return 0; + ret = nvgrace_gpu_map_device_mem(index, nvdev); if (ret) return ret;
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
>> >> >> >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not >> >> necessary. >> > >> > Harmless, other than the possibly unnecessary call through to >> > nvgrace_gpu_map_device_mem(). Maybe both >> nvgrace_gpu_map_and_read() >> > and nvgrace_gpu_map_and_write() could conditionally return 0 as their >> > first operation when !mem_count. Thanks, >> > >> >Alex >> >> IMO, this seems like adding too much code to reduce the call length for a >> very specific case. If there aren't any strong opinion on this, I'm planning >> to >> leave this code as it is. > > a slight difference. if mem_count==0 the result should always succeed > no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course > if it fails it's already a big problem probably nobody cares about the subtle > difference when reading non-exist range. > > but regarding to readability it's still clearer: > > if (mem_count) > nvgrace_gpu_map_and_read(); Makes sense. I'll change it.
RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
> From: Ankit Agrawal > Sent: Thursday, February 8, 2024 3:13 PM > >> > + * Determine how many bytes to be actually read from the > >> > device memory. > >> > + * Read request beyond the actual device memory size is > >> > filled with ~0, > >> > + * while those beyond the actual reported size is skipped. > >> > + */ > >> > + if (offset >= memregion->memlength) > >> > + mem_count = 0; > >> > >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not > >> necessary. > > > > Harmless, other than the possibly unnecessary call through to > > nvgrace_gpu_map_device_mem(). Maybe both > nvgrace_gpu_map_and_read() > > and nvgrace_gpu_map_and_write() could conditionally return 0 as their > > first operation when !mem_count. Thanks, > > > >Alex > > IMO, this seems like adding too much code to reduce the call length for a > very specific case. If there aren't any strong opinion on this, I'm planning > to > leave this code as it is. a slight difference. if mem_count==0 the result should always succeed no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course if it fails it's already a big problem probably nobody cares about the subtle difference when reading non-exist range. but regarding to readability it's still clearer: if (mem_count) nvgrace_gpu_map_and_read();
RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
> From: ank...@nvidia.com > Sent: Tuesday, February 6, 2024 7:01 AM > > Note that the usemem memory is added by the VM Nvidia device driver [5] > to the VM kernel as memblocks. Hence make the usable memory size > memblock > aligned. Is memblock size defined in spec or purely a guest implementation choice? > > If the bare metal properties are not present, the driver registers the > vfio-pci-core function pointers. so if qemu doesn't generate such property the variant driver running inside guest will always go to use core functions and guest vfio userspace will observe both resmem and usemem bars. But then there is nothing in field to prohibit mapping resmem bar as cacheable. should this driver check the presence of either ACPI property or resmem/usemem bars to enable variant function pointers? > +config NVGRACE_GPU_VFIO_PCI > + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper > Superchip" > + depends on ARM64 || (COMPILE_TEST && 64BIT) > + select VFIO_PCI_CORE > + help > + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is > + required to assign the GPU device using KVM/qemu/etc. "assign the GPU device to userspace" > + > +/* Memory size expected as non cached and reserved by the VM driver */ > +#define RESMEM_SIZE 0x4000 > +#define MEMBLK_SIZE 0x2000 also add a comment for MEMBLK_SIZE > + > +struct nvgrace_gpu_vfio_pci_core_device { will nvgrace refer to a non-gpu device? if not probably all prefixes with 'nvgrace_gpu' can be simplified to 'nvgrace'. btw following other variant drivers 'vfio' can be removed too. > + > +/* > + * Both the usable (usemem) and the reserved (resmem) device memory > region > + * are exposed as a 64b fake BARs in the VM. These fake BARs must respond s/VM/device/ > + * to the accesses on their respective PCI config space offsets. > + * > + * resmem BAR owns PCI_BASE_ADDRESS_2 & PCI_BASE_ADDRESS_3. > + * usemem BAR owns PCI_BASE_ADDRESS_4 & PCI_BASE_ADDRESS_5. > + */ > +static ssize_t > +nvgrace_gpu_read_config_emu(struct vfio_device *core_vdev, > + char __user *buf, size_t count, loff_t *ppos) > +{ > + struct nvgrace_gpu_vfio_pci_core_device *nvdev = > + container_of(core_vdev, struct > nvgrace_gpu_vfio_pci_core_device, > + core_device.vdev); > + struct mem_region *memregion = NULL; > + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK; > + __le64 val64; > + size_t register_offset; > + loff_t copy_offset; > + size_t copy_count; > + int ret; > + > + ret = vfio_pci_core_read(core_vdev, buf, count, ppos); > + if (ret < 0) > + return ret; here if core_read succeeds *ppos has been updated... > + > + if (vfio_pci_core_range_intersect_range(pos, count, > PCI_BASE_ADDRESS_2, > + sizeof(val64), > + _offset, _count, > + _offset)) > + memregion = > nvgrace_gpu_memregion(RESMEM_REGION_INDEX, nvdev); > + else if (vfio_pci_core_range_intersect_range(pos, count, > + PCI_BASE_ADDRESS_4, > + sizeof(val64), > + _offset, > _count, > + _offset)) > + memregion = > nvgrace_gpu_memregion(USEMEM_REGION_INDEX, nvdev); > + > + if (memregion) { > + val64 = nvgrace_gpu_get_read_value(memregion->bar_size, > + > PCI_BASE_ADDRESS_MEM_TYPE_64 | > + > PCI_BASE_ADDRESS_MEM_PREFETCH, > +memregion->bar_val); > + if (copy_to_user(buf + copy_offset, > + (void *) + register_offset, copy_count)) > + return -EFAULT; ...but here it's not adjusted back upon error. > + > +/* > + * Read the data from the device memory (mapped either through ioremap > + * or memremap) into the user buffer. > + */ > +static int > +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device > *nvdev, > + char __user *buf, size_t mem_count, loff_t *ppos) > +{ > + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); > + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK; > + int ret; > + > + /* > + * Handle read on the BAR regions. Map to the target device memory > + * physical address and copy to the request read buffer. > + */ duplicate with the earlier comment for the function. > +/* > + * Read count bytes from the device memory at an offset. The actual device > + * memory size (available) may not be a power-of-2. So the driver fakes > + * the size to a power-of-2 (reported) when exposing to a user space driver. > + * > + * Reads extending beyond the reported size are truncated; reads starting > + * beyond the reported size generate
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
>> > +/* Memory size expected as non cached and reserved by the VM driver >> > */ +#define RESMEM_SIZE 0x4000 >> > +#define MEMBLK_SIZE 0x2000 >> > + >> >> Maybe use SZ_* definitions in linux/size.h > > Good suggestion. Ack. >> >> Better move this part to the place between vfio_pci_core_enable() and >> vfio_pci_core_finish_enable() like others for respecting the expected >> device initialization sequence of life cycle. >> >> It doesn't bite something right now, but think about when someone >> changes the behavior of vfio_pci_core_finish_enable() in the future, >> they have to propose a patch for this. > > Agree. Good point, will move it. >> >> Wouldn't it be better to do the map in the open path? > > AIUI the device would typically be used exclusively through the mmap of > these ranges, these mappings are only for pread/pwrite type accesses, > so I think it makes sense to map them on demand. That's right, agree with Alex to keep it on-demand. >> > + * Determine how many bytes to be actually read from the >> > device memory. >> > + * Read request beyond the actual device memory size is >> > filled with ~0, >> > + * while those beyond the actual reported size is skipped. >> > + */ >> > + if (offset >= memregion->memlength) >> > + mem_count = 0; >> >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not >> necessary. > > Harmless, other than the possibly unnecessary call through to > nvgrace_gpu_map_device_mem(). Maybe both nvgrace_gpu_map_and_read() > and nvgrace_gpu_map_and_write() could conditionally return 0 as their > first operation when !mem_count. Thanks, > >Alex IMO, this seems like adding too much code to reduce the call length for a very specific case. If there aren't any strong opinion on this, I'm planning to leave this code as it is.
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
>> diff --git a/MAINTAINERS b/MAINTAINERS >> index 8999497011a2..529ec8966f58 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -23103,6 +23103,12 @@ L: k...@vger.kernel.org >> S: Maintained >> F: drivers/vfio/platform/ >> >> +VFIO NVIDIA GRACE GPU DRIVER >> +M: Ankit Agrawal >> +L: k...@vger.kernel.org >> +S: Supported >> +F: drivers/vfio/pci/nvgrace-gpu/ >> + > > Entries should be alphabetical. This will end up colliding with [1] so > I'll plan to fix it either way. I will make the change to put it at the right place. > Otherwise just a couple optional comments from me below. I see Zhi also > has a few good comments. I'd suggest soliciting a review from the other > variant driver reviewers for this version and maybe we can make v18 the > final version. Thanks, > > Alex > > [1]https://lore.kernel.org/all/20240205235427.2103714-1-alex.william...@redhat.com/ Great! >> +static ssize_t >> +nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev, >> + const char __user *buf, size_t count, loff_t >> *ppos) >> +{ >> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = >> + container_of(core_vdev, struct >> nvgrace_gpu_vfio_pci_core_device, >> + core_device.vdev); >> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK; >> + size_t register_offset; >> + loff_t copy_offset; >> + size_t copy_count; >> + struct mem_region *memregion = NULL; > > Nit, consistency and reverse Christmas tree variable declaration would > suggest pushing this up in the list, but it's not strictly required. Ack, I'll make the change. >> + if (index == USEMEM_REGION_INDEX && !memregion->memaddr) { >> + memregion->memaddr = memremap(memregion->memphys, >> + memregion->memlength, >> + MEMREMAP_WB); >> + if (!memregion->memaddr) >> + ret = -ENOMEM; >> + } else if (index == RESMEM_REGION_INDEX && !memregion->ioaddr) { >> + memregion->ioaddr = ioremap_wc(memregion->memphys, >> + memregion->memlength); >> + if (!memregion->ioaddr) >> + ret = -ENOMEM; >> + } > > As .memaddr and .ioaddr are a union we can consolidate the NULL test. Ok, will do that.
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Tue, 6 Feb 2024 04:31:23 +0530 wrote: > From: Ankit Agrawal > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > for the on-chip GPU that is the logical OS representation of the > internal proprietary chip-to-chip cache coherent interconnect. > > The device is peculiar compared to a real PCI device in that whilst > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the > device, it is not used to access device memory once the faster > chip-to-chip interconnect is initialized (occurs at the time of host > system boot). The device memory is accessed instead using the chip-to-chip > interconnect that is exposed as a contiguous physically addressable > region on the host. This device memory aperture can be obtained from host > ACPI table using device_property_read_u64(), according to the FW > specification. Since the device memory is cache coherent with the CPU, > it can be mmap into the user VMA with a cacheable mapping using > remap_pfn_range() and used like a regular RAM. The device memory > is not added to the host kernel, but mapped directly as this reduces > memory wastage due to struct pages. > > There is also a requirement of a reserved 1G uncached region (termed as > resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is > to work around a HW defect. Based on [2], the requisite properties > (uncached, unaligned access) can be achieved through a VM mapping (S1) > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide > a different non-cached property to the reserved 1G region, it needs to > be carved out from the device memory and mapped as a separate region > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the > Qemu VMA page properties (pgprot) as NORMAL_NC. > > Provide a VFIO PCI variant driver that adapts the unique device memory > representation into a more standard PCI representation facing userspace. > > The variant driver exposes these two regions - the non-cached reserved > (resmem) and the cached rest of the device memory (termed as usemem) as > separate VFIO 64b BAR regions. This is divergent from the baremetal > approach, where the device memory is exposed as a device memory region. > The decision for a different approach was taken in view of the fact that > it would necessiate additional code in Qemu to discover and insert those > regions in the VM IPA, along with the additional VM ACPI DSDT changes to > communicate the device memory region IPA to the VM workloads. Moreover, > this behavior would have to be added to a variety of emulators (beyond > top of tree Qemu) out there desiring grace hopper support. > > Since the device implements 64-bit BAR0, the VFIO PCI variant driver > maps the uncached carved out region to the next available PCI BAR (i.e. > comprising of region 2 and 3). The cached device memory aperture is > assigned BAR region 4 and 5. Qemu will then naturally generate a PCI > device in the VM with the uncached aperture reported as BAR2 region, > the cacheable as BAR4. The variant driver provides emulation for these > fake BARs' PCI config space offset registers. > > The hardware ensures that the system does not crash when the memory > is accessed with the memory enable turned off. It synthesis ~0 reads > and dropped writes on such access. So there is no need to support the > disablement/enablement of BAR through PCI_COMMAND config space register. > > The memory layout on the host looks like the following: >devmem (memlength) > |--| > |-cached|--NC--| > | | > usemem.phys/memphys resmem.phys > > PCI BARs need to be aligned to the power-of-2, but the actual memory on the > device may not. A read or write access to the physical address from the > last device PFN up to the next power-of-2 aligned physical address > results in reading ~0 and dropped writes. Note that the GPU device > driver [6] is capable of knowing the exact device memory size through > separate means. The device memory size is primarily kept in the system > ACPI tables for use by the VFIO PCI variant module. > > Note that the usemem memory is added by the VM Nvidia device driver [5] > to the VM kernel as memblocks. Hence make the usable memory size memblock > aligned. > > Currently there is no provision in KVM for a S2 mapping with > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. > As previously mentioned, resmem is mapped pgprot_writecombine(), that > sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the > proposed changes in [4] and [3], KVM marks the region with > MemAttr[2:0]=0b101 in S2. > > If the bare metal properties are not present, the driver registers the > vfio-pci-core function pointers. > > This goes along with a qemu series [6] to provides the necessary > implementation of the Grace Hopper
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Thu, 8 Feb 2024 00:32:10 +0200 Zhi Wang wrote: > On Tue, 6 Feb 2024 04:31:23 +0530 > wrote: > > > From: Ankit Agrawal > > > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > > for the on-chip GPU that is the logical OS representation of the > > internal proprietary chip-to-chip cache coherent interconnect. > > > > The device is peculiar compared to a real PCI device in that whilst > > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the > > device, it is not used to access device memory once the faster > > chip-to-chip interconnect is initialized (occurs at the time of host > > system boot). The device memory is accessed instead using the > > chip-to-chip interconnect that is exposed as a contiguous physically > > addressable region on the host. This device memory aperture can be > > obtained from host ACPI table using device_property_read_u64(), > > according to the FW specification. Since the device memory is cache > > coherent with the CPU, it can be mmap into the user VMA with a > > cacheable mapping using remap_pfn_range() and used like a regular > > RAM. The device memory is not added to the host kernel, but mapped > > directly as this reduces memory wastage due to struct pages. > > > > There is also a requirement of a reserved 1G uncached region (termed > > as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This > > is to work around a HW defect. Based on [2], the requisite properties > > (uncached, unaligned access) can be achieved through a VM mapping (S1) > > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide > > a different non-cached property to the reserved 1G region, it needs to > > be carved out from the device memory and mapped as a separate region > > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the > > Qemu VMA page properties (pgprot) as NORMAL_NC. > > > > Provide a VFIO PCI variant driver that adapts the unique device memory > > representation into a more standard PCI representation facing > > userspace. > > > > The variant driver exposes these two regions - the non-cached reserved > > (resmem) and the cached rest of the device memory (termed as usemem) > > as separate VFIO 64b BAR regions. This is divergent from the baremetal > > approach, where the device memory is exposed as a device memory > > region. The decision for a different approach was taken in view of > > the fact that it would necessiate additional code in Qemu to discover > > and insert those regions in the VM IPA, along with the additional VM > > ACPI DSDT changes to communicate the device memory region IPA to the > > VM workloads. Moreover, this behavior would have to be added to a > > variety of emulators (beyond top of tree Qemu) out there desiring > > grace hopper support. > > > > Since the device implements 64-bit BAR0, the VFIO PCI variant driver > > maps the uncached carved out region to the next available PCI BAR > > (i.e. comprising of region 2 and 3). The cached device memory > > aperture is assigned BAR region 4 and 5. Qemu will then naturally > > generate a PCI device in the VM with the uncached aperture reported > > as BAR2 region, the cacheable as BAR4. The variant driver provides > > emulation for these fake BARs' PCI config space offset registers. > > > > The hardware ensures that the system does not crash when the memory > > is accessed with the memory enable turned off. It synthesis ~0 reads > > and dropped writes on such access. So there is no need to support the > > disablement/enablement of BAR through PCI_COMMAND config space > > register. > > > > The memory layout on the host looks like the following: > >devmem (memlength) > > |--| > > |-cached|--NC--| > > | | > > usemem.phys/memphys resmem.phys > > > > PCI BARs need to be aligned to the power-of-2, but the actual memory > > on the device may not. A read or write access to the physical address > > from the last device PFN up to the next power-of-2 aligned physical > > address results in reading ~0 and dropped writes. Note that the GPU > > device driver [6] is capable of knowing the exact device memory size > > through separate means. The device memory size is primarily kept in > > the system ACPI tables for use by the VFIO PCI variant module. > > > > Note that the usemem memory is added by the VM Nvidia device driver > > [5] to the VM kernel as memblocks. Hence make the usable memory size > > memblock aligned. > > > > Currently there is no provision in KVM for a S2 mapping with > > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the > > same [3]. As previously mentioned, resmem is mapped > > pgprot_writecombine(), that sets the Qemu VMA page properties > > (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM > > marks the region with
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Tue, 6 Feb 2024 04:31:23 +0530 wrote: > From: Ankit Agrawal > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > for the on-chip GPU that is the logical OS representation of the > internal proprietary chip-to-chip cache coherent interconnect. > > The device is peculiar compared to a real PCI device in that whilst > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the > device, it is not used to access device memory once the faster > chip-to-chip interconnect is initialized (occurs at the time of host > system boot). The device memory is accessed instead using the > chip-to-chip interconnect that is exposed as a contiguous physically > addressable region on the host. This device memory aperture can be > obtained from host ACPI table using device_property_read_u64(), > according to the FW specification. Since the device memory is cache > coherent with the CPU, it can be mmap into the user VMA with a > cacheable mapping using remap_pfn_range() and used like a regular > RAM. The device memory is not added to the host kernel, but mapped > directly as this reduces memory wastage due to struct pages. > > There is also a requirement of a reserved 1G uncached region (termed > as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This > is to work around a HW defect. Based on [2], the requisite properties > (uncached, unaligned access) can be achieved through a VM mapping (S1) > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide > a different non-cached property to the reserved 1G region, it needs to > be carved out from the device memory and mapped as a separate region > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the > Qemu VMA page properties (pgprot) as NORMAL_NC. > > Provide a VFIO PCI variant driver that adapts the unique device memory > representation into a more standard PCI representation facing > userspace. > > The variant driver exposes these two regions - the non-cached reserved > (resmem) and the cached rest of the device memory (termed as usemem) > as separate VFIO 64b BAR regions. This is divergent from the baremetal > approach, where the device memory is exposed as a device memory > region. The decision for a different approach was taken in view of > the fact that it would necessiate additional code in Qemu to discover > and insert those regions in the VM IPA, along with the additional VM > ACPI DSDT changes to communicate the device memory region IPA to the > VM workloads. Moreover, this behavior would have to be added to a > variety of emulators (beyond top of tree Qemu) out there desiring > grace hopper support. > > Since the device implements 64-bit BAR0, the VFIO PCI variant driver > maps the uncached carved out region to the next available PCI BAR > (i.e. comprising of region 2 and 3). The cached device memory > aperture is assigned BAR region 4 and 5. Qemu will then naturally > generate a PCI device in the VM with the uncached aperture reported > as BAR2 region, the cacheable as BAR4. The variant driver provides > emulation for these fake BARs' PCI config space offset registers. > > The hardware ensures that the system does not crash when the memory > is accessed with the memory enable turned off. It synthesis ~0 reads > and dropped writes on such access. So there is no need to support the > disablement/enablement of BAR through PCI_COMMAND config space > register. > > The memory layout on the host looks like the following: >devmem (memlength) > |--| > |-cached|--NC--| > | | > usemem.phys/memphys resmem.phys > > PCI BARs need to be aligned to the power-of-2, but the actual memory > on the device may not. A read or write access to the physical address > from the last device PFN up to the next power-of-2 aligned physical > address results in reading ~0 and dropped writes. Note that the GPU > device driver [6] is capable of knowing the exact device memory size > through separate means. The device memory size is primarily kept in > the system ACPI tables for use by the VFIO PCI variant module. > > Note that the usemem memory is added by the VM Nvidia device driver > [5] to the VM kernel as memblocks. Hence make the usable memory size > memblock aligned. > > Currently there is no provision in KVM for a S2 mapping with > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the > same [3]. As previously mentioned, resmem is mapped > pgprot_writecombine(), that sets the Qemu VMA page properties > (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM > marks the region with MemAttr[2:0]=0b101 in S2. > > If the bare metal properties are not present, the driver registers the > vfio-pci-core function pointers. > > This goes along with a qemu series [6] to provides the necessary > implementation of the Grace
[PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
From: Ankit Agrawal NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. This device memory aperture can be obtained from host ACPI table using device_property_read_u64(), according to the FW specification. Since the device memory is cache coherent with the CPU, it can be mmap into the user VMA with a cacheable mapping using remap_pfn_range() and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communicate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--| |-cached|--NC--| | | usemem.phys/memphys resmem.phys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock aligned. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the bare metal properties are not present, the driver registers the vfio-pci-core function pointers. This goes along with a qemu series [6] to provides the necessary implementation of the Grace Hopper Superchip firmware specification so that the guest operating system can see the correct ACPI modeling for the coherent GPU device. Verified with the CUDA workload in the VM. [1]