Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Alex Williamson
On Fri, 9 Feb 2024 13:19:03 -0400
Jason Gunthorpe  wrote:

> On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote:
> > I think Kevin's point is also relative to this latter scenario, in the
> > L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
> > cachable, but in the L2 instance of the driver where we only use the
> > vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
> > problem?  An uncached mapping on top of a cachable mapping is often
> > prone to problems.
> 
> On these CPUs the ARM architecture won't permit it, the L0 level
> blocks uncachable using FWB and page table attributes. The VM, no
> matter what it does, cannot make the cachable memory uncachable.

Great, thanks,

Alex




Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Jason Gunthorpe
On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote:
> I think Kevin's point is also relative to this latter scenario, in the
> L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
> cachable, but in the L2 instance of the driver where we only use the
> vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
> problem?  An uncached mapping on top of a cachable mapping is often
> prone to problems.  

On these CPUs the ARM architecture won't permit it, the L0 level
blocks uncachable using FWB and page table attributes. The VM, no
matter what it does, cannot make the cachable memory uncachable.

Jason



Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Alex Williamson
On Fri, 9 Feb 2024 09:20:22 +
Ankit Agrawal  wrote:

> Thanks Kevin for the review. Comments inline.
> 
> >>
> >> Note that the usemem memory is added by the VM Nvidia device driver [5]
> >> to the VM kernel as memblocks. Hence make the usable memory size
> >> memblock
> >> aligned.  
> >
> > Is memblock size defined in spec or purely a guest implementation choice?  
> 
> The MEMBLOCK value is a hardwired and a constant ABI value between the GPU
> FW and VFIO driver.
> 
> >>
> >> If the bare metal properties are not present, the driver registers the
> >> vfio-pci-core function pointers.  
> >
> > so if qemu doesn't generate such property the variant driver running
> > inside guest will always go to use core functions and guest vfio userspace
> > will observe both resmem and usemem bars. But then there is nothing
> > in field to prohibit mapping resmem bar as cacheable.
> >
> > should this driver check the presence of either ACPI property or
> > resmem/usemem bars to enable variant function pointers?  
> 
> Maybe I am missing something here; but if the ACPI property is absent,
> the real physical BARs present on the device will be exposed by the
> vfio-pci-core functions to the VM. So I think if the variant driver is ran
> within the VM, it should not see the fake usemem and resmem BARs.

There are two possibilities here, either we're assigning the pure
physical device from a host that does not have the ACPI properties or
we're performing a nested assignment.  In the former case we're simply
passing along the unmodified physical BARs.  In the latter case we're
actually passing through the fake BARs, the virtualization of the
device has already happened in the level 1 assignment.

I think Kevin's point is also relative to this latter scenario, in the
L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
cachable, but in the L2 instance of the driver where we only use the
vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
problem?  An uncached mapping on top of a cachable mapping is often
prone to problems.  Thanks,

Alex




Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Ankit Agrawal
>> >
>> > IMO, this seems like adding too much code to reduce the call length for a
>> > very specific case. If there aren't any strong opinion on this, I'm 
>> > planning to
>> > leave this code as it is.
>>
>> a slight difference. if mem_count==0 the result should always succeed
>> no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
>> if it fails it's already a big problem probably nobody cares about the subtle
>> difference when reading non-exist range.
>>
>> but regarding to readability it's still clearer:
>>
>> if (mem_count)
>>   nvgrace_gpu_map_and_read();
>>
>
> The below has better flow imo vs conditionalizing the call to
> map_and_read/write and subsequent error handling, but I don't think
> either adds too much code.  Thanks,
>
> Alex
>
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -429,6 +429,9 @@ nvgrace_gpu_map_and_read(struct 
> nvgrace_gpu_vfio_pci_core_device *nvdev,
>     u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
>     int ret;
>
> +   if (!mem_count)
> +   return 0;
> +
>     /*
>     * Handle read on the BAR regions. Map to the target device memory
>  * physical address and copy to the request read buffer.
> @@ -547,6 +550,9 @@ nvgrace_gpu_map_and_write(struct 
> nvgrace_gpu_vfio_pci_core_device *nvdev,
>     loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>     int ret;
>
> +   if (!mem_count)
> +   return 0;
> +
>     ret = nvgrace_gpu_map_device_mem(index, nvdev);
>     if (ret)
>    return ret;

Sure, will update it as mentioned above.


Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Ankit Agrawal
Thanks Kevin for the review. Comments inline.

>>
>> Note that the usemem memory is added by the VM Nvidia device driver [5]
>> to the VM kernel as memblocks. Hence make the usable memory size
>> memblock
>> aligned.
>
> Is memblock size defined in spec or purely a guest implementation choice?

The MEMBLOCK value is a hardwired and a constant ABI value between the GPU
FW and VFIO driver.

>>
>> If the bare metal properties are not present, the driver registers the
>> vfio-pci-core function pointers.
>
> so if qemu doesn't generate such property the variant driver running
> inside guest will always go to use core functions and guest vfio userspace
> will observe both resmem and usemem bars. But then there is nothing
> in field to prohibit mapping resmem bar as cacheable.
>
> should this driver check the presence of either ACPI property or
> resmem/usemem bars to enable variant function pointers?

Maybe I am missing something here; but if the ACPI property is absent,
the real physical BARs present on the device will be exposed by the
vfio-pci-core functions to the VM. So I think if the variant driver is ran
within the VM, it should not see the fake usemem and resmem BARs.

>> +config NVGRACE_GPU_VFIO_PCI
>> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
>> Superchip"
>> + depends on ARM64 || (COMPILE_TEST && 64BIT)
>> + select VFIO_PCI_CORE
>> + help
>> +   VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
>> +   required to assign the GPU device using KVM/qemu/etc.
>
> "assign the GPU device to userspace"

Ack, will make the change.

>> +
>> +/* Memory size expected as non cached and reserved by the VM driver */
>> +#define RESMEM_SIZE 0x4000
>> +#define MEMBLK_SIZE 0x2000
>
> also add a comment for MEMBLK_SIZE

Yes.

>> +
>> +struct nvgrace_gpu_vfio_pci_core_device {
>
> will nvgrace refer to a non-gpu device? if not probably all prefixes with
> 'nvgrace_gpu' can be simplified to 'nvgrace'.

nvgrace_gpu indicates the GPU associated with grace CPU here. That is a
one of the reason behind keeping the nomenclature as nvgrace_gpu. Calling
it nvgrace may not be apt as it is a CPU.

> btw following other variant drivers 'vfio' can be removed too.

Ack.

>> +
>> +/*
>> + * Both the usable (usemem) and the reserved (resmem) device memory
>> region
>> + * are 64b fake BARs in the VM. These fake BARs must respond
>
> s/VM/device/

I can make it clearer to something like "64b fake device BARs in the VM".

>> + int ret;
>> +
>> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
>> + if (ret < 0)
>> + return ret;
>
> here if core_read succeeds *ppos has been updated...

Thanks for pointing that out, will fix the code handling *ppos.

>> + * Read the data from the device memory (mapped either through ioremap
>> + * or memremap) into the user buffer.
>> + */
>> +static int
>> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device
>> *nvdev,
>> +  char __user *buf, size_t mem_count, loff_t *ppos)
>> +{
>> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
>> + int ret;
>> +
>> + /*
>> +  * Handle read on the BAR regions. Map to the target device memory
>> +  * physical address and copy to the request read buffer.
>> +  */
>
> duplicate with the earlier comment for the function.

Ack.

>> +/*
>> + * Read count bytes from the device memory at an offset. The actual device
>> + * memory size (available) may not be a power-of-2. So the driver fakes
>> + * the size to a power-of-2 (reported) when exposing to a user space driver.
>> + *
>> + * Reads extending beyond the reported size are truncated; reads starting
>> + * beyond the reported size generate -EINVAL; reads extending beyond the
>> + * actual device size is filled with ~0.
>
> slightly clearer to order the description: read starting beyond reported
> size, then read extending beyond device size, and read extending beyond
> reported size.

Yes, will make the change.

>> +  * exposing the rest (termed as usable memory and represented
>> using usemem)
>> +  * as cacheable 64b BAR (region 4 and 5).
>> +  *
>> +  *   devmem (memlength)
>> +  * |-|
>> +  * |   |
>> +  * usemem.phys/memphys resmem.phys
>
> there is no usemem.phys and resmem.phys

Silly miss. Will fix that.

>> +  */
>> + nvdev->usemem.memphys = memphys;
>> +
>> + /*
>> +  * The device memory exposed to the VM is added to the kernel by
>> the
>> +  * VM driver module in chunks of memory block size. Only the usable
>> +  * memory (usemem) is added to the kernel for usage by the VM
>> +  * workloads. Make the usable memory size memblock aligned.
>> +  */
>
> If memblock size is defined by hw spec then say so.
> otherwise this 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-08 Thread Alex Williamson
On Thu, 8 Feb 2024 07:21:40 +
"Tian, Kevin"  wrote:

> > From: Ankit Agrawal 
> > Sent: Thursday, February 8, 2024 3:13 PM  
> > >> > +    * Determine how many bytes to be actually read from the
> > >> > device memory.
> > >> > +    * Read request beyond the actual device memory size is
> > >> > filled with ~0,
> > >> > +    * while those beyond the actual reported size is skipped.
> > >> > +    */
> > >> > +   if (offset >= memregion->memlength)
> > >> > +   mem_count = 0;  
> > >>
> > >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
> > >> necessary.  
> > >
> > > Harmless, other than the possibly unnecessary call through to
> > > nvgrace_gpu_map_device_mem().  Maybe both  
> > nvgrace_gpu_map_and_read()  
> > > and nvgrace_gpu_map_and_write() could conditionally return 0 as their
> > > first operation when !mem_count.  Thanks,
> > >
> > >Alex  
> > 
> > IMO, this seems like adding too much code to reduce the call length for a
> > very specific case. If there aren't any strong opinion on this, I'm 
> > planning to
> > leave this code as it is.  
> 
> a slight difference. if mem_count==0 the result should always succeed
> no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
> if it fails it's already a big problem probably nobody cares about the subtle
> difference when reading non-exist range.
> 
> but regarding to readability it's still clearer:
> 
> if (mem_count)
>   nvgrace_gpu_map_and_read();
> 

The below has better flow imo vs conditionalizing the call to
map_and_read/write and subsequent error handling, but I don't think
either adds too much code.  Thanks,

Alex

--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -429,6 +429,9 @@ nvgrace_gpu_map_and_read(struct 
nvgrace_gpu_vfio_pci_core_device *nvdev,
u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
int ret;
 
+   if (!mem_count)
+   return 0;
+
/*
 * Handle read on the BAR regions. Map to the target device memory
 * physical address and copy to the request read buffer.
@@ -547,6 +550,9 @@ nvgrace_gpu_map_and_write(struct 
nvgrace_gpu_vfio_pci_core_device *nvdev,
loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
int ret;
 
+   if (!mem_count)
+   return 0;
+
ret = nvgrace_gpu_map_device_mem(index, nvdev);
if (ret)
return ret;




Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-08 Thread Ankit Agrawal
>> >>
>> >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
>> >> necessary.
>> >
>> > Harmless, other than the possibly unnecessary call through to
>> > nvgrace_gpu_map_device_mem().  Maybe both
>> nvgrace_gpu_map_and_read()
>> > and nvgrace_gpu_map_and_write() could conditionally return 0 as their
>> > first operation when !mem_count.  Thanks,
>> >
>> >Alex
>>
>> IMO, this seems like adding too much code to reduce the call length for a
>> very specific case. If there aren't any strong opinion on this, I'm planning 
>> to
>> leave this code as it is.
>
> a slight difference. if mem_count==0 the result should always succeed
> no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
> if it fails it's already a big problem probably nobody cares about the subtle
> difference when reading non-exist range.
>
> but regarding to readability it's still clearer:
>
> if (mem_count)
>    nvgrace_gpu_map_and_read();

Makes sense. I'll change it.


RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Tian, Kevin
> From: Ankit Agrawal 
> Sent: Thursday, February 8, 2024 3:13 PM
> >> > +    * Determine how many bytes to be actually read from the
> >> > device memory.
> >> > +    * Read request beyond the actual device memory size is
> >> > filled with ~0,
> >> > +    * while those beyond the actual reported size is skipped.
> >> > +    */
> >> > +   if (offset >= memregion->memlength)
> >> > +   mem_count = 0;
> >>
> >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
> >> necessary.
> >
> > Harmless, other than the possibly unnecessary call through to
> > nvgrace_gpu_map_device_mem().  Maybe both
> nvgrace_gpu_map_and_read()
> > and nvgrace_gpu_map_and_write() could conditionally return 0 as their
> > first operation when !mem_count.  Thanks,
> >
> >Alex
> 
> IMO, this seems like adding too much code to reduce the call length for a
> very specific case. If there aren't any strong opinion on this, I'm planning 
> to
> leave this code as it is.

a slight difference. if mem_count==0 the result should always succeed
no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
if it fails it's already a big problem probably nobody cares about the subtle
difference when reading non-exist range.

but regarding to readability it's still clearer:

if (mem_count)
nvgrace_gpu_map_and_read();



RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Tian, Kevin
> From: ank...@nvidia.com 
> Sent: Tuesday, February 6, 2024 7:01 AM
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size
> memblock
> aligned.

Is memblock size defined in spec or purely a guest implementation choice?

> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.

so if qemu doesn't generate such property the variant driver running
inside guest will always go to use core functions and guest vfio userspace
will observe both resmem and usemem bars. But then there is nothing
in field to prohibit mapping resmem bar as cacheable.

should this driver check the presence of either ACPI property or 
resmem/usemem bars to enable variant function pointers?

> +config NVGRACE_GPU_VFIO_PCI
> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
> Superchip"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)
> + select VFIO_PCI_CORE
> + help
> +   VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> +   required to assign the GPU device using KVM/qemu/etc.

"assign the GPU device to userspace"

> +
> +/* Memory size expected as non cached and reserved by the VM driver */
> +#define RESMEM_SIZE 0x4000
> +#define MEMBLK_SIZE 0x2000

also add a comment for MEMBLK_SIZE

> +
> +struct nvgrace_gpu_vfio_pci_core_device {

will nvgrace refer to a non-gpu device? if not probably all prefixes with
'nvgrace_gpu' can be simplified to 'nvgrace'.

btw following other variant drivers 'vfio' can be removed too.

> +
> +/*
> + * Both the usable (usemem) and the reserved (resmem) device memory
> region
> + * are exposed as a 64b fake BARs in the VM. These fake BARs must respond

s/VM/device/

> + * to the accesses on their respective PCI config space offsets.
> + *
> + * resmem BAR owns PCI_BASE_ADDRESS_2 & PCI_BASE_ADDRESS_3.
> + * usemem BAR owns PCI_BASE_ADDRESS_4 & PCI_BASE_ADDRESS_5.
> + */
> +static ssize_t
> +nvgrace_gpu_read_config_emu(struct vfio_device *core_vdev,
> + char __user *buf, size_t count, loff_t *ppos)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev =
> + container_of(core_vdev, struct
> nvgrace_gpu_vfio_pci_core_device,
> +  core_device.vdev);
> + struct mem_region *memregion = NULL;
> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + __le64 val64;
> + size_t register_offset;
> + loff_t copy_offset;
> + size_t copy_count;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;

here if core_read succeeds *ppos has been updated...

> +
> + if (vfio_pci_core_range_intersect_range(pos, count,
> PCI_BASE_ADDRESS_2,
> + sizeof(val64),
> + _offset, _count,
> + _offset))
> + memregion =
> nvgrace_gpu_memregion(RESMEM_REGION_INDEX, nvdev);
> + else if (vfio_pci_core_range_intersect_range(pos, count,
> +  PCI_BASE_ADDRESS_4,
> +  sizeof(val64),
> +  _offset,
> _count,
> +  _offset))
> + memregion =
> nvgrace_gpu_memregion(USEMEM_REGION_INDEX, nvdev);
> +
> + if (memregion) {
> + val64 = nvgrace_gpu_get_read_value(memregion->bar_size,
> +
> PCI_BASE_ADDRESS_MEM_TYPE_64 |
> +
> PCI_BASE_ADDRESS_MEM_PREFETCH,
> +memregion->bar_val);
> + if (copy_to_user(buf + copy_offset,
> +  (void *) + register_offset, copy_count))
> + return -EFAULT;

...but here it's not adjusted back upon error.

> +
> +/*
> + * Read the data from the device memory (mapped either through ioremap
> + * or memremap) into the user buffer.
> + */
> +static int
> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device
> *nvdev,
> +  char __user *buf, size_t mem_count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + /*
> +  * Handle read on the BAR regions. Map to the target device memory
> +  * physical address and copy to the request read buffer.
> +  */

duplicate with the earlier comment for the function. 

> +/*
> + * Read count bytes from the device memory at an offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes
> + * the size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Reads extending beyond the reported size are truncated; reads starting
> + * beyond the reported size generate 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Ankit Agrawal
>> > +/* Memory size expected as non cached and reserved by the VM driver
>> > */ +#define RESMEM_SIZE 0x4000
>> > +#define MEMBLK_SIZE 0x2000
>> > +
>>
>> Maybe use SZ_* definitions in linux/size.h
>
> Good suggestion.

Ack.

>>
>> Better move this part to the place between vfio_pci_core_enable() and
>> vfio_pci_core_finish_enable() like others for respecting the expected
>> device initialization sequence of life cycle.
>>
>> It doesn't bite something right now, but think about when someone
>> changes the behavior of vfio_pci_core_finish_enable() in the future,
>> they have to propose a patch for this.
>
> Agree.

Good point, will move it.

>>
>> Wouldn't it be better to do the map in the open path?
>
> AIUI the device would typically be used exclusively through the mmap of
> these ranges, these mappings are only for pread/pwrite type accesses,
> so I think it makes sense to map them on demand.

That's right, agree with Alex to keep it on-demand.


>> > +    * Determine how many bytes to be actually read from the
>> > device memory.
>> > +    * Read request beyond the actual device memory size is
>> > filled with ~0,
>> > +    * while those beyond the actual reported size is skipped.
>> > +    */
>> > +   if (offset >= memregion->memlength)
>> > +   mem_count = 0;
>>
>> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
>> necessary.
>
> Harmless, other than the possibly unnecessary call through to
> nvgrace_gpu_map_device_mem().  Maybe both nvgrace_gpu_map_and_read()
> and nvgrace_gpu_map_and_write() could conditionally return 0 as their
> first operation when !mem_count.  Thanks,
>
>Alex

IMO, this seems like adding too much code to reduce the call length for a
very specific case. If there aren't any strong opinion on this, I'm planning to
leave this code as it is.



Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Ankit Agrawal
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 8999497011a2..529ec8966f58 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -23103,6 +23103,12 @@ L:   k...@vger.kernel.org
>>  S:   Maintained
>>  F:   drivers/vfio/platform/
>>
>> +VFIO NVIDIA GRACE GPU DRIVER
>> +M:   Ankit Agrawal 
>> +L:   k...@vger.kernel.org
>> +S:   Supported
>> +F:   drivers/vfio/pci/nvgrace-gpu/
>> +
>
> Entries should be alphabetical.  This will end up colliding with [1] so
> I'll plan to fix it either way.

I will make the change to put it at the right place.


> Otherwise just a couple optional comments from me below.  I see Zhi also
> has a few good comments.  I'd suggest soliciting a review from the other
> variant driver reviewers for this version and maybe we can make v18 the
> final version.  Thanks,
>
> Alex
> 
> [1]https://lore.kernel.org/all/20240205235427.2103714-1-alex.william...@redhat.com/

Great!


>> +static ssize_t
>> +nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev,
>> +  const char __user *buf, size_t count, loff_t 
>> *ppos)
>> +{
>> + struct nvgrace_gpu_vfio_pci_core_device *nvdev =
>> + container_of(core_vdev, struct 
>> nvgrace_gpu_vfio_pci_core_device,
>> +  core_device.vdev);
>> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> + size_t register_offset;
>> + loff_t copy_offset;
>> + size_t copy_count;
>> + struct mem_region *memregion = NULL;
>
> Nit, consistency and reverse Christmas tree variable declaration would
> suggest pushing this up in the list, but it's not strictly required.

Ack, I'll make the change.


>> + if (index == USEMEM_REGION_INDEX && !memregion->memaddr) {
>> + memregion->memaddr = memremap(memregion->memphys,
>> +   memregion->memlength,
>> +   MEMREMAP_WB);
>> + if (!memregion->memaddr)
>> + ret = -ENOMEM;
>> + } else if (index == RESMEM_REGION_INDEX && !memregion->ioaddr) {
>> + memregion->ioaddr = ioremap_wc(memregion->memphys,
>> +    memregion->memlength);
>> + if (!memregion->ioaddr)
>> + ret = -ENOMEM;
>> + }
>
> As .memaddr and .ioaddr are a union we can consolidate the NULL test.

Ok, will do that.



Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Alex Williamson
On Tue, 6 Feb 2024 04:31:23 +0530
 wrote:

> From: Ankit Agrawal 
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the chip-to-chip
> interconnect that is exposed as a contiguous physically addressable
> region on the host. This device memory aperture can be obtained from host
> ACPI table using device_property_read_u64(), according to the FW
> specification. Since the device memory is cache coherent with the CPU,
> it can be mmap into the user VMA with a cacheable mapping using
> remap_pfn_range() and used like a regular RAM. The device memory
> is not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
> 
> There is also a requirement of a reserved 1G uncached region (termed as
> resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is
> to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory region.
> The decision for a different approach was taken in view of the fact that
> it would necessiate additional code in Qemu to discover and insert those
> regions in the VM IPA, along with the additional VM ACPI DSDT changes to
> communicate the device memory region IPA to the VM workloads. Moreover,
> this behavior would have to be added to a variety of emulators (beyond
> top of tree Qemu) out there desiring grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR (i.e.
> comprising of region 2 and 3). The cached device memory aperture is
> assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
> device in the VM with the uncached aperture reported as BAR2 region,
> the cacheable as BAR4. The variant driver provides emulation for these
> fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space register.
> 
> The memory layout on the host looks like the following:
>devmem (memlength)
> |--|
> |-cached|--NC--|
> |   |
> usemem.phys/memphys resmem.phys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size memblock
> aligned.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [4] and [3], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.
> 
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace Hopper 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Alex Williamson
On Thu, 8 Feb 2024 00:32:10 +0200
Zhi Wang  wrote:

> On Tue, 6 Feb 2024 04:31:23 +0530
>  wrote:
> 
> > From: Ankit Agrawal 
> > 
> > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > for the on-chip GPU that is the logical OS representation of the
> > internal proprietary chip-to-chip cache coherent interconnect.
> > 
> > The device is peculiar compared to a real PCI device in that whilst
> > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> > device, it is not used to access device memory once the faster
> > chip-to-chip interconnect is initialized (occurs at the time of host
> > system boot). The device memory is accessed instead using the
> > chip-to-chip interconnect that is exposed as a contiguous physically
> > addressable region on the host. This device memory aperture can be
> > obtained from host ACPI table using device_property_read_u64(),
> > according to the FW specification. Since the device memory is cache
> > coherent with the CPU, it can be mmap into the user VMA with a
> > cacheable mapping using remap_pfn_range() and used like a regular
> > RAM. The device memory is not added to the host kernel, but mapped
> > directly as this reduces memory wastage due to struct pages.
> > 
> > There is also a requirement of a reserved 1G uncached region (termed
> > as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This
> > is to work around a HW defect. Based on [2], the requisite properties
> > (uncached, unaligned access) can be achieved through a VM mapping (S1)
> > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> > a different non-cached property to the reserved 1G region, it needs to
> > be carved out from the device memory and mapped as a separate region
> > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> > Qemu VMA page properties (pgprot) as NORMAL_NC.
> > 
> > Provide a VFIO PCI variant driver that adapts the unique device memory
> > representation into a more standard PCI representation facing
> > userspace.
> > 
> > The variant driver exposes these two regions - the non-cached reserved
> > (resmem) and the cached rest of the device memory (termed as usemem)
> > as separate VFIO 64b BAR regions. This is divergent from the baremetal
> > approach, where the device memory is exposed as a device memory
> > region. The decision for a different approach was taken in view of
> > the fact that it would necessiate additional code in Qemu to discover
> > and insert those regions in the VM IPA, along with the additional VM
> > ACPI DSDT changes to communicate the device memory region IPA to the
> > VM workloads. Moreover, this behavior would have to be added to a
> > variety of emulators (beyond top of tree Qemu) out there desiring
> > grace hopper support.
> > 
> > Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> > maps the uncached carved out region to the next available PCI BAR
> > (i.e. comprising of region 2 and 3). The cached device memory
> > aperture is assigned BAR region 4 and 5. Qemu will then naturally
> > generate a PCI device in the VM with the uncached aperture reported
> > as BAR2 region, the cacheable as BAR4. The variant driver provides
> > emulation for these fake BARs' PCI config space offset registers.
> > 
> > The hardware ensures that the system does not crash when the memory
> > is accessed with the memory enable turned off. It synthesis ~0 reads
> > and dropped writes on such access. So there is no need to support the
> > disablement/enablement of BAR through PCI_COMMAND config space
> > register.
> > 
> > The memory layout on the host looks like the following:
> >devmem (memlength)
> > |--|
> > |-cached|--NC--|
> > |   |
> > usemem.phys/memphys resmem.phys
> > 
> > PCI BARs need to be aligned to the power-of-2, but the actual memory
> > on the device may not. A read or write access to the physical address
> > from the last device PFN up to the next power-of-2 aligned physical
> > address results in reading ~0 and dropped writes. Note that the GPU
> > device driver [6] is capable of knowing the exact device memory size
> > through separate means. The device memory size is primarily kept in
> > the system ACPI tables for use by the VFIO PCI variant module.
> > 
> > Note that the usemem memory is added by the VM Nvidia device driver
> > [5] to the VM kernel as memblocks. Hence make the usable memory size
> > memblock aligned.
> > 
> > Currently there is no provision in KVM for a S2 mapping with
> > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the
> > same [3]. As previously mentioned, resmem is mapped
> > pgprot_writecombine(), that sets the Qemu VMA page properties
> > (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM
> > marks the region with 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Zhi Wang
On Tue, 6 Feb 2024 04:31:23 +0530
 wrote:

> From: Ankit Agrawal 
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the
> chip-to-chip interconnect that is exposed as a contiguous physically
> addressable region on the host. This device memory aperture can be
> obtained from host ACPI table using device_property_read_u64(),
> according to the FW specification. Since the device memory is cache
> coherent with the CPU, it can be mmap into the user VMA with a
> cacheable mapping using remap_pfn_range() and used like a regular
> RAM. The device memory is not added to the host kernel, but mapped
> directly as this reduces memory wastage due to struct pages.
> 
> There is also a requirement of a reserved 1G uncached region (termed
> as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This
> is to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing
> userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem)
> as separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory
> region. The decision for a different approach was taken in view of
> the fact that it would necessiate additional code in Qemu to discover
> and insert those regions in the VM IPA, along with the additional VM
> ACPI DSDT changes to communicate the device memory region IPA to the
> VM workloads. Moreover, this behavior would have to be added to a
> variety of emulators (beyond top of tree Qemu) out there desiring
> grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR
> (i.e. comprising of region 2 and 3). The cached device memory
> aperture is assigned BAR region 4 and 5. Qemu will then naturally
> generate a PCI device in the VM with the uncached aperture reported
> as BAR2 region, the cacheable as BAR4. The variant driver provides
> emulation for these fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space
> register.
> 
> The memory layout on the host looks like the following:
>devmem (memlength)
> |--|
> |-cached|--NC--|
> |   |
> usemem.phys/memphys resmem.phys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory
> on the device may not. A read or write access to the physical address
> from the last device PFN up to the next power-of-2 aligned physical
> address results in reading ~0 and dropped writes. Note that the GPU
> device driver [6] is capable of knowing the exact device memory size
> through separate means. The device memory size is primarily kept in
> the system ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver
> [5] to the VM kernel as memblocks. Hence make the usable memory size
> memblock aligned.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the
> same [3]. As previously mentioned, resmem is mapped
> pgprot_writecombine(), that sets the Qemu VMA page properties
> (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM
> marks the region with MemAttr[2:0]=0b101 in S2.
> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.
> 
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace 

[PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-05 Thread ankita
From: Ankit Agrawal 

NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
for the on-chip GPU that is the logical OS representation of the
internal proprietary chip-to-chip cache coherent interconnect.

The device is peculiar compared to a real PCI device in that whilst
there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
device, it is not used to access device memory once the faster
chip-to-chip interconnect is initialized (occurs at the time of host
system boot). The device memory is accessed instead using the chip-to-chip
interconnect that is exposed as a contiguous physically addressable
region on the host. This device memory aperture can be obtained from host
ACPI table using device_property_read_u64(), according to the FW
specification. Since the device memory is cache coherent with the CPU,
it can be mmap into the user VMA with a cacheable mapping using
remap_pfn_range() and used like a regular RAM. The device memory
is not added to the host kernel, but mapped directly as this reduces
memory wastage due to struct pages.

There is also a requirement of a reserved 1G uncached region (termed as
resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is
to work around a HW defect. Based on [2], the requisite properties
(uncached, unaligned access) can be achieved through a VM mapping (S1)
of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
a different non-cached property to the reserved 1G region, it needs to
be carved out from the device memory and mapped as a separate region
in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
Qemu VMA page properties (pgprot) as NORMAL_NC.

Provide a VFIO PCI variant driver that adapts the unique device memory
representation into a more standard PCI representation facing userspace.

The variant driver exposes these two regions - the non-cached reserved
(resmem) and the cached rest of the device memory (termed as usemem) as
separate VFIO 64b BAR regions. This is divergent from the baremetal
approach, where the device memory is exposed as a device memory region.
The decision for a different approach was taken in view of the fact that
it would necessiate additional code in Qemu to discover and insert those
regions in the VM IPA, along with the additional VM ACPI DSDT changes to
communicate the device memory region IPA to the VM workloads. Moreover,
this behavior would have to be added to a variety of emulators (beyond
top of tree Qemu) out there desiring grace hopper support.

Since the device implements 64-bit BAR0, the VFIO PCI variant driver
maps the uncached carved out region to the next available PCI BAR (i.e.
comprising of region 2 and 3). The cached device memory aperture is
assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
device in the VM with the uncached aperture reported as BAR2 region,
the cacheable as BAR4. The variant driver provides emulation for these
fake BARs' PCI config space offset registers.

The hardware ensures that the system does not crash when the memory
is accessed with the memory enable turned off. It synthesis ~0 reads
and dropped writes on such access. So there is no need to support the
disablement/enablement of BAR through PCI_COMMAND config space register.

The memory layout on the host looks like the following:
   devmem (memlength)
|--|
|-cached|--NC--|
|   |
usemem.phys/memphys resmem.phys

PCI BARs need to be aligned to the power-of-2, but the actual memory on the
device may not. A read or write access to the physical address from the
last device PFN up to the next power-of-2 aligned physical address
results in reading ~0 and dropped writes. Note that the GPU device
driver [6] is capable of knowing the exact device memory size through
separate means. The device memory size is primarily kept in the system
ACPI tables for use by the VFIO PCI variant module.

Note that the usemem memory is added by the VM Nvidia device driver [5]
to the VM kernel as memblocks. Hence make the usable memory size memblock
aligned.

Currently there is no provision in KVM for a S2 mapping with
MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
As previously mentioned, resmem is mapped pgprot_writecombine(), that
sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
proposed changes in [4] and [3], KVM marks the region with
MemAttr[2:0]=0b101 in S2.

If the bare metal properties are not present, the driver registers the
vfio-pci-core function pointers.

This goes along with a qemu series [6] to provides the necessary
implementation of the Grace Hopper Superchip firmware specification so
that the guest operating system can see the correct ACPI modeling for
the coherent GPU device. Verified with the CUDA workload in the VM.

[1]