Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson
On Fri, 8 Jun 2018 13:52:05 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 1:35 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:09:13 +1000
> > Alexey Kardashevskiy  wrote:  
> >> On 8/6/18 3:04 am, Alex Williamson wrote:  
> >>> On Thu,  7 Jun 2018 18:44:20 +1000
> >>> Alexey Kardashevskiy  wrote:  
>  diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>  index 7bddf1e..38c9475 100644
>  --- a/drivers/vfio/pci/vfio_pci.c
>  +++ b/drivers/vfio/pci/vfio_pci.c
>  @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device 
>  *vdev)
>   }
>   }
>   
>  +if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>  +pdev->device == 0x1db1 &&
>  +IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> >>>
> >>> Can't we do better than check this based on device ID?  Perhaps PCIe
> >>> capability hints at this?
> >>
> >> A normal PCI pluggable device looks like this:
> >>
> >> root@fstn3:~# sudo lspci -vs :03:00.0
> >> :03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> >>Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> >>Flags: fast devsel, IRQ 497
> >>Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
> >>Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
> >>Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>
> >>
> >> This is a NVLink v1 machine:
> >>
> >> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> >> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> >>Subsystem: NVIDIA Corporation Device 116b
> >>Flags: bus master, fast devsel, latency 0, IRQ 457
> >>Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
> >>Memory at 2600 (64-bit, prefetchable) [size=16G]
> >>Memory at 2604 (64-bit, prefetchable) [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [250] Latency Tolerance Reporting
> >>Capabilities: [258] L1 PM Substates
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>Kernel driver in use: nvidia
> >>Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> >>
> >>
> >> This is the one the patch is for:
> >>
> >> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> >> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >> (rev a1)
> >>Subsystem: NVIDIA Corporation Device 1212
> >>Flags: fast devsel, IRQ 82, NUMA node 8
> >>Memory at 620c28000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >>Memory at 62280 (64-bit, prefetchable) [disabled] [size=16G]
> >>Memory at 62284 (64-bit, prefetchable) [disabled] [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [250] Latency Tolerance Reporting
> >>Capabilities: [258] L1 PM Substates
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>Capabilities: [ac0] #23
> >>Kernel driver in use: vfio-pci
> >>
> >>
> >> I can only see a new capability #23 which I have no idea about what it
> >> actually does - my latest PCIe spec is
> >> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> >> till #21, do you have any better spec? Does not seem promising anyway...  
> > 
> > You could just look in include/uapi/linux/pci_regs.h and see that 23
> > (0x17) is a TPH Requester capability and google for that...  It's a TLP
> > processing hint related to cache processing for requests from system
> > specific interconnects.  Sounds rather promising.  Of course there's
> > also the vendor specific capability that might be probed if NVIDIA will
> > tell you what to look for and the init function you've implemented
> > looks for specific devicetree nodes, that I imagine you could test for
> > in a probe as well.  
> 
> 
> This 23 is 

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alexey Kardashevskiy
On 8/6/18 1:35 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:09:13 +1000
> Alexey Kardashevskiy  wrote:
>> On 8/6/18 3:04 am, Alex Williamson wrote:
>>> On Thu,  7 Jun 2018 18:44:20 +1000
>>> Alexey Kardashevskiy  wrote:
 diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
 index 7bddf1e..38c9475 100644
 --- a/drivers/vfio/pci/vfio_pci.c
 +++ b/drivers/vfio/pci/vfio_pci.c
 @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device 
 *vdev)
}
}
  
 +  if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
 +  pdev->device == 0x1db1 &&
 +  IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
>>>
>>> Can't we do better than check this based on device ID?  Perhaps PCIe
>>> capability hints at this?  
>>
>> A normal PCI pluggable device looks like this:
>>
>> root@fstn3:~# sudo lspci -vs :03:00.0
>> :03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>>  Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
>>  Flags: fast devsel, IRQ 497
>>  Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
>>  Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
>>  Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
>>  Capabilities: [60] Power Management version 3
>>  Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>>  Capabilities: [78] Express Endpoint, MSI 00
>>  Capabilities: [100] Virtual Channel
>>  Capabilities: [128] Power Budgeting 
>>  Capabilities: [420] Advanced Error Reporting
>>  Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
>> 
>>  Capabilities: [900] #19
>>
>>
>> This is a NVLink v1 machine:
>>
>> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
>> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
>>  Subsystem: NVIDIA Corporation Device 116b
>>  Flags: bus master, fast devsel, latency 0, IRQ 457
>>  Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
>>  Memory at 2600 (64-bit, prefetchable) [size=16G]
>>  Memory at 2604 (64-bit, prefetchable) [size=32M]
>>  Capabilities: [60] Power Management version 3
>>  Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>>  Capabilities: [78] Express Endpoint, MSI 00
>>  Capabilities: [100] Virtual Channel
>>  Capabilities: [250] Latency Tolerance Reporting
>>  Capabilities: [258] L1 PM Substates
>>  Capabilities: [128] Power Budgeting 
>>  Capabilities: [420] Advanced Error Reporting
>>  Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
>> 
>>  Capabilities: [900] #19
>>  Kernel driver in use: nvidia
>>  Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
>>
>>
>> This is the one the patch is for:
>>
>> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>> (rev a1)
>>  Subsystem: NVIDIA Corporation Device 1212
>>  Flags: fast devsel, IRQ 82, NUMA node 8
>>  Memory at 620c28000 (32-bit, non-prefetchable) [disabled] [size=16M]
>>  Memory at 62280 (64-bit, prefetchable) [disabled] [size=16G]
>>  Memory at 62284 (64-bit, prefetchable) [disabled] [size=32M]
>>  Capabilities: [60] Power Management version 3
>>  Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>>  Capabilities: [78] Express Endpoint, MSI 00
>>  Capabilities: [100] Virtual Channel
>>  Capabilities: [250] Latency Tolerance Reporting
>>  Capabilities: [258] L1 PM Substates
>>  Capabilities: [128] Power Budgeting 
>>  Capabilities: [420] Advanced Error Reporting
>>  Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
>> 
>>  Capabilities: [900] #19
>>  Capabilities: [ac0] #23
>>  Kernel driver in use: vfio-pci
>>
>>
>> I can only see a new capability #23 which I have no idea about what it
>> actually does - my latest PCIe spec is
>> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
>> till #21, do you have any better spec? Does not seem promising anyway...
> 
> You could just look in include/uapi/linux/pci_regs.h and see that 23
> (0x17) is a TPH Requester capability and google for that...  It's a TLP
> processing hint related to cache processing for requests from system
> specific interconnects.  Sounds rather promising.  Of course there's
> also the vendor specific capability that might be probed if NVIDIA will
> tell you what to look for and the init function you've implemented
> looks for specific devicetree nodes, that I imagine you could test for
> in a probe as well.


This 23 is in hex:

[aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
(rev a1)
Subsystem: NVIDIA Corporation Device 1212
Flags: fast devsel, IRQ 82, NUMA 

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson
On Fri, 8 Jun 2018 13:09:13 +1000
Alexey Kardashevskiy  wrote:
> On 8/6/18 3:04 am, Alex Williamson wrote:
> > On Thu,  7 Jun 2018 18:44:20 +1000
> > Alexey Kardashevskiy  wrote:
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 7bddf1e..38c9475 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device 
> >> *vdev)
> >>}
> >>}
> >>  
> >> +  if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >> +  pdev->device == 0x1db1 &&
> >> +  IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
> > 
> > Can't we do better than check this based on device ID?  Perhaps PCIe
> > capability hints at this?  
> 
> A normal PCI pluggable device looks like this:
> 
> root@fstn3:~# sudo lspci -vs :03:00.0
> :03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>   Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
>   Flags: fast devsel, IRQ 497
>   Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
>   Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
>   Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
> 
> 
> This is a NVLink v1 machine:
> 
> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
>   Subsystem: NVIDIA Corporation Device 116b
>   Flags: bus master, fast devsel, latency 0, IRQ 457
>   Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
>   Memory at 2600 (64-bit, prefetchable) [size=16G]
>   Memory at 2604 (64-bit, prefetchable) [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [250] Latency Tolerance Reporting
>   Capabilities: [258] L1 PM Substates
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
>   Kernel driver in use: nvidia
>   Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> 
> 
> This is the one the patch is for:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
>   Subsystem: NVIDIA Corporation Device 1212
>   Flags: fast devsel, IRQ 82, NUMA node 8
>   Memory at 620c28000 (32-bit, non-prefetchable) [disabled] [size=16M]
>   Memory at 62280 (64-bit, prefetchable) [disabled] [size=16G]
>   Memory at 62284 (64-bit, prefetchable) [disabled] [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [250] Latency Tolerance Reporting
>   Capabilities: [258] L1 PM Substates
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
>   Capabilities: [ac0] #23
>   Kernel driver in use: vfio-pci
> 
> 
> I can only see a new capability #23 which I have no idea about what it
> actually does - my latest PCIe spec is
> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> till #21, do you have any better spec? Does not seem promising anyway...

You could just look in include/uapi/linux/pci_regs.h and see that 23
(0x17) is a TPH Requester capability and google for that...  It's a TLP
processing hint related to cache processing for requests from system
specific interconnects.  Sounds rather promising.  Of course there's
also the vendor specific capability that might be probed if NVIDIA will
tell you what to look for and the init function you've implemented
looks for specific devicetree nodes, that I imagine you could test for
in a probe as well.

> > Is it worthwhile to continue with assigning the device in the !ENABLED
> > case?  For instance, maybe it would be better to provide a weak
> > definition of vfio_pci_nvlink2_init() that would cause us to fail here
> > if we don't have this device specific support enabled.  I realize
> > you're following the example 

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alexey Kardashevskiy
On 8/6/18 3:04 am, Alex Williamson wrote:
> On Thu,  7 Jun 2018 18:44:20 +1000
> Alexey Kardashevskiy  wrote:
> 
>> Some POWER9 chips come with special NVLink2 links which provide
>> cacheable memory access to the RAM physically located on NVIDIA GPU.
>> This memory is presented to a host via the device tree but remains
>> offline until the NVIDIA driver onlines it.
>>
>> This exports this RAM to the userspace as a new region so
>> the NVIDIA driver in the guest can train these links and online GPU RAM.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  drivers/vfio/pci/Makefile   |   1 +
>>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>>  include/uapi/linux/vfio.h   |   3 +
>>  drivers/vfio/pci/vfio_pci.c |   9 ++
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 
>> 
>>  drivers/vfio/pci/Kconfig|   4 +
>>  6 files changed, 215 insertions(+)
>>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
>>
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 76d8ec0..9662c06 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -1,5 +1,6 @@
>>  
>>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>>  
>>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
>> b/drivers/vfio/pci/vfio_pci_private.h
>> index 86aab05..7115b9b 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct 
>> vfio_pci_device *vdev)
>>  return -ENODEV;
>>  }
>>  #endif
>> +#ifdef CONFIG_VFIO_PCI_NVLINK2
>> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
>> +#else
>> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
>> +{
>> +return -ENODEV;
>> +}
>> +#endif
>>  #endif /* VFIO_PCI_PRIVATE_H */
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 1aa7b82..2fe8227 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG  (2)
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG   (3)
>>  
>> +/* NVIDIA GPU NV2 */
>> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2  (4)
> 
> You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
> subtype.  Each vendor has their own address space of sub-types.


True, I'll update. I just like unique numbers better :)

> 
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
>> mmapped
>>   * which allows direct access to non-MSIX registers which happened to be 
>> within
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 7bddf1e..38c9475 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>  }
>>  }
>>  
>> +if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>> +pdev->device == 0x1db1 &&
>> +IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> 
> Can't we do better than check this based on device ID?  Perhaps PCIe
> capability hints at this?

A normal PCI pluggable device looks like this:

root@fstn3:~# sudo lspci -vs :03:00.0
:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Flags: fast devsel, IRQ 497
Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting 
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 

Capabilities: [900] #19


This is a NVLink v1 machine:

aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
Subsystem: NVIDIA Corporation Device 116b
Flags: bus master, fast devsel, latency 0, IRQ 457
Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
Memory at 2600 (64-bit, prefetchable) [size=16G]
Memory at 2604 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual 

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson
On Thu,  7 Jun 2018 18:44:20 +1000
Alexey Kardashevskiy  wrote:

> Some POWER9 chips come with special NVLink2 links which provide
> cacheable memory access to the RAM physically located on NVIDIA GPU.
> This memory is presented to a host via the device tree but remains
> offline until the NVIDIA driver onlines it.
> 
> This exports this RAM to the userspace as a new region so
> the NVIDIA driver in the guest can train these links and online GPU RAM.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>  include/uapi/linux/vfio.h   |   3 +
>  drivers/vfio/pci/vfio_pci.c |   9 ++
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 
> 
>  drivers/vfio/pci/Kconfig|   4 +
>  6 files changed, 215 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..7115b9b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct 
> vfio_pci_device *vdev)
>   return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> + return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82..2fe8227 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> +/* NVIDIA GPU NV2 */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2   (4)

You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
subtype.  Each vendor has their own address space of sub-types.

> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7bddf1e..38c9475 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   }
>   }
>  
> + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> + pdev->device == 0x1db1 &&
> + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {

Can't we do better than check this based on device ID?  Perhaps PCIe
capability hints at this?

Is it worthwhile to continue with assigning the device in the !ENABLED
case?  For instance, maybe it would be better to provide a weak
definition of vfio_pci_nvlink2_init() that would cause us to fail here
if we don't have this device specific support enabled.  I realize
you're following the example set forth for IGD, but those regions are
optional, for better or worse.

> + ret = vfio_pci_nvlink2_init(vdev);
> + if (ret)
> + dev_warn(>pdev->dev,
> +  "Failed to setup NVIDIA NV2 RAM region\n");
> + }
> +
>   vfio_pci_probe_mmaps(vdev);
>  
>   return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 000..451c5cb
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + *
> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
> + * Author: Alexey Kardashevskiy 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register an on-GPU RAM region for cacheable access.
> + *
> + * Derived from original vfio_pci_igd.c:
> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
> + *   Author: Alex Williamson 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "vfio_pci_private.h"
> +
> +struct vfio_pci_nvlink2_data {
> + unsigned long gpu_hpa;
> + unsigned long useraddr;
> + unsigned long size;
> + struct mm_struct 

[RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alexey Kardashevskiy
Some POWER9 chips come with special NVLink2 links which provide
cacheable memory access to the RAM physically located on NVIDIA GPU.
This memory is presented to a host via the device tree but remains
offline until the NVIDIA driver onlines it.

This exports this RAM to the userspace as a new region so
the NVIDIA driver in the guest can train these links and online GPU RAM.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/pci/Makefile   |   1 +
 drivers/vfio/pci/vfio_pci_private.h |   8 ++
 include/uapi/linux/vfio.h   |   3 +
 drivers/vfio/pci/vfio_pci.c |   9 ++
 drivers/vfio/pci/vfio_pci_nvlink2.c | 190 
 drivers/vfio/pci/Kconfig|   4 +
 6 files changed, 215 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 76d8ec0..9662c06 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,6 @@
 
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 86aab05..7115b9b 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device 
*vdev)
return -ENODEV;
 }
 #endif
+#ifdef CONFIG_VFIO_PCI_NVLINK2
+extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
+{
+   return -ENODEV;
+}
+#endif
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 1aa7b82..2fe8227 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2)
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
 
+/* NVIDIA GPU NV2 */
+#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2 (4)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7bddf1e..38c9475 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
}
}
 
+   if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+   pdev->device == 0x1db1 &&
+   IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+   ret = vfio_pci_nvlink2_init(vdev);
+   if (ret)
+   dev_warn(>pdev->dev,
+"Failed to setup NVIDIA NV2 RAM region\n");
+   }
+
vfio_pci_probe_mmaps(vdev);
 
return 0;
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
b/drivers/vfio/pci/vfio_pci_nvlink2.c
new file mode 100644
index 000..451c5cb
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ * Author: Alexey Kardashevskiy 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Register an on-GPU RAM region for cacheable access.
+ *
+ * Derived from original vfio_pci_igd.c:
+ * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
+ * Author: Alex Williamson 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vfio_pci_private.h"
+
+struct vfio_pci_nvlink2_data {
+   unsigned long gpu_hpa;
+   unsigned long useraddr;
+   unsigned long size;
+   struct mm_struct *mm;
+   struct mm_iommu_table_group_mem_t *mem;
+};
+
+static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
+   char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+   unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+   void *base = vdev->region[i].data;
+   loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+   if (pos >= vdev->region[i].size)
+   return -EINVAL;
+
+   count = min(count, (size_t)(vdev->region[i].size - pos));
+
+   if (iswrite) {
+   if (copy_from_user(base + pos, buf, count))
+   return -EFAULT;
+   } else {
+   if (copy_to_user(buf, base + pos, count))
+   return -EFAULT;
+   }
+   *ppos += count;
+
+   return count;
+}
+
+static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
+