Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-09 Thread Alex Williamson
On Thu, 9 Aug 2018 14:21:29 +1000
Alexey Kardashevskiy  wrote:

> On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 02/08/2018 02:16, Alex Williamson wrote:  
> >> On Wed, 1 Aug 2018 18:37:35 +1000
> >> Alexey Kardashevskiy  wrote:
> >>  
> >>> On 01/08/2018 00:29, Alex Williamson wrote:  
>  On Tue, 31 Jul 2018 14:03:35 +1000
>  Alexey Kardashevskiy  wrote:
>  
> > On 31/07/2018 02:29, Alex Williamson wrote:
> >> On Mon, 30 Jul 2018 18:58:49 +1000
> >> Alexey Kardashevskiy  wrote:
> >>> After some local discussions, it was pointed out that force disabling
> >>> nvlinks won't bring us much as for an nvlink to work, both sides need 
> >>> to
> >>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>> unless a good guest enabled the link but won't happen with a well
> >>> behaving guest. And if two guests became malicious, then can still 
> >>> only
> >>> harm each other, and so can they via other ways such network. This is
> >>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>> behaving device cannot firewall itself from peers as it is up to the
> >>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU 
> >>> still
> >>> has means to protect itself, just like a guest can run "firewalld" for
> >>> network.
> >>>
> >>> Although it would be a nice feature to have an extra barrier between
> >>> GPUs, is inability to block the links in hypervisor still a blocker 
> >>> for
> >>> V100 pass through?  
> >>
> >> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >> specific routes configured?   
> >
> > The GPU-GPU links need not to be blocked and need to be enabled
> > (==trained) by a driver in the guest. There are no routes between GPUs
> > in NVLink fabric, these are direct links, it is just a switch on each
> > side, both switches need to be on for a link to work.
> 
>  Ok, but there is at least the possibility of multiple direct links per
>  GPU, the very first diagram I find of NVlink shows 8 interconnected
>  GPUs:
> 
>  https://www.nvidia.com/en-us/data-center/nvlink/
> >>>
> >>> Out design is like the left part of the picture but it is just a detail.  
> >>
> >> Unless we can specifically identify a direct link vs a mesh link, we
> >> shouldn't be making assumptions about the degree of interconnect.
> >>
>  So if each switch enables one direct, point to point link, how does the
>  guest know which links to open for which peer device?
> >>>
> >>> It uses PCI config space on GPUs to discover the topology.  
> >>
> >> So do we need to virtualize this config space if we're going to
> >> virtualize the topology?
> >>  
>  And of course
>  since we can't see the spec, a security audit is at best hearsay :-\
> >>>
> >>> Yup, the exact discovery protocol is hidden.  
> >>
> >> It could be reverse engineered...
> >>  
> > The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> > is controlled via the emulated PCI bridges which I pass through together
> > with the GPU.
> 
>  So there's a special emulated switch, is that how the guest knows which
>  GPUs it can enable NVLinks to?
> >>>
> >>> Since it only has PCI config space (there is nothing relevant in the
> >>> device tree at all), I assume (double checking with the NVIDIA folks
> >>> now) the guest driver enables them all, tests which pair works and
> >>> disables the ones which do not. This gives a malicious guest a tiny
> >>> window of opportunity to break into a good guest. Hm :-/  
> >>
> >> Let's not minimize that window, that seems like a prime candidate for
> >> an exploit.
> >>  
> >> If the former, then isn't a non-malicious
> >> guest still susceptible to a malicious guest?  
> >
> > A non-malicious guest needs to turn its switch on for a link to a GPU
> > which belongs to a malicious guest.
> 
>  Actual security, or obfuscation, will we ever know...
> >>> If the latter, how is
> >> routing configured by the guest given that the guest view of the
> >> topology doesn't match physical hardware?  Are these routes
> >> deconfigured by device reset?  Are they part of the save/restore
> >> state?  Thanks,  
> 
>  Still curious what happens to these routes on reset.  Can a later user
>  of a GPU inherit a device where the links are already enabled?  Thanks,  
>    
> >>>
> >>> I am told that the GPU reset disables links. As a side effect, we get an
> >>> HMI (a hardware fault which reset the host machine) when trying
> >>> accessing the GPU RAM which indicates that the link is down as the
> >>> memory is only accessible via the nvlink. We have special fencing code
> >>> in our host firmware (skiboot) to fence this memory on 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-08 Thread Alexey Kardashevskiy



On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> 
> 
> On 02/08/2018 02:16, Alex Williamson wrote:
>> On Wed, 1 Aug 2018 18:37:35 +1000
>> Alexey Kardashevskiy  wrote:
>>
>>> On 01/08/2018 00:29, Alex Williamson wrote:
 On Tue, 31 Jul 2018 14:03:35 +1000
 Alexey Kardashevskiy  wrote:
   
> On 31/07/2018 02:29, Alex Williamson wrote:  
>> On Mon, 30 Jul 2018 18:58:49 +1000
>> Alexey Kardashevskiy  wrote:  
>>> After some local discussions, it was pointed out that force disabling
>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>> unless a good guest enabled the link but won't happen with a well
>>> behaving guest. And if two guests became malicious, then can still only
>>> harm each other, and so can they via other ways such network. This is
>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>> behaving device cannot firewall itself from peers as it is up to the
>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>> has means to protect itself, just like a guest can run "firewalld" for
>>> network.
>>>
>>> Although it would be a nice feature to have an extra barrier between
>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>> V100 pass through?
>>
>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>> specific routes configured? 
>
> The GPU-GPU links need not to be blocked and need to be enabled
> (==trained) by a driver in the guest. There are no routes between GPUs
> in NVLink fabric, these are direct links, it is just a switch on each
> side, both switches need to be on for a link to work.  

 Ok, but there is at least the possibility of multiple direct links per
 GPU, the very first diagram I find of NVlink shows 8 interconnected
 GPUs:

 https://www.nvidia.com/en-us/data-center/nvlink/  
>>>
>>> Out design is like the left part of the picture but it is just a detail.
>>
>> Unless we can specifically identify a direct link vs a mesh link, we
>> shouldn't be making assumptions about the degree of interconnect.
>>  
 So if each switch enables one direct, point to point link, how does the
 guest know which links to open for which peer device?  
>>>
>>> It uses PCI config space on GPUs to discover the topology.
>>
>> So do we need to virtualize this config space if we're going to
>> virtualize the topology?
>>
 And of course
 since we can't see the spec, a security audit is at best hearsay :-\  
>>>
>>> Yup, the exact discovery protocol is hidden.
>>
>> It could be reverse engineered...
>>
> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> is controlled via the emulated PCI bridges which I pass through together
> with the GPU.  

 So there's a special emulated switch, is that how the guest knows which
 GPUs it can enable NVLinks to?  
>>>
>>> Since it only has PCI config space (there is nothing relevant in the
>>> device tree at all), I assume (double checking with the NVIDIA folks
>>> now) the guest driver enables them all, tests which pair works and
>>> disables the ones which do not. This gives a malicious guest a tiny
>>> window of opportunity to break into a good guest. Hm :-/
>>
>> Let's not minimize that window, that seems like a prime candidate for
>> an exploit.
>>
>> If the former, then isn't a non-malicious
>> guest still susceptible to a malicious guest?
>
> A non-malicious guest needs to turn its switch on for a link to a GPU
> which belongs to a malicious guest.  

 Actual security, or obfuscation, will we ever know...  
>>> If the latter, how is  
>> routing configured by the guest given that the guest view of the
>> topology doesn't match physical hardware?  Are these routes
>> deconfigured by device reset?  Are they part of the save/restore
>> state?  Thanks,

 Still curious what happens to these routes on reset.  Can a later user
 of a GPU inherit a device where the links are already enabled?  Thanks,  
>>>
>>> I am told that the GPU reset disables links. As a side effect, we get an
>>> HMI (a hardware fault which reset the host machine) when trying
>>> accessing the GPU RAM which indicates that the link is down as the
>>> memory is only accessible via the nvlink. We have special fencing code
>>> in our host firmware (skiboot) to fence this memory on PCI reset so
>>> reading from it returns zeroes instead of HMIs.
>>
>> What sort of reset is required for this?  Typically we rely on
>> secondary bus reset for GPUs, but it would be a problem if GPUs were to
>> start implementing FLR and nobody had a spec to learn that FLR maybe
>> didn't disable the link.  The better approach to me still seems to be
>> virtualizing 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-08 Thread Alexey Kardashevskiy



On 02/08/2018 02:16, Alex Williamson wrote:
> On Wed, 1 Aug 2018 18:37:35 +1000
> Alexey Kardashevskiy  wrote:
> 
>> On 01/08/2018 00:29, Alex Williamson wrote:
>>> On Tue, 31 Jul 2018 14:03:35 +1000
>>> Alexey Kardashevskiy  wrote:
>>>   
 On 31/07/2018 02:29, Alex Williamson wrote:  
> On Mon, 30 Jul 2018 18:58:49 +1000
> Alexey Kardashevskiy  wrote:  
>> After some local discussions, it was pointed out that force disabling
>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>> enable it so malicious guests cannot penetrate good ones (or a host)
>> unless a good guest enabled the link but won't happen with a well
>> behaving guest. And if two guests became malicious, then can still only
>> harm each other, and so can they via other ways such network. This is
>> different from PCIe as once PCIe link is unavoidably enabled, a well
>> behaving device cannot firewall itself from peers as it is up to the
>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>> has means to protect itself, just like a guest can run "firewalld" for
>> network.
>>
>> Although it would be a nice feature to have an extra barrier between
>> GPUs, is inability to block the links in hypervisor still a blocker for
>> V100 pass through?
>
> How is the NVLink configured by the guest, is it 'on'/'off' or are
> specific routes configured? 

 The GPU-GPU links need not to be blocked and need to be enabled
 (==trained) by a driver in the guest. There are no routes between GPUs
 in NVLink fabric, these are direct links, it is just a switch on each
 side, both switches need to be on for a link to work.  
>>>
>>> Ok, but there is at least the possibility of multiple direct links per
>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
>>> GPUs:
>>>
>>> https://www.nvidia.com/en-us/data-center/nvlink/  
>>
>> Out design is like the left part of the picture but it is just a detail.
> 
> Unless we can specifically identify a direct link vs a mesh link, we
> shouldn't be making assumptions about the degree of interconnect.
>  
>>> So if each switch enables one direct, point to point link, how does the
>>> guest know which links to open for which peer device?  
>>
>> It uses PCI config space on GPUs to discover the topology.
> 
> So do we need to virtualize this config space if we're going to
> virtualize the topology?
> 
>>> And of course
>>> since we can't see the spec, a security audit is at best hearsay :-\  
>>
>> Yup, the exact discovery protocol is hidden.
> 
> It could be reverse engineered...
> 
 The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
 is controlled via the emulated PCI bridges which I pass through together
 with the GPU.  
>>>
>>> So there's a special emulated switch, is that how the guest knows which
>>> GPUs it can enable NVLinks to?  
>>
>> Since it only has PCI config space (there is nothing relevant in the
>> device tree at all), I assume (double checking with the NVIDIA folks
>> now) the guest driver enables them all, tests which pair works and
>> disables the ones which do not. This gives a malicious guest a tiny
>> window of opportunity to break into a good guest. Hm :-/
> 
> Let's not minimize that window, that seems like a prime candidate for
> an exploit.
> 
> If the former, then isn't a non-malicious
> guest still susceptible to a malicious guest?

 A non-malicious guest needs to turn its switch on for a link to a GPU
 which belongs to a malicious guest.  
>>>
>>> Actual security, or obfuscation, will we ever know...  
>> If the latter, how is  
> routing configured by the guest given that the guest view of the
> topology doesn't match physical hardware?  Are these routes
> deconfigured by device reset?  Are they part of the save/restore
> state?  Thanks,
>>>
>>> Still curious what happens to these routes on reset.  Can a later user
>>> of a GPU inherit a device where the links are already enabled?  Thanks,  
>>
>> I am told that the GPU reset disables links. As a side effect, we get an
>> HMI (a hardware fault which reset the host machine) when trying
>> accessing the GPU RAM which indicates that the link is down as the
>> memory is only accessible via the nvlink. We have special fencing code
>> in our host firmware (skiboot) to fence this memory on PCI reset so
>> reading from it returns zeroes instead of HMIs.
> 
> What sort of reset is required for this?  Typically we rely on
> secondary bus reset for GPUs, but it would be a problem if GPUs were to
> start implementing FLR and nobody had a spec to learn that FLR maybe
> didn't disable the link.  The better approach to me still seems to be
> virtualizing these NVLink config registers to an extent that the user
> can only enabling links where they have ownership of both ends of the
> connection.  Thanks,



Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-01 Thread Alex Williamson
On Wed, 1 Aug 2018 18:37:35 +1000
Alexey Kardashevskiy  wrote:

> On 01/08/2018 00:29, Alex Williamson wrote:
> > On Tue, 31 Jul 2018 14:03:35 +1000
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 31/07/2018 02:29, Alex Williamson wrote:  
> >>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>> Alexey Kardashevskiy  wrote:  
>  After some local discussions, it was pointed out that force disabling
>  nvlinks won't bring us much as for an nvlink to work, both sides need to
>  enable it so malicious guests cannot penetrate good ones (or a host)
>  unless a good guest enabled the link but won't happen with a well
>  behaving guest. And if two guests became malicious, then can still only
>  harm each other, and so can they via other ways such network. This is
>  different from PCIe as once PCIe link is unavoidably enabled, a well
>  behaving device cannot firewall itself from peers as it is up to the
>  upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>  has means to protect itself, just like a guest can run "firewalld" for
>  network.
> 
>  Although it would be a nice feature to have an extra barrier between
>  GPUs, is inability to block the links in hypervisor still a blocker for
>  V100 pass through?
> >>>
> >>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>> specific routes configured? 
> >>
> >> The GPU-GPU links need not to be blocked and need to be enabled
> >> (==trained) by a driver in the guest. There are no routes between GPUs
> >> in NVLink fabric, these are direct links, it is just a switch on each
> >> side, both switches need to be on for a link to work.  
> > 
> > Ok, but there is at least the possibility of multiple direct links per
> > GPU, the very first diagram I find of NVlink shows 8 interconnected
> > GPUs:
> > 
> > https://www.nvidia.com/en-us/data-center/nvlink/  
> 
> Out design is like the left part of the picture but it is just a detail.

Unless we can specifically identify a direct link vs a mesh link, we
shouldn't be making assumptions about the degree of interconnect.
 
> > So if each switch enables one direct, point to point link, how does the
> > guest know which links to open for which peer device?  
> 
> It uses PCI config space on GPUs to discover the topology.

So do we need to virtualize this config space if we're going to
virtualize the topology?

> > And of course
> > since we can't see the spec, a security audit is at best hearsay :-\  
> 
> Yup, the exact discovery protocol is hidden.

It could be reverse engineered...

> >> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >> is controlled via the emulated PCI bridges which I pass through together
> >> with the GPU.  
> > 
> > So there's a special emulated switch, is that how the guest knows which
> > GPUs it can enable NVLinks to?  
> 
> Since it only has PCI config space (there is nothing relevant in the
> device tree at all), I assume (double checking with the NVIDIA folks
> now) the guest driver enables them all, tests which pair works and
> disables the ones which do not. This gives a malicious guest a tiny
> window of opportunity to break into a good guest. Hm :-/

Let's not minimize that window, that seems like a prime candidate for
an exploit.

> >>> If the former, then isn't a non-malicious
> >>> guest still susceptible to a malicious guest?
> >>
> >> A non-malicious guest needs to turn its switch on for a link to a GPU
> >> which belongs to a malicious guest.  
> > 
> > Actual security, or obfuscation, will we ever know...  
>  If the latter, how is  
> >>> routing configured by the guest given that the guest view of the
> >>> topology doesn't match physical hardware?  Are these routes
> >>> deconfigured by device reset?  Are they part of the save/restore
> >>> state?  Thanks,
> > 
> > Still curious what happens to these routes on reset.  Can a later user
> > of a GPU inherit a device where the links are already enabled?  Thanks,  
> 
> I am told that the GPU reset disables links. As a side effect, we get an
> HMI (a hardware fault which reset the host machine) when trying
> accessing the GPU RAM which indicates that the link is down as the
> memory is only accessible via the nvlink. We have special fencing code
> in our host firmware (skiboot) to fence this memory on PCI reset so
> reading from it returns zeroes instead of HMIs.

What sort of reset is required for this?  Typically we rely on
secondary bus reset for GPUs, but it would be a problem if GPUs were to
start implementing FLR and nobody had a spec to learn that FLR maybe
didn't disable the link.  The better approach to me still seems to be
virtualizing these NVLink config registers to an extent that the user
can only enabling links where they have ownership of both ends of the
connection.  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-01 Thread Alexey Kardashevskiy



On 01/08/2018 00:29, Alex Williamson wrote:
> On Tue, 31 Jul 2018 14:03:35 +1000
> Alexey Kardashevskiy  wrote:
> 
>> On 31/07/2018 02:29, Alex Williamson wrote:
>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>> Alexey Kardashevskiy  wrote:
 After some local discussions, it was pointed out that force disabling
 nvlinks won't bring us much as for an nvlink to work, both sides need to
 enable it so malicious guests cannot penetrate good ones (or a host)
 unless a good guest enabled the link but won't happen with a well
 behaving guest. And if two guests became malicious, then can still only
 harm each other, and so can they via other ways such network. This is
 different from PCIe as once PCIe link is unavoidably enabled, a well
 behaving device cannot firewall itself from peers as it is up to the
 upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
 has means to protect itself, just like a guest can run "firewalld" for
 network.

 Although it would be a nice feature to have an extra barrier between
 GPUs, is inability to block the links in hypervisor still a blocker for
 V100 pass through?  
>>>
>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>> specific routes configured?   
>>
>> The GPU-GPU links need not to be blocked and need to be enabled
>> (==trained) by a driver in the guest. There are no routes between GPUs
>> in NVLink fabric, these are direct links, it is just a switch on each
>> side, both switches need to be on for a link to work.
> 
> Ok, but there is at least the possibility of multiple direct links per
> GPU, the very first diagram I find of NVlink shows 8 interconnected
> GPUs:
> 
> https://www.nvidia.com/en-us/data-center/nvlink/

Out design is like the left part of the picture but it is just a detail.

> So if each switch enables one direct, point to point link, how does the
> guest know which links to open for which peer device?

It uses PCI config space on GPUs to discover the topology.

> And of course
> since we can't see the spec, a security audit is at best hearsay :-\

Yup, the exact discovery protocol is hidden.


>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>> is controlled via the emulated PCI bridges which I pass through together
>> with the GPU.
> 
> So there's a special emulated switch, is that how the guest knows which
> GPUs it can enable NVLinks to?

Since it only has PCI config space (there is nothing relevant in the
device tree at all), I assume (double checking with the NVIDIA folks
now) the guest driver enables them all, tests which pair works and
disables the ones which do not. This gives a malicious guest a tiny
window of opportunity to break into a good guest. Hm :-/


>>> If the former, then isn't a non-malicious
>>> guest still susceptible to a malicious guest?  
>>
>> A non-malicious guest needs to turn its switch on for a link to a GPU
>> which belongs to a malicious guest.
> 
> Actual security, or obfuscation, will we ever know...
 If the latter, how is
>>> routing configured by the guest given that the guest view of the
>>> topology doesn't match physical hardware?  Are these routes
>>> deconfigured by device reset?  Are they part of the save/restore
>>> state?  Thanks,  
> 
> Still curious what happens to these routes on reset.  Can a later user
> of a GPU inherit a device where the links are already enabled?  Thanks,

I am told that the GPU reset disables links. As a side effect, we get an
HMI (a hardware fault which reset the host machine) when trying
accessing the GPU RAM which indicates that the link is down as the
memory is only accessible via the nvlink. We have special fencing code
in our host firmware (skiboot) to fence this memory on PCI reset so
reading from it returns zeroes instead of HMIs.



-- 
Alexey


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-31 Thread Alex Williamson
On Tue, 31 Jul 2018 14:03:35 +1000
Alexey Kardashevskiy  wrote:

> On 31/07/2018 02:29, Alex Williamson wrote:
> > On Mon, 30 Jul 2018 18:58:49 +1000
> > Alexey Kardashevskiy  wrote:
> >> After some local discussions, it was pointed out that force disabling
> >> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >> enable it so malicious guests cannot penetrate good ones (or a host)
> >> unless a good guest enabled the link but won't happen with a well
> >> behaving guest. And if two guests became malicious, then can still only
> >> harm each other, and so can they via other ways such network. This is
> >> different from PCIe as once PCIe link is unavoidably enabled, a well
> >> behaving device cannot firewall itself from peers as it is up to the
> >> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >> has means to protect itself, just like a guest can run "firewalld" for
> >> network.
> >>
> >> Although it would be a nice feature to have an extra barrier between
> >> GPUs, is inability to block the links in hypervisor still a blocker for
> >> V100 pass through?  
> > 
> > How is the NVLink configured by the guest, is it 'on'/'off' or are
> > specific routes configured?   
> 
> The GPU-GPU links need not to be blocked and need to be enabled
> (==trained) by a driver in the guest. There are no routes between GPUs
> in NVLink fabric, these are direct links, it is just a switch on each
> side, both switches need to be on for a link to work.

Ok, but there is at least the possibility of multiple direct links per
GPU, the very first diagram I find of NVlink shows 8 interconnected
GPUs:

https://www.nvidia.com/en-us/data-center/nvlink/

So if each switch enables one direct, point to point link, how does the
guest know which links to open for which peer device?  And of course
since we can't see the spec, a security audit is at best hearsay :-\
 
> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> is controlled via the emulated PCI bridges which I pass through together
> with the GPU.

So there's a special emulated switch, is that how the guest knows which
GPUs it can enable NVLinks to?

> > If the former, then isn't a non-malicious
> > guest still susceptible to a malicious guest?  
> 
> A non-malicious guest needs to turn its switch on for a link to a GPU
> which belongs to a malicious guest.

Actual security, or obfuscation, will we ever know...

> > If the latter, how is
> > routing configured by the guest given that the guest view of the
> > topology doesn't match physical hardware?  Are these routes
> > deconfigured by device reset?  Are they part of the save/restore
> > state?  Thanks,  

Still curious what happens to these routes on reset.  Can a later user
of a GPU inherit a device where the links are already enabled?  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-30 Thread Alexey Kardashevskiy



On 31/07/2018 02:29, Alex Williamson wrote:
> On Mon, 30 Jul 2018 18:58:49 +1000
> Alexey Kardashevskiy  wrote:
> 
>> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
>>> On Tue, 10 Jul 2018 16:37:15 -0600
>>> Alex Williamson  wrote:
>>>   
 On Tue, 10 Jul 2018 14:10:20 +1000
 Alexey Kardashevskiy  wrote:
  
> On Thu, 7 Jun 2018 23:03:23 -0600
> Alex Williamson  wrote:
> 
>> On Fri, 8 Jun 2018 14:14:23 +1000
>> Alexey Kardashevskiy  wrote:
>>   
>>> On 8/6/18 1:44 pm, Alex Williamson wrote:
 On Fri, 8 Jun 2018 13:08:54 +1000
 Alexey Kardashevskiy  wrote:
   
> On 8/6/18 8:15 am, Alex Williamson wrote:  
>> On Fri, 08 Jun 2018 07:54:02 +1000
>> Benjamin Herrenschmidt  wrote:
>> 
>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:   
>>>  

 Can we back up and discuss whether the IOMMU grouping of NVLink
 connected devices makes sense?  AIUI we have a PCI view of these
 devices and from that perspective they're isolated.  That's the 
 view of
 the device used to generate the grouping.  However, not visible to 
 us,
 these devices are interconnected via NVLink.  What isolation 
 properties
 does NVLink provide given that its entire purpose for existing 
 seems to
 be to provide a high performance link for p2p between devices? 
  
>>>
>>> Not entire. On POWER chips, we also have an nvlink between the 
>>> device
>>> and the CPU which is running significantly faster than PCIe.
>>>
>>> But yes, there are cross-links and those should probably be 
>>> accounted
>>> for in the grouping.
>>
>> Then after we fix the grouping, can we just let the host driver 
>> manage
>> this coherent memory range and expose vGPUs to guests?  The use case 
>> of
>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>> convince NVIDIA to support more than a single vGPU per VM though)
>> 
>
> These are physical GPUs, not virtual sriov-alike things they are
> implementing as well elsewhere.  

 vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
 either.  That's why we have mdev devices now to implement software
 defined devices.  I don't have first hand experience with V-series, but
 I would absolutely expect a PCIe-based Tesla V100 to support vGPU. 
  
>>>
>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset 
>>> and
>>> using mediated vGPUs instead, correct?
>>
>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
>> account for lack of isolation on the NVLink side and we correct that,
>> limiting assignment to sets of 3 interconnected GPUs, is that still a
>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
>> whether they choose to support vGPU on these GPUs or whether they can
>> be convinced to support multiple vGPUs per VM.
>>   
> My current understanding is that every P9 chip in that box has some 
> NVLink2
> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 
> links
> as well.
>
> From small bits of information I have it seems that a GPU can 
> perfectly
> work alone and if the NVIDIA driver does not see these interconnects
> (because we do not pass the rest of the big 3xGPU group to this 
> guest), it
> continues with a single GPU. There is an "nvidia-smi -r" big reset 
> hammer
> which simply refuses to work until all 3 GPUs are passed so there is 
> some
> distinction between passing 1 or 3 GPUs, and I am trying (as we 
> speak) to
> get a confirmation from NVIDIA that it is ok to pass just a single 
> GPU.
>
> So we will either have 6 groups (one per GPU) or 2 groups (one per
> interconnected group).  

 I'm not gaining much confidence that we can rely on isolation between
 NVLink connected GPUs, it sounds like you're simply expecting that
 proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
 is going to play nice and nobody will figure out how to do bad things
 because... obfuscation?  Thanks,  
>>>
>>> Well, we already believe that a proprietary firmware of a sriov-capable
>>> adapter like Mellanox ConnextX is not doing bad things, how is 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-30 Thread Alex Williamson
On Mon, 30 Jul 2018 18:58:49 +1000
Alexey Kardashevskiy  wrote:

> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> > On Tue, 10 Jul 2018 16:37:15 -0600
> > Alex Williamson  wrote:
> >   
> >> On Tue, 10 Jul 2018 14:10:20 +1000
> >> Alexey Kardashevskiy  wrote:
> >>  
> >>> On Thu, 7 Jun 2018 23:03:23 -0600
> >>> Alex Williamson  wrote:
> >>> 
>  On Fri, 8 Jun 2018 14:14:23 +1000
>  Alexey Kardashevskiy  wrote:
>    
> > On 8/6/18 1:44 pm, Alex Williamson wrote:
> >> On Fri, 8 Jun 2018 13:08:54 +1000
> >> Alexey Kardashevskiy  wrote:
> >>   
> >>> On 8/6/18 8:15 am, Alex Williamson wrote:  
>  On Fri, 08 Jun 2018 07:54:02 +1000
>  Benjamin Herrenschmidt  wrote:
>  
> > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:   
> >  
> >>
> >> Can we back up and discuss whether the IOMMU grouping of NVLink
> >> connected devices makes sense?  AIUI we have a PCI view of these
> >> devices and from that perspective they're isolated.  That's the 
> >> view of
> >> the device used to generate the grouping.  However, not visible to 
> >> us,
> >> these devices are interconnected via NVLink.  What isolation 
> >> properties
> >> does NVLink provide given that its entire purpose for existing 
> >> seems to
> >> be to provide a high performance link for p2p between devices? 
> >>  
> >
> > Not entire. On POWER chips, we also have an nvlink between the 
> > device
> > and the CPU which is running significantly faster than PCIe.
> >
> > But yes, there are cross-links and those should probably be 
> > accounted
> > for in the grouping.
> 
>  Then after we fix the grouping, can we just let the host driver 
>  manage
>  this coherent memory range and expose vGPUs to guests?  The use case 
>  of
>  assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>  convince NVIDIA to support more than a single vGPU per VM though)
>  
> >>>
> >>> These are physical GPUs, not virtual sriov-alike things they are
> >>> implementing as well elsewhere.  
> >>
> >> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> >> either.  That's why we have mdev devices now to implement software
> >> defined devices.  I don't have first hand experience with V-series, but
> >> I would absolutely expect a PCIe-based Tesla V100 to support vGPU. 
> >>  
> >
> > So assuming V100 can do vGPU, you are suggesting ditching this patchset 
> > and
> > using mediated vGPUs instead, correct?
> 
>  If it turns out that our PCIe-only-based IOMMU grouping doesn't
>  account for lack of isolation on the NVLink side and we correct that,
>  limiting assignment to sets of 3 interconnected GPUs, is that still a
>  useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
>  whether they choose to support vGPU on these GPUs or whether they can
>  be convinced to support multiple vGPUs per VM.
>    
> >>> My current understanding is that every P9 chip in that box has some 
> >>> NVLink2
> >>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 
> >>> links
> >>> as well.
> >>>
> >>> From small bits of information I have it seems that a GPU can 
> >>> perfectly
> >>> work alone and if the NVIDIA driver does not see these interconnects
> >>> (because we do not pass the rest of the big 3xGPU group to this 
> >>> guest), it
> >>> continues with a single GPU. There is an "nvidia-smi -r" big reset 
> >>> hammer
> >>> which simply refuses to work until all 3 GPUs are passed so there is 
> >>> some
> >>> distinction between passing 1 or 3 GPUs, and I am trying (as we 
> >>> speak) to
> >>> get a confirmation from NVIDIA that it is ok to pass just a single 
> >>> GPU.
> >>>
> >>> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >>> interconnected group).  
> >>
> >> I'm not gaining much confidence that we can rely on isolation between
> >> NVLink connected GPUs, it sounds like you're simply expecting that
> >> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> >> is going to play nice and nobody will figure out how to do bad things
> >> because... obfuscation?  Thanks,  
> >
> > Well, we already believe that a proprietary firmware of a sriov-capable
> > adapter like Mellanox ConnextX is not doing bad things, how is this
> > different in principle?
> 
> 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-30 Thread Alexey Kardashevskiy



On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> On Tue, 10 Jul 2018 16:37:15 -0600
> Alex Williamson  wrote:
> 
>> On Tue, 10 Jul 2018 14:10:20 +1000
>> Alexey Kardashevskiy  wrote:
>>
>>> On Thu, 7 Jun 2018 23:03:23 -0600
>>> Alex Williamson  wrote:
>>>   
 On Fri, 8 Jun 2018 14:14:23 +1000
 Alexey Kardashevskiy  wrote:
 
> On 8/6/18 1:44 pm, Alex Williamson wrote:  
>> On Fri, 8 Jun 2018 13:08:54 +1000
>> Alexey Kardashevskiy  wrote:
>> 
>>> On 8/6/18 8:15 am, Alex Williamson wrote:
 On Fri, 08 Jun 2018 07:54:02 +1000
 Benjamin Herrenschmidt  wrote:
   
> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
>>
>> Can we back up and discuss whether the IOMMU grouping of NVLink
>> connected devices makes sense?  AIUI we have a PCI view of these
>> devices and from that perspective they're isolated.  That's the view 
>> of
>> the device used to generate the grouping.  However, not visible to 
>> us,
>> these devices are interconnected via NVLink.  What isolation 
>> properties
>> does NVLink provide given that its entire purpose for existing seems 
>> to
>> be to provide a high performance link for p2p between devices?   
>>  
>
> Not entire. On POWER chips, we also have an nvlink between the device
> and the CPU which is running significantly faster than PCIe.
>
> But yes, there are cross-links and those should probably be accounted
> for in the grouping.  

 Then after we fix the grouping, can we just let the host driver manage
 this coherent memory range and expose vGPUs to guests?  The use case of
 assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
 convince NVIDIA to support more than a single vGPU per VM though)  
 
>>>
>>> These are physical GPUs, not virtual sriov-alike things they are
>>> implementing as well elsewhere.
>>
>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
>> either.  That's why we have mdev devices now to implement software
>> defined devices.  I don't have first hand experience with V-series, but
>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.   
>>  
>
> So assuming V100 can do vGPU, you are suggesting ditching this patchset 
> and
> using mediated vGPUs instead, correct?  

 If it turns out that our PCIe-only-based IOMMU grouping doesn't
 account for lack of isolation on the NVLink side and we correct that,
 limiting assignment to sets of 3 interconnected GPUs, is that still a
 useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
 whether they choose to support vGPU on these GPUs or whether they can
 be convinced to support multiple vGPUs per VM.
 
>>> My current understanding is that every P9 chip in that box has some 
>>> NVLink2
>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 
>>> links
>>> as well.
>>>
>>> From small bits of information I have it seems that a GPU can perfectly
>>> work alone and if the NVIDIA driver does not see these interconnects
>>> (because we do not pass the rest of the big 3xGPU group to this guest), 
>>> it
>>> continues with a single GPU. There is an "nvidia-smi -r" big reset 
>>> hammer
>>> which simply refuses to work until all 3 GPUs are passed so there is 
>>> some
>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) 
>>> to
>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>>
>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>>> interconnected group).
>>
>> I'm not gaining much confidence that we can rely on isolation between
>> NVLink connected GPUs, it sounds like you're simply expecting that
>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
>> is going to play nice and nobody will figure out how to do bad things
>> because... obfuscation?  Thanks,
>
> Well, we already believe that a proprietary firmware of a sriov-capable
> adapter like Mellanox ConnextX is not doing bad things, how is this
> different in principle?  

 It seems like the scope and hierarchy are different.  Here we're
 talking about exposing big discrete devices, which are peers of one
 another (and have history of being reverse engineered), to userspace
 drivers.  Once handed to userspace, each of those devices needs to be
 considered untrusted.  In the case of SR-IOV, we typically have a
 trusted host driver 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-11 Thread Alexey Kardashevskiy
On Tue, 10 Jul 2018 16:37:15 -0600
Alex Williamson  wrote:

> On Tue, 10 Jul 2018 14:10:20 +1000
> Alexey Kardashevskiy  wrote:
> 
> > On Thu, 7 Jun 2018 23:03:23 -0600
> > Alex Williamson  wrote:
> >   
> > > On Fri, 8 Jun 2018 14:14:23 +1000
> > > Alexey Kardashevskiy  wrote:
> > > 
> > > > On 8/6/18 1:44 pm, Alex Williamson wrote:  
> > > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > > Alexey Kardashevskiy  wrote:
> > > > > 
> > > > >> On 8/6/18 8:15 am, Alex Williamson wrote:
> > > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > > >>> Benjamin Herrenschmidt  wrote:
> > > > >>>   
> > > >  On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > > > >
> > > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > > devices and from that perspective they're isolated.  That's the 
> > > > > view of
> > > > > the device used to generate the grouping.  However, not visible 
> > > > > to us,
> > > > > these devices are interconnected via NVLink.  What isolation 
> > > > > properties
> > > > > does NVLink provide given that its entire purpose for existing 
> > > > > seems to
> > > > > be to provide a high performance link for p2p between devices?
> > > > > 
> > > > 
> > > >  Not entire. On POWER chips, we also have an nvlink between the 
> > > >  device
> > > >  and the CPU which is running significantly faster than PCIe.
> > > > 
> > > >  But yes, there are cross-links and those should probably be 
> > > >  accounted
> > > >  for in the grouping.  
> > > > >>>
> > > > >>> Then after we fix the grouping, can we just let the host driver 
> > > > >>> manage
> > > > >>> this coherent memory range and expose vGPUs to guests?  The use 
> > > > >>> case of
> > > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > > >>> convince NVIDIA to support more than a single vGPU per VM though)   
> > > > >>>
> > > > >>
> > > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > > >> implementing as well elsewhere.
> > > > > 
> > > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > > either.  That's why we have mdev devices now to implement software
> > > > > defined devices.  I don't have first hand experience with V-series, 
> > > > > but
> > > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.
> > > > > 
> > > > 
> > > > So assuming V100 can do vGPU, you are suggesting ditching this patchset 
> > > > and
> > > > using mediated vGPUs instead, correct?  
> > > 
> > > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > > account for lack of isolation on the NVLink side and we correct that,
> > > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > > whether they choose to support vGPU on these GPUs or whether they can
> > > be convinced to support multiple vGPUs per VM.
> > > 
> > > > >> My current understanding is that every P9 chip in that box has some 
> > > > >> NVLink2
> > > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 
> > > > >> links
> > > > >> as well.
> > > > >>
> > > > >> From small bits of information I have it seems that a GPU can 
> > > > >> perfectly
> > > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > > >> (because we do not pass the rest of the big 3xGPU group to this 
> > > > >> guest), it
> > > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset 
> > > > >> hammer
> > > > >> which simply refuses to work until all 3 GPUs are passed so there is 
> > > > >> some
> > > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we 
> > > > >> speak) to
> > > > >> get a confirmation from NVIDIA that it is ok to pass just a single 
> > > > >> GPU.
> > > > >>
> > > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > > >> interconnected group).
> > > > > 
> > > > > I'm not gaining much confidence that we can rely on isolation between
> > > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > > is going to play nice and nobody will figure out how to do bad things
> > > > > because... obfuscation?  Thanks,
> > > > 
> > > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > > different in principle?  
> > > 
> > > It seems like the scope and hierarchy are different.  Here we're
> > > talking about exposing big discrete 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-10 Thread Alex Williamson
On Tue, 10 Jul 2018 14:10:20 +1000
Alexey Kardashevskiy  wrote:

> On Thu, 7 Jun 2018 23:03:23 -0600
> Alex Williamson  wrote:
> 
> > On Fri, 8 Jun 2018 14:14:23 +1000
> > Alexey Kardashevskiy  wrote:
> >   
> > > On 8/6/18 1:44 pm, Alex Williamson wrote:
> > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > Alexey Kardashevskiy  wrote:
> > > >   
> > > >> On 8/6/18 8:15 am, Alex Williamson wrote:  
> > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > >>> Benjamin Herrenschmidt  wrote:
> > > >>> 
> > >  On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > > >
> > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > devices and from that perspective they're isolated.  That's the 
> > > > view of
> > > > the device used to generate the grouping.  However, not visible to 
> > > > us,
> > > > these devices are interconnected via NVLink.  What isolation 
> > > > properties
> > > > does NVLink provide given that its entire purpose for existing 
> > > > seems to
> > > > be to provide a high performance link for p2p between devices?  
> > > > 
> > > 
> > >  Not entire. On POWER chips, we also have an nvlink between the device
> > >  and the CPU which is running significantly faster than PCIe.
> > > 
> > >  But yes, there are cross-links and those should probably be accounted
> > >  for in the grouping.
> > > >>>
> > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > >>> this coherent memory range and expose vGPUs to guests?  The use case 
> > > >>> of
> > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > >>> convince NVIDIA to support more than a single vGPU per VM though) 
> > > >>>
> > > >>
> > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > >> implementing as well elsewhere.  
> > > > 
> > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > either.  That's why we have mdev devices now to implement software
> > > > defined devices.  I don't have first hand experience with V-series, but
> > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.  
> > > 
> > > So assuming V100 can do vGPU, you are suggesting ditching this patchset 
> > > and
> > > using mediated vGPUs instead, correct?
> > 
> > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > account for lack of isolation on the NVLink side and we correct that,
> > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > whether they choose to support vGPU on these GPUs or whether they can
> > be convinced to support multiple vGPUs per VM.
> >   
> > > >> My current understanding is that every P9 chip in that box has some 
> > > >> NVLink2
> > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 
> > > >> links
> > > >> as well.
> > > >>
> > > >> From small bits of information I have it seems that a GPU can perfectly
> > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > >> (because we do not pass the rest of the big 3xGPU group to this 
> > > >> guest), it
> > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset 
> > > >> hammer
> > > >> which simply refuses to work until all 3 GPUs are passed so there is 
> > > >> some
> > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) 
> > > >> to
> > > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > > >>
> > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > >> interconnected group).  
> > > > 
> > > > I'm not gaining much confidence that we can rely on isolation between
> > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > is going to play nice and nobody will figure out how to do bad things
> > > > because... obfuscation?  Thanks,  
> > > 
> > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > different in principle?
> > 
> > It seems like the scope and hierarchy are different.  Here we're
> > talking about exposing big discrete devices, which are peers of one
> > another (and have history of being reverse engineered), to userspace
> > drivers.  Once handed to userspace, each of those devices needs to be
> > considered untrusted.  In the case of SR-IOV, we typically have a
> > trusted host driver for the PF managing untrusted VFs.  We do rely on
> > some sanity in the hardware/firmware in isolating the 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-09 Thread Alexey Kardashevskiy
On Thu, 7 Jun 2018 23:03:23 -0600
Alex Williamson  wrote:

> On Fri, 8 Jun 2018 14:14:23 +1000
> Alexey Kardashevskiy  wrote:
> 
> > On 8/6/18 1:44 pm, Alex Williamson wrote:  
> > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > Alexey Kardashevskiy  wrote:
> > > 
> > >> On 8/6/18 8:15 am, Alex Williamson wrote:
> > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > >>> Benjamin Herrenschmidt  wrote:
> > >>>   
> >  On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > >
> > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > connected devices makes sense?  AIUI we have a PCI view of these
> > > devices and from that perspective they're isolated.  That's the view 
> > > of
> > > the device used to generate the grouping.  However, not visible to us,
> > > these devices are interconnected via NVLink.  What isolation 
> > > properties
> > > does NVLink provide given that its entire purpose for existing seems 
> > > to
> > > be to provide a high performance link for p2p between devices?
> > 
> >  Not entire. On POWER chips, we also have an nvlink between the device
> >  and the CPU which is running significantly faster than PCIe.
> > 
> >  But yes, there are cross-links and those should probably be accounted
> >  for in the grouping.  
> > >>>
> > >>> Then after we fix the grouping, can we just let the host driver manage
> > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > >>> convince NVIDIA to support more than a single vGPU per VM though)  
> > >>
> > >> These are physical GPUs, not virtual sriov-alike things they are
> > >> implementing as well elsewhere.
> > > 
> > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > either.  That's why we have mdev devices now to implement software
> > > defined devices.  I don't have first hand experience with V-series, but
> > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.
> > 
> > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > using mediated vGPUs instead, correct?  
> 
> If it turns out that our PCIe-only-based IOMMU grouping doesn't
> account for lack of isolation on the NVLink side and we correct that,
> limiting assignment to sets of 3 interconnected GPUs, is that still a
> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> whether they choose to support vGPU on these GPUs or whether they can
> be convinced to support multiple vGPUs per VM.
> 
> > >> My current understanding is that every P9 chip in that box has some 
> > >> NVLink2
> > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > >> as well.
> > >>
> > >> From small bits of information I have it seems that a GPU can perfectly
> > >> work alone and if the NVIDIA driver does not see these interconnects
> > >> (because we do not pass the rest of the big 3xGPU group to this guest), 
> > >> it
> > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > >>
> > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > >> interconnected group).
> > > 
> > > I'm not gaining much confidence that we can rely on isolation between
> > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > is going to play nice and nobody will figure out how to do bad things
> > > because... obfuscation?  Thanks,
> > 
> > Well, we already believe that a proprietary firmware of a sriov-capable
> > adapter like Mellanox ConnextX is not doing bad things, how is this
> > different in principle?  
> 
> It seems like the scope and hierarchy are different.  Here we're
> talking about exposing big discrete devices, which are peers of one
> another (and have history of being reverse engineered), to userspace
> drivers.  Once handed to userspace, each of those devices needs to be
> considered untrusted.  In the case of SR-IOV, we typically have a
> trusted host driver for the PF managing untrusted VFs.  We do rely on
> some sanity in the hardware/firmware in isolating the VFs from each
> other and from the PF, but we also often have source code for Linux
> drivers for these devices and sometimes even datasheets.  Here we have
> neither of those and perhaps we won't know the extent of the lack of
> isolation between these devices until nouveau (best case) or some
> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> of 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 8 Jun 2018 14:14:23 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 1:44 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:08:54 +1000
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 8/6/18 8:15 am, Alex Williamson wrote:  
> >>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>> Benjamin Herrenschmidt  wrote:
> >>> 
>  On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> >
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
>  Not entire. On POWER chips, we also have an nvlink between the device
>  and the CPU which is running significantly faster than PCIe.
> 
>  But yes, there are cross-links and those should probably be accounted
>  for in the grouping.
> >>>
> >>> Then after we fix the grouping, can we just let the host driver manage
> >>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>> convince NVIDIA to support more than a single vGPU per VM though)
> >>
> >> These are physical GPUs, not virtual sriov-alike things they are
> >> implementing as well elsewhere.  
> > 
> > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > either.  That's why we have mdev devices now to implement software
> > defined devices.  I don't have first hand experience with V-series, but
> > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.  
> 
> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> using mediated vGPUs instead, correct?

If it turns out that our PCIe-only-based IOMMU grouping doesn't
account for lack of isolation on the NVLink side and we correct that,
limiting assignment to sets of 3 interconnected GPUs, is that still a
useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
whether they choose to support vGPU on these GPUs or whether they can
be convinced to support multiple vGPUs per VM.

> >> My current understanding is that every P9 chip in that box has some NVLink2
> >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >> as well.
> >>
> >> From small bits of information I have it seems that a GPU can perfectly
> >> work alone and if the NVIDIA driver does not see these interconnects
> >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >> which simply refuses to work until all 3 GPUs are passed so there is some
> >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>
> >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >> interconnected group).  
> > 
> > I'm not gaining much confidence that we can rely on isolation between
> > NVLink connected GPUs, it sounds like you're simply expecting that
> > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > is going to play nice and nobody will figure out how to do bad things
> > because... obfuscation?  Thanks,  
> 
> Well, we already believe that a proprietary firmware of a sriov-capable
> adapter like Mellanox ConnextX is not doing bad things, how is this
> different in principle?

It seems like the scope and hierarchy are different.  Here we're
talking about exposing big discrete devices, which are peers of one
another (and have history of being reverse engineered), to userspace
drivers.  Once handed to userspace, each of those devices needs to be
considered untrusted.  In the case of SR-IOV, we typically have a
trusted host driver for the PF managing untrusted VFs.  We do rely on
some sanity in the hardware/firmware in isolating the VFs from each
other and from the PF, but we also often have source code for Linux
drivers for these devices and sometimes even datasheets.  Here we have
neither of those and perhaps we won't know the extent of the lack of
isolation between these devices until nouveau (best case) or some
exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
of isolation between devices unless the hardware provides some
indication that isolation exists, for example ACS on PCIe.  If NVIDIA
wants to expose isolation on NVLink, perhaps they need to document
enough of it that the host kernel can manipulate and test for isolation,
perhaps even enabling virtualization of the 

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alexey Kardashevskiy
On 8/6/18 1:44 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:08:54 +1000
> Alexey Kardashevskiy  wrote:
> 
>> On 8/6/18 8:15 am, Alex Williamson wrote:
>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>> Benjamin Herrenschmidt  wrote:
>>>   
 On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
>
> Can we back up and discuss whether the IOMMU grouping of NVLink
> connected devices makes sense?  AIUI we have a PCI view of these
> devices and from that perspective they're isolated.  That's the view of
> the device used to generate the grouping.  However, not visible to us,
> these devices are interconnected via NVLink.  What isolation properties
> does NVLink provide given that its entire purpose for existing seems to
> be to provide a high performance link for p2p between devices?

 Not entire. On POWER chips, we also have an nvlink between the device
 and the CPU which is running significantly faster than PCIe.

 But yes, there are cross-links and those should probably be accounted
 for in the grouping.  
>>>
>>> Then after we fix the grouping, can we just let the host driver manage
>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>> convince NVIDIA to support more than a single vGPU per VM though)  
>>
>> These are physical GPUs, not virtual sriov-alike things they are
>> implementing as well elsewhere.
> 
> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> either.  That's why we have mdev devices now to implement software
> defined devices.  I don't have first hand experience with V-series, but
> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

So assuming V100 can do vGPU, you are suggesting ditching this patchset and
using mediated vGPUs instead, correct?


>> My current understanding is that every P9 chip in that box has some NVLink2
>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>> as well.
>>
>> From small bits of information I have it seems that a GPU can perfectly
>> work alone and if the NVIDIA driver does not see these interconnects
>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>> which simply refuses to work until all 3 GPUs are passed so there is some
>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>
>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>> interconnected group).
> 
> I'm not gaining much confidence that we can rely on isolation between
> NVLink connected GPUs, it sounds like you're simply expecting that
> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> is going to play nice and nobody will figure out how to do bad things
> because... obfuscation?  Thanks,

Well, we already believe that a proprietary firmware of a sriov-capable
adapter like Mellanox ConnextX is not doing bad things, how is this
different in principle?


ps. their obfuscation is funny indeed :)
-- 
Alexey


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 8 Jun 2018 13:08:54 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 8:15 am, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt  wrote:
> >   
> >> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> >>>
> >>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>> connected devices makes sense?  AIUI we have a PCI view of these
> >>> devices and from that perspective they're isolated.  That's the view of
> >>> the device used to generate the grouping.  However, not visible to us,
> >>> these devices are interconnected via NVLink.  What isolation properties
> >>> does NVLink provide given that its entire purpose for existing seems to
> >>> be to provide a high performance link for p2p between devices?
> >>
> >> Not entire. On POWER chips, we also have an nvlink between the device
> >> and the CPU which is running significantly faster than PCIe.
> >>
> >> But yes, there are cross-links and those should probably be accounted
> >> for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)  
> 
> These are physical GPUs, not virtual sriov-alike things they are
> implementing as well elsewhere.

vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
either.  That's why we have mdev devices now to implement software
defined devices.  I don't have first hand experience with V-series, but
I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

> My current understanding is that every P9 chip in that box has some NVLink2
> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> as well.
> 
> From small bits of information I have it seems that a GPU can perfectly
> work alone and if the NVIDIA driver does not see these interconnects
> (because we do not pass the rest of the big 3xGPU group to this guest), it
> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> which simply refuses to work until all 3 GPUs are passed so there is some
> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> 
> So we will either have 6 groups (one per GPU) or 2 groups (one per
> interconnected group).

I'm not gaining much confidence that we can rely on isolation between
NVLink connected GPUs, it sounds like you're simply expecting that
proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
is going to play nice and nobody will figure out how to do bad things
because... obfuscation?  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alexey Kardashevskiy
On 8/6/18 8:15 am, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt  wrote:
> 
>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
>>>
>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>> connected devices makes sense?  AIUI we have a PCI view of these
>>> devices and from that perspective they're isolated.  That's the view of
>>> the device used to generate the grouping.  However, not visible to us,
>>> these devices are interconnected via NVLink.  What isolation properties
>>> does NVLink provide given that its entire purpose for existing seems to
>>> be to provide a high performance link for p2p between devices?  
>>
>> Not entire. On POWER chips, we also have an nvlink between the device
>> and the CPU which is running significantly faster than PCIe.
>>
>> But yes, there are cross-links and those should probably be accounted
>> for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)

These are physical GPUs, not virtual sriov-alike things they are
implementing as well elsewhere.

My current understanding is that every P9 chip in that box has some NVLink2
logic on it so each P9 is directly connected to 3 GPUs via PCIe and
2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
as well.

>From small bits of information I have it seems that a GPU can perfectly
work alone and if the NVIDIA driver does not see these interconnects
(because we do not pass the rest of the big 3xGPU group to this guest), it
continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
which simply refuses to work until all 3 GPUs are passed so there is some
distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
get a confirmation from NVIDIA that it is ok to pass just a single GPU.

So we will either have 6 groups (one per GPU) or 2 groups (one per
interconnected group).


-- 
Alexey


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 08 Jun 2018 10:58:54 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > > We *can* allow individual GPUs to be passed through, either if somebody
> > > designs a system without cross links, or if the user is ok with the
> > > security risk as the guest driver will not enable them if it doesn't
> > > "find" both sides of them.  
> > 
> > If GPUs are not isolated and we cannot prevent them from probing each
> > other via these links, then I think we have an obligation to configure
> > grouping in a way that doesn't rely on a benevolent userspace.  Thanks,  
> 
> Well, it's a user decision, no ? Like how we used to let the user
> decide whether to pass-through things that have LSIs shared out of
> their domain.

No, users don't get to pinky swear they'll be good.  The kernel creates
IOMMU groups assuming the worst case isolation and malicious users.
Its the kernel's job to protect itself from users and to protect users
from each other.  Anything else is unsupportable.  The only way to
bypass the default grouping is to modify the kernel.  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Benjamin Herrenschmidt
On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > We *can* allow individual GPUs to be passed through, either if somebody
> > designs a system without cross links, or if the user is ok with the
> > security risk as the guest driver will not enable them if it doesn't
> > "find" both sides of them.
> 
> If GPUs are not isolated and we cannot prevent them from probing each
> other via these links, then I think we have an obligation to configure
> grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Well, it's a user decision, no ? Like how we used to let the user
decide whether to pass-through things that have LSIs shared out of
their domain.

Cheers,
Ben.


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 08 Jun 2018 09:20:30 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt  wrote:
> >   
> > > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > > > 
> > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > devices and from that perspective they're isolated.  That's the view of
> > > > the device used to generate the grouping.  However, not visible to us,
> > > > these devices are interconnected via NVLink.  What isolation properties
> > > > does NVLink provide given that its entire purpose for existing seems to
> > > > be to provide a high performance link for p2p between devices?
> > > 
> > > Not entire. On POWER chips, we also have an nvlink between the device
> > > and the CPU which is running significantly faster than PCIe.
> > > 
> > > But yes, there are cross-links and those should probably be accounted
> > > for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)
> > Thanks,  
> 
> I don't know about "vGPUs" and what nVidia may be cooking in that area.
> 
> The patched from Alexey allow for passing through the full thing, but
> they aren't trivial (there are additional issues, I'm not sure how
> covered they are, as we need to pay with the mapping attributes of
> portions of the GPU memory on the host side...).
> 
> Note: The cross-links are only per-socket so that would be 2 groups of
> 3.
> 
> We *can* allow individual GPUs to be passed through, either if somebody
> designs a system without cross links, or if the user is ok with the
> security risk as the guest driver will not enable them if it doesn't
> "find" both sides of them.

If GPUs are not isolated and we cannot prevent them from probing each
other via these links, then I think we have an obligation to configure
grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Benjamin Herrenschmidt
On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt  wrote:
> 
> > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > > 
> > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > connected devices makes sense?  AIUI we have a PCI view of these
> > > devices and from that perspective they're isolated.  That's the view of
> > > the device used to generate the grouping.  However, not visible to us,
> > > these devices are interconnected via NVLink.  What isolation properties
> > > does NVLink provide given that its entire purpose for existing seems to
> > > be to provide a high performance link for p2p between devices?  
> > 
> > Not entire. On POWER chips, we also have an nvlink between the device
> > and the CPU which is running significantly faster than PCIe.
> > 
> > But yes, there are cross-links and those should probably be accounted
> > for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)
> Thanks,

I don't know about "vGPUs" and what nVidia may be cooking in that area.

The patched from Alexey allow for passing through the full thing, but
they aren't trivial (there are additional issues, I'm not sure how
covered they are, as we need to pay with the mapping attributes of
portions of the GPU memory on the host side...).

Note: The cross-links are only per-socket so that would be 2 groups of
3.

We *can* allow individual GPUs to be passed through, either if somebody
designs a system without cross links, or if the user is ok with the
security risk as the guest driver will not enable them if it doesn't
"find" both sides of them.

Cheers,
Ben.



Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 08 Jun 2018 07:54:02 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > 
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
> Not entire. On POWER chips, we also have an nvlink between the device
> and the CPU which is running significantly faster than PCIe.
> 
> But yes, there are cross-links and those should probably be accounted
> for in the grouping.

Then after we fix the grouping, can we just let the host driver manage
this coherent memory range and expose vGPUs to guests?  The use case of
assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
convince NVIDIA to support more than a single vGPU per VM though)
Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Benjamin Herrenschmidt
On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> 
> Can we back up and discuss whether the IOMMU grouping of NVLink
> connected devices makes sense?  AIUI we have a PCI view of these
> devices and from that perspective they're isolated.  That's the view of
> the device used to generate the grouping.  However, not visible to us,
> these devices are interconnected via NVLink.  What isolation properties
> does NVLink provide given that its entire purpose for existing seems to
> be to provide a high performance link for p2p between devices?

Not entire. On POWER chips, we also have an nvlink between the device
and the CPU which is running significantly faster than PCIe.

But yes, there are cross-links and those should probably be accounted
for in the grouping.

> > Each bridge represents an additional hardware interface called "NVLink2",
> > it is not a PCI link but separate but. The design inherits from original
> > NVLink from POWER8.
> > 
> > The new feature of V100 is 16GB of cache coherent memory on GPU board.
> > This memory is presented to the host via the device tree and remains offline
> > until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> > bridges above) and the nvidia-persistenced daemon then onlines it.
> > The memory remains online as long as nvidia-persistenced is running, when
> > it stops, it offlines the memory.
> > 
> > The amount of GPUs suggest passing them through to a guest. However,
> > in order to do so we cannot use the NVIDIA driver so we have a host with
> > a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> > with no page structs backing this window and we cannot touch this memory
> > before the NVIDIA driver configures it in a host or a guest as
> > HMI (hardware management interrupt?) occurs.
> 
> Having a lot of GPUs only suggests assignment to a guest if there's
> actually isolation provided between those GPUs.  Otherwise we'd need to
> assign them as one big group, which gets a lot less useful.  Thanks,
> 
> Alex
> 
> > On the example system the GPU RAM windows are located at:
> > 0x0400  
> > 0x0420  
> > 0x0440  
> > 0x2400  
> > 0x2420  
> > 0x2440  
> > 
> > So the complications are:
> > 
> > 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> > to VFIO-to-userspace or guest-to-host-physical translations till
> > the driver trains it (i.e. nvidia-persistenced has started), otherwise
> > prefetching happens and HMI occurs; I am trying to get this changed
> > somehow;
> > 
> > 2. since it appears as normal cache coherent memory, it will be used
> > for DMA which means it has to be pinned and mapped in the host. Having
> > no page structs makes it different from the usual case - we only need
> > translate user addresses to host physical and map GPU RAM memory but
> > pinning is not required.
> > 
> > This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> > register this memory as a KVM memory slot and present memory nodes to
> > the guest. Unless NVIDIA provides an userspace driver, this is no use
> > for things like DPDK.
> > 
> > 
> > There is another problem which the series does not address but worth
> > mentioning - it is not strictly necessary to map GPU RAM to the guest
> > exactly where it is in the host (I tested this to some extent), we still
> > might want to represent the memory at the same offset as on the host
> > which increases the size of a TCE table needed to cover such a huge
> > window: (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
> > I am addressing this in a separate patchset by allocating indirect TCE
> > levels on demand and using 16MB IOMMU pages in the guest as we can now
> > back emulated pages with the smaller hardware ones.
> > 
> > 
> > This is an RFC. Please comment. Thanks.
> > 
> > 
> > 
> > Alexey Kardashevskiy (5):
> >   vfio/spapr_tce: Simplify page contained test
> >   powerpc/iommu_context: Change referencing in API
> >   powerpc/iommu: Do not pin memory of a memory device
> >   vfio_pci: Allow mapping extra regions
> >   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> > 
> >  drivers/vfio/pci/Makefile  |   1 +
> >  arch/powerpc/include/asm/mmu_context.h |   5 +-
> >  drivers/vfio/pci/vfio_pci_private.h|  11 ++
> >  include/uapi/linux/vfio.h  |   3 +
> >  arch/powerpc/kernel/iommu.c|   8 +-
> >  arch/powerpc/mm/mmu_context_iommu.c|  70 +---
> >  drivers/vfio/pci/vfio_pci.c|  19 +++-
> >  drivers/vfio/pci/vfio_pci_nvlink2.c| 190 
> > +
> >  drivers/vfio/vfio_iommu_spapr_tce.c|  42 +---
> >  drivers/vfio/pci/Kconfig   |   4 +
> >  10 files changed, 319 insertions(+), 34 deletions(-)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> > 


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Thu,  7 Jun 2018 18:44:15 +1000
Alexey Kardashevskiy  wrote:

> Here is an rfc of some patches adding psaa-through support
> for NVIDIA V100 GPU found in some POWER9 boxes.
> 
> The example P9 system has 6 GPUs, each accompanied with 2 bridges
> representing the hardware links (aka NVLink2):
> 
>  4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
>  4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 
> ^^ the number is an IOMMU group ID.

Can we back up and discuss whether the IOMMU grouping of NVLink
connected devices makes sense?  AIUI we have a PCI view of these
devices and from that perspective they're isolated.  That's the view of
the device used to generate the grouping.  However, not visible to us,
these devices are interconnected via NVLink.  What isolation properties
does NVLink provide given that its entire purpose for existing seems to
be to provide a high performance link for p2p between devices?
 
> Each bridge represents an additional hardware interface called "NVLink2",
> it is not a PCI link but separate but. The design inherits from original
> NVLink from POWER8.
> 
> The new feature of V100 is 16GB of cache coherent memory on GPU board.
> This memory is presented to the host via the device tree and remains offline
> until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> bridges above) and the nvidia-persistenced daemon then onlines it.
> The memory remains online as long as nvidia-persistenced is running, when
> it stops, it offlines the memory.
> 
> The amount of GPUs suggest passing them through to a guest. However,
> in order to do so we cannot use the NVIDIA driver so we have a host with
> a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> with no page structs backing this window and we cannot touch this memory
> before the NVIDIA driver configures it in a host or a guest as
> HMI (hardware management interrupt?) occurs.

Having a lot of GPUs only suggests assignment to a guest if there's
actually isolation provided between those GPUs.  Otherwise we'd need to
assign them as one big group, which gets a lot less useful.  Thanks,

Alex

> On the example system the GPU RAM windows are located at:
> 0x0400  
> 0x0420  
> 0x0440  
> 0x2400  
> 0x2420  
> 0x2440  
> 
> So the complications are:
> 
> 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> to VFIO-to-userspace or guest-to-host-physical translations till
> the driver trains it (i.e. nvidia-persistenced has started), otherwise
> prefetching happens and HMI occurs; I am trying to get this changed
> somehow;
> 
> 2. since it appears as normal cache coherent memory, it will be used
> for DMA which means it has to be pinned and mapped in the host. Having
> no page structs makes it different from the usual case - we only need
> translate user addresses to host physical and map GPU RAM memory but
> pinning is not required.
> 
> This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> register this memory as a KVM memory slot and present memory nodes to
> the guest. Unless NVIDIA provides an userspace driver, this is no use
> for things like DPDK.
> 
> 
> There is another problem which the series does not address but worth
> mentioning - it is not strictly necessary to map GPU RAM to the guest
> exactly where it is in the host (I tested this to some extent), we still
> might want to represent the memory at the same offset as on the host
> which increases the size of a TCE table needed to cover such a huge
> window: (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
> I am addressing this in a separate patchset by allocating indirect TCE
> levels on demand and using 16MB IOMMU pages in the guest as we can now
> back emulated pages with the smaller hardware ones.
> 
> 
> This is an RFC. Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (5):
>   vfio/spapr_tce: Simplify page contained test
>   powerpc/iommu_context: Change 

[RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alexey Kardashevskiy
Here is an rfc of some patches adding psaa-through support
for NVIDIA V100 GPU found in some POWER9 boxes.

The example P9 system has 6 GPUs, each accompanied with 2 bridges
representing the hardware links (aka NVLink2):

 4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
 4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)

^^ the number is an IOMMU group ID.

Each bridge represents an additional hardware interface called "NVLink2",
it is not a PCI link but separate but. The design inherits from original
NVLink from POWER8.

The new feature of V100 is 16GB of cache coherent memory on GPU board.
This memory is presented to the host via the device tree and remains offline
until the NVIDIA driver loads, trains NVLink2 (via the config space of these
bridges above) and the nvidia-persistenced daemon then onlines it.
The memory remains online as long as nvidia-persistenced is running, when
it stops, it offlines the memory.

The amount of GPUs suggest passing them through to a guest. However,
in order to do so we cannot use the NVIDIA driver so we have a host with
a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
with no page structs backing this window and we cannot touch this memory
before the NVIDIA driver configures it in a host or a guest as
HMI (hardware management interrupt?) occurs.

On the example system the GPU RAM windows are located at:
0x0400  
0x0420  
0x0440  
0x2400  
0x2420  
0x2440  

So the complications are:

1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
to VFIO-to-userspace or guest-to-host-physical translations till
the driver trains it (i.e. nvidia-persistenced has started), otherwise
prefetching happens and HMI occurs; I am trying to get this changed
somehow;

2. since it appears as normal cache coherent memory, it will be used
for DMA which means it has to be pinned and mapped in the host. Having
no page structs makes it different from the usual case - we only need
translate user addresses to host physical and map GPU RAM memory but
pinning is not required.

This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
register this memory as a KVM memory slot and present memory nodes to
the guest. Unless NVIDIA provides an userspace driver, this is no use
for things like DPDK.


There is another problem which the series does not address but worth
mentioning - it is not strictly necessary to map GPU RAM to the guest
exactly where it is in the host (I tested this to some extent), we still
might want to represent the memory at the same offset as on the host
which increases the size of a TCE table needed to cover such a huge
window: (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
I am addressing this in a separate patchset by allocating indirect TCE
levels on demand and using 16MB IOMMU pages in the guest as we can now
back emulated pages with the smaller hardware ones.


This is an RFC. Please comment. Thanks.



Alexey Kardashevskiy (5):
  vfio/spapr_tce: Simplify page contained test
  powerpc/iommu_context: Change referencing in API
  powerpc/iommu: Do not pin memory of a memory device
  vfio_pci: Allow mapping extra regions
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

 drivers/vfio/pci/Makefile  |   1 +
 arch/powerpc/include/asm/mmu_context.h |   5 +-
 drivers/vfio/pci/vfio_pci_private.h|  11 ++
 include/uapi/linux/vfio.h  |   3 +
 arch/powerpc/kernel/iommu.c|   8 +-
 arch/powerpc/mm/mmu_context_iommu.c|  70 +---
 drivers/vfio/pci/vfio_pci.c|  19 +++-
 drivers/vfio/pci/vfio_pci_nvlink2.c| 190 +
 drivers/vfio/vfio_iommu_spapr_tce.c|  42 +---
 drivers/vfio/pci/Kconfig   |   4 +
 10 files changed, 319 insertions(+), 34 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.11.0