Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-04-04 Thread David Gibson
On Fri, Mar 22, 2019 at 05:10:25PM -0600, Alex Williamson wrote:
> On Fri, 22 Mar 2019 14:08:38 +1100
> David Gibson  wrote:
> 
> > On Thu, Mar 21, 2019 at 12:19:34PM -0600, Alex Williamson wrote:
> > > On Thu, 21 Mar 2019 10:56:00 +1100
> > > David Gibson  wrote:
> > >   
> > > > On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:  
> > > > > On Wed, 20 Mar 2019 15:38:24 +1100
> > > > > David Gibson  wrote:
> > > > > 
> > > > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > > > > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > > > > Alexey Kardashevskiy  wrote:
> > > > > > >   
> > > > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe 
> > > > > > > > links and
> > > > > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have 
> > > > > > > > direct
> > > > > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment 
> > > > > > > > the POWERNV
> > > > > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > > > > 
> > > > > > > > However the user may want to pass individual GPUs to the 
> > > > > > > > userspace so
> > > > > > > > in order to do so we need to put them into separate IOMMU 
> > > > > > > > groups and
> > > > > > > > cut off the interconnects.
> > > > > > > > 
> > > > > > > > Thankfully V100 GPUs implement an interface to do by 
> > > > > > > > programming link
> > > > > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a 
> > > > > > > > GPU using
> > > > > > > > this interface, it cannot be re-enabled until the secondary bus 
> > > > > > > > reset is
> > > > > > > > issued to the GPU.
> > > > > > > > 
> > > > > > > > This defines a reset_done() handler for V100 NVlink2 device 
> > > > > > > > which
> > > > > > > > determines what links need to be disabled. This relies on 
> > > > > > > > presence
> > > > > > > > of the new "ibm,nvlink-peers" device tree property of a GPU 
> > > > > > > > telling which
> > > > > > > > PCI peers it is connected to (which includes NVLink bridges or 
> > > > > > > > peer GPUs).
> > > > > > > > 
> > > > > > > > This does not change the existing behaviour and instead adds
> > > > > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > > > > 
> > > > > > > > The alternative approaches would be:
> > > > > > > > 
> > > > > > > > 1. do this in the system firmware (skiboot) but for that we 
> > > > > > > > would need
> > > > > > > > to tell skiboot via an additional OPAL call whether or not we 
> > > > > > > > want this
> > > > > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > > > > 
> > > > > > > > 2. do this in the secondary bus reset handler in the POWERNV 
> > > > > > > > platform -
> > > > > > > > the problem with that is at that point the device is not 
> > > > > > > > enabled, i.e.
> > > > > > > > config space is not restored so we need to enable the device 
> > > > > > > > (i.e. MMIO
> > > > > > > > bit in CMD register + program valid address to BAR0) in order 
> > > > > > > > to disable
> > > > > > > > links and then perhaps undo all this initialization to bring 
> > > > > > > > the device
> > > > > > > > back to the state where pci_try_reset_function() expects it to 
> > > > > > > > be.  
> > > > > > > 
> > > > > > > The trouble seems to be that this approach only maintains the 
> > > > > > > isolation
> > > > > > > exposed by the IOMMU group when vfio-pci is the active driver for 
> > > > > > > the
> > > > > > > device.  IOMMU groups can be used by any driver and the IOMMU 
> > > > > > > core is
> > > > > > > incorporating groups in various ways.  
> > > > > > 
> > > > > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > > > > necessarily represent devices which *are* isolated, just devices 
> > > > > > which
> > > > > > *can be* isolated.  There are plenty of instances when we don't need
> > > > > > to isolate devices in different IOMMU groups: passing both groups to
> > > > > > the same guest or userspace VFIO driver for example, or indeed when
> > > > > > both groups are owned by regular host kernel drivers.
> > > > > > 
> > > > > > In at least some of those cases we also don't want to isolate the
> > > > > > devices when we don't have to, usually for performance reasons.
> > > > > 
> > > > > I see IOMMU groups as representing the current isolation of the 
> > > > > device,
> > > > > not just the possible isolation.  If there are ways to break down that
> > > > > isolation then ideally the group would be updated to reflect it.  The
> > > > > ACS disable patches seem to support this, at boot time we can choose 
> > > > > to
> > > > > disable ACS at certain points in the topology to favor peer-to-peer
> > > > > performance over isolation.  This is then reflected in the group
> > > > > composition, because even though ACS *can be* enabled at the given
> > > > > isolation points, it's intentionally not with this option.  Whether or
> > > > > not a 

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-22 Thread Alex Williamson
On Fri, 22 Mar 2019 14:08:38 +1100
David Gibson  wrote:

> On Thu, Mar 21, 2019 at 12:19:34PM -0600, Alex Williamson wrote:
> > On Thu, 21 Mar 2019 10:56:00 +1100
> > David Gibson  wrote:
> >   
> > > On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:  
> > > > On Wed, 20 Mar 2019 15:38:24 +1100
> > > > David Gibson  wrote:
> > > > 
> > > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > > > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > > > Alexey Kardashevskiy  wrote:
> > > > > >   
> > > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links 
> > > > > > > and
> > > > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have 
> > > > > > > direct
> > > > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the 
> > > > > > > POWERNV
> > > > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > > > 
> > > > > > > However the user may want to pass individual GPUs to the 
> > > > > > > userspace so
> > > > > > > in order to do so we need to put them into separate IOMMU groups 
> > > > > > > and
> > > > > > > cut off the interconnects.
> > > > > > > 
> > > > > > > Thankfully V100 GPUs implement an interface to do by programming 
> > > > > > > link
> > > > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU 
> > > > > > > using
> > > > > > > this interface, it cannot be re-enabled until the secondary bus 
> > > > > > > reset is
> > > > > > > issued to the GPU.
> > > > > > > 
> > > > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > > > determines what links need to be disabled. This relies on presence
> > > > > > > of the new "ibm,nvlink-peers" device tree property of a GPU 
> > > > > > > telling which
> > > > > > > PCI peers it is connected to (which includes NVLink bridges or 
> > > > > > > peer GPUs).
> > > > > > > 
> > > > > > > This does not change the existing behaviour and instead adds
> > > > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > > > 
> > > > > > > The alternative approaches would be:
> > > > > > > 
> > > > > > > 1. do this in the system firmware (skiboot) but for that we would 
> > > > > > > need
> > > > > > > to tell skiboot via an additional OPAL call whether or not we 
> > > > > > > want this
> > > > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > > > 
> > > > > > > 2. do this in the secondary bus reset handler in the POWERNV 
> > > > > > > platform -
> > > > > > > the problem with that is at that point the device is not enabled, 
> > > > > > > i.e.
> > > > > > > config space is not restored so we need to enable the device 
> > > > > > > (i.e. MMIO
> > > > > > > bit in CMD register + program valid address to BAR0) in order to 
> > > > > > > disable
> > > > > > > links and then perhaps undo all this initialization to bring the 
> > > > > > > device
> > > > > > > back to the state where pci_try_reset_function() expects it to 
> > > > > > > be.  
> > > > > > 
> > > > > > The trouble seems to be that this approach only maintains the 
> > > > > > isolation
> > > > > > exposed by the IOMMU group when vfio-pci is the active driver for 
> > > > > > the
> > > > > > device.  IOMMU groups can be used by any driver and the IOMMU core 
> > > > > > is
> > > > > > incorporating groups in various ways.  
> > > > > 
> > > > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > > > necessarily represent devices which *are* isolated, just devices which
> > > > > *can be* isolated.  There are plenty of instances when we don't need
> > > > > to isolate devices in different IOMMU groups: passing both groups to
> > > > > the same guest or userspace VFIO driver for example, or indeed when
> > > > > both groups are owned by regular host kernel drivers.
> > > > > 
> > > > > In at least some of those cases we also don't want to isolate the
> > > > > devices when we don't have to, usually for performance reasons.
> > > > 
> > > > I see IOMMU groups as representing the current isolation of the device,
> > > > not just the possible isolation.  If there are ways to break down that
> > > > isolation then ideally the group would be updated to reflect it.  The
> > > > ACS disable patches seem to support this, at boot time we can choose to
> > > > disable ACS at certain points in the topology to favor peer-to-peer
> > > > performance over isolation.  This is then reflected in the group
> > > > composition, because even though ACS *can be* enabled at the given
> > > > isolation points, it's intentionally not with this option.  Whether or
> > > > not a given user who owns multiple devices needs that isolation is
> > > > really beside the point, the user can choose to connect groups via IOMMU
> > > > mappings or reconfigure the system to disable ACS and potentially more
> > > > direct routing.  The IOMMU groups are still accurately reflecting the
> > > > topology and IOMMU 

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-21 Thread David Gibson
On Thu, Mar 21, 2019 at 12:19:34PM -0600, Alex Williamson wrote:
> On Thu, 21 Mar 2019 10:56:00 +1100
> David Gibson  wrote:
> 
> > On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:
> > > On Wed, 20 Mar 2019 15:38:24 +1100
> > > David Gibson  wrote:
> > >   
> > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:  
> > > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > > Alexey Kardashevskiy  wrote:
> > > > > 
> > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links 
> > > > > > and
> > > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have 
> > > > > > direct
> > > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the 
> > > > > > POWERNV
> > > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > > 
> > > > > > However the user may want to pass individual GPUs to the userspace 
> > > > > > so
> > > > > > in order to do so we need to put them into separate IOMMU groups and
> > > > > > cut off the interconnects.
> > > > > > 
> > > > > > Thankfully V100 GPUs implement an interface to do by programming 
> > > > > > link
> > > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU 
> > > > > > using
> > > > > > this interface, it cannot be re-enabled until the secondary bus 
> > > > > > reset is
> > > > > > issued to the GPU.
> > > > > > 
> > > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > > determines what links need to be disabled. This relies on presence
> > > > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling 
> > > > > > which
> > > > > > PCI peers it is connected to (which includes NVLink bridges or peer 
> > > > > > GPUs).
> > > > > > 
> > > > > > This does not change the existing behaviour and instead adds
> > > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > > 
> > > > > > The alternative approaches would be:
> > > > > > 
> > > > > > 1. do this in the system firmware (skiboot) but for that we would 
> > > > > > need
> > > > > > to tell skiboot via an additional OPAL call whether or not we want 
> > > > > > this
> > > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > > 
> > > > > > 2. do this in the secondary bus reset handler in the POWERNV 
> > > > > > platform -
> > > > > > the problem with that is at that point the device is not enabled, 
> > > > > > i.e.
> > > > > > config space is not restored so we need to enable the device (i.e. 
> > > > > > MMIO
> > > > > > bit in CMD register + program valid address to BAR0) in order to 
> > > > > > disable
> > > > > > links and then perhaps undo all this initialization to bring the 
> > > > > > device
> > > > > > back to the state where pci_try_reset_function() expects it to be.  
> > > > > >   
> > > > > 
> > > > > The trouble seems to be that this approach only maintains the 
> > > > > isolation
> > > > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > > > incorporating groups in various ways.
> > > > 
> > > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > > necessarily represent devices which *are* isolated, just devices which
> > > > *can be* isolated.  There are plenty of instances when we don't need
> > > > to isolate devices in different IOMMU groups: passing both groups to
> > > > the same guest or userspace VFIO driver for example, or indeed when
> > > > both groups are owned by regular host kernel drivers.
> > > > 
> > > > In at least some of those cases we also don't want to isolate the
> > > > devices when we don't have to, usually for performance reasons.  
> > > 
> > > I see IOMMU groups as representing the current isolation of the device,
> > > not just the possible isolation.  If there are ways to break down that
> > > isolation then ideally the group would be updated to reflect it.  The
> > > ACS disable patches seem to support this, at boot time we can choose to
> > > disable ACS at certain points in the topology to favor peer-to-peer
> > > performance over isolation.  This is then reflected in the group
> > > composition, because even though ACS *can be* enabled at the given
> > > isolation points, it's intentionally not with this option.  Whether or
> > > not a given user who owns multiple devices needs that isolation is
> > > really beside the point, the user can choose to connect groups via IOMMU
> > > mappings or reconfigure the system to disable ACS and potentially more
> > > direct routing.  The IOMMU groups are still accurately reflecting the
> > > topology and IOMMU based isolation.  
> > 
> > Huh, ok, I think we need to straighten this out.  Thinking of iommu
> > groups as possible rather than potential isolation was a conscious
> 
> possible ~= potential

Sorry, I meant "current" not "potential".

> > decision on my part when we were first coming up with 

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-21 Thread Alex Williamson
On Thu, 21 Mar 2019 10:56:00 +1100
David Gibson  wrote:

> On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:
> > On Wed, 20 Mar 2019 15:38:24 +1100
> > David Gibson  wrote:
> >   
> > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:  
> > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > Alexey Kardashevskiy  wrote:
> > > > 
> > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the 
> > > > > POWERNV
> > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > 
> > > > > However the user may want to pass individual GPUs to the userspace so
> > > > > in order to do so we need to put them into separate IOMMU groups and
> > > > > cut off the interconnects.
> > > > > 
> > > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU 
> > > > > using
> > > > > this interface, it cannot be re-enabled until the secondary bus reset 
> > > > > is
> > > > > issued to the GPU.
> > > > > 
> > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > determines what links need to be disabled. This relies on presence
> > > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling 
> > > > > which
> > > > > PCI peers it is connected to (which includes NVLink bridges or peer 
> > > > > GPUs).
> > > > > 
> > > > > This does not change the existing behaviour and instead adds
> > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > 
> > > > > The alternative approaches would be:
> > > > > 
> > > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > > to tell skiboot via an additional OPAL call whether or not we want 
> > > > > this
> > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > 
> > > > > 2. do this in the secondary bus reset handler in the POWERNV platform 
> > > > > -
> > > > > the problem with that is at that point the device is not enabled, i.e.
> > > > > config space is not restored so we need to enable the device (i.e. 
> > > > > MMIO
> > > > > bit in CMD register + program valid address to BAR0) in order to 
> > > > > disable
> > > > > links and then perhaps undo all this initialization to bring the 
> > > > > device
> > > > > back to the state where pci_try_reset_function() expects it to be.
> > > > 
> > > > The trouble seems to be that this approach only maintains the isolation
> > > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > > incorporating groups in various ways.
> > > 
> > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > necessarily represent devices which *are* isolated, just devices which
> > > *can be* isolated.  There are plenty of instances when we don't need
> > > to isolate devices in different IOMMU groups: passing both groups to
> > > the same guest or userspace VFIO driver for example, or indeed when
> > > both groups are owned by regular host kernel drivers.
> > > 
> > > In at least some of those cases we also don't want to isolate the
> > > devices when we don't have to, usually for performance reasons.  
> > 
> > I see IOMMU groups as representing the current isolation of the device,
> > not just the possible isolation.  If there are ways to break down that
> > isolation then ideally the group would be updated to reflect it.  The
> > ACS disable patches seem to support this, at boot time we can choose to
> > disable ACS at certain points in the topology to favor peer-to-peer
> > performance over isolation.  This is then reflected in the group
> > composition, because even though ACS *can be* enabled at the given
> > isolation points, it's intentionally not with this option.  Whether or
> > not a given user who owns multiple devices needs that isolation is
> > really beside the point, the user can choose to connect groups via IOMMU
> > mappings or reconfigure the system to disable ACS and potentially more
> > direct routing.  The IOMMU groups are still accurately reflecting the
> > topology and IOMMU based isolation.  
> 
> Huh, ok, I think we need to straighten this out.  Thinking of iommu
> groups as possible rather than potential isolation was a conscious

possible ~= potential

> decision on my part when we were first coming up with them.  The
> rationale was that that way iommu groups could be static for the
> lifetime of boot, with more dynamic isolation state layered on top.
> 
> Now, that was based on analogy with PAPR's concept of "Partitionable
> Endpoints" which are decided by firmware before boot.  However, I
> think it makes sense in other contexts too: if iommu groups represent
> current isolation, then we 

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-20 Thread David Gibson
On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:
> On Wed, 20 Mar 2019 15:38:24 +1100
> David Gibson  wrote:
> 
> > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > Alexey Kardashevskiy  wrote:
> > >   
> > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > 
> > > > However the user may want to pass individual GPUs to the userspace so
> > > > in order to do so we need to put them into separate IOMMU groups and
> > > > cut off the interconnects.
> > > > 
> > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > > issued to the GPU.
> > > > 
> > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > determines what links need to be disabled. This relies on presence
> > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling 
> > > > which
> > > > PCI peers it is connected to (which includes NVLink bridges or peer 
> > > > GPUs).
> > > > 
> > > > This does not change the existing behaviour and instead adds
> > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > 
> > > > The alternative approaches would be:
> > > > 
> > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > to tell skiboot via an additional OPAL call whether or not we want this
> > > > isolation - skiboot is unaware of IOMMU groups.
> > > > 
> > > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > > the problem with that is at that point the device is not enabled, i.e.
> > > > config space is not restored so we need to enable the device (i.e. MMIO
> > > > bit in CMD register + program valid address to BAR0) in order to disable
> > > > links and then perhaps undo all this initialization to bring the device
> > > > back to the state where pci_try_reset_function() expects it to be.  
> > > 
> > > The trouble seems to be that this approach only maintains the isolation
> > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > incorporating groups in various ways.  
> > 
> > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > necessarily represent devices which *are* isolated, just devices which
> > *can be* isolated.  There are plenty of instances when we don't need
> > to isolate devices in different IOMMU groups: passing both groups to
> > the same guest or userspace VFIO driver for example, or indeed when
> > both groups are owned by regular host kernel drivers.
> > 
> > In at least some of those cases we also don't want to isolate the
> > devices when we don't have to, usually for performance reasons.
> 
> I see IOMMU groups as representing the current isolation of the device,
> not just the possible isolation.  If there are ways to break down that
> isolation then ideally the group would be updated to reflect it.  The
> ACS disable patches seem to support this, at boot time we can choose to
> disable ACS at certain points in the topology to favor peer-to-peer
> performance over isolation.  This is then reflected in the group
> composition, because even though ACS *can be* enabled at the given
> isolation points, it's intentionally not with this option.  Whether or
> not a given user who owns multiple devices needs that isolation is
> really beside the point, the user can choose to connect groups via IOMMU
> mappings or reconfigure the system to disable ACS and potentially more
> direct routing.  The IOMMU groups are still accurately reflecting the
> topology and IOMMU based isolation.

Huh, ok, I think we need to straighten this out.  Thinking of iommu
groups as possible rather than potential isolation was a conscious
decision on my part when we were first coming up with them.  The
rationale was that that way iommu groups could be static for the
lifetime of boot, with more dynamic isolation state layered on top.

Now, that was based on analogy with PAPR's concept of "Partitionable
Endpoints" which are decided by firmware before boot.  However, I
think it makes sense in other contexts too: if iommu groups represent
current isolation, then we need some other way to advertise possible
isolation - otherwise how will the admin (and/or tools) know how it
can configure the iommu groups.

VFIO already has the container, which represents explicitly a "group
of groups" that we don't care to isolate from each other.  I don't
actually know what other uses of the iommu group infrastructure we
have at 

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-20 Thread Alex Williamson
On Wed, 20 Mar 2019 15:38:24 +1100
David Gibson  wrote:

> On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > On Fri, 15 Mar 2019 19:18:35 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > platform puts all interconnected GPUs to the same IOMMU group.
> > > 
> > > However the user may want to pass individual GPUs to the userspace so
> > > in order to do so we need to put them into separate IOMMU groups and
> > > cut off the interconnects.
> > > 
> > > Thankfully V100 GPUs implement an interface to do by programming link
> > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > issued to the GPU.
> > > 
> > > This defines a reset_done() handler for V100 NVlink2 device which
> > > determines what links need to be disabled. This relies on presence
> > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > 
> > > This does not change the existing behaviour and instead adds
> > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > 
> > > The alternative approaches would be:
> > > 
> > > 1. do this in the system firmware (skiboot) but for that we would need
> > > to tell skiboot via an additional OPAL call whether or not we want this
> > > isolation - skiboot is unaware of IOMMU groups.
> > > 
> > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > the problem with that is at that point the device is not enabled, i.e.
> > > config space is not restored so we need to enable the device (i.e. MMIO
> > > bit in CMD register + program valid address to BAR0) in order to disable
> > > links and then perhaps undo all this initialization to bring the device
> > > back to the state where pci_try_reset_function() expects it to be.  
> > 
> > The trouble seems to be that this approach only maintains the isolation
> > exposed by the IOMMU group when vfio-pci is the active driver for the
> > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > incorporating groups in various ways.  
> 
> I don't think that reasoning is quite right.  An IOMMU group doesn't
> necessarily represent devices which *are* isolated, just devices which
> *can be* isolated.  There are plenty of instances when we don't need
> to isolate devices in different IOMMU groups: passing both groups to
> the same guest or userspace VFIO driver for example, or indeed when
> both groups are owned by regular host kernel drivers.
> 
> In at least some of those cases we also don't want to isolate the
> devices when we don't have to, usually for performance reasons.

I see IOMMU groups as representing the current isolation of the device,
not just the possible isolation.  If there are ways to break down that
isolation then ideally the group would be updated to reflect it.  The
ACS disable patches seem to support this, at boot time we can choose to
disable ACS at certain points in the topology to favor peer-to-peer
performance over isolation.  This is then reflected in the group
composition, because even though ACS *can be* enabled at the given
isolation points, it's intentionally not with this option.  Whether or
not a given user who owns multiple devices needs that isolation is
really beside the point, the user can choose to connect groups via IOMMU
mappings or reconfigure the system to disable ACS and potentially more
direct routing.  The IOMMU groups are still accurately reflecting the
topology and IOMMU based isolation.

> > So, if there's a device specific
> > way to configure the isolation reported in the group, which requires
> > some sort of active management against things like secondary bus
> > resets, then I think we need to manage it above the attached endpoint
> > driver.  
> 
> The problem is that above the endpoint driver, we don't actually have
> enough information about what should be isolated.  For VFIO we want to
> isolate things if they're in different containers, for most regular
> host kernel drivers we don't need to isolate at all (although we might
> as well when it doesn't have a cost).

This idea that we only want to isolate things if they're in different
containers is bogus, imo.  There are performance reasons why we might
not want things isolated, but there are also address space reasons why
we do.  If there are direct routes between devices, the user needs to
be aware of the IOVA pollution, if we maintain singleton groups, they
don't.  Granted we don't really account for this well in most
userspaces and fumble through it by luck of the address space layout
and lack of devices really attempting peer to peer access.


Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-20 Thread David Gibson
On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> On Fri, 15 Mar 2019 19:18:35 +1100
> Alexey Kardashevskiy  wrote:
> 
> > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > platform puts all interconnected GPUs to the same IOMMU group.
> > 
> > However the user may want to pass individual GPUs to the userspace so
> > in order to do so we need to put them into separate IOMMU groups and
> > cut off the interconnects.
> > 
> > Thankfully V100 GPUs implement an interface to do by programming link
> > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > this interface, it cannot be re-enabled until the secondary bus reset is
> > issued to the GPU.
> > 
> > This defines a reset_done() handler for V100 NVlink2 device which
> > determines what links need to be disabled. This relies on presence
> > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > 
> > This does not change the existing behaviour and instead adds
> > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > 
> > The alternative approaches would be:
> > 
> > 1. do this in the system firmware (skiboot) but for that we would need
> > to tell skiboot via an additional OPAL call whether or not we want this
> > isolation - skiboot is unaware of IOMMU groups.
> > 
> > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > the problem with that is at that point the device is not enabled, i.e.
> > config space is not restored so we need to enable the device (i.e. MMIO
> > bit in CMD register + program valid address to BAR0) in order to disable
> > links and then perhaps undo all this initialization to bring the device
> > back to the state where pci_try_reset_function() expects it to be.
> 
> The trouble seems to be that this approach only maintains the isolation
> exposed by the IOMMU group when vfio-pci is the active driver for the
> device.  IOMMU groups can be used by any driver and the IOMMU core is
> incorporating groups in various ways.

I don't think that reasoning is quite right.  An IOMMU group doesn't
necessarily represent devices which *are* isolated, just devices which
*can be* isolated.  There are plenty of instances when we don't need
to isolate devices in different IOMMU groups: passing both groups to
the same guest or userspace VFIO driver for example, or indeed when
both groups are owned by regular host kernel drivers.

In at least some of those cases we also don't want to isolate the
devices when we don't have to, usually for performance reasons.

> So, if there's a device specific
> way to configure the isolation reported in the group, which requires
> some sort of active management against things like secondary bus
> resets, then I think we need to manage it above the attached endpoint
> driver.

The problem is that above the endpoint driver, we don't actually have
enough information about what should be isolated.  For VFIO we want to
isolate things if they're in different containers, for most regular
host kernel drivers we don't need to isolate at all (although we might
as well when it doesn't have a cost).  The host side nVidia GPGPU
drivers also won't want to isolate the (host owned) NVLink devices
from each other, since they'll want to use the fast interconnects

> Ideally I'd see this as a set of PCI quirks so that we might
> leverage it beyond POWER platforms.  I'm not sure how we get past the
> reliance on device tree properties that we won't have on other
> platforms though, if only NVIDIA could at least open a spec addressing
> the discovery and configuration of NVLink registers on their
> devices :-\  Thanks,

Yeah, that'd be nice :/.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-19 Thread Alexey Kardashevskiy



On 20/03/2019 03:36, Alex Williamson wrote:
> On Fri, 15 Mar 2019 19:18:35 +1100
> Alexey Kardashevskiy  wrote:
> 
>> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
>> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
>> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
>> platform puts all interconnected GPUs to the same IOMMU group.
>>
>> However the user may want to pass individual GPUs to the userspace so
>> in order to do so we need to put them into separate IOMMU groups and
>> cut off the interconnects.
>>
>> Thankfully V100 GPUs implement an interface to do by programming link
>> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
>> this interface, it cannot be re-enabled until the secondary bus reset is
>> issued to the GPU.
>>
>> This defines a reset_done() handler for V100 NVlink2 device which
>> determines what links need to be disabled. This relies on presence
>> of the new "ibm,nvlink-peers" device tree property of a GPU telling which
>> PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
>>
>> This does not change the existing behaviour and instead adds
>> a new "isolate_nvlink" kernel parameter to allow such isolation.
>>
>> The alternative approaches would be:
>>
>> 1. do this in the system firmware (skiboot) but for that we would need
>> to tell skiboot via an additional OPAL call whether or not we want this
>> isolation - skiboot is unaware of IOMMU groups.
>>
>> 2. do this in the secondary bus reset handler in the POWERNV platform -
>> the problem with that is at that point the device is not enabled, i.e.
>> config space is not restored so we need to enable the device (i.e. MMIO
>> bit in CMD register + program valid address to BAR0) in order to disable
>> links and then perhaps undo all this initialization to bring the device
>> back to the state where pci_try_reset_function() expects it to be.
> 
> The trouble seems to be that this approach only maintains the isolation
> exposed by the IOMMU group when vfio-pci is the active driver for the
> device.  IOMMU groups can be used by any driver and the IOMMU core is
> incorporating groups in various ways.  So, if there's a device specific
> way to configure the isolation reported in the group, which requires
> some sort of active management against things like secondary bus
> resets, then I think we need to manage it above the attached endpoint
> driver.

Fair point. So for now I'll go for 2) then.

> Ideally I'd see this as a set of PCI quirks so that we might
> leverage it beyond POWER platforms.  I'm not sure how we get past the
> reliance on device tree properties that we won't have on other
> platforms though, if only NVIDIA could at least open a spec addressing
> the discovery and configuration of NVLink registers on their
> devices :-\  Thanks,

This would be nice, yes...


-- 
Alexey


Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-19 Thread Alex Williamson
On Fri, 15 Mar 2019 19:18:35 +1100
Alexey Kardashevskiy  wrote:

> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> platform puts all interconnected GPUs to the same IOMMU group.
> 
> However the user may want to pass individual GPUs to the userspace so
> in order to do so we need to put them into separate IOMMU groups and
> cut off the interconnects.
> 
> Thankfully V100 GPUs implement an interface to do by programming link
> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> this interface, it cannot be re-enabled until the secondary bus reset is
> issued to the GPU.
> 
> This defines a reset_done() handler for V100 NVlink2 device which
> determines what links need to be disabled. This relies on presence
> of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> 
> This does not change the existing behaviour and instead adds
> a new "isolate_nvlink" kernel parameter to allow such isolation.
> 
> The alternative approaches would be:
> 
> 1. do this in the system firmware (skiboot) but for that we would need
> to tell skiboot via an additional OPAL call whether or not we want this
> isolation - skiboot is unaware of IOMMU groups.
> 
> 2. do this in the secondary bus reset handler in the POWERNV platform -
> the problem with that is at that point the device is not enabled, i.e.
> config space is not restored so we need to enable the device (i.e. MMIO
> bit in CMD register + program valid address to BAR0) in order to disable
> links and then perhaps undo all this initialization to bring the device
> back to the state where pci_try_reset_function() expects it to be.

The trouble seems to be that this approach only maintains the isolation
exposed by the IOMMU group when vfio-pci is the active driver for the
device.  IOMMU groups can be used by any driver and the IOMMU core is
incorporating groups in various ways.  So, if there's a device specific
way to configure the isolation reported in the group, which requires
some sort of active management against things like secondary bus
resets, then I think we need to manage it above the attached endpoint
driver.  Ideally I'd see this as a set of PCI quirks so that we might
leverage it beyond POWER platforms.  I'm not sure how we get past the
reliance on device tree properties that we won't have on other
platforms though, if only NVIDIA could at least open a spec addressing
the discovery and configuration of NVLink registers on their
devices :-\  Thanks,

Alex


[PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-15 Thread Alexey Kardashevskiy
The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
(on POWER9) NVLinks. In addition to that, GPUs themselves have direct
peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
platform puts all interconnected GPUs to the same IOMMU group.

However the user may want to pass individual GPUs to the userspace so
in order to do so we need to put them into separate IOMMU groups and
cut off the interconnects.

Thankfully V100 GPUs implement an interface to do by programming link
disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
this interface, it cannot be re-enabled until the secondary bus reset is
issued to the GPU.

This defines a reset_done() handler for V100 NVlink2 device which
determines what links need to be disabled. This relies on presence
of the new "ibm,nvlink-peers" device tree property of a GPU telling which
PCI peers it is connected to (which includes NVLink bridges or peer GPUs).

This does not change the existing behaviour and instead adds
a new "isolate_nvlink" kernel parameter to allow such isolation.

The alternative approaches would be:

1. do this in the system firmware (skiboot) but for that we would need
to tell skiboot via an additional OPAL call whether or not we want this
isolation - skiboot is unaware of IOMMU groups.

2. do this in the secondary bus reset handler in the POWERNV platform -
the problem with that is at that point the device is not enabled, i.e.
config space is not restored so we need to enable the device (i.e. MMIO
bit in CMD register + program valid address to BAR0) in order to disable
links and then perhaps undo all this initialization to bring the device
back to the state where pci_try_reset_function() expects it to be.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 24 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c  | 98 
 2 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 3a102378c8dc..6f5c769b6fc8 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -441,6 +441,23 @@ static void pnv_comp_attach_table_group(struct npu_comp 
*npucomp,
++npucomp->pe_num;
 }
 
+static bool isolate_nvlink;
+
+static int __init parse_isolate_nvlink(char *p)
+{
+   bool val;
+
+   if (!p)
+   val = true;
+   else if (kstrtobool(p, ))
+   return -EINVAL;
+
+   isolate_nvlink = val;
+
+   return 0;
+}
+early_param("isolate_nvlink", parse_isolate_nvlink);
+
 struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
 {
struct iommu_table_group *table_group;
@@ -463,7 +480,7 @@ struct iommu_table_group 
*pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
hose = pci_bus_to_host(npdev->bus);
phb = hose->private_data;
 
-   if (hose->npu) {
+   if (hose->npu && !isolate_nvlink) {
if (!phb->npucomp) {
phb->npucomp = kzalloc(sizeof(struct npu_comp),
GFP_KERNEL);
@@ -477,7 +494,10 @@ struct iommu_table_group 
*pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
pe->pe_number);
}
} else {
-   /* Create a group for 1 GPU and attached NPUs for POWER8 */
+   /*
+* Create a group for 1 GPU and attached NPUs for
+* POWER8 (always) or POWER9 (when isolate_nvlink).
+*/
pe->npucomp = kzalloc(sizeof(*pe->npucomp), GFP_KERNEL);
table_group = >npucomp->table_group;
table_group->ops = _npu_peers_ops;
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 32f695ffe128..bb6bba762f46 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -206,6 +206,102 @@ static int vfio_pci_nvgpu_group_notifier(struct 
notifier_block *nb,
return NOTIFY_OK;
 }
 
+static int vfio_pci_nvdia_v100_is_ph_in_group(struct device *dev, void *data)
+{
+   return dev->of_node->phandle == *(phandle *) data;
+}
+
+static u32 vfio_pci_nvdia_v100_get_disable_mask(struct device *dev)
+{
+   int npu, peer;
+   u32 mask;
+   struct device_node *dn;
+   struct iommu_group *group;
+
+   dn = dev->of_node;
+   if (!of_find_property(dn, "ibm,nvlink-peers", NULL))
+   return 0;
+
+   group = iommu_group_get(dev);
+   if (!group)
+   return 0;
+
+   /*
+* Collect links to keep which includes links to NPU and links to
+* other GPUs in the same IOMMU group.
+*/
+   for (npu = 0, mask = 0; ; ++npu) {
+   u32 npuph = 0;
+
+   if (of_property_read_u32_index(dn, "ibm,npu", npu, ))
+   break;
+
+