Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-11 Thread Christoph Hellwig
On Wed, Feb 10, 2021 at 09:34:52AM -0400, Jason Gunthorpe wrote:
> > I'm a bit confused about the change from v1 to v2, especially about
> > how to inject module specific operations. From live migration p.o.v
> > it may requires two hook points at least for some devices (e.g. i40e 
> > in original Yan's example):
> 
> IMHO, it was too soon to give up on putting the vfio_device_ops in the
> final driver- we should try to define a reasonable public/private
> split of vfio_pci_device as is the norm in the kernel. No reason we
> can't achieve that.
> 
> >  register a migration region and intercept guest writes to specific
> > registers. [PATCH 4/9] demonstrates the former but not the latter
> > (which is allowed in v1).
> 
> And this is why, the ROI to wrapper every vfio op in a PCI op just to
> keep vfio_pci_device completely private is poor :(

Yes.  If Alex has a strong preference to keep some values private
a split between vfio_pci_device vfio_pci_device_priv might be doable,
but it is somewhat silly.

> > Then another question. Once we have this framework in place, do we 
> > mandate this approach for any vendor specific tweak or still allow
> > doing it as vfio_pci_core extensions (such as igd and zdev in this
> > series)?
> 
> I would say no to any further vfio_pci_core extensions that are tied
> to specific PCI devices. Things like zdev are platform features, they
> are not tied to specific PCI devices

Yes, ZDEV is just a special case of exposing extra information for any
PCI device on s390.  It does not fit any split up vfio_pci framework.
In fact I wonder why it even has its own config option.

> > vfio-mdev is just the channel to bring VFIO APIs through mdev core
> > to underlying vendor specific mdev device driver, which is already
> > granted flexibility to tweak whatever needs through mdev_parent_ops.
> 
> This is the second thing, and it could just be deleted. The actual
> final mdev driver can just use vfio_device_ops directly. The
> redirection shim in vfio_mdev.c doesn't add value.

Yes, that would simplify a lot of things.


Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-10 Thread Jason Gunthorpe
On Wed, Feb 10, 2021 at 09:37:46AM -0700, Alex Williamson wrote:

> > >  register a migration region and intercept guest writes to specific
> > > registers. [PATCH 4/9] demonstrates the former but not the latter
> > > (which is allowed in v1).  
> > 
> > And this is why, the ROI to wrapper every vfio op in a PCI op just to
> > keep vfio_pci_device completely private is poor :(
> 
> Says someone who doesn't need to maintain the core, fixing bugs and
> adding features, while not breaking vendor driver touching private data
> in unexpected ways ;)

Said as someone that maintains a driver subsystem 25x larger than VFIO
that is really experienced in "crazy things drivers do". :)

Private/public is rarely at the top of my worries, and I'm confident
to say this is the general kernel philosophy. Again, look anywhere, we
rarely have private data split out of major structs like struct
device, struct netdev, struct pci_device, etc. This data has to be
public because we are using C and we expect inline functions,
container_of() and so on to work. It is rarely done with hidden
structs.

If we can get private data in some places it is a nice win, but not
worth making a mess to achieve. eg I would not give up the normal
container_of pattern just to obscure some struct, the overall ROI is
bad.

Drivers always do unexpected and crazy things, I wouldn't get fixated
on touching private data as the worst sin a driver can do :(

So, no, I don't agree that exposing a struct vfio_pci_device is the
end of the world - it is normal in the kernel to do this kind of
thing, and yes drivers can do crazy things with that if crazy slips
past the review process.

Honestly I expect people to test their drivers and fix things if a
core change broke them. It happens, QA finds it, it gets fixed, normal
stuff for Linux, IMHO.

> > > Then what exact extension is talked here by creating another subsystem
> > > module? or are we talking about some general library which can be
> > > shared by underlying mdev device drivers to reduce duplicated
> > > emulation code?  
> > 
> > IMHO it is more a design philosophy that the end driver should
> > implement the vfio_device_ops directly vs having a stack of ops
> > structs.

> Like Kevin though, I don't really understand the hand-wave
> application to mdev.  Sure, vfio-mdev could be collapsed now that
> we've rejected that there could be other drivers binding to mdev
> devices,

Again, I think the point Max was trying to make is only that vfio_mdev
can follow the same design as proposed here and replace the
mdev_parent_ops with the vfio_device_ops.

Jason


Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-10 Thread Alex Williamson
On Wed, 10 Feb 2021 09:34:52 -0400
Jason Gunthorpe  wrote:

> On Wed, Feb 10, 2021 at 07:52:08AM +, Tian, Kevin wrote:
> > > This subsystem framework will also ease on adding vendor specific
> > > functionality to VFIO devices in the future by allowing another module
> > > to provide the pci_driver that can setup number of details before
> > > registering to VFIO subsystem (such as inject its own operations).  
> > 
> > I'm a bit confused about the change from v1 to v2, especially about
> > how to inject module specific operations. From live migration p.o.v
> > it may requires two hook points at least for some devices (e.g. i40e 
> > in original Yan's example):  
> 
> IMHO, it was too soon to give up on putting the vfio_device_ops in the
> final driver- we should try to define a reasonable public/private
> split of vfio_pci_device as is the norm in the kernel. No reason we
> can't achieve that.
> 
> >  register a migration region and intercept guest writes to specific
> > registers. [PATCH 4/9] demonstrates the former but not the latter
> > (which is allowed in v1).  
> 
> And this is why, the ROI to wrapper every vfio op in a PCI op just to
> keep vfio_pci_device completely private is poor :(

Says someone who doesn't need to maintain the core, fixing bugs and
adding features, while not breaking vendor driver touching private data
in unexpected ways ;)

> > Then another question. Once we have this framework in place, do we 
> > mandate this approach for any vendor specific tweak or still allow
> > doing it as vfio_pci_core extensions (such as igd and zdev in this
> > series)?  
> 
> I would say no to any further vfio_pci_core extensions that are tied
> to specific PCI devices. Things like zdev are platform features, they
> are not tied to specific PCI devices
> 
> > If the latter, what is the criteria to judge which way is desired? Also 
> > what 
> > about the scenarios where we just want one-time vendor information, 
> > e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
> > the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
> > Do we expect to create a module for each device to provide such info?
> > Having those questions answered is helpful for better understanding of
> > this proposal IMO. 
> > 
> > [1] 
> > https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fd...@huawei.com/T/
> >   
> 
> SVA is a platform feature, so no problem. Don't see a vfio-pci change
> in here?
> 
> > [2] https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/  
> 
> This one could have been done as a broadcom_vfio_pci driver. Not sure
> exposing the entire config space unprotected is safe, hard to know
> what the device has put in there, and if it is secure to share with a
> guest..

The same argument could be made about the whole device, not just config
space.  In fact we know that devices like GPUs provide access to config
space through other address spaces, I/O port and MMIO BARs.  Config
space is only the architected means to manipulate the link interface
and device features, we can't determine if it's the only means.  Any
emulation or access restrictions we put in config space are meant to
make the device work better for the user (ex. re-using host kernel
device quirks) or prevent casual misconfiguration (ex. incompatible mps
settings, BAR registers), the safety of assigning a device to a user
is only as good as the isolation and error handling that the platform
allows.

 
> > MDEV core is already a well defined subsystem to connect mdev
> > bus driver (vfio-mdev) and mdev device driver (mlx5-mdev).  
> 
> mdev is two things
> 
>  - a driver core bus layer and sysfs that makes a lifetime model
>  - a vfio bus driver that doesn't do anything but forward ops to the
>main ops
> 
> > vfio-mdev is just the channel to bring VFIO APIs through mdev core
> > to underlying vendor specific mdev device driver, which is already
> > granted flexibility to tweak whatever needs through mdev_parent_ops.  
> 
> This is the second thing, and it could just be deleted. The actual
> final mdev driver can just use vfio_device_ops directly. The
> redirection shim in vfio_mdev.c doesn't add value.
> 
> > Then what exact extension is talked here by creating another subsystem
> > module? or are we talking about some general library which can be
> > shared by underlying mdev device drivers to reduce duplicated
> > emulation code?  
> 
> IMHO it is more a design philosophy that the end driver should
> implement the vfio_device_ops directly vs having a stack of ops
> structs.

And that's where turning vfio-pci into a vfio-pci-core library comes
in, lowering the bar for a vendor driver to re-use what vfio-pci has
already done.  This doesn't change the basic vfio architecture that
already allows any vendor driver to register with its own
vfio_device_ops, it's just code-reuse, which only makes sense if we're
not just shifting the support burden from the vendor driver to the new

Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-10 Thread Jason Gunthorpe
On Wed, Feb 10, 2021 at 07:52:08AM +, Tian, Kevin wrote:
> > This subsystem framework will also ease on adding vendor specific
> > functionality to VFIO devices in the future by allowing another module
> > to provide the pci_driver that can setup number of details before
> > registering to VFIO subsystem (such as inject its own operations).
> 
> I'm a bit confused about the change from v1 to v2, especially about
> how to inject module specific operations. From live migration p.o.v
> it may requires two hook points at least for some devices (e.g. i40e 
> in original Yan's example):

IMHO, it was too soon to give up on putting the vfio_device_ops in the
final driver- we should try to define a reasonable public/private
split of vfio_pci_device as is the norm in the kernel. No reason we
can't achieve that.

>  register a migration region and intercept guest writes to specific
> registers. [PATCH 4/9] demonstrates the former but not the latter
> (which is allowed in v1).

And this is why, the ROI to wrapper every vfio op in a PCI op just to
keep vfio_pci_device completely private is poor :(

> Then another question. Once we have this framework in place, do we 
> mandate this approach for any vendor specific tweak or still allow
> doing it as vfio_pci_core extensions (such as igd and zdev in this
> series)?

I would say no to any further vfio_pci_core extensions that are tied
to specific PCI devices. Things like zdev are platform features, they
are not tied to specific PCI devices

> If the latter, what is the criteria to judge which way is desired? Also what 
> about the scenarios where we just want one-time vendor information, 
> e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
> the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
> Do we expect to create a module for each device to provide such info?
> Having those questions answered is helpful for better understanding of
> this proposal IMO. 
> 
> [1] 
> https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fd...@huawei.com/T/

SVA is a platform feature, so no problem. Don't see a vfio-pci change
in here?

> [2] https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/

This one could have been done as a broadcom_vfio_pci driver. Not sure
exposing the entire config space unprotected is safe, hard to know
what the device has put in there, and if it is secure to share with a
guest..

> MDEV core is already a well defined subsystem to connect mdev
> bus driver (vfio-mdev) and mdev device driver (mlx5-mdev).

mdev is two things

 - a driver core bus layer and sysfs that makes a lifetime model
 - a vfio bus driver that doesn't do anything but forward ops to the
   main ops

> vfio-mdev is just the channel to bring VFIO APIs through mdev core
> to underlying vendor specific mdev device driver, which is already
> granted flexibility to tweak whatever needs through mdev_parent_ops.

This is the second thing, and it could just be deleted. The actual
final mdev driver can just use vfio_device_ops directly. The
redirection shim in vfio_mdev.c doesn't add value.

> Then what exact extension is talked here by creating another subsystem
> module? or are we talking about some general library which can be
> shared by underlying mdev device drivers to reduce duplicated
> emulation code?

IMHO it is more a design philosophy that the end driver should
implement the vfio_device_ops directly vs having a stack of ops
structs.

Jason


RE: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-09 Thread Tian, Kevin
Hi, Max,

> From: Max Gurtovoy 
> Sent: Tuesday, February 2, 2021 12:28 AM
> 
> Hi Alex and Cornelia,
> 
> This series split the vfio_pci driver into 2 parts: pci driver and a
> subsystem driver that will also be library of code. The pci driver,
> vfio_pci.ko will be used as before and it will bind to the subsystem
> driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
> if fully backward compatible. This is a typical Linux subsystem
> framework behaviour. This framework can be also adopted by vfio_mdev
> devices as we'll see in the below sketch.
> 
> This series is coming to solve the issues that were raised in the
> previous attempt for extending vfio-pci for vendor specific
> functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.
> 
> This solution is also deterministic in a sense that when a user will
> bind to a vendor specific vfio-pci driver, it will get all the special
> goodies of the HW.
> 
> This subsystem framework will also ease on adding vendor specific
> functionality to VFIO devices in the future by allowing another module
> to provide the pci_driver that can setup number of details before
> registering to VFIO subsystem (such as inject its own operations).

I'm a bit confused about the change from v1 to v2, especially about
how to inject module specific operations. From live migration p.o.v
it may requires two hook points at least for some devices (e.g. i40e 
in original Yan's example): register a migration region and intercept 
guest writes to specific registers. [PATCH 4/9] demonstrates the 
former but not the latter (which is allowed in v1). I saw some concerns 
about reporting struct vfio_pci_device outside of vfio-pci-core in v1
which should be the main reason driving this change. But I'm still 
curious to know how we plan to address this requirement (allowing 
vendor driver to tweak specific vfio_device_ops) moving forward.

Then another question. Once we have this framework in place, do we 
mandate this approach for any vendor specific tweak or still allow
doing it as vfio_pci_core extensions (such as igd and zdev in this series)? 
If the latter, what is the criteria to judge which way is desired? Also what 
about the scenarios where we just want one-time vendor information, 
e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
Do we expect to create a module for each device to provide such info?
Having those questions answered is helpful for better understanding of
this proposal IMO. 

[1] 
https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fd...@huawei.com/T/
[2] https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/

> 
> Below we can see the proposed changes (this patchset only deals with
> VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
> as well):
> 
> +--+
> |  |
> |   VFIO   |
> |  |
> +--+
> 
> +--++--+
> |  ||  |
> |VFIO_PCI_CORE || VFIO_MDEV_CORE   |
> |  ||  |
> +--++--+
> 
> +--+ +-++-+ +--+
> |  | | || | |  |
> |  | | || | |  |
> |   VFIO_PCI   | |MLX5_VFIO_PCI||  VFIO_MDEV  | |MLX5_VFIO_MDEV|
> |  | | || | |  |
> |  | | || | |  |
> +--+ +-++-+ +--+
> 

The VFIO_PCI part is clear but I didn't get how this concept is applied
to VFIO_MDEV. the mdev sub-system looks like below in my mind:

+--+
|  |
|   VFIO   |
|  |
+--+

+--++--+
|  ||  |
|VFIO_PCI_CORE || VFIO_MDEV|
|  ||  |
+--++--+

+--+ +-+

[PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-01 Thread Max Gurtovoy
Hi Alex and Cornelia,

This series split the vfio_pci driver into 2 parts: pci driver and a
subsystem driver that will also be library of code. The pci driver,
vfio_pci.ko will be used as before and it will bind to the subsystem
driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
if fully backward compatible. This is a typical Linux subsystem
framework behaviour. This framework can be also adopted by vfio_mdev
devices as we'll see in the below sketch.

This series is coming to solve the issues that were raised in the
previous attempt for extending vfio-pci for vendor specific
functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.

This solution is also deterministic in a sense that when a user will
bind to a vendor specific vfio-pci driver, it will get all the special
goodies of the HW.
 
This subsystem framework will also ease on adding vendor specific
functionality to VFIO devices in the future by allowing another module
to provide the pci_driver that can setup number of details before
registering to VFIO subsystem (such as inject its own operations).

Below we can see the proposed changes (this patchset only deals with
VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
as well):

+--+
|  |
|   VFIO   |
|  |
+--+

+--++--+
|  ||  |
|VFIO_PCI_CORE || VFIO_MDEV_CORE   |
|  ||  |
+--++--+

+--+ +-++-+ +--+
|  | | || | |  |
|  | | || | |  |
|   VFIO_PCI   | |MLX5_VFIO_PCI||  VFIO_MDEV  | |MLX5_VFIO_MDEV|
|  | | || | |  |
|  | | || | |  |
+--+ +-++-+ +--+

First 3 patches introduce the above changes for vfio_pci and
vfio_pci_core.

Patch (4/9) introduces a new mlx5 vfio-pci module that registers to VFIO
subsystem using vfio_pci_core. It also registers to Auxiliary bus for
binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This
will allow extending mlx5-vfio-pci devices with HW specific features
such as Live Migration (mlx5_core patches are not part of this series
that comes for proposing the changes need for the vfio pci subsystem).

For our testing and development we used 2 VirtIO-BLK physical functions
based on NVIDIAs Bluefield-2 SNAP technology. These 2 PCI functions were
enumerated as 08:00.0 and 09:00.0 by the Hypervisor.

The Bluefield-2 device driver, mlx5_core, will create auxiliary devices
for each VirtIO-BLK SNAP PF (the "parent"/"manager" of each VirtIO-BLK PF
will actually issue auxiliary device creation).

These auxiliary devices will be seen on the Auxiliary bus as:
mlx5_core.vfio_pci.2048 -> 
../../../devices/pci:00/:00:02.0/:05:00.0/:06:00.0/:07:00.0/mlx5_core.vfio_pci.2048
mlx5_core.vfio_pci.2304 -> 
../../../devices/pci:00/:00:02.0/:05:00.0/:06:00.0/:07:00.1/mlx5_core.vfio_pci.2304

2048 represents BDF 08:00.0 (parent is :07:00.0 Bluefield-2 p0) and
2304 represents BDF 09:00.0 (parent is :07:00.1 Bluefield-2 p1) in
decimal view. In this manner, the administrator will be able to locate the
correct vfio-pci module it should bind the desired BDF to (by finding
the pointer to the module according to the Auxiliary driver of that BDF).

Note: The discovery mechanism we used for the RFC might need some
  improvements as mentioned in the TODO list.

In this way, we'll use the HW vendor driver core to manage the lifecycle
of these devices. This is reasonable since only the vendor driver knows
exactly about the status on its internal state and the capabilities of
its acceleratots, for example.

changes from v1:
 - Create a private and public vfio-pci structures (From Alex)
 - register to vfio core directly from vfio-pci-core (for now, expose
   minimal public funcionality to vfio pci drivers. This will remove the
   implicit behaviour from v1. More power to the drivers can be added in
   the future)
 - Add patch 3/9 to emphasize the needed extension for LM feature (From
   Cornelia)
 - take/release refcount for the pci module during core open/release
 - update nvlink, igd and zdev to PowerNV, X86 and s390 extensions for
   vfio-pci core
 - fix segfault bugs in current vfio-pci zdev code

TODOs:
1.