Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
On Wed, Feb 10, 2021 at 09:34:52AM -0400, Jason Gunthorpe wrote: > > I'm a bit confused about the change from v1 to v2, especially about > > how to inject module specific operations. From live migration p.o.v > > it may requires two hook points at least for some devices (e.g. i40e > > in original Yan's example): > > IMHO, it was too soon to give up on putting the vfio_device_ops in the > final driver- we should try to define a reasonable public/private > split of vfio_pci_device as is the norm in the kernel. No reason we > can't achieve that. > > > register a migration region and intercept guest writes to specific > > registers. [PATCH 4/9] demonstrates the former but not the latter > > (which is allowed in v1). > > And this is why, the ROI to wrapper every vfio op in a PCI op just to > keep vfio_pci_device completely private is poor :( Yes. If Alex has a strong preference to keep some values private a split between vfio_pci_device vfio_pci_device_priv might be doable, but it is somewhat silly. > > Then another question. Once we have this framework in place, do we > > mandate this approach for any vendor specific tweak or still allow > > doing it as vfio_pci_core extensions (such as igd and zdev in this > > series)? > > I would say no to any further vfio_pci_core extensions that are tied > to specific PCI devices. Things like zdev are platform features, they > are not tied to specific PCI devices Yes, ZDEV is just a special case of exposing extra information for any PCI device on s390. It does not fit any split up vfio_pci framework. In fact I wonder why it even has its own config option. > > vfio-mdev is just the channel to bring VFIO APIs through mdev core > > to underlying vendor specific mdev device driver, which is already > > granted flexibility to tweak whatever needs through mdev_parent_ops. > > This is the second thing, and it could just be deleted. The actual > final mdev driver can just use vfio_device_ops directly. The > redirection shim in vfio_mdev.c doesn't add value. Yes, that would simplify a lot of things.
Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
On Wed, Feb 10, 2021 at 09:37:46AM -0700, Alex Williamson wrote: > > > register a migration region and intercept guest writes to specific > > > registers. [PATCH 4/9] demonstrates the former but not the latter > > > (which is allowed in v1). > > > > And this is why, the ROI to wrapper every vfio op in a PCI op just to > > keep vfio_pci_device completely private is poor :( > > Says someone who doesn't need to maintain the core, fixing bugs and > adding features, while not breaking vendor driver touching private data > in unexpected ways ;) Said as someone that maintains a driver subsystem 25x larger than VFIO that is really experienced in "crazy things drivers do". :) Private/public is rarely at the top of my worries, and I'm confident to say this is the general kernel philosophy. Again, look anywhere, we rarely have private data split out of major structs like struct device, struct netdev, struct pci_device, etc. This data has to be public because we are using C and we expect inline functions, container_of() and so on to work. It is rarely done with hidden structs. If we can get private data in some places it is a nice win, but not worth making a mess to achieve. eg I would not give up the normal container_of pattern just to obscure some struct, the overall ROI is bad. Drivers always do unexpected and crazy things, I wouldn't get fixated on touching private data as the worst sin a driver can do :( So, no, I don't agree that exposing a struct vfio_pci_device is the end of the world - it is normal in the kernel to do this kind of thing, and yes drivers can do crazy things with that if crazy slips past the review process. Honestly I expect people to test their drivers and fix things if a core change broke them. It happens, QA finds it, it gets fixed, normal stuff for Linux, IMHO. > > > Then what exact extension is talked here by creating another subsystem > > > module? or are we talking about some general library which can be > > > shared by underlying mdev device drivers to reduce duplicated > > > emulation code? > > > > IMHO it is more a design philosophy that the end driver should > > implement the vfio_device_ops directly vs having a stack of ops > > structs. > Like Kevin though, I don't really understand the hand-wave > application to mdev. Sure, vfio-mdev could be collapsed now that > we've rejected that there could be other drivers binding to mdev > devices, Again, I think the point Max was trying to make is only that vfio_mdev can follow the same design as proposed here and replace the mdev_parent_ops with the vfio_device_ops. Jason
Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
On Wed, 10 Feb 2021 09:34:52 -0400 Jason Gunthorpe wrote: > On Wed, Feb 10, 2021 at 07:52:08AM +, Tian, Kevin wrote: > > > This subsystem framework will also ease on adding vendor specific > > > functionality to VFIO devices in the future by allowing another module > > > to provide the pci_driver that can setup number of details before > > > registering to VFIO subsystem (such as inject its own operations). > > > > I'm a bit confused about the change from v1 to v2, especially about > > how to inject module specific operations. From live migration p.o.v > > it may requires two hook points at least for some devices (e.g. i40e > > in original Yan's example): > > IMHO, it was too soon to give up on putting the vfio_device_ops in the > final driver- we should try to define a reasonable public/private > split of vfio_pci_device as is the norm in the kernel. No reason we > can't achieve that. > > > register a migration region and intercept guest writes to specific > > registers. [PATCH 4/9] demonstrates the former but not the latter > > (which is allowed in v1). > > And this is why, the ROI to wrapper every vfio op in a PCI op just to > keep vfio_pci_device completely private is poor :( Says someone who doesn't need to maintain the core, fixing bugs and adding features, while not breaking vendor driver touching private data in unexpected ways ;) > > Then another question. Once we have this framework in place, do we > > mandate this approach for any vendor specific tweak or still allow > > doing it as vfio_pci_core extensions (such as igd and zdev in this > > series)? > > I would say no to any further vfio_pci_core extensions that are tied > to specific PCI devices. Things like zdev are platform features, they > are not tied to specific PCI devices > > > If the latter, what is the criteria to judge which way is desired? Also > > what > > about the scenarios where we just want one-time vendor information, > > e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or > > the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]? > > Do we expect to create a module for each device to provide such info? > > Having those questions answered is helpful for better understanding of > > this proposal IMO. > > > > [1] > > https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fd...@huawei.com/T/ > > > > SVA is a platform feature, so no problem. Don't see a vfio-pci change > in here? > > > [2] https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/ > > This one could have been done as a broadcom_vfio_pci driver. Not sure > exposing the entire config space unprotected is safe, hard to know > what the device has put in there, and if it is secure to share with a > guest.. The same argument could be made about the whole device, not just config space. In fact we know that devices like GPUs provide access to config space through other address spaces, I/O port and MMIO BARs. Config space is only the architected means to manipulate the link interface and device features, we can't determine if it's the only means. Any emulation or access restrictions we put in config space are meant to make the device work better for the user (ex. re-using host kernel device quirks) or prevent casual misconfiguration (ex. incompatible mps settings, BAR registers), the safety of assigning a device to a user is only as good as the isolation and error handling that the platform allows. > > MDEV core is already a well defined subsystem to connect mdev > > bus driver (vfio-mdev) and mdev device driver (mlx5-mdev). > > mdev is two things > > - a driver core bus layer and sysfs that makes a lifetime model > - a vfio bus driver that doesn't do anything but forward ops to the >main ops > > > vfio-mdev is just the channel to bring VFIO APIs through mdev core > > to underlying vendor specific mdev device driver, which is already > > granted flexibility to tweak whatever needs through mdev_parent_ops. > > This is the second thing, and it could just be deleted. The actual > final mdev driver can just use vfio_device_ops directly. The > redirection shim in vfio_mdev.c doesn't add value. > > > Then what exact extension is talked here by creating another subsystem > > module? or are we talking about some general library which can be > > shared by underlying mdev device drivers to reduce duplicated > > emulation code? > > IMHO it is more a design philosophy that the end driver should > implement the vfio_device_ops directly vs having a stack of ops > structs. And that's where turning vfio-pci into a vfio-pci-core library comes in, lowering the bar for a vendor driver to re-use what vfio-pci has already done. This doesn't change the basic vfio architecture that already allows any vendor driver to register with its own vfio_device_ops, it's just code-reuse, which only makes sense if we're not just shifting the support burden from the vendor driver to the new
Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
On Wed, Feb 10, 2021 at 07:52:08AM +, Tian, Kevin wrote: > > This subsystem framework will also ease on adding vendor specific > > functionality to VFIO devices in the future by allowing another module > > to provide the pci_driver that can setup number of details before > > registering to VFIO subsystem (such as inject its own operations). > > I'm a bit confused about the change from v1 to v2, especially about > how to inject module specific operations. From live migration p.o.v > it may requires two hook points at least for some devices (e.g. i40e > in original Yan's example): IMHO, it was too soon to give up on putting the vfio_device_ops in the final driver- we should try to define a reasonable public/private split of vfio_pci_device as is the norm in the kernel. No reason we can't achieve that. > register a migration region and intercept guest writes to specific > registers. [PATCH 4/9] demonstrates the former but not the latter > (which is allowed in v1). And this is why, the ROI to wrapper every vfio op in a PCI op just to keep vfio_pci_device completely private is poor :( > Then another question. Once we have this framework in place, do we > mandate this approach for any vendor specific tweak or still allow > doing it as vfio_pci_core extensions (such as igd and zdev in this > series)? I would say no to any further vfio_pci_core extensions that are tied to specific PCI devices. Things like zdev are platform features, they are not tied to specific PCI devices > If the latter, what is the criteria to judge which way is desired? Also what > about the scenarios where we just want one-time vendor information, > e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or > the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]? > Do we expect to create a module for each device to provide such info? > Having those questions answered is helpful for better understanding of > this proposal IMO. > > [1] > https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fd...@huawei.com/T/ SVA is a platform feature, so no problem. Don't see a vfio-pci change in here? > [2] https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/ This one could have been done as a broadcom_vfio_pci driver. Not sure exposing the entire config space unprotected is safe, hard to know what the device has put in there, and if it is secure to share with a guest.. > MDEV core is already a well defined subsystem to connect mdev > bus driver (vfio-mdev) and mdev device driver (mlx5-mdev). mdev is two things - a driver core bus layer and sysfs that makes a lifetime model - a vfio bus driver that doesn't do anything but forward ops to the main ops > vfio-mdev is just the channel to bring VFIO APIs through mdev core > to underlying vendor specific mdev device driver, which is already > granted flexibility to tweak whatever needs through mdev_parent_ops. This is the second thing, and it could just be deleted. The actual final mdev driver can just use vfio_device_ops directly. The redirection shim in vfio_mdev.c doesn't add value. > Then what exact extension is talked here by creating another subsystem > module? or are we talking about some general library which can be > shared by underlying mdev device drivers to reduce duplicated > emulation code? IMHO it is more a design philosophy that the end driver should implement the vfio_device_ops directly vs having a stack of ops structs. Jason
RE: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
Hi, Max, > From: Max Gurtovoy > Sent: Tuesday, February 2, 2021 12:28 AM > > Hi Alex and Cornelia, > > This series split the vfio_pci driver into 2 parts: pci driver and a > subsystem driver that will also be library of code. The pci driver, > vfio_pci.ko will be used as before and it will bind to the subsystem > driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset > if fully backward compatible. This is a typical Linux subsystem > framework behaviour. This framework can be also adopted by vfio_mdev > devices as we'll see in the below sketch. > > This series is coming to solve the issues that were raised in the > previous attempt for extending vfio-pci for vendor specific > functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao. > > This solution is also deterministic in a sense that when a user will > bind to a vendor specific vfio-pci driver, it will get all the special > goodies of the HW. > > This subsystem framework will also ease on adding vendor specific > functionality to VFIO devices in the future by allowing another module > to provide the pci_driver that can setup number of details before > registering to VFIO subsystem (such as inject its own operations). I'm a bit confused about the change from v1 to v2, especially about how to inject module specific operations. From live migration p.o.v it may requires two hook points at least for some devices (e.g. i40e in original Yan's example): register a migration region and intercept guest writes to specific registers. [PATCH 4/9] demonstrates the former but not the latter (which is allowed in v1). I saw some concerns about reporting struct vfio_pci_device outside of vfio-pci-core in v1 which should be the main reason driving this change. But I'm still curious to know how we plan to address this requirement (allowing vendor driver to tweak specific vfio_device_ops) moving forward. Then another question. Once we have this framework in place, do we mandate this approach for any vendor specific tweak or still allow doing it as vfio_pci_core extensions (such as igd and zdev in this series)? If the latter, what is the criteria to judge which way is desired? Also what about the scenarios where we just want one-time vendor information, e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]? Do we expect to create a module for each device to provide such info? Having those questions answered is helpful for better understanding of this proposal IMO. [1] https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fd...@huawei.com/T/ [2] https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/ > > Below we can see the proposed changes (this patchset only deals with > VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem > as well): > > +--+ > | | > | VFIO | > | | > +--+ > > +--++--+ > | || | > |VFIO_PCI_CORE || VFIO_MDEV_CORE | > | || | > +--++--+ > > +--+ +-++-+ +--+ > | | | || | | | > | | | || | | | > | VFIO_PCI | |MLX5_VFIO_PCI|| VFIO_MDEV | |MLX5_VFIO_MDEV| > | | | || | | | > | | | || | | | > +--+ +-++-+ +--+ > The VFIO_PCI part is clear but I didn't get how this concept is applied to VFIO_MDEV. the mdev sub-system looks like below in my mind: +--+ | | | VFIO | | | +--+ +--++--+ | || | |VFIO_PCI_CORE || VFIO_MDEV| | || | +--++--+ +--+ +-+
[PATCH v2 0/9] Introduce vfio-pci-core subsystem
Hi Alex and Cornelia, This series split the vfio_pci driver into 2 parts: pci driver and a subsystem driver that will also be library of code. The pci driver, vfio_pci.ko will be used as before and it will bind to the subsystem driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset if fully backward compatible. This is a typical Linux subsystem framework behaviour. This framework can be also adopted by vfio_mdev devices as we'll see in the below sketch. This series is coming to solve the issues that were raised in the previous attempt for extending vfio-pci for vendor specific functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao. This solution is also deterministic in a sense that when a user will bind to a vendor specific vfio-pci driver, it will get all the special goodies of the HW. This subsystem framework will also ease on adding vendor specific functionality to VFIO devices in the future by allowing another module to provide the pci_driver that can setup number of details before registering to VFIO subsystem (such as inject its own operations). Below we can see the proposed changes (this patchset only deals with VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem as well): +--+ | | | VFIO | | | +--+ +--++--+ | || | |VFIO_PCI_CORE || VFIO_MDEV_CORE | | || | +--++--+ +--+ +-++-+ +--+ | | | || | | | | | | || | | | | VFIO_PCI | |MLX5_VFIO_PCI|| VFIO_MDEV | |MLX5_VFIO_MDEV| | | | || | | | | | | || | | | +--+ +-++-+ +--+ First 3 patches introduce the above changes for vfio_pci and vfio_pci_core. Patch (4/9) introduces a new mlx5 vfio-pci module that registers to VFIO subsystem using vfio_pci_core. It also registers to Auxiliary bus for binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This will allow extending mlx5-vfio-pci devices with HW specific features such as Live Migration (mlx5_core patches are not part of this series that comes for proposing the changes need for the vfio pci subsystem). For our testing and development we used 2 VirtIO-BLK physical functions based on NVIDIAs Bluefield-2 SNAP technology. These 2 PCI functions were enumerated as 08:00.0 and 09:00.0 by the Hypervisor. The Bluefield-2 device driver, mlx5_core, will create auxiliary devices for each VirtIO-BLK SNAP PF (the "parent"/"manager" of each VirtIO-BLK PF will actually issue auxiliary device creation). These auxiliary devices will be seen on the Auxiliary bus as: mlx5_core.vfio_pci.2048 -> ../../../devices/pci:00/:00:02.0/:05:00.0/:06:00.0/:07:00.0/mlx5_core.vfio_pci.2048 mlx5_core.vfio_pci.2304 -> ../../../devices/pci:00/:00:02.0/:05:00.0/:06:00.0/:07:00.1/mlx5_core.vfio_pci.2304 2048 represents BDF 08:00.0 (parent is :07:00.0 Bluefield-2 p0) and 2304 represents BDF 09:00.0 (parent is :07:00.1 Bluefield-2 p1) in decimal view. In this manner, the administrator will be able to locate the correct vfio-pci module it should bind the desired BDF to (by finding the pointer to the module according to the Auxiliary driver of that BDF). Note: The discovery mechanism we used for the RFC might need some improvements as mentioned in the TODO list. In this way, we'll use the HW vendor driver core to manage the lifecycle of these devices. This is reasonable since only the vendor driver knows exactly about the status on its internal state and the capabilities of its acceleratots, for example. changes from v1: - Create a private and public vfio-pci structures (From Alex) - register to vfio core directly from vfio-pci-core (for now, expose minimal public funcionality to vfio pci drivers. This will remove the implicit behaviour from v1. More power to the drivers can be added in the future) - Add patch 3/9 to emphasize the needed extension for LM feature (From Cornelia) - take/release refcount for the pci module during core open/release - update nvlink, igd and zdev to PowerNV, X86 and s390 extensions for vfio-pci core - fix segfault bugs in current vfio-pci zdev code TODOs: 1.