from:"Alex Williamson"

Re: [RFC net-next v1 1/3] vfio/mdev: Inherit dma masks of parent device

2019-03-08 Thread Alex Williamson

On Fri,  8 Mar 2019 16:07:54 -0600
Parav Pandit  wrote:

> Inherit dma mask of parent device in child mdev devices, so that
> protocol stack can use right dma mask while doing dma mappings.
> 
> Signed-off-by: Parav Pandit 
> ---
>  drivers/vfio/mdev/mdev_core.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 0212f0e..9b8bdc9 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -315,6 +315,10 @@ int mdev_device_create(struct kobject *kobj, struct 
> device *dev, uuid_le uuid)
>   mdev->dev.parent  = dev;
>   mdev->dev.bus = &mdev_bus_type;
>   mdev->dev.release = mdev_device_release;
> + mdev->dev.dma_mask = dev->dma_mask;
> + mdev->dev.dma_parms = dev->dma_parms;
> + mdev->dev.coherent_dma_mask = dev->coherent_dma_mask;
> +
>   dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
>   ret = device_register(&mdev->dev);

This seems like a rather large assumption and none of the existing mdev
drivers even make use of DMA ops.  Why shouldn't this be done in
mdev_parent_ops.create?  Thanks,

Alex

Re: [PATCH v7 9/9] vfio/type1: Handle different mdev isolation type

2019-03-07 Thread Alex Williamson

On Thu, 7 Mar 2019 00:44:54 -0800
Neo Jia  wrote:

> On Fri, Feb 22, 2019 at 10:19:27AM +0800, Lu Baolu wrote:
> > This adds the support to determine the isolation type
> > of a mediated device group by checking whether it has
> > an iommu device. If an iommu device exists, an iommu
> > domain will be allocated and then attached to the iommu
> > device. Otherwise, keep the same behavior as it is.
> > 
> > Cc: Ashok Raj 
> > Cc: Jacob Pan 
> > Cc: Kevin Tian 
> > Signed-off-by: Sanjay Kumar 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Lu Baolu 
> > Reviewed-by: Jean-Philippe Brucker 
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 48 -
> >  1 file changed, 41 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index ccc4165474aa..f1392c582a3c 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -1368,13 +1368,40 @@ static void vfio_iommu_detach_group(struct 
> > vfio_domain *domain,
> > iommu_detach_group(domain->domain, group->iommu_group);
> >  }
> >
> 
> Hi Baolu,
> 
> To allow IOMMU-awared mdev, I think you need to modify the
> vfio_iommu_type1_pin_pages and vfio_iommu_type1_unpin_pages to remove the
> iommu->external_domain check.
> 
> Could you please include that in your patch? If not, I can send out a separate
> patch to address that issue.

I figured it was intentional that an IOMMU backed mdev would not use
the pin/unpin interface and therefore the exiting -EINVAL returns would
be correct.  Can you elaborate on the use case for still requiring the
mdev pin/unpin interface for these devices?  Thanks,

Alex

Re: [RFC v1 0/2] vfio/pci: wrap pci device as mdev with vfio-pci driver

2019-03-07 Thread Alex Williamson

On Sun,  3 Mar 2019 20:57:59 +0800
"Liu, Yi L"  wrote:

> This patchset aims to add a vfio-pci-like meta driver on existing
> PCI devices, as a demo user of the vfio changes introduced in
> "vfio/mdev: IOMMU aware mediated device" patchset from Baolu Lu.
> 
> To build such a meta driver. We have two choices.
> a) add a vfio-pci alike sample driver under samples directory
> b) add some extensions in vfio-pci driver to make it wrap pci
>device as mdev
> 
> For choice a), the new sample driver will have quite a few
> duplicated code with vfio-pci driver since the new sample
> driver also wants to virtualize the PCI config space. So
> this choice may bring in extra maintain effort in kernel
> and also looks strange since there will be a bunch of
> duplicated code with vfio-pci driver.
> 
> For choice b), it may reuse the existing vfio-pci driver
> by adding a new working mode. With this mode, user can wrap
> a pci device as a mediated device by binding it with the
> vfio-pci driver which works in the new mode. Thus can be used
> to verify the ""vfio/mdev: IOMMU aware mediated device"
> patchset.
> 
> This patchset is following choice b). However, we are open on
> the direction of the implementation of this vfio-pci-like meta
> driver. Pls feel free give your suggestions.

Thanks for doing this Yi!  Rather than a module option for vfio-pci,
what about having this build into a separate module (ex.
vfio-pci-mdev)?  Then we could test "regular" vfio-pci along side mdev
wrapped devices simply by which driver we bind and it'd probably be more
friendly to existing users, like libvirt.  This might also make a good
base driver for experimenting with device specific mdev migration as
well.  Thanks,

Alex

[GIT PULL] VFIO updates for v5.1-rc1

2019-02-22 Thread Alex Williamson

Hi Linus,

Sorry for the duplicate, botched the [GIT PULL] first time around.
This is for the next merge window.  Thanks!

The following changes since commit 8834f5600cf3c8db365e18a3d5cac2c2780c81e5:

  Linux 5.0-rc5 (2019-02-03 13:48:04 -0800)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.1-rc1

for you to fetch changes up to 0cfd027be1d6def4a462cdc180c055143af24069:

  vfio_pci: Enable memory accesses before calling pci_map_rom (2019-02-18 
14:57:50 -0700)


VFIO updates for v5.1-rc1

 - Switch mdev to generic UUID API (Andy Shevchenko)

 - Fixup platform reset include paths (Masahiro Yamada)

 - Fix usage of MINORMASK (Chengguang Xu)

 - Remove noise from duplicate spapr table unsets (Alexey Kardashevskiy)

 - Restore device state after PM reset (Alex Williamson)

 - Ensure memory translation enabled for PCI ROM access (Eric Auger)


Alex Williamson (1):
  vfio/pci: Restore device state on PM transition

Alexey Kardashevskiy (1):
  vfio/spapr_tce: Skip unsetting already unset table

Andy Shevchenko (1):
  vfio-mdev: Switch to use new generic UUID API

Chengguang Xu (4):
  vfio: expand minor range when registering chrdev region
  samples/vfio-mdev/mbochs: expand minor range when registering chrdev 
region
  samples/vfio-mdev/mdpy: expand minor range when registering chrdev region
  samples/vfio-mdev/mtty: expand minor range when registering chrdev region

Eric Auger (1):
  vfio_pci: Enable memory accesses before calling pci_map_rom

Masahiro Yamada (1):
  vfio: platform: reset: fix up include directives to remove ccflags-y

 drivers/vfio/mdev/mdev_core.c  | 16 ++--
 drivers/vfio/mdev/mdev_private.h   |  5 +-
 drivers/vfio/mdev/mdev_sysfs.c |  6 +-
 drivers/vfio/pci/vfio_pci.c| 90 ++
 drivers/vfio/pci/vfio_pci_config.c |  2 +-
 drivers/vfio/pci/vfio_pci_private.h|  6 ++
 drivers/vfio/platform/reset/Makefile   |  2 -
 .../vfio/platform/reset/vfio_platform_amdxgbe.c|  2 +-
 .../vfio/platform/reset/vfio_platform_bcmflexrm.c  |  2 +-
 .../platform/reset/vfio_platform_calxedaxgmac.c|  2 +-
 drivers/vfio/vfio.c|  8 +-
 drivers/vfio/vfio_iommu_spapr_tce.c|  3 +-
 include/linux/mdev.h   |  2 +-
 samples/vfio-mdev/mbochs.c |  8 +-
 samples/vfio-mdev/mdpy.c   |  8 +-
 samples/vfio-mdev/mtty.c   | 17 ++--
 16 files changed, 125 insertions(+), 54 deletions(-)

VFIO updates for v5.1-rc1

2019-02-22 Thread Alex Williamson

Hi Linus,

An early pull request for the v5.1 merge window.

The following changes since commit 8834f5600cf3c8db365e18a3d5cac2c2780c81e5:

  Linux 5.0-rc5 (2019-02-03 13:48:04 -0800)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.1-rc1

for you to fetch changes up to 0cfd027be1d6def4a462cdc180c055143af24069:

  vfio_pci: Enable memory accesses before calling pci_map_rom (2019-02-18 
14:57:50 -0700)


VFIO updates for v5.1-rc1

 - Switch mdev to generic UUID API (Andy Shevchenko)

 - Fixup platform reset include paths (Masahiro Yamada)

 - Fix usage of MINORMASK (Chengguang Xu)

 - Remove noise from duplicate spapr table unsets (Alexey Kardashevskiy)

 - Restore device state after PM reset (Alex Williamson)

 - Ensure memory translation enabled for PCI ROM access (Eric Auger)


Alex Williamson (1):
  vfio/pci: Restore device state on PM transition

Alexey Kardashevskiy (1):
  vfio/spapr_tce: Skip unsetting already unset table

Andy Shevchenko (1):
  vfio-mdev: Switch to use new generic UUID API

Chengguang Xu (4):
  vfio: expand minor range when registering chrdev region
  samples/vfio-mdev/mbochs: expand minor range when registering chrdev 
region
  samples/vfio-mdev/mdpy: expand minor range when registering chrdev region
  samples/vfio-mdev/mtty: expand minor range when registering chrdev region

Eric Auger (1):
  vfio_pci: Enable memory accesses before calling pci_map_rom

Masahiro Yamada (1):
  vfio: platform: reset: fix up include directives to remove ccflags-y

 drivers/vfio/mdev/mdev_core.c  | 16 ++--
 drivers/vfio/mdev/mdev_private.h   |  5 +-
 drivers/vfio/mdev/mdev_sysfs.c |  6 +-
 drivers/vfio/pci/vfio_pci.c| 90 ++
 drivers/vfio/pci/vfio_pci_config.c |  2 +-
 drivers/vfio/pci/vfio_pci_private.h|  6 ++
 drivers/vfio/platform/reset/Makefile   |  2 -
 .../vfio/platform/reset/vfio_platform_amdxgbe.c|  2 +-
 .../vfio/platform/reset/vfio_platform_bcmflexrm.c  |  2 +-
 .../platform/reset/vfio_platform_calxedaxgmac.c|  2 +-
 drivers/vfio/vfio.c|  8 +-
 drivers/vfio/vfio_iommu_spapr_tce.c|  3 +-
 include/linux/mdev.h   |  2 +-
 samples/vfio-mdev/mbochs.c |  8 +-
 samples/vfio-mdev/mdpy.c   |  8 +-
 samples/vfio-mdev/mtty.c   | 17 ++--
 16 files changed, 125 insertions(+), 54 deletions(-)

Re: [PATCH v7 7/9] vfio/mdev: Add iommu related member in mdev_device

2019-02-22 Thread Alex Williamson

On Fri, 22 Feb 2019 06:34:38 -0800
Christoph Hellwig  wrote:

> On Fri, Feb 22, 2019 at 10:19:25AM +0800, Lu Baolu wrote:
> > A parent device might create different types of mediated
> > devices. For example, a mediated device could be created
> > by the parent device with full isolation and protection
> > provided by the IOMMU. One usage case could be found on
> > Intel platforms where a mediated device is an assignable
> > subset of a PCI, the DMA requests on behalf of it are all
> > tagged with a PASID. Since IOMMU supports PASID-granular
> > translations (scalable mode in VT-d 3.0), this mediated
> > device could be individually protected and isolated by an
> > IOMMU.
> > 
> > This patch adds a new member in the struct mdev_device to
> > indicate that the mediated device represented by mdev could
> > be isolated and protected by attaching a domain to a device
> > represented by mdev->iommu_device. It also adds a helper to
> > add or set the iommu device.
> > 
> > * mdev_device->iommu_device
> >   - This, if set, indicates that the mediated device could
> > be fully isolated and protected by IOMMU via attaching
> > an iommu domain to this device. If empty, it indicates
> > using vendor defined isolation, hence bypass IOMMU.
> > 
> > * mdev_set/get_iommu_device(dev, iommu_device)
> >   - Set or get the iommu device which represents this mdev
> > in IOMMU's device scope. Drivers don't need to set the
> >     iommu device if it uses vendor defined isolation.
> > 
> > Cc: Ashok Raj 
> > Cc: Jacob Pan 
> > Cc: Kevin Tian 
> > Cc: Liu Yi L 
> > Suggested-by: Kevin Tian 
> > Suggested-by: Alex Williamson 
> > Signed-off-by: Lu Baolu 
> > Reviewed-by: Jean-Philippe Brucker 
> > ---
> >  drivers/vfio/mdev/mdev_core.c| 18 ++
> >  drivers/vfio/mdev/mdev_private.h |  1 +
> >  include/linux/mdev.h | 14 ++
> >  3 files changed, 33 insertions(+)
> > 
> > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> > index 0212f0ee8aea..9be58d392d2b 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -390,6 +390,24 @@ int mdev_device_remove(struct device *dev, bool 
> > force_remove)
> > return 0;
> >  }
> >  
> > +int mdev_set_iommu_device(struct device *dev, struct device *iommu_device)
> > +{
> > +   struct mdev_device *mdev = to_mdev_device(dev);
> > +
> > +   mdev->iommu_device = iommu_device;
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL(mdev_set_iommu_device);  
> 
> As said before, please make all new mdev/vfio exports EXPORT_SYMBOL_GPL
> to fit the other exports in vfio.

Well...

$ grep EXPORT_SYMBOL drivers/vfio/mdev/*
drivers/vfio/mdev/mdev_core.c:EXPORT_SYMBOL(mdev_parent_dev);
drivers/vfio/mdev/mdev_core.c:EXPORT_SYMBOL(mdev_get_drvdata);
drivers/vfio/mdev/mdev_core.c:EXPORT_SYMBOL(mdev_set_drvdata);
drivers/vfio/mdev/mdev_core.c:EXPORT_SYMBOL(mdev_dev);
drivers/vfio/mdev/mdev_core.c:EXPORT_SYMBOL(mdev_from_dev);
drivers/vfio/mdev/mdev_core.c:EXPORT_SYMBOL(mdev_uuid);
drivers/vfio/mdev/mdev_core.c:EXPORT_SYMBOL(mdev_register_device);
drivers/vfio/mdev/mdev_core.c:EXPORT_SYMBOL(mdev_unregister_device);
drivers/vfio/mdev/mdev_driver.c:EXPORT_SYMBOL_GPL(mdev_bus_type);
drivers/vfio/mdev/mdev_driver.c:EXPORT_SYMBOL(mdev_register_driver);
drivers/vfio/mdev/mdev_driver.c:EXPORT_SYMBOL(mdev_unregister_driver);

For better or worse, the mdev interface does allow non-GPL vendor
drivers.  This export seems consistent with that, it's a simple
association allowing the vendor driver to define an IOMMU API backing
device for an mdev device.  I don't think this association implies
sufficient operational knowledge to require a GPL symbol and it's been
requested for use by one of those non-GPL mdev vendor drivers,
therefore I support this definition.  Thanks,

Alex

Re: [PATCH v3] vfio_pci: Enable memory accesses before calling pci_map_rom

2019-02-19 Thread Alex Williamson

On Fri, 15 Feb 2019 17:16:06 +0100
Eric Auger  wrote:

> pci_map_rom/pci_get_rom_size() performs memory access in the ROM.
> In case the Memory Space accesses were disabled, readw() is likely
> to trigger a synchronous external abort on some platforms.
> 
> In case memory accesses were disabled, re-enable them before the
> call and disable them back again just after.
> 
> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> 
> Signed-off-by: Eric Auger 
> Suggested-by: Alex Williamson 

Applied to vfio next branch for v5.1.  Thanks,

Alex

> ---
> v2 -> v3:
> - follow Alex re-writing
> 
> v1 -> v2:
> - also re-enable in case of error
> ---
>  drivers/vfio/pci/vfio_pci.c | 25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index ff60bd1ea587..4b0d30f1eabc 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -708,6 +708,7 @@ static long vfio_pci_ioctl(void *device_data,
>   {
>   void __iomem *io;
>   size_t size;
> + u16 orig_cmd;
>  
>   info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
>   info.flags = 0;
> @@ -723,15 +724,23 @@ static long vfio_pci_ioctl(void *device_data,
>   break;
>   }
>  
> - /* Is it really there? */
> - io = pci_map_rom(pdev, &size);
> - if (!io || !size) {
> - info.size = 0;
> - break;
> - }
> - pci_unmap_rom(pdev, io);
> + /*
> +  * Is it really there?  Enable memory decode for
> +  * implicit access in pci_map_rom().
> +  */
> + pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd);
> + pci_write_config_word(pdev, PCI_COMMAND,
> +   orig_cmd | PCI_COMMAND_MEMORY);
>  
> - info.flags = VFIO_REGION_INFO_FLAG_READ;
> + io = pci_map_rom(pdev, &size);
> + if (io) {
> + info.flags = VFIO_REGION_INFO_FLAG_READ;
> + pci_unmap_rom(pdev, io);
> + } else {
> + info.size = 0;
> + }
> +
> + pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
>   break;
>   }
>   case VFIO_PCI_VGA_REGION_INDEX:

[PATCH v2] PCI: Fix "try" semantics of bus and slot reset

2019-02-18 Thread Alex Williamson

The commit referenced below introduced device locking around save and
restore of state for each device during a PCI bus "try" reset, making
it decidely non-"try" and prone to deadlock in the event that a device
is already locked.  Restore __pci_reset_bus() and __pci_reset_slot()
to their advertised locking semantics by pushing the save and restore
functions into the branch where the entire tree is already locked.
Extend the helper function names with "_locked" and update the comment
to reflect this calling requirement.

Fixes: b014e96d1abb ("PCI: Protect pci_error_handlers->reset_notify() usage 
with device_lock()")
Signed-off-by: Alex Williamson 
---
 drivers/pci/pci.c |   54 ++---
 1 file changed, 26 insertions(+), 28 deletions(-)

v2: White space only fix suggested by Myron Stowe, removing an additional
empty line from __pci_reset_slot() after the restore call is moved.

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index c25acace7d91..2fb149216cde 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5058,39 +5058,42 @@ static int pci_slot_trylock(struct pci_slot *slot)
return 0;
 }
 
-/* Save and disable devices from the top of the tree down */
-static void pci_bus_save_and_disable(struct pci_bus *bus)
+/*
+ * Save and disable devices from the top of the tree down while holding
+ * the @dev mutex lock for the entire tree.
+ */
+static void pci_bus_save_and_disable_locked(struct pci_bus *bus)
 {
struct pci_dev *dev;
 
list_for_each_entry(dev, &bus->devices, bus_list) {
-   pci_dev_lock(dev);
pci_dev_save_and_disable(dev);
-   pci_dev_unlock(dev);
if (dev->subordinate)
-   pci_bus_save_and_disable(dev->subordinate);
+   pci_bus_save_and_disable_locked(dev->subordinate);
}
 }
 
 /*
- * Restore devices from top of the tree down - parent bridges need to be
- * restored before we can get to subordinate devices.
+ * Restore devices from top of the tree down while holding @dev mutex lock
+ * for the entire tree.  Parent bridges need to be restored before we can
+ * get to subordinate devices.
  */
-static void pci_bus_restore(struct pci_bus *bus)
+static void pci_bus_restore_locked(struct pci_bus *bus)
 {
struct pci_dev *dev;
 
list_for_each_entry(dev, &bus->devices, bus_list) {
-   pci_dev_lock(dev);
pci_dev_restore(dev);
-   pci_dev_unlock(dev);
if (dev->subordinate)
-   pci_bus_restore(dev->subordinate);
+   pci_bus_restore_locked(dev->subordinate);
}
 }
 
-/* Save and disable devices from the top of the tree down */
-static void pci_slot_save_and_disable(struct pci_slot *slot)
+/*
+ * Save and disable devices from the top of the tree down while holding
+ * the @dev mutex lock for the entire tree.
+ */
+static void pci_slot_save_and_disable_locked(struct pci_slot *slot)
 {
struct pci_dev *dev;
 
@@ -5099,26 +5102,25 @@ static void pci_slot_save_and_disable(struct pci_slot 
*slot)
continue;
pci_dev_save_and_disable(dev);
if (dev->subordinate)
-   pci_bus_save_and_disable(dev->subordinate);
+   pci_bus_save_and_disable_locked(dev->subordinate);
}
 }
 
 /*
- * Restore devices from top of the tree down - parent bridges need to be
- * restored before we can get to subordinate devices.
+ * Restore devices from top of the tree down while holding @dev mutex lock
+ * for the entire tree.  Parent bridges need to be restored before we can
+ * get to subordinate devices.
  */
-static void pci_slot_restore(struct pci_slot *slot)
+static void pci_slot_restore_locked(struct pci_slot *slot)
 {
struct pci_dev *dev;
 
list_for_each_entry(dev, &slot->bus->devices, bus_list) {
if (!dev->slot || dev->slot != slot)
continue;
-   pci_dev_lock(dev);
pci_dev_restore(dev);
-   pci_dev_unlock(dev);
if (dev->subordinate)
-   pci_bus_restore(dev->subordinate);
+   pci_bus_restore_locked(dev->subordinate);
}
 }
 
@@ -5177,17 +5179,15 @@ static int __pci_reset_slot(struct pci_slot *slot)
if (rc)
return rc;
 
-   pci_slot_save_and_disable(slot);
-
if (pci_slot_trylock(slot)) {
+   pci_slot_save_and_disable_locked(slot);
might_sleep();
rc = pci_reset_hotplug_slot(slot->hotplug, 0);
+   pci_slot_restore_locked(slot);
pci_slot_unlock(slot);
} else
rc = -EAGAIN;
 
-   pci_slot_restore(slot);
-
return rc;
 }
 
@

[PATCH] PCI: Fix "try" semantics of bus and slot reset

2019-02-15 Thread Alex Williamson

The commit referenced below introduced device locking around save and
restore of state for each device during a PCI bus "try" reset, making
it decidely non-"try" and prone to deadlock in the event that a device
is already locked.  Restore __pci_reset_bus() and __pci_reset_slot()
to their advertised locking semantics by pushing the save and restore
functions into the branch where the entire tree is already locked.
Extend the helper function names with "_locked" and update the comment
to reflect this calling requirement.

Fixes: b014e96d1abb ("PCI: Protect pci_error_handlers->reset_notify() usage 
with device_lock()")
Signed-off-by: Alex Williamson 
---
 drivers/pci/pci.c |   53 ++---
 1 file changed, 26 insertions(+), 27 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index c25acace7d91..870834cb6672 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5058,39 +5058,42 @@ static int pci_slot_trylock(struct pci_slot *slot)
return 0;
 }
 
-/* Save and disable devices from the top of the tree down */
-static void pci_bus_save_and_disable(struct pci_bus *bus)
+/*
+ * Save and disable devices from the top of the tree down while holding
+ * the @dev mutex lock for the entire tree.
+ */
+static void pci_bus_save_and_disable_locked(struct pci_bus *bus)
 {
struct pci_dev *dev;
 
list_for_each_entry(dev, &bus->devices, bus_list) {
-   pci_dev_lock(dev);
pci_dev_save_and_disable(dev);
-   pci_dev_unlock(dev);
if (dev->subordinate)
-   pci_bus_save_and_disable(dev->subordinate);
+   pci_bus_save_and_disable_locked(dev->subordinate);
}
 }
 
 /*
- * Restore devices from top of the tree down - parent bridges need to be
- * restored before we can get to subordinate devices.
+ * Restore devices from top of the tree down while holding @dev mutex lock
+ * for the entire tree.  Parent bridges need to be restored before we can
+ * get to subordinate devices.
  */
-static void pci_bus_restore(struct pci_bus *bus)
+static void pci_bus_restore_locked(struct pci_bus *bus)
 {
struct pci_dev *dev;
 
list_for_each_entry(dev, &bus->devices, bus_list) {
-   pci_dev_lock(dev);
pci_dev_restore(dev);
-   pci_dev_unlock(dev);
if (dev->subordinate)
-   pci_bus_restore(dev->subordinate);
+   pci_bus_restore_locked(dev->subordinate);
}
 }
 
-/* Save and disable devices from the top of the tree down */
-static void pci_slot_save_and_disable(struct pci_slot *slot)
+/*
+ * Save and disable devices from the top of the tree down while holding
+ * the @dev mutex lock for the entire tree.
+ */
+static void pci_slot_save_and_disable_locked(struct pci_slot *slot)
 {
struct pci_dev *dev;
 
@@ -5099,26 +5102,25 @@ static void pci_slot_save_and_disable(struct pci_slot 
*slot)
continue;
pci_dev_save_and_disable(dev);
if (dev->subordinate)
-   pci_bus_save_and_disable(dev->subordinate);
+   pci_bus_save_and_disable_locked(dev->subordinate);
}
 }
 
 /*
- * Restore devices from top of the tree down - parent bridges need to be
- * restored before we can get to subordinate devices.
+ * Restore devices from top of the tree down while holding @dev mutex lock
+ * for the entire tree.  Parent bridges need to be restored before we can
+ * get to subordinate devices.
  */
-static void pci_slot_restore(struct pci_slot *slot)
+static void pci_slot_restore_locked(struct pci_slot *slot)
 {
struct pci_dev *dev;
 
list_for_each_entry(dev, &slot->bus->devices, bus_list) {
if (!dev->slot || dev->slot != slot)
continue;
-   pci_dev_lock(dev);
pci_dev_restore(dev);
-   pci_dev_unlock(dev);
if (dev->subordinate)
-   pci_bus_restore(dev->subordinate);
+   pci_bus_restore_locked(dev->subordinate);
}
 }
 
@@ -5177,16 +5179,15 @@ static int __pci_reset_slot(struct pci_slot *slot)
if (rc)
return rc;
 
-   pci_slot_save_and_disable(slot);
-
if (pci_slot_trylock(slot)) {
+   pci_slot_save_and_disable_locked(slot);
might_sleep();
rc = pci_reset_hotplug_slot(slot->hotplug, 0);
+   pci_slot_restore_locked(slot);
pci_slot_unlock(slot);
} else
rc = -EAGAIN;
 
-   pci_slot_restore(slot);
 
return rc;
 }
@@ -5273,17 +5274,15 @@ static int __pci_reset_bus(struct pci_bus *bus)
if (rc)
return rc;
 
-   pci_bus_

Re: [PATCH v2] vfio_pci: Enable memory accesses before calling pci_map_rom

2019-02-14 Thread Alex Williamson

On Thu, 14 Feb 2019 19:27:15 +0100
Auger Eric  wrote:

> Hi Alex,
> 
> On 2/13/19 6:52 PM, Alex Williamson wrote:
> > On Wed, 13 Feb 2019 12:06:10 +0100
> > Eric Auger  wrote:
> >   
> >> pci_map_rom/pci_get_rom_size() performs memory access in the ROM.
> >> In case the Memory Space accesses were disabled, readw() is likely to
> >> crash the host with a synchronous external abort (aarch64).  
> > 
> > As implied in response to Konrad, the likeliness really depends on the
> > whole platform, not just the CPU architecture.  It's a class of
> > problems that depends on OS control or error handling, which we simply
> > don't have on many systems.  But we can fix this instance of it.  
> 
> Agreed, I just hit this issue on one specific aarch64 machine
> >   
> >> In case memory accesses were disabled, re-enable them before the call
> >> and disable them back again just after.
> >>
> >> Signed-off-by: Eric Auger   
> > 
> > This has been around since the beginning, but maybe a Fixes tag would
> > be useful:
> > 
> > Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")  
> OK
> >   
> >>
> >> ---
> >>
> >> v1 -> v2:
> >> - also re-enable in case of error
> >> ---
> >>  drivers/vfio/pci/vfio_pci.c | 17 -
> >>  1 file changed, 16 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index ff60bd1ea587..721aa55424a4 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -706,8 +706,10 @@ static long vfio_pci_ioctl(void *device_data,
> >>break;
> >>case VFIO_PCI_ROM_REGION_INDEX:
> >>{
> >> +  bool mem_access_disabled;
> >>void __iomem *io;
> >>size_t size;
> >> +  u16 cmd;
> >>  
> >>info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> >>info.flags = 0;
> >> @@ -723,15 +725,28 @@ static long vfio_pci_ioctl(void *device_data,
> >>break;
> >>}
> >>  
> >> +  pci_read_config_word(pdev, PCI_COMMAND, &cmd);
> >> +  mem_access_disabled = !(cmd & PCI_COMMAND_MEMORY);
> >> +  if (mem_access_disabled) {
> >> +  cmd |= PCI_COMMAND_MEMORY;
> >> +  pci_write_config_word(pdev, PCI_COMMAND, cmd);
> >> +  }
> >> +
> >>/* Is it really there? */
> >>io = pci_map_rom(pdev, &size);
> >>if (!io || !size) {
> >>info.size = 0;
> >> -  break;
> >> +  goto rom_info_out;
> >>}
> >>pci_unmap_rom(pdev, io);
> >>  
> >>info.flags = VFIO_REGION_INFO_FLAG_READ;
> >> +rom_info_out:
> >> +  if (mem_access_disabled) {
> >> +  cmd &= ~PCI_COMMAND_MEMORY;
> >> +  pci_write_config_word(pdev, PCI_COMMAND, cmd);
> >> +  }
> >> +
> >>break;
> >>}
> >>case VFIO_PCI_VGA_REGION_INDEX:  
> > 
> > I don't think we need to be so timid about the command register and we
> > can also avoid the goto by modifying the test (testing io and size in
> > the original is probably overly paranoid), perhaps simply:  
> Yes looks fine.
> 
> Do you want to respin or do you prefer I do?

Please take it, test it, and repost it, I haven't tested it at all.
Thanks,

Alex

Re: [PATCH v6 0/9] vfio/mdev: IOMMU aware mediated device

2019-02-14 Thread Alex Williamson

On Wed, 13 Feb 2019 12:02:52 +0800
Lu Baolu  wrote:

> Hi,
> 
> The Mediate Device is a framework for fine-grained physical device
> sharing across the isolated domains. Currently the mdev framework
> is designed to be independent of the platform IOMMU support. As the
> result, the DMA isolation relies on the mdev parent device in a
> vendor specific way.
> 
> There are several cases where a mediated device could be protected
> and isolated by the platform IOMMU. For example, Intel vt-d rev3.0
> [1] introduces a new translation mode called 'scalable mode', which
> enables PASID-granular translations. The vt-d scalable mode is the
> key ingredient for Scalable I/O Virtualization [2] [3] which allows
> sharing a device in minimal possible granularity (ADI - Assignable
> Device Interface).
> 
> A mediated device backed by an ADI could be protected and isolated
> by the IOMMU since 1) the parent device supports tagging an unique
> PASID to all DMA traffic out of the mediated device; and 2) the DMA
> translation unit (IOMMU) supports the PASID granular translation.
> We can apply IOMMU protection and isolation to this kind of devices
> just as what we are doing with an assignable PCI device.
> 
> In order to distinguish the IOMMU-capable mediated devices from those
> which still need to rely on parent devices, this patch set adds one
> new member in struct mdev_device.
> 
> * iommu_device
>   - This, if set, indicates that the mediated device could
> be fully isolated and protected by IOMMU via attaching
> an iommu domain to this device. If empty, it indicates
> using vendor defined isolation.
> 
> Below helpers are added to set and get above iommu device in mdev core
> implementation.
> 
> * mdev_set/get_iommu_device(dev, iommu_device)
>   - Set or get the iommu device which represents this mdev
> in IOMMU's device scope. Drivers don't need to set the
> iommu device if it uses vendor defined isolation.
> 
> The mdev parent device driver could opt-in that the mdev could be
> fully isolated and protected by the IOMMU when the mdev is being
> created by invoking mdev_set_iommu_device() in its @create().
> 
> In the vfio_iommu_type1_attach_group(), a domain allocated through
> iommu_domain_alloc() will be attached to the mdev iommu device if
> an iommu device has been set. Otherwise, the dummy external domain
> will be used and all the DMA isolation and protection are routed to
> parent driver as the result.
> 
> On IOMMU side, a basic requirement is allowing to attach multiple
> domains to a PCI device if the device advertises the capability
> and the IOMMU hardware supports finer granularity translations than
> the normal PCI Source ID based translation.
> 
> As the result, a PCI device could work in two modes: normal mode
> and auxiliary mode. In the normal mode, a pci device could be
> isolated in the Source ID granularity; the pci device itself could
> be assigned to a user application by attaching a single domain
> to it. In the auxiliary mode, a pci device could be isolated in
> finer granularity, hence subsets of the device could be assigned
> to different user level application by attaching a different domain
> to each subset.
> 
> Below APIs are introduced in iommu generic layer for aux-domain
> purpose:
> 
> * iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)
>   - Check whether both IOMMU and device support IOMMU aux
> domain feature. Below aux-domain specific interfaces
> are available only after this returns true.
> 
> * iommu_dev_enable/disable_feature(dev, IOMMU_DEV_FEAT_AUX)
>   - Enable/disable device specific aux-domain feature.
> 
> * iommu_dev_feature_enabled(dev, IOMMU_DEV_FEAT_AUX)
>   - Check whether the aux domain specific feature enabled or
> not.
> 
> * iommu_aux_attach_device(domain, dev)
>   - Attaches @domain to @dev in the auxiliary mode. Multiple
> domains could be attached to a single device in the
> auxiliary mode with each domain representing an isolated
> address space for an assignable subset of the device.
> 
> * iommu_aux_detach_device(domain, dev)
>   - Detach @domain which has been attached to @dev in the
> auxiliary mode.
> 
> * iommu_aux_get_pasid(domain, dev)
>   - Return ID used for finer-granularity DMA translation.
> For the Intel Scalable IOV usage model, this will be
> a PASID. The device which supports Scalable IOV needs
> to write this ID to the device register so that DMA
> requests could be tagged with a right PASID prefix.
> 
> In order for the ease of discussion, sometimes we call "a domain in
> auxiliary mode' or simply 'an auxiliary domain' when a domain is
> attached to a device for finer granularity translations. But we need
> to keep in mind that this doesn't mean there is a differnt domain
> type. A same domain could be bound to a device for Source ID based
> translation, and bound to another device for finer granularity
> translation at the same time.
> 
> This patch series extends both IOMMU

Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-13 Thread Alex Williamson

On Tue, 12 Feb 2019 19:26:50 -0500
Daniel Jordan  wrote:

> On Tue, Feb 12, 2019 at 11:41:10AM -0700, Alex Williamson wrote:
> > Daniel Jordan  wrote:  
> > > On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:  
> > > > I haven't looked at this super closely, but how does this stuff work?
> > > > 
> > > > do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> > > > 
> > > > Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
> > > >
> > > > Otherwise MEMLOCK is really doubled..
> > > 
> > > So this has been a problem for some time, but it's not as easy as adding 
> > > them
> > > together, see [1][2] for a start.
> > > 
> > > The locked_vm/pinned_vm issue definitely needs fixing, but all this 
> > > series is
> > > trying to do is account to the right counter.  
> 
> Thanks for taking a look, Alex.
> 
> > This still makes me nervous because we have userspace dependencies on
> > setting process locked memory.  
> 
> Could you please expand on this?  Trying to get more context.

VFIO is a userspace driver interface and the pinned/locked page
accounting we're doing here is trying to prevent a user from exceeding
their locked memory limits.  Thus a VM management tool or unprivileged
userspace driver needs to have appropriate locked memory limits
configured for their use case.  Currently we do not have a unified
accounting scheme, so if a page is mlock'd by the user and also mapped
through VFIO for DMA, it's accounted twice, these both increment
locked_vm and userspace needs to manage that.  If pinned memory
and locked memory are now two separate buckets and we're only comparing
one of them against the locked memory limit, then it seems we have
effectively doubled the user's locked memory for this use case, as
Jason questioned.  The user could mlock one page and DMA map another,
they're both "locked", but now they only take one slot in each bucket.

If we continue forward with using a separate bucket here, userspace
could infer that accounting is unified and lower the user's locked
memory limit, or exploit the gap that their effective limit might
actually exceed system memory.  In the former case, if we do eventually
correct to compare the total of the combined buckets against the user's
locked memory limits, we'll break users that have adapted their locked
memory limits to meet the apparent needs.  In the latter case, the
inconsistent accounting is potentially an attack vector.

> > There's a user visible difference if we
> > account for them in the same bucket vs separate.  Perhaps we're
> > counting in the wrong bucket now, but if we "fix" that and userspace
> > adapts, how do we ever go back to accounting both mlocked and pinned
> > memory combined against rlimit?  Thanks,  
> 
> PeterZ posted an RFC that addresses this point[1].  It kept pinned_vm and
> locked_vm accounting separate, but allowed the two to be added safely to be
> compared against RLIMIT_MEMLOCK.

Unless I'm incorrect in the concerns above, I don't see how we can
convert vfio before this occurs.

> Anyway, until some solution is agreed on, are there objections to converting
> locked_vm to an atomic, to avoid user-visible changes, instead of switching
> locked_vm users to pinned_vm?

Seems that as long as we have separate buckets that are compared
individually to rlimit that we've got problems, it's just a matter of
where they're exposed based on which bucket is used for which
interface.  Thanks,

Alex

Re: [PATCH v2] vfio_pci: Enable memory accesses before calling pci_map_rom

2019-02-13 Thread Alex Williamson

On Wed, 13 Feb 2019 12:06:10 +0100
Eric Auger  wrote:

> pci_map_rom/pci_get_rom_size() performs memory access in the ROM.
> In case the Memory Space accesses were disabled, readw() is likely to
> crash the host with a synchronous external abort (aarch64).

As implied in response to Konrad, the likeliness really depends on the
whole platform, not just the CPU architecture.  It's a class of
problems that depends on OS control or error handling, which we simply
don't have on many systems.  But we can fix this instance of it.

> In case memory accesses were disabled, re-enable them before the call
> and disable them back again just after.
> 
> Signed-off-by: Eric Auger 

This has been around since the beginning, but maybe a Fixes tag would
be useful:

Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")

> 
> ---
> 
> v1 -> v2:
> - also re-enable in case of error
> ---
>  drivers/vfio/pci/vfio_pci.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index ff60bd1ea587..721aa55424a4 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -706,8 +706,10 @@ static long vfio_pci_ioctl(void *device_data,
>   break;
>   case VFIO_PCI_ROM_REGION_INDEX:
>   {
> + bool mem_access_disabled;
>   void __iomem *io;
>   size_t size;
> + u16 cmd;
>  
>   info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
>   info.flags = 0;
> @@ -723,15 +725,28 @@ static long vfio_pci_ioctl(void *device_data,
>   break;
>   }
>  
> + pci_read_config_word(pdev, PCI_COMMAND, &cmd);
> + mem_access_disabled = !(cmd & PCI_COMMAND_MEMORY);
> + if (mem_access_disabled) {
> + cmd |= PCI_COMMAND_MEMORY;
> + pci_write_config_word(pdev, PCI_COMMAND, cmd);
> + }
> +
>   /* Is it really there? */
>   io = pci_map_rom(pdev, &size);
>   if (!io || !size) {
>   info.size = 0;
> - break;
> + goto rom_info_out;
>   }
>   pci_unmap_rom(pdev, io);
>  
>   info.flags = VFIO_REGION_INFO_FLAG_READ;
> +rom_info_out:
> + if (mem_access_disabled) {
> + cmd &= ~PCI_COMMAND_MEMORY;
> + pci_write_config_word(pdev, PCI_COMMAND, cmd);
> + }
> +
>   break;
>   }
>   case VFIO_PCI_VGA_REGION_INDEX:

I don't think we need to be so timid about the command register and we
can also avoid the goto by modifying the test (testing io and size in
the original is probably overly paranoid), perhaps simply:

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index ff60bd1ea587..659b7c1ea8fb 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -708,6 +708,7 @@ static long vfio_pci_ioctl(void *device_data,
{
void __iomem *io;
size_t size;
+   u16 orig_cmd;
 
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.flags = 0;
@@ -723,15 +724,23 @@ static long vfio_pci_ioctl(void *device_data,
break;
}
 
-   /* Is it really there? */
+   /*
+* Is it really there?  Enable memory decode for
+* implicit access in pci_map_rom().
+*/
+   pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd);
+   pci_write_config_word(pdev, PCI_COMMAND,
+ orig_cmd | PCI_COMMAND_MEMORY);
+
io = pci_map_rom(pdev, &size);
-   if (!io || !size) {
+   if (io) {
+   info.flags = VFIO_REGION_INFO_FLAG_READ;
+   pci_unmap_rom(pdev, io);
+   } else
info.size = 0;
-   break;
-   }
-   pci_unmap_rom(pdev, io);
 
-   info.flags = VFIO_REGION_INFO_FLAG_READ;
+   pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
+
break;
}
case VFIO_PCI_VGA_REGION_INDEX:

Re: [PATCH] vfio_pci: Enable memory accesses before calling pci_map_rom

2019-02-13 Thread Alex Williamson

On Wed, 13 Feb 2019 11:28:21 -0500
Konrad Rzeszutek Wilk  wrote:

> On Wed, Feb 13, 2019 at 11:14:06AM +0100, Eric Auger wrote:
> > pci_map_rom/pci_get_rom_size() performs memory access in the ROM.
> > In case the Memory Space accesses were disabled, readw() is likely to
> > crash the host with a synchronous external abort (aarch64).  
> 
> Ouch. Is there an CVE for this?
> 
> Also I think this can cause x86 machines to blow up.
> 
> See https://xenbits.xen.org/xsa/advisory-120.html

The far more common response to a target abort on x86 is simply a -1
return.  Device assignment quickly becomes unfeasible in the general
case, as outlined in the above link by restricting only to SR-IOV VFs,
if we try to claim there is no possible way that the device cannot
trigger a fatal error on the host.  Systems implementing APEI pretty
much guarantee that by escalating device specific faults to fatal
errors and removing the host OS from the error handling path.  Some
platforms will even trigger a fatal error for a DMA outside of the
range mapped for the device by the IOMMU.  We can't probe for this
behavior to restrict the devices, we cannot know how DMA is programmed
for every device in order to babysit it, nor can we guarantee that the
PCI config space command register is the only way a device manages
access to I/O resources hosted on the device.  Restricting user access
to the command register therefore seems like a false sense of security,
potentially with behavioral issues to the user.

It would be great if we always had a hook into the error handling path
such that we could declare this as a user generated fault, kill the
user process, and keep running, but we're limited by the error handling
capabilities of the hardware and the degree to which the
platform/firmware allows OS control of that error handling.  Thanks,

Alex

> > 
> > In case memory accesses were disabled, re-enable them before the call
> > and disable them back again just after.
> > 
> > Signed-off-by: Eric Auger 
> > ---
> >  drivers/vfio/pci/vfio_pci.c | 14 ++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index ff60bd1ea587..96b8bbd909d7 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -706,8 +706,10 @@ static long vfio_pci_ioctl(void *device_data,
> > break;
> > case VFIO_PCI_ROM_REGION_INDEX:
> > {
> > +   bool mem_access_disabled;
> > void __iomem *io;
> > size_t size;
> > +   u16 cmd;
> >  
> > info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > info.flags = 0;
> > @@ -723,6 +725,13 @@ static long vfio_pci_ioctl(void *device_data,
> > break;
> > }
> >  
> > +   pci_read_config_word(pdev, PCI_COMMAND, &cmd);
> > +   mem_access_disabled = !(cmd & PCI_COMMAND_MEMORY);
> > +   if (mem_access_disabled) {
> > +   cmd |= PCI_COMMAND_MEMORY;
> > +   pci_write_config_word(pdev, PCI_COMMAND, cmd);
> > +   }
> > +
> > /* Is it really there? */
> > io = pci_map_rom(pdev, &size);
> > if (!io || !size) {
> > @@ -731,6 +740,11 @@ static long vfio_pci_ioctl(void *device_data,
> > }
> > pci_unmap_rom(pdev, io);
> >  
> > +   if (mem_access_disabled) {
> > +   cmd &= ~PCI_COMMAND_MEMORY;
> > +   pci_write_config_word(pdev, PCI_COMMAND, cmd);
> > +   }
> > +
> > info.flags = VFIO_REGION_INFO_FLAG_READ;
> > break;
> > }
> > -- 
> > 2.20.1
> >

Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages

2019-02-12 Thread Alex Williamson

On Tue, 12 Feb 2019 17:56:18 +1100
Alexey Kardashevskiy  wrote:

> On 12/02/2019 09:44, Daniel Jordan wrote:
> > Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> > pages"), locked and pinned pages are accounted separately.  The SPAPR
> > TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm
> > instead.
> > 
> > pinned_vm recently became atomic and so no longer relies on mmap_sem
> > held as writer: delete.
> > 
> > Signed-off-by: Daniel Jordan 
> > ---
> >  Documentation/vfio.txt  |  6 +--
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++---
> >  2 files changed, 33 insertions(+), 37 deletions(-)
> > 
> > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > index f1a4d3c3ba0b..fa37d65363f9 100644
> > --- a/Documentation/vfio.txt
> > +++ b/Documentation/vfio.txt
> > @@ -308,7 +308,7 @@ This implementation has some specifics:
> > currently there is no way to reduce the number of calls. In order to 
> > make
> > things faster, the map/unmap handling has been implemented in real mode
> > which provides an excellent performance which has limitations such as
> > -   inability to do locked pages accounting in real time.
> > +   inability to do pinned pages accounting in real time.
> >  
> >  4) According to sPAPR specification, A Partitionable Endpoint (PE) is an 
> > I/O
> > subtree that can be treated as a unit for the purposes of partitioning 
> > and
> > @@ -324,7 +324,7 @@ This implementation has some specifics:
> > returns the size and the start of the DMA window on the PCI bus.
> >  
> > VFIO_IOMMU_ENABLE
> > -   enables the container. The locked pages accounting
> > +   enables the container. The pinned pages accounting
> > is done at this point. This lets user first to know what
> > the DMA window is and adjust rlimit before doing any real job.

I don't know of a ulimit only covering pinned pages, so for
documentation it seems more correct to continue referring to this as
locked page accounting.

> > @@ -454,7 +454,7 @@ This implementation has some specifics:
> >  
> > PPC64 paravirtualized guests generate a lot of map/unmap requests,
> > and the handling of those includes pinning/unpinning pages and updating
> > -   mm::locked_vm counter to make sure we do not exceed the rlimit.
> > +   mm::pinned_vm counter to make sure we do not exceed the rlimit.
> > The v2 IOMMU splits accounting and pinning into separate operations:
> >  
> > - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 
> > ioctls
> > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> > b/drivers/vfio/vfio_iommu_spapr_tce.c
> > index c424913324e3..f47e020dc5e4 100644
> > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > @@ -34,9 +34,11 @@
> >  static void tce_iommu_detach_group(void *iommu_data,
> > struct iommu_group *iommu_group);
> >  
> > -static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> > +static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
> >  {
> > -   long ret = 0, locked, lock_limit;
> > +   long ret = 0;
> > +   s64 pinned;
> > +   unsigned long lock_limit;
> >  
> > if (WARN_ON_ONCE(!mm))
> > return -EPERM;
> > @@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct 
> > *mm, long npages)
> > if (!npages)
> > return 0;
> >  
> > -   down_write(&mm->mmap_sem);
> > -   locked = mm->locked_vm + npages;
> > +   pinned = atomic64_add_return(npages, &mm->pinned_vm);
> > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > -   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> > +   if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
> > ret = -ENOMEM;
> > -   else
> > -   mm->locked_vm += npages;
> > +   atomic64_sub(npages, &mm->pinned_vm);
> > +   }
> >  
> > -   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> > +   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
> > npages << PAGE_SHIFT,
> > -   mm->locked_vm << PAGE_SHIFT,
> > -   rlimit(RLIMIT_MEMLOCK),
> > -   ret ? " - exceeded" : "");
> > -
> > -   up_write(&mm->mmap_sem);
> > +   atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
> > +   rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
> >  
> > return ret;
> >  }
> >  
> > -static void decrement_locked_vm(struct mm_struct *mm, long npages)
> > +static void decrement_pinned_vm(struct mm_struct *mm, long npages)
> >  {
> > if (!mm || !npages)
> > return;
> >  
> > -   down_write(&mm->mmap_sem);
> > -   if (WARN_ON_ONCE(npages > mm->locked_vm))
> > -   npages = mm->locked_vm;
> > -   mm->locked_vm -= npages;
> > -   pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
> > +   if (WARN_ON_ONCE

Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-12 Thread Alex Williamson

On Mon, 11 Feb 2019 18:11:53 -0500
Daniel Jordan  wrote:

> On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote:  
> > > @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, 
> > > long npage, bool async)
> > >   if (!mm)
> > >   return -ESRCH; /* process exited */
> > >  
> > > - ret = down_write_killable(&mm->mmap_sem);
> > > - if (!ret) {
> > > - if (npage > 0) {
> > > - if (!dma->lock_cap) {
> > > - unsigned long limit;
> > > -
> > > - limit = task_rlimit(dma->task,
> > > - RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > + pinned_vm = atomic64_add_return(npage, &mm->pinned_vm);
> > >  
> > > - if (mm->locked_vm + npage > limit)
> > > - ret = -ENOMEM;
> > > - }
> > > + if (npage > 0 && !dma->lock_cap) {
> > > + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
> > > +
> > > - PAGE_SHIFT;  
> > 
> > I haven't looked at this super closely, but how does this stuff work?
> > 
> > do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> > 
> > Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
> >
> > Otherwise MEMLOCK is really doubled..  
> 
> So this has been a problem for some time, but it's not as easy as adding them
> together, see [1][2] for a start.
> 
> The locked_vm/pinned_vm issue definitely needs fixing, but all this series is
> trying to do is account to the right counter.

This still makes me nervous because we have userspace dependencies on
setting process locked memory.  There's a user visible difference if we
account for them in the same bucket vs separate.  Perhaps we're
counting in the wrong bucket now, but if we "fix" that and userspace
adapts, how do we ever go back to accounting both mlocked and pinned
memory combined against rlimit?  Thanks,

Alex

[PATCH] vfio/pci: Restore device state on PM transition

2019-02-11 Thread Alex Williamson

PCI core handles save and restore of device state around reset, but
when using pci_set_power_state() we can unintentionally trigger a soft
reset of the device, where PCI core only restores the BAR state.  If
we're using vfio-pci's idle D3 support to try to put devices into low
power when unused, this might trigger a reset when the device is woken
for use.  Also power state management by the user, or within a guest,
can put the device into D3 power state with potentially limited
ability to restore the device if it should undergo a reset.  The PCI
spec does not define the extent of a soft reset and many devices
reporting soft reset on D3->D0 transition do not undergo a PCI config
space reset.  It's therefore assumed safe to unconditionally restore
the remainder of the state if the device indicates soft reset
support, even on a user initiated wakeup.

Implement a wrapper in vfio-pci to tag devices reporting PM reset
support, save their state on transitions into D3 and restore on
transitions back to D0.

Reported-by: Alexander Duyck 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |   71 +++
 drivers/vfio/pci/vfio_pci_config.c  |2 -
 drivers/vfio/pci/vfio_pci_private.h |6 +++
 3 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index ff60bd1ea587..84b593b9fb1f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -209,6 +209,57 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
return false;
 }
 
+static void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
+{
+   struct pci_dev *pdev = vdev->pdev;
+   u16 pmcsr;
+
+   if (!pdev->pm_cap)
+   return;
+
+   pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
+
+   vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
+}
+
+/*
+ * pci_set_power_state() wrapper handling devices which perform a soft reset on
+ * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
+ * restore when returned to D0.  Saved separately from pci_saved_state for use
+ * by PM capability emulation and separately from pci_dev internal saved state
+ * to avoid it being overwritten and consumed around other resets.
+ */
+int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
+{
+   struct pci_dev *pdev = vdev->pdev;
+   bool needs_restore = false, needs_save = false;
+   int ret;
+
+   if (vdev->needs_pm_restore) {
+   if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
+   pci_save_state(pdev);
+   needs_save = true;
+   }
+
+   if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
+   needs_restore = true;
+   }
+
+   ret = pci_set_power_state(pdev, state);
+
+   if (!ret) {
+   /* D3 might be unsupported via quirk, skip unless in D3 */
+   if (needs_save && pdev->current_state >= PCI_D3hot) {
+   vdev->pm_save = pci_store_saved_state(pdev);
+   } else if (needs_restore) {
+   pci_load_and_free_saved_state(pdev, &vdev->pm_save);
+   pci_restore_state(pdev);
+   }
+   }
+
+   return ret;
+}
+
 static int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
struct pci_dev *pdev = vdev->pdev;
@@ -216,7 +267,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
u16 cmd;
u8 msix_pos;
 
-   pci_set_power_state(pdev, PCI_D0);
+   vfio_pci_set_power_state(vdev, PCI_D0);
 
/* Don't allow our initial saved state to include busmaster */
pci_clear_master(pdev);
@@ -407,7 +458,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
vfio_pci_try_bus_reset(vdev);
 
if (!disable_idle_d3)
-   pci_set_power_state(pdev, PCI_D3hot);
+   vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
 
 static void vfio_pci_release(void *device_data)
@@ -1286,6 +1337,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
vfio_pci_set_vga_decode(vdev, false));
}
 
+   vfio_pci_probe_power_state(vdev);
+
if (!disable_idle_d3) {
/*
 * pci-core sets the device power state to an unknown value at
@@ -1296,8 +1349,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 * be able to get to D3.  Therefore first do a D0 transition
 * before going to D3.
 */
-   pci_set_power_state(pdev, PCI_D0);
-   pci_set_power_state(pdev, PCI_D3hot);
+   vfio_pci_set_power_state(vdev, PCI

Re: [PATCH V3 2/3] HYPERV/IOMMU: Add Hyper-V stub IOMMU driver

2019-02-07 Thread Alex Williamson

On Thu,  7 Feb 2019 23:33:48 +0800
lantianyu1...@gmail.com wrote:

> From: Lan Tianyu 
> 
> On the bare metal, enabling X2APIC mode requires interrupt remapping
> function which helps to deliver irq to cpu with 32-bit APIC ID.
> Hyper-V doesn't provide interrupt remapping function so far and Hyper-V
> MSI protocol already supports to deliver interrupt to the CPU whose
> virtual processor index is more than 255. IO-APIC interrupt still has
> 8-bit APIC ID limitation.
> 
> This patch is to add Hyper-V stub IOMMU driver in order to enable
> X2APIC mode successfully in Hyper-V Linux guest. The driver returns X2APIC
> interrupt remapping capability when X2APIC mode is available. Otherwise,
> it creates a Hyper-V irq domain to limit IO-APIC interrupts' affinity
> and make sure cpus assigned with IO-APIC interrupt have 8-bit APIC ID.
> 
> Define 24 IO-APIC remapping entries because Hyper-V only expose one
> single IO-APIC and one IO-APIC has 24 pins according IO-APIC spec(
> https://pdos.csail.mit.edu/6.828/2016/readings/ia32/ioapic.pdf).
> 
> Signed-off-by: Lan Tianyu 
> ---
> Change since v2:
>- Improve comment about why save IO-APIC entry in the irq chip data.
>- Some code improvement.
>- Improve statement in the IOMMU Kconfig.
> 
> Change since v1:
>   - Remove unused pr_fmt
>   - Make ioapic_ir_domain as static variable
>   - Remove unused variables cfg and entry in the 
> hyperv_irq_remapping_alloc()
>   - Fix comments
> ---
>  drivers/iommu/Kconfig |   8 ++
>  drivers/iommu/Makefile|   1 +
>  drivers/iommu/hyperv-iommu.c  | 194 
> ++
>  drivers/iommu/irq_remapping.c |   3 +
>  drivers/iommu/irq_remapping.h |   1 +
>  5 files changed, 207 insertions(+)
>  create mode 100644 drivers/iommu/hyperv-iommu.c
...
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> new file mode 100644
> index 000..d8572c5
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu.c
...
> +static int __init hyperv_prepare_irq_remapping(void)
> +{
> + struct fwnode_handle *fn;
> + int i;
> +
> + if (!hypervisor_is_type(x86_hyper_type) ||
> + !x2apic_supported())
> + return -ENODEV;
> +
> + fn = irq_domain_alloc_named_id_fwnode("HYPERV-IR", 0);
> + if (!fn)
> + return -ENOMEM;
> +
> + ioapic_ir_domain =
> + irq_domain_create_hierarchy(arch_get_ir_parent_domain(),
> + 0, IOAPIC_REMAPPING_ENTRY, fn,
> + &hyperv_ir_domain_ops, NULL);
> +
> + irq_domain_free_fwnode(fn);
> +
> + /*
> +  * Hyper-V doesn't provide irq remapping function for
> +  * IO-APIC and so IO-APIC only accepts 8-bit APIC ID.
> +  * Cpu's APIC ID is read from ACPI MADT table and APIC IDs
> +  * in the MADT table on Hyper-v are sorted monotonic increasingly.
> +  * APIC ID reflects cpu topology. There maybe some APIC ID
> +  * gaps when cpu number in a socket is not power of two. Prepare
> +  * max cpu affinity for IOAPIC irqs. Scan cpu 0-255 and set cpu
> +  * into ioapic_max_cpumask if its APIC ID is less than 256.
> +  */
> + for (i = 0; i < 256; i++)
> + if (cpu_physical_id(i) < 256)
> + cpumask_set_cpu(i, &ioapic_max_cpumask);

This looks sketchy.  What if NR_CPUS is less than 256?  Thanks,

Alex

> +
> + return 0;
> +}

Re: [PATCH] vfio: platform: reset: fix up include directives to remove ccflags-y

2019-02-05 Thread Alex Williamson

On Wed, 30 Jan 2019 11:52:31 +0900
Masahiro Yamada  wrote:

> For the include directive with double-quotes "", the preprocessor
> searches the header in the relative path to the current file.
> 
> Fix them up, and remove the header search path option.
> 
> Signed-off-by: Masahiro Yamada 
> ---
> 
>  drivers/vfio/platform/reset/Makefile | 2 --
>  drivers/vfio/platform/reset/vfio_platform_amdxgbe.c  | 2 +-
>  drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c| 2 +-
>  drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c | 2 +-
>  4 files changed, 3 insertions(+), 5 deletions(-)

Applied with Eric's Ack to the vfio next branch for v5.1.  Thanks,

Alex

> diff --git a/drivers/vfio/platform/reset/Makefile 
> b/drivers/vfio/platform/reset/Makefile
> index 57abd4f..7294c5e 100644
> --- a/drivers/vfio/platform/reset/Makefile
> +++ b/drivers/vfio/platform/reset/Makefile
> @@ -2,8 +2,6 @@
>  vfio-platform-calxedaxgmac-y := vfio_platform_calxedaxgmac.o
>  vfio-platform-amdxgbe-y := vfio_platform_amdxgbe.o
>  
> -ccflags-y += -Idrivers/vfio/platform
> -
>  obj-$(CONFIG_VFIO_PLATFORM_CALXEDAXGMAC_RESET) += 
> vfio-platform-calxedaxgmac.o
>  obj-$(CONFIG_VFIO_PLATFORM_AMDXGBE_RESET) += vfio-platform-amdxgbe.o
>  obj-$(CONFIG_VFIO_PLATFORM_BCMFLEXRM_RESET) += vfio_platform_bcmflexrm.o
> diff --git a/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c 
> b/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
> index bcd419c..3ddb270 100644
> --- a/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
> +++ b/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
> @@ -25,7 +25,7 @@
>  #include 
>  #include 
>  
> -#include "vfio_platform_private.h"
> +#include "../vfio_platform_private.h"
>  
>  #define DMA_MR   0x3000
>  #define MAC_VR   0x0110
> diff --git a/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c 
> b/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c
> index d45c3be..16165a6 100644
> --- a/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c
> +++ b/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c
> @@ -23,7 +23,7 @@
>  #include 
>  #include 
>  
> -#include "vfio_platform_private.h"
> +#include "../vfio_platform_private.h"
>  
>  /* FlexRM configuration */
>  #define RING_REGS_SIZE   0x1
> diff --git a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c 
> b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> index 49e5df6..e0356de 100644
> --- a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> +++ b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> @@ -24,7 +24,7 @@
>  #include 
>  #include 
>  
> -#include "vfio_platform_private.h"
> +#include "../vfio_platform_private.h"
>  
>  #define DRIVER_VERSION  "0.1"
>  #define DRIVER_AUTHOR   "Eric Auger "

Re: [RFC PATCH 8/8] svm: Allow AVIC with in-kernel irqchip mode

2019-02-05 Thread Alex Williamson

On Mon, 4 Feb 2019 14:42:32 +
"Suthikulpanit, Suravee"  wrote:

> Once the IRQ ack notifier for in-kernel PIT is no longer required
> and run-time AVIC activate/deactivate is supported, we can remove
> the kernel irqchip split mode requirement for AVIC.
> 
> Hence, remove the check for irqchip split mode when enabling AVIC.

Yay!  Could we also at this point make avic enabled by default or are
there remaining incompatibilities?  Thanks,

Alex


> Cc: Radim Krčmář 
> Cc: Paolo Bonzini 
> Signed-off-by: Suravee Suthikulpanit 
> ---
>  arch/x86/kvm/svm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 580ab40ba207..24dfa6a93711 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5157,7 +5157,7 @@ static void svm_set_virtual_apic_mode(struct kvm_vcpu 
> *vcpu)
>  
>  static bool svm_get_enable_apicv(struct kvm_vcpu *vcpu)
>  {
> - return avic && irqchip_split(vcpu->kvm);
> + return avic;
>  }
>  
>  static void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)

Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API

2019-01-29 Thread Alex Williamson

On Fri, 25 Jan 2019 17:49:20 +0100
Auger Eric  wrote:

> Hi Alex,
> 
> On 1/11/19 10:30 PM, Alex Williamson wrote:
> > On Tue,  8 Jan 2019 11:26:14 +0100
> > Eric Auger  wrote:
> >   
> >> From: "Liu, Yi L" 
> >>
> >> In any virtualization use case, when the first translation stage
> >> is "owned" by the guest OS, the host IOMMU driver has no knowledge
> >> of caching structure updates unless the guest invalidation activities
> >> are trapped by the virtualizer and passed down to the host.
> >>
> >> Since the invalidation data are obtained from user space and will be
> >> written into physical IOMMU, we must allow security check at various
> >> layers. Therefore, generic invalidation data format are proposed here,
> >> model specific IOMMU drivers need to convert them into their own format.
> >>
> >> Signed-off-by: Liu, Yi L 
> >> Signed-off-by: Jean-Philippe Brucker 
> >> Signed-off-by: Jacob Pan 
> >> Signed-off-by: Ashok Raj 
> >> Signed-off-by: Eric Auger 
> >>
> >> ---
> >> v1 -> v2:
> >> - add arch_id field
> >> - renamed tlb_invalidate into cache_invalidate as this API allows
> >>   to invalidate context caches on top of IOTLBs
> >>
> >> v1:
> >> renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
> >> header. Commit message reworded.
> >> ---
> >>  drivers/iommu/iommu.c  | 14 ++
> >>  include/linux/iommu.h  | 14 ++
> >>  include/uapi/linux/iommu.h | 95 ++
> >>  3 files changed, 123 insertions(+)
> >>
> >> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >> index 0f2b7f1fc7c8..b2e248770508 100644
> >> --- a/drivers/iommu/iommu.c
> >> +++ b/drivers/iommu/iommu.c
> >> @@ -1403,6 +1403,20 @@ int iommu_set_pasid_table(struct iommu_domain 
> >> *domain,
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
> >>  
> >> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device 
> >> *dev,
> >> + struct iommu_cache_invalidate_info *inv_info)
> >> +{
> >> +  int ret = 0;
> >> +
> >> +  if (unlikely(!domain->ops->cache_invalidate))
> >> +  return -ENODEV;
> >> +
> >> +  ret = domain->ops->cache_invalidate(domain, dev, inv_info);
> >> +
> >> +  return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
> >> +
> >>  static void __iommu_detach_device(struct iommu_domain *domain,
> >>  struct device *dev)
> >>  {
> >> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> >> index 1da2a2357ea4..96d59886f230 100644
> >> --- a/include/linux/iommu.h
> >> +++ b/include/linux/iommu.h
> >> @@ -186,6 +186,7 @@ struct iommu_resv_region {
> >>   * @of_xlate: add OF master IDs to iommu grouping
> >>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> >>   * @set_pasid_table: set pasid table
> >> + * @cache_invalidate: invalidate translation caches
> >>   */
> >>  struct iommu_ops {
> >>bool (*capable)(enum iommu_cap);
> >> @@ -231,6 +232,9 @@ struct iommu_ops {
> >>int (*set_pasid_table)(struct iommu_domain *domain,
> >>   struct iommu_pasid_table_config *cfg);
> >>  
> >> +  int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
> >> +  struct iommu_cache_invalidate_info *inv_info);
> >> +
> >>unsigned long pgsize_bitmap;
> >>  };
> >>  
> >> @@ -294,6 +298,9 @@ extern void iommu_detach_device(struct iommu_domain 
> >> *domain,
> >>struct device *dev);
> >>  extern int iommu_set_pasid_table(struct iommu_domain *domain,
> >> struct iommu_pasid_table_config *cfg);
> >> +extern int iommu_cache_invalidate(struct iommu_domain *domain,
> >> +  struct device *dev,
> >> +  struct iommu_cache_invalidate_info *inv_info);
> >>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
> >>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
> >>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> >> @@ -709,6 +716,13 @@ int iommu

[GIT PULL] VFIO fixes for v5.0-rc4

2019-01-25 Thread Alex Williamson

Hi Linus,

The following changes since commit 49a57857aeea06ca831043acbb0fa5e0f50602fd:

  Linux 5.0-rc3 (2019-01-21 13:14:44 +1300)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.0-rc4

for you to fetch changes up to 9a71ac7e15a723e90fc40388b4b92eefaabf747c:

  vfio-pci/nvlink2: Fix ancient gcc warnings (2019-01-23 08:20:43 -0700)


VFIO fixes for v5.0-rc4

 - Cleanup licenses in new files (Thomas Gleixner)

 - Cleanup new compiler warnings (Alexey Kardashevskiy)


Alexey Kardashevskiy (1):
  vfio-pci/nvlink2: Fix ancient gcc warnings

Thomas Gleixner (1):
  vfio/pci: Cleanup license mess

 drivers/vfio/pci/trace.h|  6 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c | 36 
 2 files changed, 17 insertions(+), 25 deletions(-)

Re: [patch 9/9] vfio/pci: Cleanup license mess

2019-01-23 Thread Alex Williamson

On Fri, 18 Jan 2019 00:14:25 +0100
Thomas Gleixner  wrote:

> The recently added nvlink2 VFIO driver introduced a license conflict in two
> files. In both cases the SPDX license identifier is:
> 
>   SPDX-License-Identifier: GPL-2.0+
> 
> but the files contain also the following license boiler plate text:
> 
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
>   * published by the Free Software Foundation
> 
> The latter is GPL-2.9-only and not GPL-2.0=.
> 
> Looking deeper. The nvlink source file is derived from vfio_pci_igd.c which
> is also licensed under GPL-2.0-only and it can be assumed that the file was
> copied and modified. As the original file is licensed GPL-2.0-only it's not
> possible to relicense derivative work to GPL-2.0-or-later.
> 
> Fix the SPDX identifier and remove the boiler plate as it is redundant.
> 
> Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> subdriver")
> Signed-off-by: Thomas Gleixner 
> Cc: Alexey Kardashevskiy 
> Cc: Alex Williamson 
> Cc: Michael Ellerman 
> Cc: k...@vger.kernel.org
> ---
> 
> P.S.: This patch is part of a larger cleanup, but independent of other
>   patches and is intended to be picked up by the maintainer directly.

Applied to vfio for-linus branch for v5.0.  Thanks,

Alex


> ---
>  drivers/vfio/pci/trace.h|6 +-
>  drivers/vfio/pci/vfio_pci_nvlink2.c |6 +-
>  2 files changed, 2 insertions(+), 10 deletions(-)
> 
> --- a/drivers/vfio/pci/trace.h
> +++ b/drivers/vfio/pci/trace.h
> @@ -1,13 +1,9 @@
> -/* SPDX-License-Identifier: GPL-2.0+ */
> +/* SPDX-License-Identifier: GPL-2.0-only */
>  /*
>   * VFIO PCI mmap/mmap_fault tracepoints
>   *
>   * Copyright (C) 2018 IBM Corp.  All rights reserved.
>   * Author: Alexey Kardashevskiy 
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
>   */
>  
>  #undef TRACE_SYSTEM
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -1,14 +1,10 @@
> -// SPDX-License-Identifier: GPL-2.0+
> +// SPDX-License-Identifier: GPL-2.0-only
>  /*
>   * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
>   *
>   * Copyright (C) 2018 IBM Corp.  All rights reserved.
>   * Author: Alexey Kardashevskiy 
>   *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
>   * Register an on-GPU RAM region for cacheable access.
>   *
>   * Derived from original vfio_pci_igd.c:
> 
>

Re: [PATCH kernel] vfio-pci/nvlink2: Fix ancient gcc warnings

2019-01-22 Thread Alex Williamson

Hi Geert,

The below patch comes about from the build regressions and improvements
list you've sent out, but something doesn't add up that we'd be testing
with an old compiler where initialization with { 0 } generates a
"missing braces around initialization" warning.  Is this really the
case or are we missing something here?  There's no harm that I can see
with Alexey's fix, but are these really just false positives from a
compiler bug that we should selectively ignore if the "fix" is less
clean?  Thanks,

Alex

On Wed, 23 Jan 2019 15:07:11 +1100
Alexey Kardashevskiy  wrote:

> Using the {0} construct as a generic initializer is perfectly fine in C,
> however due to a bug in old gcc there is a warning:
> 
>   + /kisskb/src/drivers/vfio/pci/vfio_pci_nvlink2.c: warning: (near
> initialization for 'cap.header') [-Wmissing-braces]:  => 181:9
> 
> Since for whatever reason we still want to compile the modern kernel
> with such an old gcc without warnings, this changes the capabilities
> initialization.
> 
> The gcc bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 30 ++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index 054a2cf..91d945b 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -178,11 +178,11 @@ static int vfio_pci_nvgpu_add_capability(struct 
> vfio_pci_device *vdev,
>   struct vfio_pci_region *region, struct vfio_info_cap *caps)
>  {
>   struct vfio_pci_nvgpu_data *data = region->data;
> - struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 };
> -
> - cap.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> - cap.header.version = 1;
> - cap.tgt = data->gpu_tgt;
> + struct vfio_region_info_cap_nvlink2_ssatgt cap = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> + .header.version = 1,
> + .tgt = data->gpu_tgt
> + };
>  
>   return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
>  }
> @@ -365,18 +365,18 @@ static int vfio_pci_npu2_add_capability(struct 
> vfio_pci_device *vdev,
>   struct vfio_pci_region *region, struct vfio_info_cap *caps)
>  {
>   struct vfio_pci_npu2_data *data = region->data;
> - struct vfio_region_info_cap_nvlink2_ssatgt captgt = { 0 };
> - struct vfio_region_info_cap_nvlink2_lnkspd capspd = { 0 };
> + struct vfio_region_info_cap_nvlink2_ssatgt captgt = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> + .header.version = 1,
> + .tgt = data->gpu_tgt
> + };
> + struct vfio_region_info_cap_nvlink2_lnkspd capspd = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD,
> + .header.version = 1,
> + .link_speed = data->link_speed
> + };
>   int ret;
>  
> - captgt.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> - captgt.header.version = 1;
> - captgt.tgt = data->gpu_tgt;
> -
> - capspd.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD;
> - capspd.header.version = 1;
> - capspd.link_speed = data->link_speed;
> -
>   ret = vfio_info_add_capability(caps, &captgt.header, sizeof(captgt));
>   if (ret)
>   return ret;

Re: [PATCH v1] iommu/s390: Declare s390 iommu reserved regions

2019-01-22 Thread Alex Williamson

On Mon, 21 Jan 2019 12:51:14 +0100
Pierre Morel  wrote:

> On 18/01/2019 14:51, Jean-Philippe Brucker wrote:
> > Hi Pierre,
> > 
> > On 18/01/2019 13:29, Pierre Morel wrote:  
> >> On 17/01/2019 14:02, Robin Murphy wrote:  
> >>> On 15/01/2019 17:37, Pierre Morel wrote:  
>  The s390 iommu can only allow DMA transactions between the zPCI device
>  entries start_dma and end_dma.
>   
> 
> ...
> 
> >>
> >> I already posted a patch retrieving the geometry through
> >> VFIO_IOMMU_GET_INFO using a specific capability for the geometry [1],
> >> and AFAIU, Alex did not agree with this.  
> > 
> > On arm we also need to report the IOMMU geometry to userspace (max IOVA
> > size in particular). Shameer has been working on a solution [2] that
> > presents a unified view of both geometry and reserved regions into the
> > VFIO_IOMMU_GET_INFO call, and I think we should go with that. If I
> > understand correctly it's currently blocked on the RMRR problem and
> > we're waiting for Jacob or Ashok to take a look at it, as Kevin pinged
> > them on thread [1]?
> > 
> > [2] https://lkml.org/lkml/2018/4/18/293
> > 
> > Thanks,
> > Jean
> >   
> 
> Hi Jean,
> 
> I hopped that this proposition went in the same direction based on the 
> following assumptions:
> 
> 
> - The goal of the get_resv_region is defined in iommu.h as:
> -
> * @get_resv_regions: Request list of reserved regions for a device
> -
> 
> - A iommu reserve region is a region which should not be mapped.
> Isn't it exactly what happens outside the aperture?
> Shouldn't it be reported by the iommu reserved region?
> 
> - If we use VFIO and want to get all reserved region we will have the 
> VFIO_IOMMU_GET_INFO call provided by Shameer and it can get all reserved 
> regions depending from the iommu driver itself at once by calling the 
> get_reserved_region callback instead of having to merge them with the 
> aperture.
> 
> - If there are other reserved region, depending on the system 
> configuration and not on the IOMMU itself, the VFIO_IOMMU_GET_INFO call 
> will have to merge them with the region gotten from the iommu driver.
> 
> - If we do not use QEMU nor VFIO at all, AFAIU, the standard way to 
> retrieve the reserved regions associated with a device is to call the 
> get_reserved_region callback from the associated iommu.
> 
> Please tell me were I am wrong.

The existing proposal by Shameer exposes an IOVA list, which includes
ranges that the user _can_ map through the IOMMU, therefore this
original patch is unnecessary, the IOMMU geometry is already the
starting point of the IOVA list, creating the original node, which is
split as necessary to account for the reserved regions.  To me,
presenting a user interface that exposes ranges that _cannot_ be mapped
is a strange interface.  If we started from a position of what
information do we want to provide to the user and how will they consume
it, would we present a list of reserved ranges?  I think the only
argument for the negative ranges is that we already have them in the
kernel, but I don't see that it necessarily makes them the right
solution for userspace.


> >> What is different in what you propose?
> >>
> >> @Alex: I was hoping that this patch goes in your direction. What do you
> >> think?

I think it's unnecessary given that in Shameer's proposal:

 - We start from the IOMMU exposed geometry
 - We present a list of usable IOVA ranges, not a list of reserved
   ranges

Thanks,
Alex

Re: [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type

2019-01-14 Thread Alex Williamson

On Mon, 14 Jan 2019 21:48:06 +0100
Auger Eric  wrote:

> Hi Alex,
> 
> On 1/12/19 12:58 AM, Alex Williamson wrote:
> > On Tue,  8 Jan 2019 11:26:30 +0100
> > Eric Auger  wrote:
> >   
> >> This patch adds a new 64kB region aiming to report nested mode
> >> translation faults.
> >>
> >> The region contains a header with the size of the queue,
> >> the producer and consumer indices and then the actual
> >> fault queue data. The producer is updated by the kernel while
> >> the consumer is updated by the userspace.
> >>
> >> Signed-off-by: Eric Auger 
> >>
> >> ---
> >> ---
> >>  drivers/vfio/pci/vfio_pci.c | 102 +++-
> >>  drivers/vfio/pci/vfio_pci_private.h |   2 +
> >>  include/uapi/linux/vfio.h   |  15 
> >>  3 files changed, 118 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index ff60bd1ea587..2ba181ab2edd 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -56,6 +56,11 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> >>  MODULE_PARM_DESC(disable_idle_d3,
> >> "Disable using the PCI D3 low power state for idle, unused 
> >> devices");
> >>  
> >> +#define VFIO_FAULT_REGION_SIZE 0x1  
> > 
> > Why 64K?  
> For the region to be mmappable with 64kB page size.

Isn't hard coding 64K just as bad as hard coding 4K?  The kernel knows
what PAGE_SIZE is after all.  Is there some target number of queue
entries here that we could round up to a multiple of PAGE_SIZE?
 
> >> +#define VFIO_FAULT_QUEUE_SIZE \
> >> +  ((VFIO_FAULT_REGION_SIZE - sizeof(struct vfio_fault_region_header)) / \
> >> +  sizeof(struct iommu_fault))
> >> +
> >>  static inline bool vfio_vga_disabled(void)
> >>  {
> >>  #ifdef CONFIG_VFIO_PCI_VGA
> >> @@ -1226,6 +1231,100 @@ static const struct vfio_device_ops vfio_pci_ops = 
> >> {
> >>  static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
> >>  static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
> >>  
> >> +static size_t
> >> +vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf,
> >> +size_t count, loff_t *ppos, bool iswrite)
> >> +{
> >> +  unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> >> +  void *base = vdev->region[i].data;
> >> +  loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +
> >> +  if (pos >= vdev->region[i].size)
> >> +  return -EINVAL;
> >> +
> >> +  count = min(count, (size_t)(vdev->region[i].size - pos));
> >> +
> >> +  if (copy_to_user(buf, base + pos, count))
> >> +  return -EFAULT;
> >> +
> >> +  *ppos += count;
> >> +
> >> +  return count;
> >> +}
> >> +
> >> +static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev,
> >> + struct vfio_pci_region *region,
> >> + struct vm_area_struct *vma)
> >> +{
> >> +  u64 phys_len, req_len, pgoff, req_start;
> >> +  unsigned long long addr;
> >> +  unsigned int index;
> >> +
> >> +  index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> >> +
> >> +  if (vma->vm_end < vma->vm_start)
> >> +  return -EINVAL;
> >> +  if ((vma->vm_flags & VM_SHARED) == 0)
> >> +  return -EINVAL;
> >> +
> >> +  phys_len = VFIO_FAULT_REGION_SIZE;
> >> +
> >> +  req_len = vma->vm_end - vma->vm_start;
> >> +  pgoff = vma->vm_pgoff &
> >> +  ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> >> +  req_start = pgoff << PAGE_SHIFT;
> >> +
> >> +  if (req_start + req_len > phys_len)
> >> +  return -EINVAL;
> >> +
> >> +  addr = virt_to_phys(vdev->fault_region);
> >> +  vma->vm_private_data = vdev;
> >> +  vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff;
> >> +
> >> +  return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> >> + req_len, vma->vm_page_prot);
> >> +}
> >> +
> >> +void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev,
> >> +

Re: [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type

2019-01-11 Thread Alex Williamson

On Tue,  8 Jan 2019 11:26:30 +0100
Eric Auger  wrote:

> This patch adds a new 64kB region aiming to report nested mode
> translation faults.
> 
> The region contains a header with the size of the queue,
> the producer and consumer indices and then the actual
> fault queue data. The producer is updated by the kernel while
> the consumer is updated by the userspace.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> ---
>  drivers/vfio/pci/vfio_pci.c | 102 +++-
>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>  include/uapi/linux/vfio.h   |  15 
>  3 files changed, 118 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index ff60bd1ea587..2ba181ab2edd 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -56,6 +56,11 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(disable_idle_d3,
>"Disable using the PCI D3 low power state for idle, unused 
> devices");
>  
> +#define VFIO_FAULT_REGION_SIZE 0x1

Why 64K?

> +#define VFIO_FAULT_QUEUE_SIZE\
> + ((VFIO_FAULT_REGION_SIZE - sizeof(struct vfio_fault_region_header)) / \
> + sizeof(struct iommu_fault))
> +
>  static inline bool vfio_vga_disabled(void)
>  {
>  #ifdef CONFIG_VFIO_PCI_VGA
> @@ -1226,6 +1231,100 @@ static const struct vfio_device_ops vfio_pci_ops = {
>  static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
>  static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
>  
> +static size_t
> +vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf,
> +   size_t count, loff_t *ppos, bool iswrite)
> +{
> + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> + void *base = vdev->region[i].data;
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> + if (pos >= vdev->region[i].size)
> + return -EINVAL;
> +
> + count = min(count, (size_t)(vdev->region[i].size - pos));
> +
> + if (copy_to_user(buf, base + pos, count))
> + return -EFAULT;
> +
> + *ppos += count;
> +
> + return count;
> +}
> +
> +static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev,
> +struct vfio_pci_region *region,
> +struct vm_area_struct *vma)
> +{
> + u64 phys_len, req_len, pgoff, req_start;
> + unsigned long long addr;
> + unsigned int index;
> +
> + index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> +
> + if (vma->vm_end < vma->vm_start)
> + return -EINVAL;
> + if ((vma->vm_flags & VM_SHARED) == 0)
> + return -EINVAL;
> +
> + phys_len = VFIO_FAULT_REGION_SIZE;
> +
> + req_len = vma->vm_end - vma->vm_start;
> + pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> + req_start = pgoff << PAGE_SHIFT;
> +
> + if (req_start + req_len > phys_len)
> + return -EINVAL;
> +
> + addr = virt_to_phys(vdev->fault_region);
> + vma->vm_private_data = vdev;
> + vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff;
> +
> + return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> +req_len, vma->vm_page_prot);
> +}
> +
> +void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev,
> + struct vfio_pci_region *region)
> +{
> +}
> +
> +static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {
> + .rw = vfio_pci_dma_fault_rw,
> + .mmap   = vfio_pci_dma_fault_mmap,
> + .release= vfio_pci_dma_fault_release,
> +};
> +
> +static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
> +{
> + u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
> + VFIO_REGION_INFO_FLAG_MMAP;
> + int ret;
> +
> + spin_lock_init(&vdev->fault_queue_lock);
> +
> + vdev->fault_region = kmalloc(VFIO_FAULT_REGION_SIZE, GFP_KERNEL);
> + if (!vdev->fault_region)
> + return -ENOMEM;
> +
> + ret = vfio_pci_register_dev_region(vdev,
> + VFIO_REGION_TYPE_NESTED,
> + VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION,
> + &vfio_pci_dma_fault_regops, VFIO_FAULT_REGION_SIZE,
> + flags, vdev->fault_region);
> + if (ret) {
> + kfree(vdev->fault_region);
> + return ret;
> + }
> +
> + vdev->fault_region->header.prod = 0;
> + vdev->fault_region->header.cons = 0;
> + vdev->fault_region->header.reserved = 0;

Use kzalloc above or else we're leaking kernel memory to userspace
anyway.

> + vdev->fault_region->header.size = VFIO_FAULT_QUEUE_SIZE;
> + return 0;
> +}
> +
>  static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id 
> *id)
>  {
>   struct vfio_pci_device *vdev;
> @@ -1300,7 +1399,7 @@ static int vfio_pci_probe(struct pci_dev *pdev,

Re: [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI

2019-01-11 Thread Alex Williamson

On Fri, 11 Jan 2019 16:02:44 -0700
Alex Williamson  wrote:

> On Tue,  8 Jan 2019 11:26:18 +0100
> Eric Auger  wrote:
> 
> > This patch adds the VFIO_IOMMU_BIND_MSI ioctl which aims at
> > passing the guest MSI binding to the host.
> > 
> > Signed-off-by: Eric Auger 
> > 
> > ---
> > 
> > v2 -> v3:
> > - adapt to new proto of bind_guest_msi
> > - directly use vfio_iommu_for_each_dev
> > 
> > v1 -> v2:
> > - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 27 +++
> >  include/uapi/linux/vfio.h   |  7 +++
> >  2 files changed, 34 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index c3ba3f249438..59229f6e2d84 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -1673,6 +1673,15 @@ static int vfio_cache_inv_fn(struct device *dev, 
> > void *data)
> > return iommu_cache_invalidate(d, dev, &ustruct->info);
> >  }
> >  
> > +static int vfio_bind_guest_msi_fn(struct device *dev, void *data)
> > +{
> > +   struct vfio_iommu_type1_bind_guest_msi *ustruct =
> > +   (struct vfio_iommu_type1_bind_guest_msi *)data;
> > +   struct iommu_domain *d = iommu_get_domain_for_dev(dev);
> > +
> > +   return iommu_bind_guest_msi(d, dev, &ustruct->binding);
> > +}
> > +
> >  static int
> >  vfio_set_pasid_table(struct vfio_iommu *iommu,
> >   struct vfio_iommu_type1_set_pasid_table *ustruct)
> > @@ -1792,6 +1801,24 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >   vfio_cache_inv_fn);
> > mutex_unlock(&iommu->lock);
> > return ret;
> > +   } else if (cmd == VFIO_IOMMU_BIND_MSI) {
> > +   struct vfio_iommu_type1_bind_guest_msi ustruct;
> > +   int ret;
> > +
> > +   minsz = offsetofend(struct vfio_iommu_type1_bind_guest_msi,
> > +   binding);
> > +
> > +   if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> > +   return -EFAULT;
> > +
> > +   if (ustruct.argsz < minsz || ustruct.flags)
> > +   return -EINVAL;
> > +
> > +   mutex_lock(&iommu->lock);
> > +   ret = vfio_iommu_for_each_dev(iommu, &ustruct,
> > + vfio_bind_guest_msi_fn);  
> 
> The vfio_iommu_for_each_dev() interface is fine for invalidation, where
> a partial failure requires no unwind, but it's not sufficiently robust
> here.

Additionally, what happens as devices are added and removed from the
guest?  Are we designing an interface that specifically precludes
hotplug?  Thanks,

Alex
 
> > +   mutex_unlock(&iommu->lock);
> > +   return ret;
> > }
> >  
> > return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 11a07165e7e1..352e795a93c8 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -774,6 +774,13 @@ struct vfio_iommu_type1_cache_invalidate {
> >  };
> >  #define VFIO_IOMMU_CACHE_INVALIDATE  _IO(VFIO_TYPE, VFIO_BASE + 23)
> >  
> > +struct vfio_iommu_type1_bind_guest_msi {
> > +   __u32   argsz;
> > +   __u32   flags;
> > +   struct iommu_guest_msi_binding binding;
> > +};
> > +#define VFIO_IOMMU_BIND_MSI  _IO(VFIO_TYPE, VFIO_BASE + 24)  
> 
> -ENOCOMMENTS  MSIs are setup and torn down, is this only a machine init
> sort of interface?  How does the user un-bind?  Thanks,
> 
> Alex
> 
> > +
> >  /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
> >  
> >  /*  
>

Re: [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI

2019-01-11 Thread Alex Williamson

On Tue,  8 Jan 2019 11:26:18 +0100
Eric Auger  wrote:

> This patch adds the VFIO_IOMMU_BIND_MSI ioctl which aims at
> passing the guest MSI binding to the host.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v2 -> v3:
> - adapt to new proto of bind_guest_msi
> - directly use vfio_iommu_for_each_dev
> 
> v1 -> v2:
> - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
> ---
>  drivers/vfio/vfio_iommu_type1.c | 27 +++
>  include/uapi/linux/vfio.h   |  7 +++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c3ba3f249438..59229f6e2d84 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1673,6 +1673,15 @@ static int vfio_cache_inv_fn(struct device *dev, void 
> *data)
>   return iommu_cache_invalidate(d, dev, &ustruct->info);
>  }
>  
> +static int vfio_bind_guest_msi_fn(struct device *dev, void *data)
> +{
> + struct vfio_iommu_type1_bind_guest_msi *ustruct =
> + (struct vfio_iommu_type1_bind_guest_msi *)data;
> + struct iommu_domain *d = iommu_get_domain_for_dev(dev);
> +
> + return iommu_bind_guest_msi(d, dev, &ustruct->binding);
> +}
> +
>  static int
>  vfio_set_pasid_table(struct vfio_iommu *iommu,
> struct vfio_iommu_type1_set_pasid_table *ustruct)
> @@ -1792,6 +1801,24 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> vfio_cache_inv_fn);
>   mutex_unlock(&iommu->lock);
>   return ret;
> + } else if (cmd == VFIO_IOMMU_BIND_MSI) {
> + struct vfio_iommu_type1_bind_guest_msi ustruct;
> + int ret;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_bind_guest_msi,
> + binding);
> +
> + if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (ustruct.argsz < minsz || ustruct.flags)
> + return -EINVAL;
> +
> + mutex_lock(&iommu->lock);
> + ret = vfio_iommu_for_each_dev(iommu, &ustruct,
> +   vfio_bind_guest_msi_fn);

The vfio_iommu_for_each_dev() interface is fine for invalidation, where
a partial failure requires no unwind, but it's not sufficiently robust
here.

> + mutex_unlock(&iommu->lock);
> + return ret;
>   }
>  
>   return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 11a07165e7e1..352e795a93c8 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -774,6 +774,13 @@ struct vfio_iommu_type1_cache_invalidate {
>  };
>  #define VFIO_IOMMU_CACHE_INVALIDATE  _IO(VFIO_TYPE, VFIO_BASE + 23)
>  
> +struct vfio_iommu_type1_bind_guest_msi {
> + __u32   argsz;
> + __u32   flags;
> + struct iommu_guest_msi_binding binding;
> +};
> +#define VFIO_IOMMU_BIND_MSI  _IO(VFIO_TYPE, VFIO_BASE + 24)

-ENOCOMMENTS  MSIs are setup and torn down, is this only a machine init
sort of interface?  How does the user un-bind?  Thanks,

Alex

> +
>  /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
>  
>  /*

Re: [RFC v3 04/21] vfio: VFIO_IOMMU_SET_PASID_TABLE

2019-01-11 Thread Alex Williamson

On Tue,  8 Jan 2019 11:26:16 +0100
Eric Auger  wrote:

> From: "Liu, Yi L" 
> 
> This patch adds VFIO_IOMMU_SET_PASID_TABLE ioctl which aims at
> passing the virtual iommu guest configuration to the VFIO driver
> downto to the iommu subsystem.
> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Eric Auger 
> 
> ---
> v2 -> v3:
> - s/BIND_PASID_TABLE/SET_PASID_TABLE
> 
> v1 -> v2:
> - s/BIND_GUEST_STAGE/BIND_PASID_TABLE
> - remove the struct device arg
> ---
>  drivers/vfio/vfio_iommu_type1.c | 31 +++
>  include/uapi/linux/vfio.h   |  8 
>  2 files changed, 39 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 7651cfb14836..d9dd23f64f00 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1644,6 +1644,24 @@ static int vfio_domains_have_iommu_cache(struct 
> vfio_iommu *iommu)
>   return ret;
>  }
>  
> +static int
> +vfio_set_pasid_table(struct vfio_iommu *iommu,
> +   struct vfio_iommu_type1_set_pasid_table *ustruct)
> +{
> + struct vfio_domain *d;
> + int ret = 0;
> +
> + mutex_lock(&iommu->lock);
> +
> + list_for_each_entry(d, &iommu->domain_list, next) {
> + ret = iommu_set_pasid_table(d->domain, &ustruct->config);
> + if (ret)
> + break;
> + }

There's no unwind on failure here, leaves us in an inconsistent state
should something go wrong or domains don't have homogeneous PASID
support.  What's expected to happen if a PASID table is already set for
a domain, does it replace the old one or return -EBUSY?

> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -1714,6 +1732,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>   return copy_to_user((void __user *)arg, &unmap, minsz) ?
>   -EFAULT : 0;
> + } else if (cmd == VFIO_IOMMU_SET_PASID_TABLE) {
> + struct vfio_iommu_type1_set_pasid_table ustruct;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_set_pasid_table,
> + config);
> +
> + if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (ustruct.argsz < minsz || ustruct.flags)
> + return -EINVAL;
> +
> + return vfio_set_pasid_table(iommu, &ustruct);
>   }
>  
>   return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 02bb7ad6e986..0d9f4090c95d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -14,6 +14,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #define VFIO_API_VERSION 0
>  
> @@ -759,6 +760,13 @@ struct vfio_iommu_type1_dma_unmap {
>  #define VFIO_IOMMU_ENABLE_IO(VFIO_TYPE, VFIO_BASE + 15)
>  #define VFIO_IOMMU_DISABLE   _IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +struct vfio_iommu_type1_set_pasid_table {
> + __u32   argsz;
> + __u32   flags;
> + struct iommu_pasid_table_config config;
> +};
> +#define VFIO_IOMMU_SET_PASID_TABLE   _IO(VFIO_TYPE, VFIO_BASE + 22)

-ENOCOMMENTS  Thanks,

Alex

> +
>  /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
>  
>  /*

Re: [RFC v3 03/21] iommu: Introduce bind_guest_msi

2019-01-11 Thread Alex Williamson

On Tue,  8 Jan 2019 11:26:15 +0100
Eric Auger  wrote:

> On ARM, MSI are translated by the SMMU. An IOVA is allocated
> for each MSI doorbell. If both the host and the guest are exposed
> with SMMUs, we end up with 2 different IOVAs allocated by each.
> guest allocates an IOVA (gIOVA) to map onto the guest MSI
> doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
> onto the physical doorbell (hDB).
> 
> So we end up with 2 untied mappings:
>  S1S2
> gIOVA->gDB
>   hIOVA->gDB
   ^^^ hDB

> Currently the PCI device is programmed by the host with hIOVA
> as MSI doorbell. So this does not work.
> 
> This patch introduces an API to pass gIOVA/gDB to the host so
> that gIOVA can be reused by the host instead of re-allocating
> a new IOVA. So the goal is to create the following nested mapping:
> 
>  S1S2
> gIOVA->gDB ->hDB
> 
> and program the PCI device with gIOVA MSI doorbell.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v2 -> v3:
> - add a struct device handle
> ---
>  drivers/iommu/iommu.c  | 10 ++
>  include/linux/iommu.h  | 13 +
>  include/uapi/linux/iommu.h |  6 ++
>  3 files changed, 29 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index b2e248770508..ea11442e7054 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1431,6 +1431,16 @@ static void __iommu_detach_device(struct iommu_domain 
> *domain,
>   trace_detach_device_from_domain(dev);
>  }
>  
> +int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
> +  struct iommu_guest_msi_binding *binding)
> +{
> + if (unlikely(!domain->ops->bind_guest_msi))
> + return -ENODEV;
> +
> + return domain->ops->bind_guest_msi(domain, dev, binding);
> +}
> +EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
> +
>  void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
>  {
>   struct iommu_group *group;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 96d59886f230..244c1a3d5989 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -235,6 +235,9 @@ struct iommu_ops {
>   int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
>   struct iommu_cache_invalidate_info *inv_info);
>  
> + int (*bind_guest_msi)(struct iommu_domain *domain, struct device *dev,
> +   struct iommu_guest_msi_binding *binding);
> +
>   unsigned long pgsize_bitmap;
>  };
>  
> @@ -301,6 +304,9 @@ extern int iommu_set_pasid_table(struct iommu_domain 
> *domain,
>  extern int iommu_cache_invalidate(struct iommu_domain *domain,
>   struct device *dev,
>   struct iommu_cache_invalidate_info *inv_info);
> +extern int iommu_bind_guest_msi(struct iommu_domain *domain, struct device 
> *dev,
> + struct iommu_guest_msi_binding *binding);
> +
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -724,6 +730,13 @@ iommu_cache_invalidate(struct iommu_domain *domain,
>   return -ENODEV;
>  }
>  
> +static inline
> +int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
> +  struct iommu_guest_msi_binding *binding)
> +{
> + return -ENODEV;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #ifdef CONFIG_IOMMU_DEBUGFS
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 4605f5cfac84..f28cd9a1aa96 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -142,4 +142,10 @@ struct iommu_cache_invalidate_info {
>   __u64   arch_id;
>   __u64   addr;
>  };
> +
> +struct iommu_guest_msi_binding {
> + __u64   iova;
> + __u64   gpa;
> + __u32   granule;

What's granule?  The size?  This looks a lot like just a stage 1
mapping interface, I can't really figure out from the description how
this matches to any specific MSI mapping.  Zero comments in the code
or headers here about how this is supposed to work.  Thanks,

Alex

Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API

2019-01-11 Thread Alex Williamson

On Tue,  8 Jan 2019 11:26:14 +0100
Eric Auger  wrote:

> From: "Liu, Yi L" 
> 
> In any virtualization use case, when the first translation stage
> is "owned" by the guest OS, the host IOMMU driver has no knowledge
> of caching structure updates unless the guest invalidation activities
> are trapped by the virtualizer and passed down to the host.
> 
> Since the invalidation data are obtained from user space and will be
> written into physical IOMMU, we must allow security check at various
> layers. Therefore, generic invalidation data format are proposed here,
> model specific IOMMU drivers need to convert them into their own format.
> 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Ashok Raj 
> Signed-off-by: Eric Auger 
> 
> ---
> v1 -> v2:
> - add arch_id field
> - renamed tlb_invalidate into cache_invalidate as this API allows
>   to invalidate context caches on top of IOTLBs
> 
> v1:
> renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
> header. Commit message reworded.
> ---
>  drivers/iommu/iommu.c  | 14 ++
>  include/linux/iommu.h  | 14 ++
>  include/uapi/linux/iommu.h | 95 ++
>  3 files changed, 123 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 0f2b7f1fc7c8..b2e248770508 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1403,6 +1403,20 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>  
> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
> +struct iommu_cache_invalidate_info *inv_info)
> +{
> + int ret = 0;
> +
> + if (unlikely(!domain->ops->cache_invalidate))
> + return -ENODEV;
> +
> + ret = domain->ops->cache_invalidate(domain, dev, inv_info);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
> struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 1da2a2357ea4..96d59886f230 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -186,6 +186,7 @@ struct iommu_resv_region {
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>   * @set_pasid_table: set pasid table
> + * @cache_invalidate: invalidate translation caches
>   */
>  struct iommu_ops {
>   bool (*capable)(enum iommu_cap);
> @@ -231,6 +232,9 @@ struct iommu_ops {
>   int (*set_pasid_table)(struct iommu_domain *domain,
>  struct iommu_pasid_table_config *cfg);
>  
> + int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
> + struct iommu_cache_invalidate_info *inv_info);
> +
>   unsigned long pgsize_bitmap;
>  };
>  
> @@ -294,6 +298,9 @@ extern void iommu_detach_device(struct iommu_domain 
> *domain,
>   struct device *dev);
>  extern int iommu_set_pasid_table(struct iommu_domain *domain,
>struct iommu_pasid_table_config *cfg);
> +extern int iommu_cache_invalidate(struct iommu_domain *domain,
> + struct device *dev,
> + struct iommu_cache_invalidate_info *inv_info);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -709,6 +716,13 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
>  {
>   return -ENODEV;
>  }
> +static inline int
> +iommu_cache_invalidate(struct iommu_domain *domain,
> +struct device *dev,
> +struct iommu_cache_invalidate_info *inv_info)
> +{
> + return -ENODEV;
> +}
>  
>  #endif /* CONFIG_IOMMU_API */
>  
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 7a7cf7a3de7c..4605f5cfac84 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
>   };
>  };
>  
> +/**
> + * enum iommu_inv_granularity - Generic invalidation granularity
> + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:TLB entries or PASID caches of 
> all
> + *   PASIDs associated with a domain ID
> + * @IOMMU_INV_GRANU_PASID_SEL:   TLB entries or PASID cache 
> associated
> + *   with a PASID and a domain
> + * @IOMMU_INV_GRANU_PAGE_PASID:  TLB entries of selected page 
> range
> + *   within a PASID
> + *
> + * When an invalidation request is passed down to IOMMU to flush translation
> + * caches, it may carry different granularity l

Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API

2019-01-11 Thread Alex Williamson

On Tue,  8 Jan 2019 11:26:13 +0100
Eric Auger  wrote:

> From: Jacob Pan 
> 
> In virtualization use case, when a guest is assigned
> a PCI host device, protected by a virtual IOMMU on a guest,
> the physical IOMMU must be programmed to be consistent with
> the guest mappings. If the physical IOMMU supports two
> translation stages it makes sense to program guest mappings
> onto the first stage/level (ARM/VTD terminology) while to host
> owns the stage/level 2.
> 
> In that case, it is mandated to trap on guest configuration
> settings and pass those to the physical iommu driver.
> 
> This patch adds a new API to the iommu subsystem that allows
> to set the pasid table information.
> 
> A generic iommu_pasid_table_config struct is introduced in
> a new iommu.h uapi header. This is going to be used by the VFIO
> user API. We foresee at least two specializations of this struct,
> for PASID table passing and ARM SMMUv3.
> 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Ashok Raj 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> This patch generalizes the API introduced by Jacob & co-authors in
> https://lwn.net/Articles/754331/
> 
> v2 -> v3:
> - replace unbind/bind by set_pasid_table
> - move table pointer and pasid bits in the generic part of the struct
> 
> v1 -> v2:
> - restore the original pasid table name
> - remove the struct device * parameter in the API
> - reworked iommu_pasid_smmuv3
> ---
>  drivers/iommu/iommu.c  | 10 
>  include/linux/iommu.h  | 14 +++
>  include/uapi/linux/iommu.h | 50 ++
>  3 files changed, 74 insertions(+)
>  create mode 100644 include/uapi/linux/iommu.h
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3ed4db334341..0f2b7f1fc7c8 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, 
> struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_set_pasid_table(struct iommu_domain *domain,
> +   struct iommu_pasid_table_config *cfg)
> +{
> + if (unlikely(!domain->ops->set_pasid_table))
> + return -ENODEV;
> +
> + return domain->ops->set_pasid_table(domain, cfg);
> +}
> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
> struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e90da6b6f3d1..1da2a2357ea4 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define IOMMU_READ   (1 << 0)
>  #define IOMMU_WRITE  (1 << 1)
> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>   * @domain_window_disable: Disable a particular window for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @set_pasid_table: set pasid table
>   */
>  struct iommu_ops {
>   bool (*capable)(enum iommu_cap);
> @@ -226,6 +228,9 @@ struct iommu_ops {
>   int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>   bool (*is_attach_deferred)(struct iommu_domain *domain, struct device 
> *dev);
>  
> + int (*set_pasid_table)(struct iommu_domain *domain,
> +struct iommu_pasid_table_config *cfg);
> +
>   unsigned long pgsize_bitmap;
>  };
>  
> @@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain 
> *domain,
>  struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>   struct device *dev);
> +extern int iommu_set_pasid_table(struct iommu_domain *domain,
> +  struct iommu_pasid_table_config *cfg);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct 
> fwnode_handle *fwnode)
>   return NULL;
>  }
>  
> +static inline
> +int iommu_set_pasid_table(struct iommu_domain *domain,
> +   struct iommu_pasid_table_config *cfg)
> +{
> + return -ENODEV;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #ifdef CONFIG_IOMMU_DEBUGFS
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> new file mode 100644
> index ..7a7cf7a3de7c
> --- /dev/null
> +++ b/include/uapi/linux/iommu.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * IOMMU user API definitions
> + *
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as

Re: [PATCH] vfio_pci: set TRACE_INCLUDE_PATH to fix the build error

2019-01-10 Thread Alex Williamson

On Fri, 11 Jan 2019 12:13:35 +1100
Alexey Kardashevskiy  wrote:

> On 11/01/2019 01:47, Steven Rostedt wrote:
> > On Tue, Jan 08, 2019 at 12:08:03PM +0900, Masahiro Yamada wrote:  
> >> ---
> >>
> >>  drivers/vfio/pci/trace.h | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h
> >> index 228ccdb..4d13e51 100644
> >> --- a/drivers/vfio/pci/trace.h
> >> +++ b/drivers/vfio/pci/trace.h
> >> @@ -94,7 +94,7 @@ TRACE_EVENT(vfio_pci_npu2_mmap,
> >>  #endif /* _TRACE_VFIO_PCI_H */
> >>  
> >>  #undef TRACE_INCLUDE_PATH
> >> -#define TRACE_INCLUDE_PATH .
> >> +#define TRACE_INCLUDE_PATH ../../drivers/vfio/pci  
> > 
> > Note, the reason why I did not show this method in the samples/trace_events/
> > is that there's one "gotcha" that you need to be careful about. It may not 
> > be
> > an issue here, but please be aware of it.
> > 
> > The words in TRACE_INCLUDE_PATH can be updated by C preprocessor defines. 
> > For
> > example, if for some reason you had:
> > 
> > #define pci special_pci
> > 
> > The above would turn into:
> > 
> >  ../../drivers/vfio/special_pci
> > 
> > and it wont build, and you will be left scratching your head wondering why. 
> >  

Thanks for the info Steve, that'd definitely be a head scratcher, but
it also seems really unlikely for this path.

> Lovely :) imho it is +1 for
> CFLAGS_vfio_pci_nvlink2.o += -I$(src)
> and a comment.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d1fc1176c055c9ec9c6ec4d113a284e0bad9d09a

Obviously we can still refine further, but I don't see this new piece
of information making a meaningful difference in the choice.  Thanks,

Alex

[GIT PULL] VFIO fixes for v5.0-rc2

2019-01-10 Thread Alex Williamson

Hi Linus,

The following changes since commit bfeffd155283772bbe78c6a05dec7c0128ee500c:

  Linux 5.0-rc1 (2019-01-06 17:08:20 -0800)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.0-rc2

for you to fetch changes up to 58fec830fc19208354895d9832785505046d6c01:

  vfio/type1: Fix unmap overflow off-by-one (2019-01-08 09:31:28 -0700)


VFIO fixes for v5.0-rc2

 - Fix trace header include path for in-tree builds (Masahiro Yamada)

 - Fix overflow in unmap wrap-around test (Alex Williamson)


Alex Williamson (1):
  vfio/type1: Fix unmap overflow off-by-one

Masahiro Yamada (1):
  vfio/pci: set TRACE_INCLUDE_PATH to fix the build error

 drivers/vfio/pci/trace.h| 2 +-
 drivers/vfio/vfio_iommu_type1.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Re: [PATCH v1 1/2] vfio:iommu: Use capabilities do report IOMMU informations

2019-01-09 Thread Alex Williamson

On Wed, 9 Jan 2019 18:07:19 +0100
Pierre Morel  wrote:

> On 09/01/2019 16:37, Alex Williamson wrote:
> > On Wed,  9 Jan 2019 13:41:53 +0100
> > Pierre Morel  wrote:
> >   
> >> We add a new flag, VFIO_IOMMU_INFO_CAPABILITIES, inside the
> >> vfio_iommu_type1_info to specify the support for capabilities.
> >>
> >> We add a new capability, with id VFIO_IOMMU_INFO_CAP_DMA
> >> in the capability list of the VFIO_IOMMU_GET_INFO ioctl.
> >>
> >> Signed-off-by: Pierre Morel 
> >> ---
> >>   include/uapi/linux/vfio.h | 9 +
> >>   1 file changed, 9 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 8131028..54c4fcb 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -669,6 +669,15 @@ struct vfio_iommu_type1_info {
> >>__u32   flags;
> >>   #define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
> >>__u64   iova_pgsizes;   /* Bitmap of supported page sizes */
> >> +#define VFIO_IOMMU_INFO_CAPABILITIES (1 << 1)  /* support capabilities 
> >> info */
> >> +  __u64   cap_offset; /* Offset within info struct of first cap */
> >> +};
> >> +
> >> +#define VFIO_IOMMU_INFO_CAP_DMA 1
> >> +struct vfio_iommu_cap_dma {
> >> +  struct vfio_info_cap_header header;
> >> +  __u64   dma_start;
> >> +  __u64   dma_end;
> >>   };
> >>   
> >>   #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)  
> > 
> > Unfortunately for most systems, a simple start and end is not really
> > sufficient to describe the available IOVA space, there are often
> > reserved regions intermixed, so this is not really a complete
> > solution.  Shameer tried to solve this last year[1] but we ran into a
> > road block that Intel IGD devices impose a reserved range of IOVA
> > spaces reported to the user that conflict with existing assignment of
> > this device and we haven't figured out yet how to be more selective of
> > the enforcement of those reserved ranges.  Thanks,
> > 
> > Alex
> > 
> > [1] https://lkml.org/lkml/2018/4/18/293
> >   
> 
> I understand that some architecture may be more complex and have special 
> needs.
> However the IOMMU geometry is a constant for all IOMMU devices and
> is reported by the geometry in the iommu operations.
> 
> This makes the IOMMU geometry a special case.

I'm not so sure that the geometry is a constant for all IOMMU devices,
nor am I sure how if that were true and it's part of an in-kernel
interface that it automatically qualifies it as the right way to expose
it to userspace.  The fact that we have a reserved region interface to
augment a basic contiguous range suggests it's known to be insufficient
even for in-kernel use.

> It is also a special case because it is an inclusive description of 
> available memory, to oppose to the exclusive description given by the 
> windows.

Geometry doesn't really have anything to do with available memory, it's
the minimum and maximum IOVA aperture.  Shameer's proposal gave us an
IOVA list, which is based on the IOMMU geometry, from which it excludes
various reserved ranges.  So if you have a less complex architecture,
you might only have one entry in the list, which gives you the start
and end of the base geometry.  Move complex architectures might have
more entries, but the geometry can still be deduced from the absolute
highest and lowest addresses within the list.  Therefore a basic
geometry capability is automatically redundant to the interface that's
already been proposed.

> Isn't it possible to separate the IOMMU geometry, which is really 
> related to the IOMMU chip, from other windows exclusion related to the 
> system memory mapping?

Why would we ever have both given the description above?

> Retrieving the IOMMU geometry is very important for us because the 
> driver inside the guest must get it and program the IOMMU based on these 
> values.

So you have motivation to help move the IOVA list proposal forward,
or some equally inclusive proposal that isn't just a stop-gap ;)
Thanks,

Alex

Re: [PATCH v1 1/2] vfio:iommu: Use capabilities do report IOMMU informations

2019-01-09 Thread Alex Williamson

On Wed,  9 Jan 2019 13:41:53 +0100
Pierre Morel  wrote:

> We add a new flag, VFIO_IOMMU_INFO_CAPABILITIES, inside the
> vfio_iommu_type1_info to specify the support for capabilities.
> 
> We add a new capability, with id VFIO_IOMMU_INFO_CAP_DMA
> in the capability list of the VFIO_IOMMU_GET_INFO ioctl.
> 
> Signed-off-by: Pierre Morel 
> ---
>  include/uapi/linux/vfio.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8131028..54c4fcb 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -669,6 +669,15 @@ struct vfio_iommu_type1_info {
>   __u32   flags;
>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
>   __u64   iova_pgsizes;   /* Bitmap of supported page sizes */
> +#define VFIO_IOMMU_INFO_CAPABILITIES (1 << 1)  /* support capabilities info 
> */
> + __u64   cap_offset; /* Offset within info struct of first cap */
> +};
> +
> +#define VFIO_IOMMU_INFO_CAP_DMA 1
> +struct vfio_iommu_cap_dma {
> + struct vfio_info_cap_header header;
> + __u64   dma_start;
> + __u64   dma_end;
>  };
>  
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)

Unfortunately for most systems, a simple start and end is not really
sufficient to describe the available IOVA space, there are often
reserved regions intermixed, so this is not really a complete
solution.  Shameer tried to solve this last year[1] but we ran into a
road block that Intel IGD devices impose a reserved range of IOVA
spaces reported to the user that conflict with existing assignment of
this device and we haven't figured out yet how to be more selective of
the enforcement of those reserved ranges.  Thanks,

Alex

[1] https://lkml.org/lkml/2018/4/18/293

Re: [PATCH] vfio_pci: set TRACE_INCLUDE_PATH to fix the build error

2019-01-08 Thread Alex Williamson

On Tue,  8 Jan 2019 12:08:03 +0900
Masahiro Yamada  wrote:

> drivers/vfio/pci/vfio_pci_nvlink2.c cannot be compiled for in-tree
> building.
> 
> CC  drivers/vfio/pci/vfio_pci_nvlink2.o
>   In file included from drivers/vfio/pci/trace.h:102,
>from drivers/vfio/pci/vfio_pci_nvlink2.c:29:
>   ./include/trace/define_trace.h:89:42: fatal error: ./trace.h: No such file 
> or directory
>#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
>   ^
>   compilation terminated.
>   make[1]: *** [scripts/Makefile.build;277: 
> drivers/vfio/pci/vfio_pci_nvlink2.o] Error 1
> 
> To fix the build error, let's tell include/trace/define_trace.h the
> location of drivers/vfio/pci/trace.h
> 
> Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> subdriver")
> Reported-by: Laura Abbott 
> Signed-off-by: Masahiro Yamada 
> ---

Thanks for posting this, it's my preferred fix.  I'll give it another
day to collect reviews/objections then pop it into my for-linus branch
for rc2.  Thanks!

Alex
 
>  drivers/vfio/pci/trace.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h
> index 228ccdb..4d13e51 100644
> --- a/drivers/vfio/pci/trace.h
> +++ b/drivers/vfio/pci/trace.h
> @@ -94,7 +94,7 @@ TRACE_EVENT(vfio_pci_npu2_mmap,
>  #endif /* _TRACE_VFIO_PCI_H */
>  
>  #undef TRACE_INCLUDE_PATH
> -#define TRACE_INCLUDE_PATH .
> +#define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
>  #undef TRACE_INCLUDE_FILE
>  #define TRACE_INCLUDE_FILE trace
>

[PATCH] vfio/type1: Fix unmap overflow off-by-one

2019-01-08 Thread Alex Williamson

The below referenced commit adds a test for integer overflow, but in
doing so prevents the unmap ioctl from ever including the last page of
the address space.  Subtract one to compare to the last address of the
unmap to avoid the overflow and wrap-around.

Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
Cc: Dan Carpenter 
Reported-by: Pei Zhang 
Debugged-by: Peter Xu 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7651cfb14836..73652e21efec 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -878,7 +878,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
return -EINVAL;
if (!unmap->size || unmap->size & mask)
return -EINVAL;
-   if (unmap->iova + unmap->size < unmap->iova ||
+   if (unmap->iova + unmap->size - 1 < unmap->iova ||
unmap->size > SIZE_MAX)
return -EINVAL;

Re: [PATCH] vfio_pci: Add local source directory as include

2019-01-07 Thread Alex Williamson

On Tue, 8 Jan 2019 10:52:43 +1100
Alexey Kardashevskiy  wrote:

> On 08/01/2019 07:13, Alex Williamson wrote:
> > On Mon, 7 Jan 2019 20:39:19 +0900
> > Masahiro Yamada  wrote:
> >   
> >> On Mon, Jan 7, 2019 at 8:09 PM Cornelia Huck  wrote:  
> >>>
> >>> On Mon, 7 Jan 2019 19:12:10 +0900
> >>> Masahiro Yamada  wrote:
> >>>
> >>>> On Mon, Jan 7, 2019 at 6:18 PM Michael Ellerman  
> >>>> wrote:
> >>>>>
> >>>>> Laura Abbott  writes:
> >>>>>> Commit 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2]
> >>>>>> subdriver") introduced a trace.h file in the local directory but
> >>>>>> missed adding the local include path, resulting in compilation
> >>>>>> failures with tracepoints:
> >>>>>>
> >>>>>> In file included from drivers/vfio/pci/trace.h:102,
> >>>>>>  from drivers/vfio/pci/vfio_pci_nvlink2.c:29:
> >>>>>> ./include/trace/define_trace.h:89:42: fatal error: ./trace.h: No such 
> >>>>>> file or directory
> >>>>>>  #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
> >>>>>>
> >>>>>> Fix this by adjusting the include path.
> >>>>>>
> >>>>>> Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> >>>>>> subdriver")
> >>>>>> Signed-off-by: Laura Abbott 
> >>>
> >>> (...)
> >>>
> >>>>> Alex I assume you'll merge this fix via the vfio tree?
> >>>>>
> >>>>> cheers
> >>>>>
> >>>>>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> >>>>>> index 9662c063a6b1..08d4676a8495 100644
> >>>>>> --- a/drivers/vfio/pci/Makefile
> >>>>>> +++ b/drivers/vfio/pci/Makefile
> >>>>>> @@ -1,3 +1,4 @@
> >>>>>> +ccflags-y   += -I$(src)
> >>>>>>
> >>>>>>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o 
> >>>>>> vfio_pci_config.o
> >>>>>>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> >>>>>> --
> >>>>>> 2.20.1
> >>>>
> >>>>
> >>>> Hi.
> >>>>
> >>>> If I correctly understand the usage of TRACE_INCLUDE_PATH,
> >>>> the correct fix should be like follows:
> >>>>
> >>>>
> >>>> diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h
> >>>> index 228ccdb..4d13e51 100644
> >>>> --- a/drivers/vfio/pci/trace.h
> >>>> +++ b/drivers/vfio/pci/trace.h
> >>>> @@ -94,7 +94,7 @@ TRACE_EVENT(vfio_pci_npu2_mmap,
> >>>>  #endif /* _TRACE_VFIO_PCI_H */
> >>>>
> >>>>  #undef TRACE_INCLUDE_PATH
> >>>> -#define TRACE_INCLUDE_PATH .
> >>>> +#define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
> >>>>  #undef TRACE_INCLUDE_FILE
> >>>>  #define TRACE_INCLUDE_FILE trace
> >>>
> >>> Going from the comments in samples/trace_events/trace-events-sample.h,
> >>> I think both approaches are possible, and I see both used in various
> >>> places.
> >>>
> >>> Personally, I'd prefer Laura's patch, as it doesn't involve hardcoding
> >>> a path.  
> > 
> > Numbering options for clarity:
> > 
> > 1)  
> >> ccflags-y += -I$(src)
> >> would add the header search path for all files in drivers/vfio/pci/
> >> whereas only the drivers/vfio/pci/vfio_pci_nvlink2.c needs it.
> >>  
> > 
> > 2)  
> >> CFLAGS_vfio_pci_nvlink2.o += -I$(src)
> >> is a bit better.
> >> However, it is not obvious why this extra header search path is needed
> >> until you find vfio_pci_nvlink2.c including trace.h
> >>  
> > 
> > 3)  
> >> #define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
> >> clarifies the intention because the related code is all placed in trace.h
> >>
> >>
> >>
> >> From the comment in include/trace/define_trace.h
> >> TRACE_INCLUDE_PATH is relative to include/trace/define_trace.h  
> > 
> > In my scan of the tree, th

Re: [PATCH] vfio_pci: Add local source directory as include

2019-01-07 Thread Alex Williamson

On Mon, 7 Jan 2019 20:39:19 +0900
Masahiro Yamada  wrote:

> On Mon, Jan 7, 2019 at 8:09 PM Cornelia Huck  wrote:
> >
> > On Mon, 7 Jan 2019 19:12:10 +0900
> > Masahiro Yamada  wrote:
> >  
> > > On Mon, Jan 7, 2019 at 6:18 PM Michael Ellerman  
> > > wrote:  
> > > >
> > > > Laura Abbott  writes:  
> > > > > Commit 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2]
> > > > > subdriver") introduced a trace.h file in the local directory but
> > > > > missed adding the local include path, resulting in compilation
> > > > > failures with tracepoints:
> > > > >
> > > > > In file included from drivers/vfio/pci/trace.h:102,
> > > > >  from drivers/vfio/pci/vfio_pci_nvlink2.c:29:
> > > > > ./include/trace/define_trace.h:89:42: fatal error: ./trace.h: No such 
> > > > > file or directory
> > > > >  #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
> > > > >
> > > > > Fix this by adjusting the include path.
> > > > >
> > > > > Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> > > > > subdriver")
> > > > > Signed-off-by: Laura Abbott   
> >
> > (...)
> >  
> > > > Alex I assume you'll merge this fix via the vfio tree?
> > > >
> > > > cheers
> > > >  
> > > > > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> > > > > index 9662c063a6b1..08d4676a8495 100644
> > > > > --- a/drivers/vfio/pci/Makefile
> > > > > +++ b/drivers/vfio/pci/Makefile
> > > > > @@ -1,3 +1,4 @@
> > > > > +ccflags-y   += -I$(src)
> > > > >
> > > > >  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o 
> > > > > vfio_pci_config.o
> > > > >  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> > > > > --
> > > > > 2.20.1  
> > >
> > >
> > > Hi.
> > >
> > > If I correctly understand the usage of TRACE_INCLUDE_PATH,
> > > the correct fix should be like follows:
> > >
> > >
> > > diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h
> > > index 228ccdb..4d13e51 100644
> > > --- a/drivers/vfio/pci/trace.h
> > > +++ b/drivers/vfio/pci/trace.h
> > > @@ -94,7 +94,7 @@ TRACE_EVENT(vfio_pci_npu2_mmap,
> > >  #endif /* _TRACE_VFIO_PCI_H */
> > >
> > >  #undef TRACE_INCLUDE_PATH
> > > -#define TRACE_INCLUDE_PATH .
> > > +#define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
> > >  #undef TRACE_INCLUDE_FILE
> > >  #define TRACE_INCLUDE_FILE trace  
> >
> > Going from the comments in samples/trace_events/trace-events-sample.h,
> > I think both approaches are possible, and I see both used in various
> > places.
> >
> > Personally, I'd prefer Laura's patch, as it doesn't involve hardcoding
> > a path.

Numbering options for clarity:

1)
> ccflags-y += -I$(src)
> would add the header search path for all files in drivers/vfio/pci/
> whereas only the drivers/vfio/pci/vfio_pci_nvlink2.c needs it.
> 

2)
> CFLAGS_vfio_pci_nvlink2.o += -I$(src)
> is a bit better.
> However, it is not obvious why this extra header search path is needed
> until you find vfio_pci_nvlink2.c including trace.h
> 

3)
> #define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
> clarifies the intention because the related code is all placed in trace.h
> 
> 
> 
> From the comment in include/trace/define_trace.h
> TRACE_INCLUDE_PATH is relative to include/trace/define_trace.h

In my scan of the tree, the most common solution seems to be 2) as this
is essentially recommended in the sample file.  3) is well represented,
with much fewer examples of 1), though it might depend how liberally
we grep out or examine the use cases.  Choice 1) also seems to be the
most shotgun approach, adding to the search path for all files.  From a
maintenance perspective I agree that 2) seems more error prone,
especially when the build system only catches the error on in-tree
builds, something I rarely do.  Therefore I'm leaning towards option
3).  The hardcoded path here doesn't seem much of an issue relative to
the negatives of the other approaches (how often do we move these
files?) and it keeps the trace support relatively self-contained.  Are
there further arguments for or against these options?  Otherwise who
wants to formally post the TRACE_INCLUDE_PATH version?  Thanks,

Alex

[GIT PULL] VFIO updates for v4.21-rc1

2018-12-21 Thread Alex Williamson

Hi Linus,

An early pull request for the next merge window.  Happy holidays!

The following changes since commit 40e020c129cfc991e8ab4736d2665351ffd1468d:

  Linux 4.20-rc6 (2018-12-09 15:31:00 -0800)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v4.21-rc1

for you to fetch changes up to 8ba35b3a0046d6573c98f00461d9bd1b86250d35:

  vfio-mdev/samples: Use u8 instead of char for handle functions (2018-12-17 
11:07:13 -0700)


VFIO updates for v4.21

 - Replace global vfio-pci lock with per bus lock to allow concurrent
   open and release (Alex Williamson)

 - Declare mdev function as static (Paolo Cretaro)

 - Convert char to u8 in mdev/mtty sample driver (Nathan Chancellor)


Alex Williamson (1):
  vfio/pci: Parallelize device open and release

Nathan Chancellor (1):
  vfio-mdev/samples: Use u8 instead of char for handle functions

Paolo Cretaro (1):
  vfio/mdev: add static modifier to add_mdev_supported_type

 drivers/vfio/mdev/mdev_sysfs.c  |   4 +-
 drivers/vfio/pci/vfio_pci.c | 160 ++--
 drivers/vfio/pci/vfio_pci_private.h |   6 ++
 samples/vfio-mdev/mtty.c|  26 +++---
 4 files changed, 157 insertions(+), 39 deletions(-)

Re: [PATCH kernel v7 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-20 Thread Alex Williamson

On Fri, 21 Dec 2018 12:50:00 +1100
Alexey Kardashevskiy  wrote:

> On 21/12/2018 12:37, Alex Williamson wrote:
> > On Fri, 21 Dec 2018 12:23:16 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 21/12/2018 03:46, Alex Williamson wrote:  
> >>> On Thu, 20 Dec 2018 19:23:50 +1100
> >>> Alexey Kardashevskiy  wrote:
> >>> 
> >>>> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> >>>> pluggable PCIe devices but still have PCIe links which are used
> >>>> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> >>>> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> >>>> have a special unit on a die called an NPU which is an NVLink2 host bus
> >>>> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> >>>> These systems also support ATS (address translation services) which is
> >>>> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> >>>> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> >>>> cache-coherent access to a GPU RAM.
> >>>>
> >>>> This exports GPU RAM to the userspace as a new VFIO device region. This
> >>>> preregisters the new memory as device memory as it might be used for DMA.
> >>>> This inserts pfns from the fault handler as the GPU memory is not onlined
> >>>> until the vendor driver is loaded and trained the NVLinks so doing this
> >>>> earlier causes low level errors which we fence in the firmware so
> >>>> it does not hurt the host system but still better be avoided; for the 
> >>>> same
> >>>> reason this does not map GPU RAM into the host kernel (usual thing for
> >>>> emulated access otherwise).
> >>>>
> >>>> This exports an ATSD (Address Translation Shootdown) register of NPU 
> >>>> which
> >>>> allows TLB invalidations inside GPU for an operating system. The register
> >>>> conveniently occupies a single 64k page. It is also presented to
> >>>> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
> >>>> each of them can be used for TLB invalidation in a GPU linked to this 
> >>>> NPU.
> >>>> This allocates one ATSD register per an NVLink bridge allowing passing
> >>>> up to 6 registers. Due to the host firmware bug (just recently fixed),
> >>>> only 1 ATSD register per NPU was actually advertised to the host system
> >>>> so this passes that alone register via the first NVLink bridge device in
> >>>> the group which is still enough as QEMU collects them all back and
> >>>> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
> >>>>
> >>>> In order to provide the userspace with the information about 
> >>>> GPU-to-NVLink
> >>>> connections, this exports an additional capability called "tgt"
> >>>> (which is an abbreviated host system bus address). The "tgt" property
> >>>> tells the GPU its own system address and allows the guest driver to
> >>>> conglomerate the routing information so each GPU knows how to get 
> >>>> directly
> >>>> to the other GPUs.
> >>>>
> >>>> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> >>>> know LPID (a logical partition ID or a KVM guest hardware ID in other
> >>>> words) and PID (a memory context ID of a userspace process, not to be
> >>>> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> >>>> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> >>>> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> >>>>
> >>>> This requires coherent memory and ATSD to be available on the host as
> >>>> the GPU vendor only supports configurations with both features enabled
> >>>> and other configurations are known not to work. Because of this and
> >>>> because of the ways the features are advertised to the host system
> >>>> (which is a device tree with very platform specific properties),
> >>>> this requires enabled POWERNV platform.
> >>>>
> >>>> The V100 GPUs do not advertise any of these capabilities via the config
> >>>> space and there are more than just one de

Re: [PATCH kernel v7 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-20 Thread Alex Williamson

On Fri, 21 Dec 2018 12:23:16 +1100
Alexey Kardashevskiy  wrote:

> On 21/12/2018 03:46, Alex Williamson wrote:
> > On Thu, 20 Dec 2018 19:23:50 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> >> pluggable PCIe devices but still have PCIe links which are used
> >> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> >> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> >> have a special unit on a die called an NPU which is an NVLink2 host bus
> >> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> >> These systems also support ATS (address translation services) which is
> >> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> >> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> >> cache-coherent access to a GPU RAM.
> >>
> >> This exports GPU RAM to the userspace as a new VFIO device region. This
> >> preregisters the new memory as device memory as it might be used for DMA.
> >> This inserts pfns from the fault handler as the GPU memory is not onlined
> >> until the vendor driver is loaded and trained the NVLinks so doing this
> >> earlier causes low level errors which we fence in the firmware so
> >> it does not hurt the host system but still better be avoided; for the same
> >> reason this does not map GPU RAM into the host kernel (usual thing for
> >> emulated access otherwise).
> >>
> >> This exports an ATSD (Address Translation Shootdown) register of NPU which
> >> allows TLB invalidations inside GPU for an operating system. The register
> >> conveniently occupies a single 64k page. It is also presented to
> >> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
> >> each of them can be used for TLB invalidation in a GPU linked to this NPU.
> >> This allocates one ATSD register per an NVLink bridge allowing passing
> >> up to 6 registers. Due to the host firmware bug (just recently fixed),
> >> only 1 ATSD register per NPU was actually advertised to the host system
> >> so this passes that alone register via the first NVLink bridge device in
> >> the group which is still enough as QEMU collects them all back and
> >> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
> >>
> >> In order to provide the userspace with the information about GPU-to-NVLink
> >> connections, this exports an additional capability called "tgt"
> >> (which is an abbreviated host system bus address). The "tgt" property
> >> tells the GPU its own system address and allows the guest driver to
> >> conglomerate the routing information so each GPU knows how to get directly
> >> to the other GPUs.
> >>
> >> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> >> know LPID (a logical partition ID or a KVM guest hardware ID in other
> >> words) and PID (a memory context ID of a userspace process, not to be
> >> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> >> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> >> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> >>
> >> This requires coherent memory and ATSD to be available on the host as
> >> the GPU vendor only supports configurations with both features enabled
> >> and other configurations are known not to work. Because of this and
> >> because of the ways the features are advertised to the host system
> >> (which is a device tree with very platform specific properties),
> >> this requires enabled POWERNV platform.
> >>
> >> The V100 GPUs do not advertise any of these capabilities via the config
> >> space and there are more than just one device ID so this relies on
> >> the platform to tell whether these GPUs have special abilities such as
> >> NVLinks.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >> Changes:
> >> v6.1:
> >> * fixed outdated comment about VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD
> >>
> >> v6:
> >> * reworked capabilities - tgt for nvlink and gpu and link-speed
> >> for nvlink only
> >>
> >> v5:
> >> * do not memremap GPU RAM for emulation, map it only when it is needed
> >> * allocate 1 ATSD register per NVLink bridge, if none left, then expose
> >> the region with a zero size
> >>

Re: [PATCH kernel v7 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-20 Thread Alex Williamson

On Thu, 20 Dec 2018 19:23:50 +1100
Alexey Kardashevskiy  wrote:

> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> pluggable PCIe devices but still have PCIe links which are used
> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> have a special unit on a die called an NPU which is an NVLink2 host bus
> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> These systems also support ATS (address translation services) which is
> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> cache-coherent access to a GPU RAM.
> 
> This exports GPU RAM to the userspace as a new VFIO device region. This
> preregisters the new memory as device memory as it might be used for DMA.
> This inserts pfns from the fault handler as the GPU memory is not onlined
> until the vendor driver is loaded and trained the NVLinks so doing this
> earlier causes low level errors which we fence in the firmware so
> it does not hurt the host system but still better be avoided; for the same
> reason this does not map GPU RAM into the host kernel (usual thing for
> emulated access otherwise).
> 
> This exports an ATSD (Address Translation Shootdown) register of NPU which
> allows TLB invalidations inside GPU for an operating system. The register
> conveniently occupies a single 64k page. It is also presented to
> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
> each of them can be used for TLB invalidation in a GPU linked to this NPU.
> This allocates one ATSD register per an NVLink bridge allowing passing
> up to 6 registers. Due to the host firmware bug (just recently fixed),
> only 1 ATSD register per NPU was actually advertised to the host system
> so this passes that alone register via the first NVLink bridge device in
> the group which is still enough as QEMU collects them all back and
> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
> 
> In order to provide the userspace with the information about GPU-to-NVLink
> connections, this exports an additional capability called "tgt"
> (which is an abbreviated host system bus address). The "tgt" property
> tells the GPU its own system address and allows the guest driver to
> conglomerate the routing information so each GPU knows how to get directly
> to the other GPUs.
> 
> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> know LPID (a logical partition ID or a KVM guest hardware ID in other
> words) and PID (a memory context ID of a userspace process, not to be
> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> 
> This requires coherent memory and ATSD to be available on the host as
> the GPU vendor only supports configurations with both features enabled
> and other configurations are known not to work. Because of this and
> because of the ways the features are advertised to the host system
> (which is a device tree with very platform specific properties),
> this requires enabled POWERNV platform.
> 
> The V100 GPUs do not advertise any of these capabilities via the config
> space and there are more than just one device ID so this relies on
> the platform to tell whether these GPUs have special abilities such as
> NVLinks.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v6.1:
> * fixed outdated comment about VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD
> 
> v6:
> * reworked capabilities - tgt for nvlink and gpu and link-speed
> for nvlink only
> 
> v5:
> * do not memremap GPU RAM for emulation, map it only when it is needed
> * allocate 1 ATSD register per NVLink bridge, if none left, then expose
> the region with a zero size
> * separate caps per device type
> * addressed AW review comments
> 
> v4:
> * added nvlink-speed to the NPU bridge capability as this turned out to
> be not a constant value
> * instead of looking at the exact device ID (which also changes from system
> to system), now this (indirectly) looks at the device tree to know
> if GPU and NPU support NVLink
> 
> v3:
> * reworded the commit log about tgt
> * added tracepoints (do we want them enabled for entire vfio-pci?)
> * added code comments
> * added write|mmap flags to the new regions
> * auto enabled VFIO_PCI_NVLINK2 config option
> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
> references; there are required by the NVIDIA driver
> * keep notifier registered only for short time
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/trace.h| 102 ++
>  drivers/vfio/pci/vfio_pci_private.h |  14 +
>  include/uapi/linux/vfio.h   |  37 +++
>  drivers/vfio/pci/vfio_pci.c |  27 +-
>  drivers/vfio/pci

Re: [PATCH kernel v6 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-19 Thread Alex Williamson

ction */
> +#include 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 93c1738..127071b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -163,4 +163,18 @@ static inline int vfio_pci_igd_init(struct 
> vfio_pci_device *vdev)
>   return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
> +extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device 
> *vdev)
> +{
> + return -ENODEV;
> +}
> +
> +static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
> +{
> + return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8131028..22b825c 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -353,6 +353,21 @@ struct vfio_region_gfx_edid {
>  #define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
>  };
>  
> +/*
> + * 10de vendor sub-type
> + *
> + * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
> + */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
> +
> +/*
> + * 1014 vendor sub-type
> + *
> + * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> + * to do TLB invalidation on a GPU.
> + */
> +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within
> @@ -363,6 +378,29 @@ struct vfio_region_gfx_edid {
>   */
>  #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE   3
>  
> +/*
> + * Capability with compressed real address (aka SSA - small system address)
> + * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing.
> + */
> +#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT  4
> +
> +struct vfio_region_info_cap_nvlink2_ssatgt {
> + struct vfio_info_cap_header header;
> + __u64 tgt;
> +};
> +
> +/*
> + * Capability with compressed real address (aka SSA - small system address),
> + * used to match the NVLink bridge with a GPU. Also contains a link speed.
> + */

Comments carried over from previous definitions are no longer
accurate.  Thanks,

Alex

> +#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD  5
> +
> +struct vfio_region_info_cap_nvlink2_lnkspd {
> + struct vfio_info_cap_header header;
> + __u32 link_speed;
> + __u32 __pad;
> +};
> +
>  /**
>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>   *   struct vfio_irq_info)
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 6cb70cf..67c03f2 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -302,14 +302,37 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   if (ret) {
>   dev_warn(&vdev->pdev->dev,
>"Failed to setup Intel IGD regions\n");
> - vfio_pci_disable(vdev);
> - return ret;
> + goto disable_exit;
> + }
> + }
> +
> + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> + ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
> + if (ret && ret != -ENODEV) {
> + dev_warn(&vdev->pdev->dev,
> +  "Failed to setup NVIDIA NV2 RAM region\n");
> + goto disable_exit;
> + }
> + }
> +
> + if (pdev->vendor == PCI_VENDOR_ID_IBM &&
> + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> + ret = vfio_pci_ibm_npu2_init(vdev);
> + if (ret && ret != -ENODEV) {
> + dev_warn(&vdev->pdev->dev,
> + "Failed to setup NVIDIA NV2 ATSD 
> region\n");
> + goto disable_exit;
>   }
>   }
>  
>   vfio_pci_probe_mmaps(vdev);
>  
>   return 0;
> +
> +disable_exit:
> + vfio_pci_disable(vdev);
> + return ret;
>  }
>  
>  static void vfio_pci_disable(struct vfio_pci_device *vdev)
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 000..054a2cf
> --- /dev/null

Re: [PATCH kernel v6 18/20] vfio_pci: Allow mapping extra regions

2018-12-19 Thread Alex Williamson

[cc +kvm, +lkml]

Sorry, just noticed these are only visible on ppc lists or for those
directly cc'd.  vfio's official development list is the kvm list.  I'll
let spapr specific changes get away without copying this list, but
changes like this really need to be visible to everyone.  Thanks,

Alex

On Wed, 19 Dec 2018 19:52:30 +1100
Alexey Kardashevskiy  wrote:

> So far we only allowed mapping of MMIO BARs to the userspace. However
> there are GPUs with on-board coherent RAM accessible via side
> channels which we also want to map to the userspace. The first client
> for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
> NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
> to the system address space, we are going to export these as an extra
> PCI region.
> 
> We already support extra PCI regions and this adds support for mapping
> them to the userspace.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 
> Acked-by: Alex Williamson 
> ---
> Changes:
> v2:
> * reverted one of mistakenly removed error checks
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 3 +++
>  drivers/vfio/pci/vfio_pci.c | 9 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
> size_t count, loff_t *ppos, bool iswrite);
>   void(*release)(struct vfio_pci_device *vdev,
>  struct vfio_pci_region *region);
> + int (*mmap)(struct vfio_pci_device *vdev,
> + struct vfio_pci_region *region,
> + struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index fef5002..4a6f7c0 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1130,6 +1130,15 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   return -EINVAL;
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
> + if (index >= VFIO_PCI_NUM_REGIONS) {
> + int regnum = index - VFIO_PCI_NUM_REGIONS;
> + struct vfio_pci_region *region = vdev->region + regnum;
> +
> + if (region && region->ops && region->ops->mmap &&
> + (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
> + return region->ops->mmap(vdev, region, vma);
> + return -EINVAL;
> + }
>   if (index >= VFIO_PCI_ROM_REGION_INDEX)
>   return -EINVAL;
>   if (!vdev->bar_mmap_supported[index])

Re: [PATCH kernel v6 19/20] vfio_pci: Allow regions to add own capabilities

2018-12-19 Thread Alex Williamson

[cc +kvm, +lkml]

Ditto list cc comment from 18/20

On Wed, 19 Dec 2018 19:52:31 +1100
Alexey Kardashevskiy  wrote:

> VFIO regions already support region capabilities with a limited set of
> fields. However the subdriver might have to report to the userspace
> additional bits.
> 
> This adds an add_capability() hook to vfio_pci_regops.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Acked-by: Alex Williamson 
> ---
> Changes:
> v3:
> * removed confusing rationale for the patch, the next patch makes
> use of it anyway
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 3 +++
>  drivers/vfio/pci/vfio_pci.c | 6 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..93c1738 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -62,6 +62,9 @@ struct vfio_pci_regops {
>   int (*mmap)(struct vfio_pci_device *vdev,
>   struct vfio_pci_region *region,
>   struct vm_area_struct *vma);
> + int (*add_capability)(struct vfio_pci_device *vdev,
> +   struct vfio_pci_region *region,
> +   struct vfio_info_cap *caps);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 4a6f7c0..6cb70cf 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -763,6 +763,12 @@ static long vfio_pci_ioctl(void *device_data,
>   if (ret)
>   return ret;
>  
> + if (vdev->region[i].ops->add_capability) {
> + ret = vdev->region[i].ops->add_capability(vdev,
> + &vdev->region[i], &caps);
> + if (ret)
> + return ret;
> + }
>   }
>   }
>

Re: [PATCH] vfio-mdev/samples: Use u8 instead of char for handle functions

2018-12-17 Thread Alex Williamson

On Fri, 19 Oct 2018 11:04:27 -0700
Nathan Chancellor  wrote:

> Clang warns:
> 
> samples/vfio-mdev/mtty.c:592:39: warning: implicit conversion from 'int'
> to 'char' changes value from 162 to -94 [-Wconstant-conversion]
> *buf = UART_MSR_DSR | UART_MSR_DDSR | UART_MSR_DCD;
>  ~ ~^~
> 1 warning generated.
> 
> Turns out that all uses of buf in this function ultimately end up stored
> or cast to an unsigned type. Just use u8, which has the same number of
> bits but can store this larger number so Clang no longer warns.
> 
> Signed-off-by: Nathan Chancellor 
> ---
>  samples/vfio-mdev/mtty.c | 26 +-
>  1 file changed, 13 insertions(+), 13 deletions(-)

Applied to vfio next branch for v4.21.  Thanks,

Alex

> 
> diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
> index 7abb79d8313d..f6732aa16bb1 100644
> --- a/samples/vfio-mdev/mtty.c
> +++ b/samples/vfio-mdev/mtty.c
> @@ -171,7 +171,7 @@ static struct mdev_state *find_mdev_state_by_uuid(uuid_le 
> uuid)
>   return NULL;
>  }
>  
> -void dump_buffer(char *buf, uint32_t count)
> +void dump_buffer(u8 *buf, uint32_t count)
>  {
>  #if defined(DEBUG)
>   int i;
> @@ -250,7 +250,7 @@ static void mtty_create_config_space(struct mdev_state 
> *mdev_state)
>  }
>  
>  static void handle_pci_cfg_write(struct mdev_state *mdev_state, u16 offset,
> -  char *buf, u32 count)
> +  u8 *buf, u32 count)
>  {
>   u32 cfg_addr, bar_mask, bar_index = 0;
>  
> @@ -304,7 +304,7 @@ static void handle_pci_cfg_write(struct mdev_state 
> *mdev_state, u16 offset,
>  }
>  
>  static void handle_bar_write(unsigned int index, struct mdev_state 
> *mdev_state,
> - u16 offset, char *buf, u32 count)
> + u16 offset, u8 *buf, u32 count)
>  {
>   u8 data = *buf;
>  
> @@ -475,7 +475,7 @@ static void handle_bar_write(unsigned int index, struct 
> mdev_state *mdev_state,
>  }
>  
>  static void handle_bar_read(unsigned int index, struct mdev_state 
> *mdev_state,
> - u16 offset, char *buf, u32 count)
> + u16 offset, u8 *buf, u32 count)
>  {
>   /* Handle read requests by guest */
>   switch (offset) {
> @@ -650,7 +650,7 @@ static void mdev_read_base(struct mdev_state *mdev_state)
>   }
>  }
>  
> -static ssize_t mdev_access(struct mdev_device *mdev, char *buf, size_t count,
> +static ssize_t mdev_access(struct mdev_device *mdev, u8 *buf, size_t count,
>  loff_t pos, bool is_write)
>  {
>   struct mdev_state *mdev_state;
> @@ -698,7 +698,7 @@ static ssize_t mdev_access(struct mdev_device *mdev, char 
> *buf, size_t count,
>  #if defined(DEBUG_REGS)
>   pr_info("%s: BAR%d  WR @0x%llx %s val:0x%02x dlab:%d\n",
>   __func__, index, offset, wr_reg[offset],
> - (u8)*buf, mdev_state->s[index].dlab);
> + *buf, mdev_state->s[index].dlab);
>  #endif
>   handle_bar_write(index, mdev_state, offset, buf, count);
>   } else {
> @@ -708,7 +708,7 @@ static ssize_t mdev_access(struct mdev_device *mdev, char 
> *buf, size_t count,
>  #if defined(DEBUG_REGS)
>   pr_info("%s: BAR%d  RD @0x%llx %s val:0x%02x dlab:%d\n",
>   __func__, index, offset, rd_reg[offset],
> - (u8)*buf, mdev_state->s[index].dlab);
> + *buf, mdev_state->s[index].dlab);
>  #endif
>   }
>   break;
> @@ -827,7 +827,7 @@ ssize_t mtty_read(struct mdev_device *mdev, char __user 
> *buf, size_t count,
>   if (count >= 4 && !(*ppos % 4)) {
>   u32 val;
>  
> - ret =  mdev_access(mdev, (char *)&val, sizeof(val),
> + ret =  mdev_access(mdev, (u8 *)&val, sizeof(val),
>  *ppos, false);
>   if (ret <= 0)
>   goto read_err;
> @@ -839,7 +839,7 @@ ssize_t mtty_read(struct mdev_device *mdev, char __user 
> *buf, size_t count,
>   } else if (count >= 2 && !(*ppos % 2)) {
>   u16 val;
>  
> - ret = mdev_access(mdev, (char *)&val, sizeof(val),
> + ret = mdev_access(mdev, (u8 *)&val, sizeof(val),
> *ppos, false);
>   if (ret <= 0)
>   goto read_err;
> @@ -851,7 +851,7 @@ ssize_t mtty_read(struct mdev_device *mdev, char __user 
> *buf, size_t count,
>   } else {
>   u8 val;
>  
> - ret = mdev_access(mdev, (char *)&val, sizeof(val),
> + ret = mdev_access(mdev, (u8 *)&val, sizeof(val),
>

Re: [PATCH] vfio/mdev: add static modifier to add_mdev_supported_type

2018-12-12 Thread Alex Williamson

On Tue, 13 Nov 2018 09:45:43 +0100
Paolo Cretaro  wrote:

> Set add_mdev_supported_type as static since it is only used within
> mdev_sysfs.c.
> This fixes -Wmissing-prototypes gcc warning.
> 
> Signed-off-by: Paolo Cretaro 
> ---
>  drivers/vfio/mdev/mdev_sysfs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 249472f05509..ce5dd219f2c8 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -92,8 +92,8 @@ static struct kobj_type mdev_type_ktype = {
>   .release = mdev_type_release,
>  };
>  
> -struct mdev_type *add_mdev_supported_type(struct mdev_parent *parent,
> -   struct attribute_group *group)
> +static struct mdev_type *add_mdev_supported_type(struct mdev_parent *parent,
> +  struct attribute_group *group)
>  {
>   struct mdev_type *type;
>   int ret;

Applied to vfio next branch with Cornlia's review.  Thanks,

Alex

Re: [PATCH] PCI: Mark NXP LS1088 to avoid bus reset bus

2018-11-30 Thread Alex Williamson

On Fri, 30 Nov 2018 06:24:16 +
Bharat Bhushan  wrote:

> Hi Alex,
> 
> > -Original Message-
> > From: Alex Williamson 
> > Sent: Friday, November 30, 2018 11:26 AM
> > To: Bharat Bhushan 
> > Cc: Bjorn Helgaas ; Bjorn Helgaas
> > ; linux-...@vger.kernel.org; Linux Kernel Mailing List
> > ; bharatb.ya...@gmail.com; David Daney
> > ; jglau...@cavium.com;
> > mbroe...@libmpq.org; chrisrblak...@gmail.com
> > Subject: Re: [PATCH] PCI: Mark NXP LS1088 to avoid bus reset bus
> > 
> > On Fri, 30 Nov 2018 05:29:47 +
> > Bharat Bhushan  wrote:
> >   
> > > Hi,
> > >  
> > > > -Original Message-
> > > > From: Bjorn Helgaas 
> > > > Sent: Thursday, November 29, 2018 1:46 AM
> > > > To: Bharat Bhushan 
> > > > Cc: alex.william...@redhat.com; Bjorn Helgaas ;
> > > > linux- p...@vger.kernel.org; Linux Kernel Mailing List  > > > ker...@vger.kernel.org>; bharatb.ya...@gmail.com; David Daney  
> > > > ; jglau...@cavium.com;  
> > mbroe...@libmpq.org;  
> > > > chrisrblak...@gmail.com
> > > > Subject: Re: [PATCH] PCI: Mark NXP LS1088 to avoid bus reset bus
> > > >
> > > > On Tue, Nov 27, 2018 at 10:32 PM Bharat Bhushan
> > > >  wrote:
> > > >  
> > > > > > -Original Message-
> > > > > > From: Alex Williamson 
> > > > > > Sent: Tuesday, November 27, 2018 9:39 PM
> > > > > > To: Bjorn Helgaas 
> > > > > > Cc: Bharat Bhushan ;
> > > > > > linux-...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > > bharatb.ya...@gmail.com; David Daney  
> > ;  
> > > > Jan  
> > > > > > Glauber ; Maik Broemme  
> > > > ;  
> > > > > > Chris Blake 
> > > > > > Subject: Re: [PATCH] PCI: Mark NXP LS1088 to avoid bus reset bus
> > > > > >
> > > > > > On Tue, 27 Nov 2018 09:33:56 -0600 Bjorn Helgaas
> > > > > >  wrote:  
> > > >  
> > > > > > > 4) Is there a hardware erratum for this?  If so, please
> > > > > > > include the URL here.  
> > > > >
> > > > > No h/w errata as of now.  
> > > >
> > > > Does that mean (a) the HW folks agree this is a hardware problem but
> > > > they haven't written an erratum, (b) there is an erratum but it
> > > > isn't public, (c) we don't have any concrete evidence of a hardware
> > > > problem, but things just don't work if we do a bus reset, (d) something 
> > > >  
> > else?  
> > >
> > > I will say it is (c) - not concluded to be hardware h/w issue.
> > >  
> > > >  
> > > > > In pci_reset_secondary_bus() I have tried to increase the delay
> > > > > after reset  
> > > > but not helped.  
> > > > > Do I need to add delay at some other place as well?  
> > > >
> > > > No, I think the place you tried should be enough.
> > > >
> > > > You should also be able to exercise this from user-space by using
> > > > "setpci" to set and clear the Secondary Bus Reset bit in the Bridge
> > > > Control register.  Then you can also use setpci to read/write config
> > > > space of the NIC.  The kernel would normally read the Vendor and
> > > > Device IDs as the first access to the device during enumeration.
> > > > You also might be able to learn something by using "lspci -vv" on
> > > > the bridge before and after the reset to see if it logs any AER bits 
> > > > (if it  
> > supports AER) or the other standard error logging bits.  
> > >
> > > I tried below sequence for Secondary bus reset and device config space
> > > show 0xff
> > >
> > > root@localhost:~# lspci -x
> > > 0002:00:00.0 PCI bridge: Freescale Semiconductor Inc Device 80c0 (rev
> > > 10)
> > > 00: 57 19 c0 80 07 01 10 00 10 00 04 06 08 00 01 00
> > > 10: 00 00 00 00 00 00 00 00 00 01 ff 00 01 01 00 00
> > > 20: 00 40 00 40 f1 ff 01 00 00 00 00 00 00 00 00 00
> > > 30: 00 00 00 00 40 00 00 00 00 00 00 40 63 01 00 00
> > >
> > > 0002:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit
> > > Network Connection
> > > 00: 86 80 d3 10 06 04 10 00 00 00 00 02 10 00 00 00
> > > 10: 00 00 0c 40 00 00 00 40 01 00 00

Re: [PATCH] PCI: Mark NXP LS1088 to avoid bus reset bus

2018-11-29 Thread Alex Williamson

On Fri, 30 Nov 2018 05:29:47 +
Bharat Bhushan  wrote:

> Hi,
> 
> > -Original Message-
> > From: Bjorn Helgaas 
> > Sent: Thursday, November 29, 2018 1:46 AM
> > To: Bharat Bhushan 
> > Cc: alex.william...@redhat.com; Bjorn Helgaas ; linux-
> > p...@vger.kernel.org; Linux Kernel Mailing List  > ker...@vger.kernel.org>; bharatb.ya...@gmail.com; David Daney  
> > ; jglau...@cavium.com;
> > mbroe...@libmpq.org; chrisrblak...@gmail.com
> > Subject: Re: [PATCH] PCI: Mark NXP LS1088 to avoid bus reset bus
> > 
> > On Tue, Nov 27, 2018 at 10:32 PM Bharat Bhushan
> >  wrote:
> >   
> > > > -Original Message-
> > > > From: Alex Williamson 
> > > > Sent: Tuesday, November 27, 2018 9:39 PM
> > > > To: Bjorn Helgaas 
> > > > Cc: Bharat Bhushan ;
> > > > linux-...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > bharatb.ya...@gmail.com; David Daney ;  
> > Jan  
> > > > Glauber ; Maik Broemme  
> > ;  
> > > > Chris Blake 
> > > > Subject: Re: [PATCH] PCI: Mark NXP LS1088 to avoid bus reset bus
> > > >
> > > > On Tue, 27 Nov 2018 09:33:56 -0600
> > > > Bjorn Helgaas  wrote:  
> >   
> > > > > 4) Is there a hardware erratum for this?  If so, please include
> > > > > the URL here.  
> > >
> > > No h/w errata as of now.  
> > 
> > Does that mean (a) the HW folks agree this is a hardware problem but they
> > haven't written an erratum, (b) there is an erratum but it isn't public, 
> > (c) we
> > don't have any concrete evidence of a hardware problem, but things just
> > don't work if we do a bus reset, (d) something else?  
> 
> I will say it is (c) - not concluded to be hardware h/w issue. 
> 
> >   
> > > In pci_reset_secondary_bus() I have tried to increase the delay after 
> > > reset  
> > but not helped.  
> > > Do I need to add delay at some other place as well?  
> > 
> > No, I think the place you tried should be enough.
> > 
> > You should also be able to exercise this from user-space by using "setpci" 
> > to
> > set and clear the Secondary Bus Reset bit in the Bridge Control register.  
> > Then
> > you can also use setpci to read/write config space of the NIC.  The kernel
> > would normally read the Vendor and Device IDs as the first access to the
> > device during enumeration.  You also might be able to learn something by
> > using "lspci -vv" on the bridge before and after the reset to see if it 
> > logs any
> > AER bits (if it supports AER) or the other standard error logging bits.  
> 
> I tried below sequence for Secondary bus reset and device config space show 
> 0xff
> 
> root@localhost:~# lspci -x
> 0002:00:00.0 PCI bridge: Freescale Semiconductor Inc Device 80c0 (rev 10)
> 00: 57 19 c0 80 07 01 10 00 10 00 04 06 08 00 01 00
> 10: 00 00 00 00 00 00 00 00 00 01 ff 00 01 01 00 00
> 20: 00 40 00 40 f1 ff 01 00 00 00 00 00 00 00 00 00
> 30: 00 00 00 00 40 00 00 00 00 00 00 40 63 01 00 00
> 
> 0002:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network 
> Connection
> 00: 86 80 d3 10 06 04 10 00 00 00 00 02 10 00 00 00
> 10: 00 00 0c 40 00 00 00 40 01 00 00 00 00 00 0e 40
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 1f a0
> 30: 00 00 24 40 c8 00 00 00 00 00 00 00 63 01 00 00
> 
> root@localhost:~#  setpci -s 0002:00:00.0 0x3e.b=0x40
> root@localhost:~#  setpci -s 0002:00:00.0 0x3e.b=0x00
> 
> root@localhost:~# lspci -x
> 0002:00:00.0 PCI bridge: Freescale Semiconductor Inc Device 80c0 (rev 10)
> 00: 57 19 c0 80 07 01 10 00 10 00 04 06 08 00 01 00
> 10: 00 00 00 00 00 00 00 00 00 01 ff 00 01 01 00 00
> 20: 00 40 00 40 f1 ff 01 00 00 00 00 00 00 00 00 00
> 30: 00 00 00 00 40 00 00 00 00 00 00 40 63 01 00 00

Just for curiosity sake, what if you re-write the secondary and
subordinate bus registers here:

# setpci -s 0002:00:00.0 0x19.b=0x01
# setpci -s 0002:00:00.0 0x1a.b=0xff

IIRC the users that debugged the AMD bus reset issue re-wrote the
entire 64 bytes of the bridge config header and then further narrowed
the issue down to the two registers above.  If one bridge
implementation can have such an issue, maybe others do too.  Perhaps
there's common IP in use.  Are you able to test other endpoints besides
this e1000e device with this setpci technique?  Thanks,

Alex

> 0002:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network 
> Connection (rev ff)
> 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 
> Thanks
> -Bharat
> 
>

Re: [PATCH v2 2/3] vfio: ap: ioctl definitions for AP Queue Interrupt Control

2018-11-27 Thread Alex Williamson

On Thu, 22 Nov 2018 18:11:14 +0100
Pierre Morel  wrote:

> We define two VFIO ioctl command to setup and clear
> the AP Queues interrupt.
> 
> Arguments passed by the guest are:
> - the apqn, AP queue number
> - the Notification by address
> - the identifier of the previously associated adapter


We have an extensible VFIO_DEVICE_SET_IRQS ioctl already, why does AP
need its own?


> Signed-off-by: Pierre Morel 
> ---
>  include/uapi/linux/vfio.h | 25 +
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8131028..9a1b350 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -866,6 +866,31 @@ struct vfio_iommu_spapr_tce_remove {
>  };
>  #define VFIO_IOMMU_SPAPR_TCE_REMOVE  _IO(VFIO_TYPE, VFIO_BASE + 20)
>  
> +/**
> + * VFIO_AP_SET_IRQ - _IOWR(VFIO_TYPE, VFIO_BASE + 21, struct vfio_ap_aqic)
> + *
> + * Setup IRQ for an AP Queue
> + * @cmd contains the AP queue number (apqn)
> + * @status receives the resulting status of the command
> + * @nib is the Notification Indicator byte address
> + * @adapter_id allows to retrieve the associated adapter
> + */
> +struct vfio_ap_aqic {
> + __u32   argsz;
> + __u32   flags;
> + /* out */
> + __u32 status;
> + /* in */
> + __u32 adapter_id;
> + __u64 nib;
> + __u16 apqn;
> + __u8 isc;
> + __u8 reserved[5];
> +};
> +#define VFIO_AP_SET_IRQ  _IO(VFIO_TYPE, VFIO_BASE + 21)
> +#define VFIO_AP_CLEAR_IRQ_IO(VFIO_TYPE, VFIO_BASE + 22)
> +
>  /* * */
>  
> +
>  #endif /* _UAPIVFIO_H */

Re: [PATCH] PCI: Mark NXP LS1088 to avoid bus reset bus

2018-11-27 Thread Alex Williamson

On Tue, 27 Nov 2018 09:33:56 -0600
Bjorn Helgaas  wrote:

> [+cc David, Jan, Alex, Maik, Chris]
> 
> On Tue, Nov 27, 2018 at 08:46:33AM +, Bharat Bhushan wrote:
> > NXP (Freescale Vendor ID) LS1088 chips do not behave correctly after
> > bus reset with e1000e. Link state of device does not comes UP and so
> > config space never accessible again.  
> 
> Previous similar commits:
> 
>   822155100e58 ("PCI: Mark Cavium CN8xxx to avoid bus reset")
>   8e2e03179923 ("PCI: Mark Atheros AR9580 to avoid bus reset")
>   9ac0108c2bac ("PCI: Mark Atheros AR9485 and QCA9882 to avoid bus reset")
>   c3e59ee4e766 ("PCI: Mark Atheros AR93xx to avoid bus reset")
> 
> 1) Please make your subject match (remove the spurious "bus" at the
> end)
> 
> 2) This should probably be marked for stable (v3.14 and later, since
> the quirk itself appeared in v3.19 and marked for v3.14 and later
> stable kernels).  Maybe even mark it as "Fixes: c3e59ee4e766..." to
> connect it.
> 
> 3) The 1957:80c0 PCI ID doesn't appear in https://pci-ids.ucw.cz/; can
> you add it?
> 
> 4) Is there a hardware erratum for this?  If so, please include the
> URL here.
> 
> 5) Can you reproduce the problem using the same endpoint (e1000e) on a
> different system with a different bridge?
> 
> 6) Have you looked at this with a PCIe analyzer?  It would be very
> interesting to compare the boot-time or system reboot path with the
> individual bus reset path you're fixing.
> 
> Since there are several similar reports and they sometimes involve the
> same devices (both your patch and 822155100e58 mention e1000e), I'm a
> little suspicious that we're doing something wrong in the bus reset
> path.

I agree, entirely excluding bus resets is not something to be taken
lightly.  It's less than ideal for an endpoint and a fairly major
functional gap for a downstream port.  It should really be considered
a last resort.

> I think bus reset uses Secondary Bus Reset in the Bridge Control
> register.  That's a generic mechanism that I would expect to be pretty
> well-tested.  I suspect the BIOS probably uses it in the reboot path,
> and the device probably works after that.
> 
> So I wonder if the Linux delay isn't quite long enough, or our first
> access to the device isn't quite right, e.g., maybe there's some issue
> with the bus/device number capture (PCIe r4.0, sec 2.2.6.2).

Tweaking the delay would be a reasonable solution, though we are seeing
some issues where users with lots of assigned devices that require bus
resets experience long delays as vfio file descriptors are closed
sequentially on exit.  So perhaps we could flag downstream ports
requiring an extra delay, if that becomes a solution.  Your mention of
the bus/device number also reminds me of the issue we saw on
Threadripper where there were patches proposed to re-write the
secondary and subordinate bus numbers after reset.  AMD was able to
resolve that in a firmware update, but there could be something similar
occurring here. Thanks,

Alex

> > Signed-off-by: Bharat Bhushan 
> > ---
> >  drivers/pci/quirks.c | 7 +++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 4700d24e5d55..b9ae4e9f101a 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3391,6 +3391,13 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 
> > 0x0033, quirk_no_bus_reset);
> >   */
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_no_bus_reset);
> >  
> > +/*
> > + * NXP (Freescale Vendor ID) LS1088 chips do not behave correctly after
> > + * bus reset. Link state of device does not comes UP and so config space
> > + * never accessible again.
> > + */
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, 0x80c0, 
> > quirk_no_bus_reset);
> > +
> >  static void quirk_no_pm_reset(struct pci_dev *dev)
> >  {
> > /*
> > -- 
> > 2.19.1
> >

[PATCH v2] vfio/pci: Parallelize device open and release

2018-11-16 Thread Alex Williamson

In commit 61d792562b53 ("vfio-pci: Use mutex around open, release, and
remove") a mutex was added to freeze the refcnt for a device so that
we can handle errors and perform bus resets on final close.  However,
bus resets can be rather slow and a global mutex here is undesirable.
Evaluating the potential locking granularity, a per-device mutex
provides the best resolution but with multiple devices on a bus all
released concurrently, they'll race to acquire each other's mutex,
likely resulting in no reset at all if we use trylock.  We therefore
lock at the granularity of the bus/slot reset as we're only attempting
a single reset for this group of devices anyway.  This allows much
greater scaling as we're bounded in the number of devices protected by
a single reflck object.

Reported-by: Christian Ehrhardt 
Tested-by: Christian Ehrhardt 
Reviewed-by: Eric Auger 
Signed-off-by: Alex Williamson 
---

v2:
 - Rolled in PTR_ERR_OR_ZERO suggestion from kbuild bot
 - Updated commit log and comments per Eric's feedback

 drivers/vfio/pci/vfio_pci.c |  160 ++-
 drivers/vfio/pci/vfio_pci_private.h |6 +
 2 files changed, 142 insertions(+), 24 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 50cdedfca9fe..ea0670c60c80 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -56,8 +56,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 "Disable using the PCI D3 low power state for idle, unused 
devices");
 
-static DEFINE_MUTEX(driver_lock);
-
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -393,14 +391,14 @@ static void vfio_pci_release(void *device_data)
 {
struct vfio_pci_device *vdev = device_data;
 
-   mutex_lock(&driver_lock);
+   mutex_lock(&vdev->reflck->lock);
 
if (!(--vdev->refcnt)) {
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);
}
 
-   mutex_unlock(&driver_lock);
+   mutex_unlock(&vdev->reflck->lock);
 
module_put(THIS_MODULE);
 }
@@ -413,7 +411,7 @@ static int vfio_pci_open(void *device_data)
if (!try_module_get(THIS_MODULE))
return -ENODEV;
 
-   mutex_lock(&driver_lock);
+   mutex_lock(&vdev->reflck->lock);
 
if (!vdev->refcnt) {
ret = vfio_pci_enable(vdev);
@@ -424,7 +422,7 @@ static int vfio_pci_open(void *device_data)
}
vdev->refcnt++;
 error:
-   mutex_unlock(&driver_lock);
+   mutex_unlock(&vdev->reflck->lock);
if (ret)
module_put(THIS_MODULE);
return ret;
@@ -1187,6 +1185,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
.request= vfio_pci_request,
 };
 
+static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
+static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
+
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
struct vfio_pci_device *vdev;
@@ -1233,6 +1234,14 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
return ret;
}
 
+   ret = vfio_pci_reflck_attach(vdev);
+   if (ret) {
+   vfio_del_group_dev(&pdev->dev);
+   vfio_iommu_group_put(group, &pdev->dev);
+   kfree(vdev);
+   return ret;
+   }
+
if (vfio_pci_is_vga(pdev)) {
vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
vga_set_legacy_decoding(pdev,
@@ -1264,6 +1273,8 @@ static void vfio_pci_remove(struct pci_dev *pdev)
if (!vdev)
return;
 
+   vfio_pci_reflck_put(vdev->reflck);
+
vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
kfree(vdev->region);
mutex_destroy(&vdev->ioeventfds_lock);
@@ -1320,16 +1331,97 @@ static struct pci_driver vfio_pci_driver = {
.err_handler= &vfio_err_handlers,
 };
 
+static DEFINE_MUTEX(reflck_lock);
+
+static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
+{
+   struct vfio_pci_reflck *reflck;
+
+   reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
+   if (!reflck)
+   return ERR_PTR(-ENOMEM);
+
+   kref_init(&reflck->kref);
+   mutex_init(&reflck->lock);
+
+   return reflck;
+}
+
+static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
+{
+   kref_get(&reflck->kref);
+}
+
+static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
+{
+   struct vfio_pci_reflck **preflck = data;
+   struct vfio_device *device;
+   struct vfio_pci_device *vdev;
+
+   device = vfio_device_get_from_dev(&pdev->dev);
+   if (!device)
+   ret

Re: [PATCH] vfio/pci: Parallelize device open and release

2018-11-13 Thread Alex Williamson

On Tue, 13 Nov 2018 15:42:49 +0100
Auger Eric  wrote:

> Hi Alex,
> 
> On 11/9/18 11:09 PM, Alex Williamson wrote:
> > In commit 61d792562b53 ("vfio-pci: Use mutex around open, release, and
> > remove") a mutex was added to freeze the refcnt for a device so that
> > we can handle errors and perform bus resets on final close.  However,
> > bus resets can be rather slow and a global mutex here is undesirable.
> > A per-device mutex provides the best granularity, but then our chances
> > of triggering a bus/slot reset with multiple affected devices is slim
> > when devices are released in parallel.  
> Sorry I don't get the above sentence.

There's a locking granularity question here, where currently we're
locking at a global level.  If I want to reduce that granularity, it
seems the obvious question is what minimum granularity can we achieve.
A per-device lock it technically that minimum for the purposes of
serializing the device around open/release, but then concurrent
releases don't necessarily provide us the opportunity to perform a bus
reset affecting multiple devices since all the devices are racing each
other.  Therefore I conclude that a bus/slot locking granularity
provides us the best compromise of granularity vs functionality.

>   Instead create a reflck object
> > shared among all devices under the same bus or slot, allowing devices
> > on independent buses to be released in parallel while serializing per  
> > bus/slot.>  
> > Reported-by: Christian Ehrhardt 
> > Tested-by: Christian Ehrhardt 
> > Signed-off-by: Alex Williamson 
> > ---
> >  drivers/vfio/pci/vfio_pci.c |  157 
> > ++-
> >  drivers/vfio/pci/vfio_pci_private.h |6 +
> >  2 files changed, 139 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 50cdedfca9fe..d443fb7a4e75 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -56,8 +56,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> >  MODULE_PARM_DESC(disable_idle_d3,
> >  "Disable using the PCI D3 low power state for idle, unused 
> > devices");
> >  
> > -static DEFINE_MUTEX(driver_lock);
> > -
> >  static inline bool vfio_vga_disabled(void)
> >  {
> >  #ifdef CONFIG_VFIO_PCI_VGA
> > @@ -393,14 +391,14 @@ static void vfio_pci_release(void *device_data)
> >  {
> > struct vfio_pci_device *vdev = device_data;
> >  
> > -   mutex_lock(&driver_lock);
> > +   mutex_lock(&vdev->reflck->lock);
> >  
> > if (!(--vdev->refcnt)) {
> > vfio_spapr_pci_eeh_release(vdev->pdev);
> > vfio_pci_disable(vdev);
> > }
> >  
> > -   mutex_unlock(&driver_lock);
> > +   mutex_unlock(&vdev->reflck->lock);
> >  
> > module_put(THIS_MODULE);
> >  }
> > @@ -413,7 +411,7 @@ static int vfio_pci_open(void *device_data)
> > if (!try_module_get(THIS_MODULE))
> > return -ENODEV;
> >  
> > -   mutex_lock(&driver_lock);
> > +   mutex_lock(&vdev->reflck->lock);
> >  
> > if (!vdev->refcnt) {
> > ret = vfio_pci_enable(vdev);
> > @@ -424,7 +422,7 @@ static int vfio_pci_open(void *device_data)
> > }
> > vdev->refcnt++;
> >  error:
> > -   mutex_unlock(&driver_lock);
> > +   mutex_unlock(&vdev->reflck->lock);
> > if (ret)
> > module_put(THIS_MODULE);
> > return ret;
> > @@ -1187,6 +1185,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
> > .request= vfio_pci_request,
> >  };
> >  
> > +static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
> > +static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
> > +
> >  static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id 
> > *id)
> >  {
> > struct vfio_pci_device *vdev;
> > @@ -1233,6 +1234,14 @@ static int vfio_pci_probe(struct pci_dev *pdev, 
> > const struct pci_device_id *id)
> > return ret;
> > }
> >  
> > +   ret = vfio_pci_reflck_attach(vdev);
> > +   if (ret) {
> > +   vfio_del_group_dev(&pdev->dev);
> > +   vfio_iommu_group_put(group, &pdev->dev);
> > +   kfree(vdev);
> > +   return ret;
> > +   }
> > +
> > if (vfio_pci_is_vga(pdev)) {
> > vga_client_register(pdev, vdev, NULL, vf

[PATCH] vfio/pci: Parallelize device open and release

2018-11-09 Thread Alex Williamson

In commit 61d792562b53 ("vfio-pci: Use mutex around open, release, and
remove") a mutex was added to freeze the refcnt for a device so that
we can handle errors and perform bus resets on final close.  However,
bus resets can be rather slow and a global mutex here is undesirable.
A per-device mutex provides the best granularity, but then our chances
of triggering a bus/slot reset with multiple affected devices is slim
when devices are released in parallel.  Instead create a reflck object
shared among all devices under the same bus or slot, allowing devices
on independent buses to be released in parallel while serializing per
bus/slot.

Reported-by: Christian Ehrhardt 
Tested-by: Christian Ehrhardt 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |  157 ++-
 drivers/vfio/pci/vfio_pci_private.h |6 +
 2 files changed, 139 insertions(+), 24 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 50cdedfca9fe..d443fb7a4e75 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -56,8 +56,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 "Disable using the PCI D3 low power state for idle, unused 
devices");
 
-static DEFINE_MUTEX(driver_lock);
-
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -393,14 +391,14 @@ static void vfio_pci_release(void *device_data)
 {
struct vfio_pci_device *vdev = device_data;
 
-   mutex_lock(&driver_lock);
+   mutex_lock(&vdev->reflck->lock);
 
if (!(--vdev->refcnt)) {
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);
}
 
-   mutex_unlock(&driver_lock);
+   mutex_unlock(&vdev->reflck->lock);
 
module_put(THIS_MODULE);
 }
@@ -413,7 +411,7 @@ static int vfio_pci_open(void *device_data)
if (!try_module_get(THIS_MODULE))
return -ENODEV;
 
-   mutex_lock(&driver_lock);
+   mutex_lock(&vdev->reflck->lock);
 
if (!vdev->refcnt) {
ret = vfio_pci_enable(vdev);
@@ -424,7 +422,7 @@ static int vfio_pci_open(void *device_data)
}
vdev->refcnt++;
 error:
-   mutex_unlock(&driver_lock);
+   mutex_unlock(&vdev->reflck->lock);
if (ret)
module_put(THIS_MODULE);
return ret;
@@ -1187,6 +1185,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
.request= vfio_pci_request,
 };
 
+static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
+static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
+
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
struct vfio_pci_device *vdev;
@@ -1233,6 +1234,14 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
return ret;
}
 
+   ret = vfio_pci_reflck_attach(vdev);
+   if (ret) {
+   vfio_del_group_dev(&pdev->dev);
+   vfio_iommu_group_put(group, &pdev->dev);
+   kfree(vdev);
+   return ret;
+   }
+
if (vfio_pci_is_vga(pdev)) {
vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
vga_set_legacy_decoding(pdev,
@@ -1264,6 +1273,8 @@ static void vfio_pci_remove(struct pci_dev *pdev)
if (!vdev)
return;
 
+   vfio_pci_reflck_put(vdev->reflck);
+
vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
kfree(vdev->region);
mutex_destroy(&vdev->ioeventfds_lock);
@@ -1320,16 +1331,97 @@ static struct pci_driver vfio_pci_driver = {
.err_handler= &vfio_err_handlers,
 };
 
+static DEFINE_MUTEX(reflck_lock);
+
+static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
+{
+   struct vfio_pci_reflck *reflck;
+
+   reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
+   if (!reflck)
+   return ERR_PTR(-ENOMEM);
+
+   kref_init(&reflck->kref);
+   mutex_init(&reflck->lock);
+
+   return reflck;
+}
+
+static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
+{
+   kref_get(&reflck->kref);
+}
+
+static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
+{
+   struct vfio_pci_reflck **preflck = data;
+   struct vfio_device *device;
+   struct vfio_pci_device *vdev;
+
+   device = vfio_device_get_from_dev(&pdev->dev);
+   if (!device)
+   return 0;
+
+   if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+   vfio_device_put(device);
+   return 0;
+   }
+
+   vdev = vfio_device_data(device);
+
+   if (vdev->reflck) {
+   vfio_pci_reflck_get(vdev->reflck);
+   *pre

Re: [RFC PATCH v4 06/13] vfio: parallelize vfio_pin_map_dma

2018-11-05 Thread Alex Williamson

On Mon,  5 Nov 2018 11:55:51 -0500
Daniel Jordan  wrote:

> When starting a large-memory kvm guest, it takes an excessively long
> time to start the boot process because qemu must pin all guest pages to
> accommodate DMA when VFIO is in use.  Currently just one CPU is
> responsible for the page pinning, which usually boils down to page
> clearing time-wise, so the ways to optimize this are buying a faster
> CPU ;-) or using more of the CPUs you already have.
> 
> Parallelize with ktask.  Refactor so workqueue workers pin with the mm
> of the calling thread, and to enable an undo callback for ktask to
> handle errors during page pinning.
> 
> Performance results appear later in the series.
> 
> Signed-off-by: Daniel Jordan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 106 +++-
>  1 file changed, 76 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d9fd3188615d..e7cfbf0c8071 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -41,6 +41,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson "
> @@ -395,7 +396,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>   */
>  static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> long npage, unsigned long *pfn_base,
> -   unsigned long limit)
> +   unsigned long limit, struct mm_struct *mm)
>  {
>   unsigned long pfn = 0;
>   long ret, pinned = 0, lock_acct = 0;
> @@ -403,10 +404,10 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
> unsigned long vaddr,
>   dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
>  
>   /* This code path is only user initiated */
> - if (!current->mm)
> + if (!mm)
>   return -ENODEV;
>  
> - ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
> + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
>   if (ret)
>   return ret;
>  
> @@ -418,7 +419,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
> unsigned long vaddr,
>* pages are already counted against the user.
>*/
>   if (!rsvd && !vfio_find_vpfn(dma, iova)) {
> - if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
> + if (!dma->lock_cap && mm->locked_vm + 1 > limit) {
>   put_pfn(*pfn_base, dma->prot);
>   pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
>   limit << PAGE_SHIFT);
> @@ -433,7 +434,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
> unsigned long vaddr,
>   /* Lock all the consecutive pages from pfn_base */
>   for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
>pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
> - ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn);
> + ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn);
>   if (ret)
>   break;
>  
> @@ -445,7 +446,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
> unsigned long vaddr,
>  
>   if (!rsvd && !vfio_find_vpfn(dma, iova)) {
>   if (!dma->lock_cap &&
> - current->mm->locked_vm + lock_acct + 1 > limit) {
> + mm->locked_vm + lock_acct + 1 > limit) {
>   put_pfn(pfn, dma->prot);
>   pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>   __func__, limit << PAGE_SHIFT);
> @@ -752,15 +753,15 @@ static size_t unmap_unpin_slow(struct vfio_domain 
> *domain,
>  }
>  
>  static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> +  dma_addr_t iova, dma_addr_t end,
>bool do_accounting)
>  {
> - dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
>   struct vfio_domain *domain, *d;
>   LIST_HEAD(unmapped_region_list);
>   int unmapped_region_cnt = 0;
>   long unlocked = 0;
>  
> - if (!dma->size)
> + if (iova == end)
>   return 0;
>  
>   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> @@ -777,7 +778,7 @@ static long vfio_unmap_unpin(struct vfio_iom

[GIT PULL] VFIO updates for v4.20-rc1

2018-10-30 Thread Alex Williamson

Hi Linus,

The following changes since commit 6bf4ca7fbc85d80446ac01c0d1d77db4d91a6d84:

  Linux 4.19-rc5 (2018-09-23 19:15:18 +0200)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v4.20-rc1.v2

for you to fetch changes up to 104c7405a64d937254b6a154938e6151f91c9e0d:

  vfio: add edid support to mbochs sample driver (2018-10-11 10:22:36 -0600)


VFIO updates for v4.20

 - EDID interfaces for vfio devices supporting display extensions
   (Gerd Hoffmann)

 - Generically select Type-1 IOMMU model support on ARM/ARM64
   (Geert Uytterhoeven)

 - Quirk for VFs reporting INTx pin (Alex Williamson)

 - Fix error path memory leak in MSI support (Li Qiang)


Alex Williamson (1):
  vfio/pci: Mask buggy SR-IOV VF INTx support

Geert Uytterhoeven (1):
  drivers/vfio: Allow type-1 IOMMU instantiation with all ARM/ARM64 IOMMUs

Gerd Hoffmann (2):
  vfio: add edid api for display (vgpu) devices.
  vfio: add edid support to mbochs sample driver

Li Qiang (1):
  vfio/pci: Fix potential memory leak in vfio_msi_cap_len

 drivers/vfio/Kconfig   |   2 +-
 drivers/vfio/pci/vfio_pci.c|   8 ++-
 drivers/vfio/pci/vfio_pci_config.c |  31 -
 include/uapi/linux/vfio.h  |  50 ++
 samples/vfio-mdev/mbochs.c | 136 +++--
 5 files changed, 204 insertions(+), 23 deletions(-)

Re: [PATCH v4] drivers/vfio: Fix a redundant copy bug

2018-10-29 Thread Alex Williamson

On Mon, 29 Oct 2018 13:56:54 -0500
Wenwen Wang  wrote:

> Hello,
> 
> Could you please apply this patch? Thanks!

I'd like to see testing and/or review from David or Alexey since I also
don't have an environment for spapr/eeh.  We're already late into the
v4.20 merge window so this is probably v4.21 material.  Thanks,

Alex

> On Wed, Oct 17, 2018 at 2:18 PM Wenwen Wang  wrote:
> >
> > In vfio_spapr_iommu_eeh_ioctl(), if the ioctl command is VFIO_EEH_PE_OP,
> > the user-space buffer 'arg' is copied to the kernel object 'op' and the
> > 'argsz' and 'flags' fields of 'op' are checked. If the check fails, an
> > error code EINVAL is returned. Otherwise, 'op.op' is further checked
> > through a switch statement to invoke related handlers. If 'op.op' is
> > VFIO_EEH_PE_INJECT_ERR, the whole user-space buffer 'arg' is copied again
> > to 'op' to obtain the err information. However, in the following execution
> > of this case, the fields of 'op', except the field 'err', are actually not
> > used. That is, the second copy has a redundant part. Therefore, for
> > performance consideration, the redundant part of the second copy should be
> > removed.
> >
> > This patch removes such a part in the second copy. It only copies from
> > 'err.type' to 'err.mask', which is exactly required by the
> > VFIO_EEH_PE_INJECT_ERR op.
> >
> > Signed-off-by: Wenwen Wang 
> > ---
> >  drivers/vfio/vfio_spapr_eeh.c | 9 ++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_spapr_eeh.c b/drivers/vfio/vfio_spapr_eeh.c
> > index 38edeb4..66634c6 100644
> > --- a/drivers/vfio/vfio_spapr_eeh.c
> > +++ b/drivers/vfio/vfio_spapr_eeh.c
> > @@ -37,6 +37,7 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
> > struct eeh_pe *pe;
> > struct vfio_eeh_pe_op op;
> > unsigned long minsz;
> > +   unsigned long start, end;
> > long ret = -EINVAL;
> >
> > switch (cmd) {
> > @@ -86,10 +87,12 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group 
> > *group,
> > ret = eeh_pe_configure(pe);
> > break;
> > case VFIO_EEH_PE_INJECT_ERR:
> > -   minsz = offsetofend(struct vfio_eeh_pe_op, 
> > err.mask);
> > -   if (op.argsz < minsz)
> > +   start = offsetof(struct vfio_eeh_pe_op, err.type);
> > +   end = offsetofend(struct vfio_eeh_pe_op, err.mask);
> > +   if (op.argsz < end)
> > return -EINVAL;
> > -   if (copy_from_user(&op, (void __user *)arg, minsz))
> > +   if (copy_from_user(&op.err, (char __user *)arg +
> > +   start, end - start))
> > return -EFAULT;
> >
> > ret = eeh_pe_inject_err(pe, op.err.type, 
> > op.err.func,
> > --
> > 2.7.4
> >

Re: Can VFIO pin only a specific region of guest mem when use pass through devices?

2018-10-29 Thread Alex Williamson

On Mon, 29 Oct 2018 17:14:46 +0800
Jason Wang  wrote:

> On 2018/10/29 上午10:42, Simon Guo wrote:
> > Hi,
> >
> > I am using network device pass through mode with qemu x86(-device 
> > vfio-pci,host=:xx:yy.z)
> > and “intel_iommu=on” in host kernel command line, and it shows the whole 
> > guest memory
> > were pinned(vfio_pin_pages()), viewed by the “top” RES memory output. I 
> > understand it is due
> > to device can DMA to any guest memory address and it cannot be swapped.
> >
> > However can we just pin a rang of address space allowed by iommu group of 
> > that device,
> > instead of pin whole address space? I do notice some code like 
> > vtd_host_dma_iommu().
> > Maybe there is already some way to enable that?
> >
> > Sorry if I missed some basics. I googled some but no luck to find the 
> > answer yet. Please
> > let me know if any discussion already raised on that.
> >
> > Any other suggestion will also be appreciated. For example, can we modify 
> > the guest network
> > card driver to allocate only from a specific memory region(zone), and qemu 
> > advises guest
> > kernel to only pin that memory region(zone) accordingly?
> >
> > Thanks,
> > - Simon  
> 
> 
> One possible method is to enable IOMMU of VM.

Right, making use of a virtual IOMMU in the VM is really the only way
to bound the DMA to some subset of guest memory, but vIOMMU usage by
the guest is optional on x86 and even if the guest does use it, it might
enable passthrough mode, which puts you back at the problem that all
guest memory is pinned with the additional problem that it might also
be accounted for once per assigned device and may hit locked memory
limits.  Also, the DMA mapping and unmapping path with a vIOMMU is very
slow, so performance of the device in the guest will be abysmal unless
the use case is limited to very static mappings, such as userspace use
within the guest for nested assignment or perhaps DPDK use cases.

Modifying the guest to only use a portion of memory for DMA sounds like
a quite intrusive option.  There are certainly IOMMU models where the
IOMMU provides a fixed IOVA range, but creating dynamic mappings within
that range doesn't really solve anything given that it simply returns
us to a vIOMMU with slow mapping.  A window with a fixed identity
mapping used as a DMA zone seems plausible, but again, also pretty
intrusive to the guest, possibly also to the drivers.  Host IOMMU page
faulting can also help the pinned memory footprint, but of course
requires hardware support and lots of new code paths, many of which are
already being discussed for things like Scalable IOV and SVA.  Thanks,

Alex

Re: [PATCH] drivers/vfio: Fix an 8-byte alignment issue

2018-10-17 Thread Alex Williamson

On Wed, 17 Oct 2018 17:15:33 -0400
Konrad Rzeszutek Wilk  wrote:

> On Wed, Oct 17, 2018 at 01:18:19PM -0500, Wenwen Wang wrote:
> > This patch adds a 4-byte reserved field in the structure
> > vfio_eeh_pe_op to make sure that the u64 fields in the structure
> > vfio_eeh_pe_err are 8-byte aligned.  
> 
> Won't this break 32-bit kernels? That is the size of the structure
> will now be 4 bytes bigger..

Hi Konrad,

EEH support here depends on SPAPR_TCE_IOMMU which depends on either
PPC_POWERNV or PPC_PSERIES, both of which depend on PPC64.  So I don't
think 32-bit kernels are a concern here.  Thanks,

Alex
 
> > Signed-off-by: Wenwen Wang 
> > ---
> >  include/uapi/linux/vfio.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 1aa7b82..3e71ded 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -729,6 +729,7 @@ struct vfio_eeh_pe_op {
> > __u32 argsz;
> > __u32 flags;
> > __u32 op;
> > +   __u32 __res;
> > union {
> > struct vfio_eeh_pe_err err;
> > };
> > -- 
> > 2.7.4
> >

Re: [PATCH v3] drivers/vfio: Fix a redundant copy bug

2018-10-17 Thread Alex Williamson

On Wed, 17 Oct 2018 12:58:26 -0500
Wenwen Wang  wrote:

> On Wed, Oct 17, 2018 at 10:45 AM Alex Williamson
>  wrote:
> >
> > On Wed, 17 Oct 2018 09:32:04 -0500
> > Wenwen Wang  wrote:
> >  
> > > In vfio_spapr_iommu_eeh_ioctl(), if the ioctl command is VFIO_EEH_PE_OP,
> > > the user-space buffer 'arg' is copied to the kernel object 'op' and the
> > > 'argsz' and 'flags' fields of 'op' are checked. If the check fails, an
> > > error code EINVAL is returned. Otherwise, 'op.op' is further checked
> > > through a switch statement to invoke related handlers. If 'op.op' is
> > > VFIO_EEH_PE_INJECT_ERR, the whole user-space buffer 'arg' is copied again
> > > to 'op' to obtain the err information. However, in the following execution
> > > of this case, the fields of 'op', except the field 'err', are actually not
> > > used. That is, the second copy has a redundant part. Therefore, for both
> > > performance consideration, the redundant part of the second copy should be
> > > removed.
> > >
> > > This patch removes such a part in the second copy. It only copies from
> > > 'err.type' to 'err.mask', which is exactly required by the
> > > VFIO_EEH_PE_INJECT_ERR op.
> > >
> > > This patch also adds a 4-byte reserved field in the structure
> > > vfio_eeh_pe_op to make sure that the u64 fields in the structure
> > > vfio_eeh_pe_err are 8-byte aligned.
> > >
> > > Signed-off-by: Wenwen Wang 
> > > ---
> > >  drivers/vfio/vfio_spapr_eeh.c | 9 ++---
> > >  include/uapi/linux/vfio.h | 1 +
> > >  2 files changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vfio/vfio_spapr_eeh.c b/drivers/vfio/vfio_spapr_eeh.c
> > > index 38edeb4..66634c6 100644
> > > --- a/drivers/vfio/vfio_spapr_eeh.c
> > > +++ b/drivers/vfio/vfio_spapr_eeh.c
> > > @@ -37,6 +37,7 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group 
> > > *group,
> > >   struct eeh_pe *pe;
> > >   struct vfio_eeh_pe_op op;
> > >   unsigned long minsz;
> > > + unsigned long start, end;
> > >   long ret = -EINVAL;
> > >
> > >   switch (cmd) {
> > > @@ -86,10 +87,12 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group 
> > > *group,
> > >   ret = eeh_pe_configure(pe);
> > >   break;
> > >   case VFIO_EEH_PE_INJECT_ERR:
> > > - minsz = offsetofend(struct vfio_eeh_pe_op, 
> > > err.mask);
> > > - if (op.argsz < minsz)
> > > + start = offsetof(struct vfio_eeh_pe_op, err.type);  
> >
> > I noted in the previous version that we already have this in minsz, so
> > you're fixing a redundant copy with a redundant operation.  
> 
> The value in start is different from the value in minsz. So why is
> this a redundant operation?

I suppose that's true given the alignment issue below, so we're
actually avoiding 16 bytes rather than 12.  The benefit of this change
still seems pretty thin to me, but it is more correct, so I guess it's
ok.  Do you want to send a new version or shall I just drop the vfio.h
changes and the last paragraph of the commit log in favor of the
separate patch?  Alexey or David, do you want to provide an Ack for
these?  Thanks,

Alex

> > > + end = offsetofend(struct vfio_eeh_pe_op, err.mask);
> > > + if (op.argsz < end)
> > >   return -EINVAL;
> > > - if (copy_from_user(&op, (void __user *)arg, minsz))
> > > + if (copy_from_user(&op.err, (char __user *)arg +
> > > + start, end - start))
> > >   return -EFAULT;
> > >
> > >   ret = eeh_pe_inject_err(pe, op.err.type, 
> > > op.err.func,
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 1aa7b82..d904c42 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -729,6 +729,7 @@ struct vfio_eeh_pe_op {
> > >   __u32 argsz;
> > >   __u32 flags;
> > >   __u32 op;
> > > + __u32 __resv;
> > >   union {
> > >   struct vfio_eeh_pe_err err;
> > >   };  
> >
> > Please don't include two separate issues in the same patch.  Am I also
> > correct in assuming that this is untested?  Thanks,  
> 
> No problem. I will seperate these two patches. And yes, this is not tested.
> 
> Thanks,
> Wenwen

Re: [PATCH v3] drivers/vfio: Fix a redundant copy bug

2018-10-17 Thread Alex Williamson

On Wed, 17 Oct 2018 09:32:04 -0500
Wenwen Wang  wrote:

> In vfio_spapr_iommu_eeh_ioctl(), if the ioctl command is VFIO_EEH_PE_OP,
> the user-space buffer 'arg' is copied to the kernel object 'op' and the
> 'argsz' and 'flags' fields of 'op' are checked. If the check fails, an
> error code EINVAL is returned. Otherwise, 'op.op' is further checked
> through a switch statement to invoke related handlers. If 'op.op' is
> VFIO_EEH_PE_INJECT_ERR, the whole user-space buffer 'arg' is copied again
> to 'op' to obtain the err information. However, in the following execution
> of this case, the fields of 'op', except the field 'err', are actually not
> used. That is, the second copy has a redundant part. Therefore, for both
> performance consideration, the redundant part of the second copy should be
> removed.
> 
> This patch removes such a part in the second copy. It only copies from
> 'err.type' to 'err.mask', which is exactly required by the
> VFIO_EEH_PE_INJECT_ERR op.
> 
> This patch also adds a 4-byte reserved field in the structure
> vfio_eeh_pe_op to make sure that the u64 fields in the structure
> vfio_eeh_pe_err are 8-byte aligned.
> 
> Signed-off-by: Wenwen Wang 
> ---
>  drivers/vfio/vfio_spapr_eeh.c | 9 ++---
>  include/uapi/linux/vfio.h | 1 +
>  2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_spapr_eeh.c b/drivers/vfio/vfio_spapr_eeh.c
> index 38edeb4..66634c6 100644
> --- a/drivers/vfio/vfio_spapr_eeh.c
> +++ b/drivers/vfio/vfio_spapr_eeh.c
> @@ -37,6 +37,7 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>   struct eeh_pe *pe;
>   struct vfio_eeh_pe_op op;
>   unsigned long minsz;
> + unsigned long start, end;
>   long ret = -EINVAL;
>  
>   switch (cmd) {
> @@ -86,10 +87,12 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>   ret = eeh_pe_configure(pe);
>   break;
>   case VFIO_EEH_PE_INJECT_ERR:
> - minsz = offsetofend(struct vfio_eeh_pe_op, err.mask);
> - if (op.argsz < minsz)
> + start = offsetof(struct vfio_eeh_pe_op, err.type);

I noted in the previous version that we already have this in minsz, so
you're fixing a redundant copy with a redundant operation.

> + end = offsetofend(struct vfio_eeh_pe_op, err.mask);
> + if (op.argsz < end)
>   return -EINVAL;
> - if (copy_from_user(&op, (void __user *)arg, minsz))
> + if (copy_from_user(&op.err, (char __user *)arg +
> + start, end - start))
>   return -EFAULT;
>  
>   ret = eeh_pe_inject_err(pe, op.err.type, op.err.func,
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82..d904c42 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -729,6 +729,7 @@ struct vfio_eeh_pe_op {
>   __u32 argsz;
>   __u32 flags;
>   __u32 op;
> + __u32 __resv;
>   union {
>   struct vfio_eeh_pe_err err;
>   };

Please don't include two separate issues in the same patch.  Am I also
correct in assuming that this is untested?  Thanks,

Alex

Re: [PATCH v2] drivers/vfio: Fix a redundant copy bug

2018-10-08 Thread Alex Williamson

On Mon,  8 Oct 2018 13:06:20 -0500
Wenwen Wang  wrote:

> In vfio_spapr_iommu_eeh_ioctl(), if the ioctl command is VFIO_EEH_PE_OP,
> the user-space buffer 'arg' is copied to the kernel object 'op' and the
> 'argsz' and 'flags' fields of 'op' are checked. If the check fails, an
> error code EINVAL is returned. Otherwise, 'op.op' is further checked
> through a switch statement to invoke related handlers. If 'op.op' is
> VFIO_EEH_PE_INJECT_ERR, the whole user-space buffer 'arg' is copied again
> to 'op' to obtain the err information. However, in the following execution
> of this case, the fields of 'op', except the field 'err', are actually not
> used. That is, the second copy has a redundant part. Therefore, for both
> performance consideration, the redundant part of the second copy should be
> removed.
> 
> This patch removes such a part in the second copy. It only copies from
> 'err.type' to 'err.mask', which is exactly required by the
> VFIO_EEH_PE_INJECT_ERR op.
> 
> Signed-off-by: Wenwen Wang 
> ---
>  drivers/vfio/vfio_spapr_eeh.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_spapr_eeh.c b/drivers/vfio/vfio_spapr_eeh.c
> index 38edeb4..66634c6 100644
> --- a/drivers/vfio/vfio_spapr_eeh.c
> +++ b/drivers/vfio/vfio_spapr_eeh.c
> @@ -37,6 +37,7 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>   struct eeh_pe *pe;
>   struct vfio_eeh_pe_op op;
>   unsigned long minsz;
> + unsigned long start, end;
>   long ret = -EINVAL;
>  
>   switch (cmd) {
> @@ -86,10 +87,12 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>   ret = eeh_pe_configure(pe);
>   break;
>   case VFIO_EEH_PE_INJECT_ERR:
> - minsz = offsetofend(struct vfio_eeh_pe_op, err.mask);
> - if (op.argsz < minsz)
> + start = offsetof(struct vfio_eeh_pe_op, err.type);

We already have this in minsz, offsetofend(,op) == offsetof(,err.type).
That can't change without breaking userspace.

> + end = offsetofend(struct vfio_eeh_pe_op, err.mask);
> + if (op.argsz < end)
>   return -EINVAL;
> - if (copy_from_user(&op, (void __user *)arg, minsz))
> + if (copy_from_user(&op.err, (char __user *)arg +
> + start, end - start))

So we trade 12 bytes of redundant copy for an extra stack variable and
an arithmetic operation, not necessarily an obvious win, but more
correct I guess.

Alexey, I also notice that these 12 bytes means that the u64 fields in
struct vfio_eeh_pe_err are not 8-byte aligned which could lead to
compiler dependent packing interpretation issues with userspace.
Should there be a 4-byte reserved field in there to make it explicit
(so long as it matches the current interpretation)?  Thanks,

Alex

Re: [PATCH] drivers/vfio: Fix a redundant copy bug

2018-10-08 Thread Alex Williamson

Hi,

On Sun,  7 Oct 2018 09:44:25 -0500
Wenwen Wang  wrote:

> In vfio_spapr_iommu_eeh_ioctl(), if the ioctl command is VFIO_EEH_PE_OP,
> the user-space buffer 'arg' is copied to the kernel object 'op' and the
> 'argsz' and 'flags' fields of 'op' are checked. If the check fails, an
> error code EINVAL is returned. Otherwise, 'op.op' is further checked
> through a switch statement to invoke related handlers. If 'op.op' is
> VFIO_EEH_PE_INJECT_ERR, the whole user-space buffer 'arg' is copied again
> to 'op' to obtain the err information. However, in the following execution
> of this case, the fields of 'op', except the field 'err', are actually not
> used. That is, the second copy has a redundant part. Therefore, for both
> performance and security reasons, the redundant part of the second copy
> should be removed.

Redundant, yes.  Performance-wise it's 12 bytes on a non-performance
path, so theoretically yes, but in practice maybe it's a simplicity
trade-off.  Security?  I don't see it, please explain.

> This patch removes such a part in the second copy. It only copies the 'err'
> information from the buffer 'arg'.
> 
> Signed-off-by: Wenwen Wang 
> ---
>  drivers/vfio/vfio_spapr_eeh.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_spapr_eeh.c b/drivers/vfio/vfio_spapr_eeh.c
> index 38edeb4..5bc4b60 100644
> --- a/drivers/vfio/vfio_spapr_eeh.c
> +++ b/drivers/vfio/vfio_spapr_eeh.c
> @@ -86,10 +86,10 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>   ret = eeh_pe_configure(pe);
>   break;
>   case VFIO_EEH_PE_INJECT_ERR:
> - minsz = offsetofend(struct vfio_eeh_pe_op, err.mask);
> - if (op.argsz < minsz)
> + if (op.argsz < sizeof(op))
>   return -EINVAL;

The original code is written such that new operations can be added,
possibly with new entries in the struct vfio_eeh_pe_op union, which
might change sizeof(op) to be more than necessary for a
VFIO_EEH_PE_INJECT_ERR op.  Existing userspace suddenly wouldn't work
without effectively reverting this change.  This is a subtle dependency
that is not worth the above code change, imo.

> - if (copy_from_user(&op, (void __user *)arg, minsz))
> + if (copy_from_user(&op.err, (char __user *)arg +
> + minsz, sizeof(op.err)))
>   return -EFAULT;

Please rework with the assumption that the union in struct
vfio_eeh_pe_op can be expanded and must not break existing userspace.
Thanks,

Alex

Re: [PATCH v3 2/2] vfio: add edid support to mbochs sample driver

2018-09-28 Thread Alex Williamson

On Fri, 28 Sep 2018 14:10:26 +0530
Kirti Wankhede  wrote:

> On 9/28/2018 11:10 AM, Gerd Hoffmann wrote:
> >>> + case MBOCHS_EDID_REGION_INDEX:
> >>> + ext->base.argsz = sizeof(*ext);
> >>> + ext->base.offset = MBOCHS_EDID_OFFSET;
> >>> + ext->base.size = MBOCHS_EDID_SIZE;
> >>> + ext->base.flags = (VFIO_REGION_INFO_FLAG_READ  |
> >>> +VFIO_REGION_INFO_FLAG_WRITE |
> >>> +VFIO_REGION_INFO_FLAG_CAPS);  
> >>
> >> Any reason to not to use _MMAP flag?  
> > 
> > There is no page backing this.  Also it is not performance-critical,
> > edid updates should be rare, so the extra code for mmap support doesn't
> > look like it is worth it.
> > 
> > Also for the virtual registers (especially link_state) it is probably
> > useful to have the write callback of the mdev driver called to get
> > notified about the change.
> >   
> >> How would QEMU side code read this region? will it be always trapped?  
> > 
> > qemu uses read & write syscalls (well, pread & pwrite actually).
> >   
> >> If vendor driver sets _MMAP flag, will QEMU side handle that case as well? 
> >>  
> > 
> > The current test branch doesn't, it expects read+write to work.
> >   https://git.kraxel.org/cgit/qemu/log/?h=sirius/edid-vfio
> >   
> 
> Ok.
> Can you add a comment in vfio.h that this region is non-mmappable?

No, region access mechanisms are self describing through the region
info ioctl, it's left to the implementation to decide what to support.
The region definition should not impose such a restriction.  Thanks,

Alex

Re: [PATCH v3 2/2] vfio: add edid support to mbochs sample driver

2018-09-27 Thread Alex Williamson

On Fri, 28 Sep 2018 01:27:16 +0530
Kirti Wankhede  wrote:

> On 9/21/2018 2:00 PM, Gerd Hoffmann wrote:
> > Signed-off-by: Gerd Hoffmann 
> > @@ -964,6 +1050,20 @@ static int mbochs_get_region_info(struct mdev_device 
> > *mdev,
> > region_info->flags  = (VFIO_REGION_INFO_FLAG_READ  |
> >VFIO_REGION_INFO_FLAG_WRITE);
> > break;
> > +   case MBOCHS_EDID_REGION_INDEX:
> > +   ext->base.argsz = sizeof(*ext);
> > +   ext->base.offset = MBOCHS_EDID_OFFSET;
> > +   ext->base.size = MBOCHS_EDID_SIZE;
> > +   ext->base.flags = (VFIO_REGION_INFO_FLAG_READ  |
> > +  VFIO_REGION_INFO_FLAG_WRITE |
> > +  VFIO_REGION_INFO_FLAG_CAPS);  
> 
> Any reason to not to use _MMAP flag?
> How would QEMU side code read this region? will it be always trapped?
> If vendor driver sets _MMAP flag, will QEMU side handle that case as well?
> I think since its blob, edid could be read by QEMU using one memcpy
> rather than adding multiple memcpy of 4 or 8 bytes.

"Trapping" would only come into play if the region were exposed to the
VM, which there's no intention to do here afaik.  Also, just because
it doesn't support mmap doesn't mean that QEMU necessarily needs to
break down accesses into smaller words, QEMU could do:

pwrite(fd, buf, edid_max_size, region_offset + edid_offset)

ie. write the entire edid area with one operation.  I don't think
there's anything in the specification that prevents mmap now,
edid_offset could be at a page alignment, edid_max_size could be
PAGE_SIZE, and a sparse mmap capability could indicate that only the
EDID area is mmap'able, but is it worth the code to support that?
Thanks,

Alex

Re: [PATCH v11 26/26] s390: doc: detailed specifications for AP virtualization

2018-09-26 Thread Alex Williamson

On Tue, 25 Sep 2018 19:16:41 -0400
Tony Krowiak  wrote:

> From: Tony Krowiak 
> 
> This patch provides documentation describing the AP architecture and
> design concepts behind the virtualization of AP devices. It also
> includes an example of how to configure AP devices for exclusive
> use of KVM guests.
> 
> Signed-off-by: Tony Krowiak 
> Reviewed-by: Halil Pasic 
> ---
>  Documentation/s390/vfio-ap.txt | 782 +
>  MAINTAINERS|   1 +
>  2 files changed, 783 insertions(+)
>  create mode 100644 Documentation/s390/vfio-ap.txt
...
> +Example:
> +===
> +Let's now provide an example to illustrate how KVM guests may be given
> +access to AP facilities. For this example, we will show how to configure
> +three guests such that executing the lszcrypt command on the guests would
> +look like this:
> +
> +Guest1
> +--
> +CARD.DOMAIN TYPE  MODE
> +--
> +05  CEX5C CCA-Coproc
> +05.0004 CEX5C CCA-Coproc
> +05.00ab CEX5C CCA-Coproc
> +06  CEX5A Accelerator
> +06.0004 CEX5A Accelerator
> +06.00ab CEX5C CCA-Coproc
> +
> +Guest2
> +--
> +CARD.DOMAIN TYPE  MODE
> +--
> +05  CEX5A Accelerator
> +05.0047 CEX5A Accelerator
> +05.00ff CEX5A Accelerator (5,4), (5,171), (6,4), (6,171),
 ^^^
Seems like an unfinished thought here. 

> +
> +Guest2
> +--
> +CARD.DOMAIN TYPE  MODE
> +--
> +06  CEX5A Accelerator
> +06.0047 CEX5A Accelerator
> +06.00ff CEX5A Accelerator
> +
> +These are the steps:
> +
> +1. Install the vfio_ap module on the linux host. The dependency chain for the
> +   vfio_ap module is:
> +   * iommu
> +   * s390
> +   * zcrypt
> +   * vfio
> +   * vfio_mdev
> +   * vfio_mdev_device
> +   * KVM
> +
> +   To build the vfio_ap module, the kernel build must be configured with the
> +   following Kconfig elements selected:
> +   * IOMMU_SUPPORT
> +   * S390
> +   * ZCRYPT
> +   * S390_AP_IOMMU
> +   * VFIO
> +   * VFIO_MDEV
> +   * VFIO_MDEV_DEVICE
> +   * KVM
> +
> +   If using make menuconfig select the following to build the vfio_ap module:
> +   -> Device Drivers
> +  -> IOMMU Hardware Support
> + select S390 AP IOMMU Support
> +  -> VFIO Non-Privileged userspace driver framework
> + -> Mediated device driver frramework
> +-> VFIO driver for Mediated devices
> +   -> I/O subsystem
> +  -> VFIO support for AP devices
> +
> +2. Secure the AP queues to be used by the three guests so that the host can 
> not
> +   access them. To secure them, there are two sysfs files that specify
> +   bitmasks marking a subset of the APQN range as 'usable by the default AP
> +   queue device drivers' or 'not usable by the default device drivers' and 
> thus
> +   available for use by the vfio_ap device driver'. The sysfs files 
> containing
> +   the sysfs locations of the masks are:
> +
> +   /sys/bus/ap/apmask
> +   /sys/bus/ap/aqmask
> +
> +   The 'apmask' is a 256-bit mask that identifies a set of AP adapter IDs
> +   (APID). Each bit in the mask, from most significant to least significant 
> bit,
> +   corresponds to an APID from 0-255. If a bit is set, the APID is marked as
> +   usable only by the default AP queue device drivers; otherwise, the APID is
> +   usable by the vfio_ap device driver.
> +
> +   The 'aqmask' is a 256-bit mask that identifies a set of AP queue indexes
> +   (APQI). Each bit in the mask, from most significant to least significant 
> bit,
> +   corresponds to an APQI from 0-255. If a bit is set, the APQI is marked as
> +   usable only by the default AP queue device drivers; otherwise, the APQI is
> +   usable by the vfio_ap device driver.
> +
> +   The APQN of each AP queue device assigned to the linux host is checked by 
> the
> +   AP bus against the set of APQNs derived from the cross product of APIDs
> +   and APQIs marked as usable only by the default AP queue device drivers. 
> If a
> +   match is detected,  only the default AP queue device drivers will be 
> probed;
> +   otherwise, the vfio_ap device driver will be probed.
> +
> +   By default, the two masks are set to reserve all APQNs for use by the 
> default
> +   AP queue device drivers. There are two ways the default masks can be 
> changed:
> +
> +   1. The masks can be changed at boot time with the kernel command line
> +  like this:
> +
> + ap.apmask=0x ap.aqmask=0x40
> +
> + This would give these two pools:
> +
> +default drivers pool:adapter 0-15, domain 1
> +alternate drivers pool:  adapter 16-255, domains 2-255

What happened to domain 0?  I'm also a little confused by the bit
ordering.  If 0x40 is bit 1 and 0x is bits 0-15, then the least
significant bit is furthest left?  Did I miss documentation of that?

> +
> +   2. The sysfs mask files can also be edited by echoing

Re: [PATCH v2] vfio/pci: Mask buggy SR-IOV VF INTx support

2018-09-21 Thread Alex Williamson

On Thu, 20 Sep 2018 22:53:04 -0700
Christoph Hellwig  wrote:

> > +/*
> > + * Nag about hardware bugs, hopefully to have vendors fix them, but at 
> > least
> > + * to collect a list of dependencies for the VF INTx pin quirk below.
> > + */
> > +static const struct pci_device_id known_bogus_vf_intx_pin[] = {
> > +   { PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x270c) },
> > +   {}
> > +};  
> 
> What device is this? We don't have the device ID anywhere, so I guess
> it is something match by the class code?

Intel hasn't disclosed the device yet, so we don't know that there's an
existing driver at all for it.  Thanks,

Alex

[PATCH v2] vfio/pci: Mask buggy SR-IOV VF INTx support

2018-09-20 Thread Alex Williamson

The SR-IOV spec requires that VFs must report zero for the INTx pin
register as VFs are precluded from INTx support.  It's much easier for
the host kernel to understand whether a device is a VF and therefore
whether a non-zero pin register value is bogus than it is to do the
same in userspace.  Override the INTx count for such devices and
virtualize the pin register to provide a consistent view of the device
to the user.

As this is clearly a spec violation, warn about it to support hardware
validation, but also provide a known whitelist as it doesn't do much
good to continue complaining if the hardware vendor doesn't plan to
fix it.

Known devices with this issue: 8086:270c

Signed-off-by: Alex Williamson 
---

v2:
Moved the warning to vfio_config_init(), so it triggers on device open and
no longer depends on the user looking at the number of INTx IRQs available.
Also changed from dev_warn_once() to pci_warn() as this new location seems
sufficiently low frequency to nag repeatedly.  Please test.  Thanks,

Alex

 drivers/vfio/pci/vfio_pci.c|8 ++--
 drivers/vfio/pci/vfio_pci_config.c |   27 +++
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index cddb453a1ba5..50cdedfca9fe 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -434,10 +434,14 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device 
*vdev, int irq_type)
 {
if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
u8 pin;
+
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) ||
+   vdev->nointx || vdev->pdev->is_virtfn)
+   return 0;
+
pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
-   if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && !vdev->nointx && pin)
-   return 1;
 
+   return pin ? 1 : 0;
} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
u8 pos;
u16 flags;
diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 62023b4a373b..423ea1f98441 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1611,6 +1611,15 @@ static int vfio_ecap_init(struct vfio_pci_device *vdev)
return 0;
 }
 
+/*
+ * Nag about hardware bugs, hopefully to have vendors fix them, but at least
+ * to collect a list of dependencies for the VF INTx pin quirk below.
+ */
+static const struct pci_device_id known_bogus_vf_intx_pin[] = {
+   { PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x270c) },
+   {}
+};
+
 /*
  * For each device we allocate a pci_config_map that indicates the
  * capability occupying each dword and thus the struct perm_bits we
@@ -1676,6 +1685,24 @@ int vfio_config_init(struct vfio_pci_device *vdev)
if (pdev->is_virtfn) {
*(__le16 *)&vconfig[PCI_VENDOR_ID] = cpu_to_le16(pdev->vendor);
*(__le16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
+
+   /*
+* Per SR-IOV spec rev 1.1, 3.4.1.18 the interrupt pin register
+* does not apply to VFs and VFs must implement this register
+* as read-only with value zero.  Userspace is not readily able
+* to identify whether a device is a VF and thus that the pin
+* definition on the device is bogus should it violate this
+* requirement.  We already virtualize the pin register for
+* other purposes, so we simply need to replace the bogus value
+* and consider VFs when we determine INTx IRQ count.
+*/
+   if (vconfig[PCI_INTERRUPT_PIN] &&
+   !pci_match_id(known_bogus_vf_intx_pin, pdev))
+   pci_warn(pdev,
+"Hardware bug: VF reports bogus INTx pin %d\n",
+vconfig[PCI_INTERRUPT_PIN]);
+
+   vconfig[PCI_INTERRUPT_PIN] = 0; /* Gratuitous for good VFs */
}
 
if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)

Re: [PATCH v2 1/2] vfio: add edid api for display (vgpu) devices.

2018-09-19 Thread Alex Williamson

On Tue, 18 Sep 2018 15:38:12 +0200
Gerd Hoffmann  wrote:

No empty commit logs please.  There must be something to say about the
goal or motivation beyond the subject.

> Signed-off-by: Gerd Hoffmann 
> ---
>  include/uapi/linux/vfio.h | 39 +++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82e81..78e5a37d83 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -301,6 +301,45 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> +#define VFIO_REGION_TYPE_PCI_GFX(1)

nit, what's the PCI dependency?

> +#define VFIO_REGION_SUBTYPE_GFX_EDID(1)
> +
> +/**
> + * Set display link state and edid blob.
> + *
> + * For the edid blob spec look here:
> + * https://en.wikipedia.org/wiki/Extended_Display_Identification_Data
> + *
> + * The guest should be notified about edid changes, for example by
> + * setting the link status to down temporarely (emulate monitor
> + * hotplug).

Who is responsible for this notification, the user interacting with
this region or the driver providing the region when a new edid is
provided?  This comment needs to state the expected API as clearly as
possible.

> + *
> + * @link_state:
> + * VFIO_DEVICE_GFX_LINK_STATE_UP: Monitor is turned on.
> + * VFIO_DEVICE_GFX_LINK_STATE_DOWN: Monitor is turned off.
> + *
> + * @edid_size: Size of the edid data blob.
> + * @edid_blob: The actual edid data.

What signals that the user edid_blob update is complete?  Should the
size be written before or after the blob?  Is the user required to
update the entire blob in a single write or can it be written
incrementally?

It might also be worth defining access widths, I see that you use
memcpy to support any width in mbochs, but we could define only native
field accesses for discrete registers if it makes the implementation
easier.

> + *
> + * Returns 0 on success, error code (such as -EINVAL) on failure.

Left over from ioctl.

> + */
> +struct vfio_region_gfx_edid {
> + /* device capability hints (read only) */
> + __u32 max_xres;
> + __u32 max_yres;
> + __u32 __reserved1[6];

Is the plan to use the version field within vfio_info_cap_header to
make use of these reserved fields later, ie. version 2 might define a
field from this reserved block?

> +
> + /* device state (read/write) */
> + __u32 link_state;
> +#define VFIO_DEVICE_GFX_LINK_STATE_UP1
> +#define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
> + __u32 edid_size;
> + __u32 __reserved2[6];
> +
> + /* edid blob (read/write) */
> + __u8 edid_blob[512];

It seems the placement of this blob is what makes us feel like we need
to introduce reserved fields for later use, but we could instead define
an edid_offset read-only field so that the blob is always at the end of
whatever discrete fields we define.  Perhaps then we wouldn't even need
a read-only vs read-write section, simply define it per virtual
register.

Overall, I prefer this approach rather than adding yet more ioctls for
every feature and extension we add, thanks for implementing it.  What's
your impression vs ioctls?  Thanks,

Alex

Re: [PATCH 1/2] vfio: add edid api for display (vgpu) devices.

2018-09-14 Thread Alex Williamson

On Fri, 14 Sep 2018 14:25:52 +0200
Gerd Hoffmann  wrote:

>   Hi,
> 
> > Another possible implementation would be via a vfio region, we already
> > support device specific regions via capabilities with vfio_region_info,
> > so we could have an edid region which could handle both input and
> > output using a defined structure and protocol within the region.  With
> > an edid blob of up to 512 bytes now, that means the vendor driver would
> > need to buffer writes to that section of the region until some sort of
> > activation, possibly using another "register" within the field to
> > trigger the link state and only processing the edid blob on link down
> > to link up transition.  
> 
> Hmm, using a virtual register space makes things more complicated for no
> good reason.  This is a configuration interface for qemu, not something
> we expose to the guest.  So, I'm not a fan ...

Ok, I'm curious what makes it more complicated though and why it
matters that it's for QEMU vs exposed to the guest.  From the user
perspective, it's just a few pwrites rather than an ioctl.  If we use
the ioctl, I think we still need to improve the protocol, it's
confusing that the user can specify both an EDID and a link state.
Does the user need to toggle the link state when setting an EDID or
does that happen automatically?  How are link state vs EDID specified?
Simply link_state=0 and edid_size=0 is reserved as not provided?

> New revision of the vfio.h patch attached below, how does that look
> like?  If it is ok I'll go continue with that next week (more verbose
> docs, update qemu code and test, ...)

Yes, modulo the ioctl protocol questions above.  Thanks,

Alex

> From 818f2ea4dd756d28908e58a32a2fdd0d197a28da Mon Sep 17 00:00:00 2001
> From: Gerd Hoffmann 
> Date: Thu, 6 Sep 2018 16:17:17 +0200
> Subject: [PATCH] vfio: add edid api for display (vgpu) devices.
> 
> Signed-off-by: Gerd Hoffmann 
> ---
>  include/uapi/linux/vfio.h | 48 
> +++
>  1 file changed, 48 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82e81..901f279033 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -200,12 +200,25 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)  /* vfio-platform device */
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3) /* vfio-amba device */
>  #define VFIO_DEVICE_FLAGS_CCW(1 << 4)/* vfio-ccw device */
> +#define VFIO_DEVICE_FLAGS_CAPS   (1 << 5)/* cap_offset present */
>   __u32   num_regions;/* Max region index + 1 */
>   __u32   num_irqs;   /* Max IRQ index + 1 */
> + __u32   cap_offset; /* Offset within info struct of first cap */
>  };
>  #define VFIO_DEVICE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 7)
>  
>  /*
> + * FIXME: docs ...
> + */
> +#define VFIO_DEVICE_INFO_CAP_EDID1
> +
> +struct vfio_device_info_edid_cap {
> + struct vfio_info_cap_header header;
> + __u32   max_x; /* Max display height (zero == no limit) */
> + __u32   max_y; /* Max display height (zero == no limit) */
> +};
> +
> +/*
>   * Vendor driver using Mediated device framework should provide device_api
>   * attribute in supported type attribute groups. Device API string should be 
> one
>   * of the following corresponding to device flags in vfio_device_info 
> structure.
> @@ -602,6 +615,41 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_DEVICE_SET_GFX_EDID - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> + * struct vfio_device_set_gfx_edid)
> + *
> + * Set display link state and edid blob (for UP state).
> + *
> + * For the edid blob spec look here:
> + * https://en.wikipedia.org/wiki/Extended_Display_Identification_Data
> + *
> + * The guest should be notified about edid changes, for example by
> + * setting the link status to down temporarely (emulate monitor
> + * hotplug).
> + *
> + * @link_state:
> + * VFIO_DEVICE_GFX_LINK_STATE_UP: Monitor is turned on.
> + * VFIO_DEVICE_GFX_LINK_STATE_DOWN: Monitor is turned off.
> + *
> + * @edid_size: Size of the edid data blob.
> + * @edid_blob: The actual edid data.
> + *
> + * Returns 0 on success, error code (such as -EINVAL) on failure.
> + */
> +struct vfio_device_set_gfx_edid {
> + __u32 argsz;
> + __u32 flags;
> + /* in */
> + __u32 link_state;
> +#define VFIO_DEVICE_GFX_LINK_STATE_UP1
> +#define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
> + __u32 edid_size;
> + __u8  edid_blob[512];
> +};
> +
> +#define VFIO_DEVICE_SET_GFX_EDID _IO(VFIO_TYPE, VFIO_BASE + 17)
> +
>  /*  API for Type1 VFIO IOMMU  */
>  
>  /**

Re: [PATCH 1/2] vfio: add edid api for display (vgpu) devices.

2018-09-13 Thread Alex Williamson

On Thu, 13 Sep 2018 07:47:44 +0200
Gerd Hoffmann  wrote:

Some sort of commit log indicating the motivation for the change is
always appreciated.

> Signed-off-by: Gerd Hoffmann 
> ---
>  include/uapi/linux/vfio.h | 38 ++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82e81..38b591e909 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -200,8 +200,11 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)  /* vfio-platform device */
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3) /* vfio-amba device */
>  #define VFIO_DEVICE_FLAGS_CCW(1 << 4)/* vfio-ccw device */
> +#define VFIO_DEVICE_FLAGS_EDID   (1 << 5)/* Device supports edid 
> */
>   __u32   num_regions;/* Max region index + 1 */
>   __u32   num_irqs;   /* Max IRQ index + 1 */
> + __u32   edid_max_x; /* Max display width (zero == no limit) */
> + __u32   edid_max_y; /* Max display height (zero == no limit) */
>  };

Hmm, not really what I was looking for, devices providing these values
are only a very small subset of devices supported by vfio, so I was
thinking a new flag bit would indicate the presence of a new __u32
cap_offset field and we'd define a capability something like:

struct vfio_device_info_edid_cap {
struct vfio_info_cap_header header;
__u32 edid_max_x;
__u32 edid_max_y;
};

Therefore the capability is a generic expansion and the user would look
for this specific edid capability within that.  The protocol would be
as we have today with region info where a call using the base
vfio_device_info would return success regardless of argsz, but indicate
capabilities are supported and return in argsz the size necessary to
receive them.

Another possible implementation would be via a vfio region, we already
support device specific regions via capabilities with vfio_region_info,
so we could have an edid region which could handle both input and
output using a defined structure and protocol within the region.  With
an edid blob of up to 512 bytes now, that means the vendor driver would
need to buffer writes to that section of the region until some sort of
activation, possibly using another "register" within the field to
trigger the link state and only processing the edid blob on link down
to link up transition.  So the virtual register space of the region
might look like

struct vfio_device_edid_region {
__u32 max_x;/* read-only */
__u32 max_y;/* read-only */
__u32 link_state;   /* read-write, 0=down, 1=up */
/* edid blob processed on 0->1 */
__u32 blob_size;/* read-write */
/* max size = region_size - end of blob_size */
__u8 blob[];
};

This is sort of the "we're defining our own hardware, so why not use a
region as virtual register space for the device rather than throw a new
ioctl at everything" approach.  Thoughts?  Thanks,

Alex

>  #define VFIO_DEVICE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 7)
>  
> @@ -602,6 +605,41 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_DEVICE_SET_GFX_EDID - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> + * struct vfio_device_set_gfx_edid)
> + *
> + * Set display link state and edid blob (for UP state).
> + *
> + * For the edid blob spec look here:
> + * https://en.wikipedia.org/wiki/Extended_Display_Identification_Data
> + *
> + * The guest should be notified about edid changes, for example by
> + * setting the link status to down temporarely (emulate monitor
> + * hotplug).
> + *
> + * @link_state:
> + * VFIO_DEVICE_GFX_LINK_STATE_UP: Monitor is turned on.
> + * VFIO_DEVICE_GFX_LINK_STATE_DOWN: Monitor is turned off.
> + *
> + * @edid_size: Size of the edid data blob.
> + * @edid_blob: The actual edid data.
> + *
> + * Returns 0 on success, error code (such as -EINVAL) on failure.
> + */
> +struct vfio_device_set_gfx_edid {
> + __u32 argsz;
> + __u32 flags;
> + /* in */
> + __u32 link_state;
> +#define VFIO_DEVICE_GFX_LINK_STATE_UP1
> +#define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
> + __u32 edid_size;
> + __u8  edid_blob[512];
> +};
> +
> +#define VFIO_DEVICE_SET_GFX_EDID _IO(VFIO_TYPE, VFIO_BASE + 17)
> +
>  /*  API for Type1 VFIO IOMMU  */
>  
>  /**

Re: [PATCH] vfio: fix potential memory leak in vfio_msi_cap_len

2018-09-04 Thread Alex Williamson

On Mon, 27 Aug 2018 05:47:21 -0700
Li Qiang  wrote:

> Free the vdev->msi_perm in error path.
> 
> Signed-off-by: Li Qiang 
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index 115a36f6f403..62023b4a373b 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1180,8 +1180,10 @@ static int vfio_msi_cap_len(struct vfio_pci_device 
> *vdev, u8 pos)
>   return -ENOMEM;
>  
>   ret = init_pci_cap_msi_perm(vdev->msi_perm, len, flags);
> - if (ret)
> + if (ret) {
> + kfree(vdev->msi_perm);
>   return ret;
> + }
>  
>   return len;
>  }

Fix looks correct to me, I'll queue this for v4.20 with Eric's R-b.
Thanks,

Alex

[GIT PULL] VFIO updates for v4.19-rc1

2018-08-16 Thread Alex Williamson

Hi Linus,

The following changes since commit 1ffaddd029c867d134a1dde39f540dcc8c52e274:

  Linux 4.18-rc8 (2018-08-05 12:37:41 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v4.19-rc1

for you to fetch changes up to 0dd0e297f0ec780b6b3484ba38b27d18c8ca7af9:

  vfio-pci: Disable binding to PFs with SR-IOV enabled (2018-08-06 12:23:19 
-0600)


VFIO updates for v4.19

 - Mark switch fall throughs (Gustavo A. R. Silva)

 - Disable binding SR-IOV enabled PFs (Alex Williamson)


Alex Williamson (1):
  vfio-pci: Disable binding to PFs with SR-IOV enabled

Gustavo A. R. Silva (1):
  vfio: Mark expected switch fall-throughs

 drivers/vfio/pci/vfio_pci.c | 15 ++-
 drivers/vfio/vfio_iommu_type1.c |  1 +
 2 files changed, 15 insertions(+), 1 deletion(-)

[PATCH v4 2/3] PCI: Samsung SM961/PM961 NVMe disable before FLR quirk

2018-08-09 Thread Alex Williamson

The Samsung SM961/PM961 (960 EVO) sometimes fails to return from FLR
with the PCI config space reading back as -1.  A reproducible instance
of this behavior is resolved by clearing the enable bit in the NVMe
configuration register and waiting for the ready status to clear
(disabling the NVMe controller) prior to FLR.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1542494
Signed-off-by: Alex Williamson 
---
 drivers/pci/quirks.c |   83 ++
 1 file changed, 83 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index e72c8742aafa..0a4d802cb307 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include/* isa_dma_bridge_buggy */
 #include "pci.h"
 
@@ -3669,6 +3670,87 @@ static int reset_chelsio_generic_dev(struct pci_dev 
*dev, int probe)
 #define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
 #define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
 
+/*
+ * The Samsung SM961/PM961 controller can sometimes enter a fatal state after
+ * FLR where config space reads from the device return -1.  We seem to be
+ * able to avoid this condition if we disable the NVMe controller prior to
+ * FLR.  This quirk is generic for any NVMe class device requiring similar
+ * assistance to quiesce the device prior to FLR.
+ *
+ * NVMe specification: https://nvmexpress.org/resources/specifications/
+ * Revision 1.0e:
+ *Chapter 2: Required and optional PCI config registers
+ *Chapter 3: NVMe control registers
+ *Chapter 7.3: Reset behavior
+ */
+static int nvme_disable_and_flr(struct pci_dev *dev, int probe)
+{
+   void __iomem *bar;
+   u16 cmd;
+   u32 cfg;
+
+   if (dev->class != PCI_CLASS_STORAGE_EXPRESS ||
+   !pcie_has_flr(dev) || !pci_resource_start(dev, 0))
+   return -ENOTTY;
+
+   if (probe)
+   return 0;
+
+   bar = pci_iomap(dev, 0, NVME_REG_CC + sizeof(cfg));
+   if (!bar)
+   return -ENOTTY;
+
+   pci_read_config_word(dev, PCI_COMMAND, &cmd);
+   pci_write_config_word(dev, PCI_COMMAND, cmd | PCI_COMMAND_MEMORY);
+
+   cfg = readl(bar + NVME_REG_CC);
+
+   /* Disable controller if enabled */
+   if (cfg & NVME_CC_ENABLE) {
+   u32 cap = readl(bar + NVME_REG_CAP);
+   unsigned long timeout;
+
+   /*
+* Per nvme_disable_ctrl() skip shutdown notification as it
+* could complete commands to the admin queue.  We only intend
+* to quiesce the device before reset.
+*/
+   cfg &= ~(NVME_CC_SHN_MASK | NVME_CC_ENABLE);
+
+   writel(cfg, bar + NVME_REG_CC);
+
+   /*
+* Some controllers require an additional delay here, see
+* NVME_QUIRK_DELAY_BEFORE_CHK_RDY.  None of those are yet
+* supported by this quirk.
+*/
+
+   /* Cap register provides max timeout in 500ms increments */
+   timeout = ((NVME_CAP_TIMEOUT(cap) + 1) * HZ / 2) + jiffies;
+
+   for (;;) {
+   u32 status = readl(bar + NVME_REG_CSTS);
+
+   /* Ready status becomes zero on disable complete */
+   if (!(status & NVME_CSTS_RDY))
+   break;
+
+   msleep(100);
+
+   if (time_after(jiffies, timeout)) {
+   pci_warn(dev, "Timeout waiting for NVMe ready 
status to clear after disable\n");
+   break;
+   }
+   }
+   }
+
+   pci_iounmap(dev, bar);
+
+   pcie_flr(dev);
+
+   return 0;
+}
+
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 reset_intel_82599_sfp_virtfn },
@@ -3676,6 +3758,7 @@ static const struct pci_dev_reset_methods 
pci_dev_reset_methods[] = {
reset_ivb_igd },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
reset_ivb_igd },
+   { PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr },
{ PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
reset_chelsio_generic_dev },
{ 0 }

[PATCH v4 3/3] PCI: Intel DC P3700 NVMe delay after FLR quirk

2018-08-09 Thread Alex Williamson

Add a device specific reset for Intel DC P3700 NVMe device which
exhibits a timeout failure in drivers waiting for the ready status to
update after NVMe enable if the driver interacts with the device too
quickly after FLR.  As this has been observed in device assignment
scenarios, resolve this with a device specific reset quirk to add an
additional, heuristically determined, delay after the FLR completes.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1592654
Signed-off-by: Alex Williamson 
---
 drivers/pci/quirks.c |   22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 0a4d802cb307..93791bd31a55 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3751,6 +3751,27 @@ static int nvme_disable_and_flr(struct pci_dev *dev, int 
probe)
return 0;
 }
 
+/*
+ * Intel DC P3700 NVMe controller will timeout waiting for ready status
+ * to change after NVMe enable if the driver starts interacting with the
+ * device too quickly after FLR.  A 250ms delay after FLR has heuristically
+ * proven to produce reliably working results for device assignment cases.
+ */
+static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
+{
+   if (!pcie_has_flr(dev))
+   return -ENOTTY;
+
+   if (probe)
+   return 0;
+
+   pcie_flr(dev);
+
+   msleep(250);
+
+   return 0;
+}
+
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 reset_intel_82599_sfp_virtfn },
@@ -3759,6 +3780,7 @@ static const struct pci_dev_reset_methods 
pci_dev_reset_methods[] = {
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
reset_ivb_igd },
{ PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr },
+   { PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr },
{ PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
reset_chelsio_generic_dev },
{ 0 }

[PATCH v4 1/3] PCI: Export pcie_has_flr()

2018-08-09 Thread Alex Williamson

pcie_flr() suggests pcie_has_flr() to ensure that PCIe FLR support is
present prior to calling.  pcie_flr() is exported while pcie_has_flr()
is not.  Resolve this.

Signed-off-by: Alex Williamson 
---
 drivers/pci/pci.c   |3 ++-
 include/linux/pci.h |1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2bec76c9d9a7..52fe2d72a99c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4071,7 +4071,7 @@ static int pci_dev_wait(struct pci_dev *dev, char 
*reset_type, int timeout)
  * Returns true if the device advertises support for PCIe function level
  * resets.
  */
-static bool pcie_has_flr(struct pci_dev *dev)
+bool pcie_has_flr(struct pci_dev *dev)
 {
u32 cap;
 
@@ -4081,6 +4081,7 @@ static bool pcie_has_flr(struct pci_dev *dev)
pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap);
return cap & PCI_EXP_DEVCAP_FLR;
 }
+EXPORT_SYMBOL_GPL(pcie_has_flr);
 
 /**
  * pcie_flr - initiate a PCIe function level reset
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 04c7ea6ed67b..bbe030d7814f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1092,6 +1092,7 @@ u32 pcie_bandwidth_available(struct pci_dev *dev, struct 
pci_dev **limiting_dev,
 enum pci_bus_speed *speed,
 enum pcie_link_width *width);
 void pcie_print_link_status(struct pci_dev *dev);
+bool pcie_has_flr(struct pci_dev *dev);
 int pcie_flr(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);

[PATCH v4 0/3] PCI: NVMe reset quirks

2018-08-09 Thread Alex Williamson

v4: Fix 0-day i386 build error for readq, simply use readl
instead, the bits we're interested in are 24:31 and the NVMe
spec indicates that smaller, aligned accesses are allowed.
Update bz links for both device specific resets.

v3: Separate quirks, only for the afflicted devices

v2: Add bug link, use Samsung vendor ID, add spec references

Fix two different NVMe device reset issues with device specific
quirks.  The Samsung controller in patch 2 sometimes doesn't like
being reset while enabled, so disable the NVMe controller prior to
FLR.  This quirk is generic to all NVMe class devices, though I've
dropped the additional delay some devices require between disabling
and checking ready status.  This can be added later should any of
those devices need this quirk.  The Intel controller quirk is now
just a simple delay after FLR, which clearly any device needing
similar behavior can also use.  Thanks,

Alex

---

Alex Williamson (3):
  PCI: Export pcie_has_flr()
  PCI: Samsung SM961/PM961 NVMe disable before FLR quirk
  PCI: Intel DC P3700 NVMe delay after FLR quirk


 drivers/pci/pci.c|3 +
 drivers/pci/quirks.c |  105 ++
 include/linux/pci.h  |1 
 3 files changed, 108 insertions(+), 1 deletion(-)

Re: [PATCH] vfio: Mark expected switch fall-throughs

2018-08-07 Thread Alex Williamson

On Mon, 9 Jul 2018 17:53:09 -0500
"Gustavo A. R. Silva"  wrote:

> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/vfio/pci/vfio_pci.c | 2 +-
>  drivers/vfio/vfio_iommu_type1.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)

Applied to vfio next branch for v4.19.  Thanks,

Alex

Re: [RFC PATCH] vfio/pci: map prefetchble bars as writecombine

2018-08-01 Thread Alex Williamson

On Wed, 1 Aug 2018 23:28:53 +0530
Srinath Mannam  wrote:

> Hi Alex,
> 
> In user space UIO driver (DPDK) implementation, sysfs interface
> "/sys/devices/pci/.../resource0_wc" is used to map prefetchable PCI
> resources as WC.
> Platforms which support write-combining maps of PCI resources have
> arch_can_pci_mmap_wc() flag enabled. So that it allows to map resources as WC.
> In this approach mmap calls "pci_mmap_resource_range" kernel function
> with write_combine parameter set.
> "drivers/pci/pci-sysfs.c" kernel file has this implementation.
> 
> If this approach fits to vfio driver, then code change in vfio driver are
> 
>  if (arch_can_pci_mmap_wc() &&
>  (pci_resource_flags(pdev, index) & IORESOURCE_PREFETCH))
> vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> else
> vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> 
> Please provide your feedback.

Let me see if I've got this straight, UIO (in reality pci-sysfs)
provides a separate interface such that a driver can *choose* to get
either a UC or, at their discretion, a WC mapping to a region, and
you're using that as justification that vfio-pci should arbitrarily
convert all existing users from UC mappings to WC mappings, without
their consent, any time the architecture supports WC mappings.  The fact
that pci-sysfs provides separate interfaces such that drives can choose
their preferred mapping attributes only reaffirms to me that this needs
to be a driver decision.  How would we ever validate that a change like
you're proposing above would not introduce regressions for existing
users?  Thanks,

Alex

Re: [PATCH v3 2/3] PCI: Samsung SM961/PM961 NVMe disable before FLR quirk

2018-07-24 Thread Alex Williamson

On Wed, 25 Jul 2018 04:53:18 +0900
Minwoo Im  wrote:

> Hi Alex,
> 
> On Tue, 2018-07-24 at 10:14 -0600, Alex Williamson wrote:
> > The Samsung SM961/PM961 (960 EVO) sometimes fails to return from FLR
> > with the PCI config space reading back as -1.  A reproducible instance
> > of this behavior is resolved by clearing the enable bit in the NVMe
> > configuration register and waiting for the ready status to clear
> > (disabling the NVMe controller) prior to FLR.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> >  drivers/pci/quirks.c |   83
> > ++
> >  1 file changed, 83 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index e72c8742aafa..3899cdd2514b 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -28,6 +28,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include/* isa_dma_bridge_buggy */
> >  #include "pci.h"
> >  
> > @@ -3669,6 +3670,87 @@ static int reset_chelsio_generic_dev(struct pci_dev
> > *dev, int probe)
> >  #define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
> >  #define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
> >  
> > +/*
> > + * The Samsung SM961/PM961 controller can sometimes enter a fatal state 
> > after
> > + * FLR where config space reads from the device return -1.  We seem to be
> > + * able to avoid this condition if we disable the NVMe controller prior to
> > + * FLR.  This quirk is generic for any NVMe class device requiring similar
> > + * assistance to quiesce the device prior to FLR.
> > + *
> > + * NVMe specification: https://nvmexpress.org/resources/specifications/
> > + * Revision 1.0e:  
> 
> It seems too old version of NVMe specification.  Do you have any special 
> reason
> to comment the specified 1.0 version instead of 1.3 or something newer?

I wanted to show that I'm using NVMe features that have been available
since the initial release and there's no reason to check the version
field for their support.

> > + *Chapter 2: Required and optional PCI config registers
> > + *Chapter 3: NVMe control registers
> > + *Chapter 7.3: Reset behavior
> > + */
> > +static int nvme_disable_and_flr(struct pci_dev *dev, int probe)  
> 
> The name of this function seems able to be started with 'reset_' prefix just
> like other quirks for reset.
> What about reset_samsung_pm961 or something?

I'm fine with any generic prefix, but I'm not fine with obfuscating the
purpose of the function behind a vendor/device specific name.  If
someone else comes along needing this same functionality, they'll
probably be reluctant to even look at what a "reset_samsung_sm961"
function does.  If they do, they might still be reluctant to reuse
something so obviously made for a specific device.  I thought this was
pretty descriptive of what it's doing.  Prefixing with 'reset_' is a
tad redundant.

> > +{
> > +   void __iomem *bar;
> > +   u16 cmd;
> > +   u32 cfg;
> > +
> > +   if (dev->class != PCI_CLASS_STORAGE_EXPRESS ||
> > +   !pcie_has_flr(dev) || !pci_resource_start(dev, 0))
> > +   return -ENOTTY;
> > +
> > +   if (probe)
> > +   return 0;
> > +
> > +   bar = pci_iomap(dev, 0, NVME_REG_CC + sizeof(cfg));
> > +   if (!bar)
> > +   return -ENOTTY;
> > +
> > +   pci_read_config_word(dev, PCI_COMMAND, &cmd);
> > +   pci_write_config_word(dev, PCI_COMMAND, cmd | PCI_COMMAND_MEMORY);
> > +
> > +   cfg = readl(bar + NVME_REG_CC);
> > +
> > +   /* Disable controller if enabled */
> > +   if (cfg & NVME_CC_ENABLE) {
> > +   u64 cap = readq(bar + NVME_REG_CAP);
> > +   unsigned long timeout;
> > +
> > +   /*
> > +    * Per nvme_disable_ctrl() skip shutdown notification as it
> > +    * could complete commands to the admin queue.  We only
> > intend
> > +    * to quiesce the device before reset.
> > +    */
> > +   cfg &= ~(NVME_CC_SHN_MASK | NVME_CC_ENABLE);
> > +
> > +   writel(cfg, bar + NVME_REG_CC);
> > +
> > +   /*
> > +    * Some controllers require an additional delay here, see
> > +    * NVME_QUIRK_DELAY_BEFORE_CHK_RDY.  None of those are yet
> > +    * supported by this quirk.
> > +    */
> > +
> > +   /* Cap register provides max time

Re: [PATCH v3 3/3] PCI: Intel DC P3700 NVMe delay after FLR quirk

2018-07-24 Thread Alex Williamson

On Tue, 24 Jul 2018 10:14:46 -0600
Alex Williamson  wrote:

> Add a device specific reset for Intel DC P3700 NVMe device which
> exhibits a timeout failure in drivers waiting for the ready status to
> update after NVMe enable if the driver interacts with the device too
> quickly after FLR.  As this has been observed in device assignment
> scenarios, resolve this with a device specific reset quirk to add an
> additional, heuristically determined, delay after the FLR completes.
> 
> Signed-off-by: Alex Williamson 
> ---
>  drivers/pci/quirks.c |   22 ++
>  1 file changed, 22 insertions(+)

I forgot to link the bz in this one, if this somehow becomes the final
version, please add:

Link: https://bugzilla.redhat.com/show_bug.cgi?id=159265

Thanks,
Alex
 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 3899cdd2514b..08fafd804588 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3751,6 +3751,27 @@ static int nvme_disable_and_flr(struct pci_dev *dev, 
> int probe)
>   return 0;
>  }
>  
> +/*
> + * Intel DC P3700 NVMe controller will timeout waiting for ready status
> + * to change after NVMe enable if the driver starts interacting with the
> + * device too quickly after FLR.  A 250ms delay after FLR has heuristically
> + * proven to produce reliably working results for device assignment cases.
> + */
> +static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
> +{
> + if (!pcie_has_flr(dev))
> + return -ENOTTY;
> +
> + if (probe)
> + return 0;
> +
> + pcie_flr(dev);
> +
> + msleep(250);
> +
> + return 0;
> +}
> +
>  static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
>   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
>reset_intel_82599_sfp_virtfn },
> @@ -3759,6 +3780,7 @@ static const struct pci_dev_reset_methods 
> pci_dev_reset_methods[] = {
>   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
>   reset_ivb_igd },
>   { PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr },
> + { PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr },
>   { PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
>   reset_chelsio_generic_dev },
>   { 0 }
>

[PATCH v3 0/3] PCI: NVMe reset quirks

2018-07-24 Thread Alex Williamson

v3: Separate quirks, only for the afflicted devices

v2: Add bug link, use Samsung vendor ID, add spec references

Fix two different NVMe device reset issues with device specific
quirks.  The Samsung controller in patch 2 sometimes doesn't like
being reset while enabled, so disable the NVMe controller prior to
FLR.  This quirk is generic to all NVMe class devices, though I've
dropped the additional delay some devices require between disabling
and checking ready status.  This can be added later should any of
those devices need this quirk.  The Intel controller quirk is now
just a simple delay after FLR, which clearly any device needing
similar behavior can also use.  Thanks,

Alex

---

Alex Williamson (3):
  PCI: Export pcie_has_flr()
  PCI: Samsung SM961/PM961 NVMe disable before FLR quirk
  PCI: Intel DC P3700 NVMe delay after FLR quirk


 drivers/pci/pci.c|3 +
 drivers/pci/quirks.c |  105 ++
 include/linux/pci.h  |1 
 3 files changed, 108 insertions(+), 1 deletion(-)

[PATCH v3 2/3] PCI: Samsung SM961/PM961 NVMe disable before FLR quirk

2018-07-24 Thread Alex Williamson

The Samsung SM961/PM961 (960 EVO) sometimes fails to return from FLR
with the PCI config space reading back as -1.  A reproducible instance
of this behavior is resolved by clearing the enable bit in the NVMe
configuration register and waiting for the ready status to clear
(disabling the NVMe controller) prior to FLR.

Signed-off-by: Alex Williamson 
---
 drivers/pci/quirks.c |   83 ++
 1 file changed, 83 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index e72c8742aafa..3899cdd2514b 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include/* isa_dma_bridge_buggy */
 #include "pci.h"
 
@@ -3669,6 +3670,87 @@ static int reset_chelsio_generic_dev(struct pci_dev 
*dev, int probe)
 #define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
 #define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
 
+/*
+ * The Samsung SM961/PM961 controller can sometimes enter a fatal state after
+ * FLR where config space reads from the device return -1.  We seem to be
+ * able to avoid this condition if we disable the NVMe controller prior to
+ * FLR.  This quirk is generic for any NVMe class device requiring similar
+ * assistance to quiesce the device prior to FLR.
+ *
+ * NVMe specification: https://nvmexpress.org/resources/specifications/
+ * Revision 1.0e:
+ *Chapter 2: Required and optional PCI config registers
+ *Chapter 3: NVMe control registers
+ *Chapter 7.3: Reset behavior
+ */
+static int nvme_disable_and_flr(struct pci_dev *dev, int probe)
+{
+   void __iomem *bar;
+   u16 cmd;
+   u32 cfg;
+
+   if (dev->class != PCI_CLASS_STORAGE_EXPRESS ||
+   !pcie_has_flr(dev) || !pci_resource_start(dev, 0))
+   return -ENOTTY;
+
+   if (probe)
+   return 0;
+
+   bar = pci_iomap(dev, 0, NVME_REG_CC + sizeof(cfg));
+   if (!bar)
+   return -ENOTTY;
+
+   pci_read_config_word(dev, PCI_COMMAND, &cmd);
+   pci_write_config_word(dev, PCI_COMMAND, cmd | PCI_COMMAND_MEMORY);
+
+   cfg = readl(bar + NVME_REG_CC);
+
+   /* Disable controller if enabled */
+   if (cfg & NVME_CC_ENABLE) {
+   u64 cap = readq(bar + NVME_REG_CAP);
+   unsigned long timeout;
+
+   /*
+* Per nvme_disable_ctrl() skip shutdown notification as it
+* could complete commands to the admin queue.  We only intend
+* to quiesce the device before reset.
+*/
+   cfg &= ~(NVME_CC_SHN_MASK | NVME_CC_ENABLE);
+
+   writel(cfg, bar + NVME_REG_CC);
+
+   /*
+* Some controllers require an additional delay here, see
+* NVME_QUIRK_DELAY_BEFORE_CHK_RDY.  None of those are yet
+* supported by this quirk.
+*/
+
+   /* Cap register provides max timeout in 500ms increments */
+   timeout = ((NVME_CAP_TIMEOUT(cap) + 1) * HZ / 2) + jiffies;
+
+   for (;;) {
+   u32 status = readl(bar + NVME_REG_CSTS);
+
+   /* Ready status becomes zero on disable complete */
+   if (!(status & NVME_CSTS_RDY))
+   break;
+
+   msleep(100);
+
+   if (time_after(jiffies, timeout)) {
+   pci_warn(dev, "Timeout waiting for NVMe ready 
status to clear after disable\n");
+   break;
+   }
+   }
+   }
+
+   pci_iounmap(dev, bar);
+
+   pcie_flr(dev);
+
+   return 0;
+}
+
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 reset_intel_82599_sfp_virtfn },
@@ -3676,6 +3758,7 @@ static const struct pci_dev_reset_methods 
pci_dev_reset_methods[] = {
reset_ivb_igd },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
reset_ivb_igd },
+   { PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr },
{ PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
reset_chelsio_generic_dev },
{ 0 }

[PATCH v3 3/3] PCI: Intel DC P3700 NVMe delay after FLR quirk

2018-07-24 Thread Alex Williamson

Add a device specific reset for Intel DC P3700 NVMe device which
exhibits a timeout failure in drivers waiting for the ready status to
update after NVMe enable if the driver interacts with the device too
quickly after FLR.  As this has been observed in device assignment
scenarios, resolve this with a device specific reset quirk to add an
additional, heuristically determined, delay after the FLR completes.

Signed-off-by: Alex Williamson 
---
 drivers/pci/quirks.c |   22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 3899cdd2514b..08fafd804588 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3751,6 +3751,27 @@ static int nvme_disable_and_flr(struct pci_dev *dev, int 
probe)
return 0;
 }
 
+/*
+ * Intel DC P3700 NVMe controller will timeout waiting for ready status
+ * to change after NVMe enable if the driver starts interacting with the
+ * device too quickly after FLR.  A 250ms delay after FLR has heuristically
+ * proven to produce reliably working results for device assignment cases.
+ */
+static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
+{
+   if (!pcie_has_flr(dev))
+   return -ENOTTY;
+
+   if (probe)
+   return 0;
+
+   pcie_flr(dev);
+
+   msleep(250);
+
+   return 0;
+}
+
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 reset_intel_82599_sfp_virtfn },
@@ -3759,6 +3780,7 @@ static const struct pci_dev_reset_methods 
pci_dev_reset_methods[] = {
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
reset_ivb_igd },
{ PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr },
+   { PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr },
{ PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
reset_chelsio_generic_dev },
{ 0 }

[PATCH v3 1/3] PCI: Export pcie_has_flr()

2018-07-24 Thread Alex Williamson

pcie_flr() suggests pcie_has_flr() to ensure that PCIe FLR support is
present prior to calling.  pcie_flr() is exported while pcie_has_flr()
is not.  Resolve this.

Signed-off-by: Alex Williamson 
---
 drivers/pci/pci.c   |3 ++-
 include/linux/pci.h |1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2bec76c9d9a7..52fe2d72a99c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4071,7 +4071,7 @@ static int pci_dev_wait(struct pci_dev *dev, char 
*reset_type, int timeout)
  * Returns true if the device advertises support for PCIe function level
  * resets.
  */
-static bool pcie_has_flr(struct pci_dev *dev)
+bool pcie_has_flr(struct pci_dev *dev)
 {
u32 cap;
 
@@ -4081,6 +4081,7 @@ static bool pcie_has_flr(struct pci_dev *dev)
pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap);
return cap & PCI_EXP_DEVCAP_FLR;
 }
+EXPORT_SYMBOL_GPL(pcie_has_flr);
 
 /**
  * pcie_flr - initiate a PCIe function level reset
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 04c7ea6ed67b..bbe030d7814f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1092,6 +1092,7 @@ u32 pcie_bandwidth_available(struct pci_dev *dev, struct 
pci_dev **limiting_dev,
 enum pci_bus_speed *speed,
 enum pcie_link_width *width);
 void pcie_print_link_status(struct pci_dev *dev);
+bool pcie_has_flr(struct pci_dev *dev);
 int pcie_flr(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);

Re: [PATCH v2 2/2] PCI: NVMe device specific reset quirk

2018-07-23 Thread Alex Williamson

On Mon, 23 Jul 2018 19:20:41 -0700
Sinan Kaya  wrote:

> On 7/23/18, Alex Williamson  wrote:
> > On Mon, 23 Jul 2018 17:40:02 -0700
> > Sinan Kaya  wrote:
> >  
> >> On 7/23/2018 5:13 PM, Alex Williamson wrote:  
> >> > + * The NVMe specification requires that controllers support PCIe FLR,
> >> > but
> >> > + * but some Samsung SM961/PM961 controllers fail to recover after FLR
> >> > (-1
> >> > + * config space) unless the device is quiesced prior to FLR.  
> >>
> >> Does disabling the memory bit in PCI config space as part of the FLR
> >> reset function help? (like the very first thing)  
> >
> > No, it does not.  I modified this to only clear PCI_COMMAND_MEMORY and
> > call pcie_flr(), the Samsung controller dies just as it did previously.
> >  
> >> Can we do that in the pcie_flr() function to cover other endpoint types
> >> that might be pushing traffic while code is trying to do a reset?  
> >
> > Do you mean PCI_COMMAND_MASTER rather than PCI_COMMAND_MEMORY?  
> 
> Yes
> 
> >  I tried
> > that too, it doesn't work either.  I'm not really sure the theory
> > behind clearing memory, clearing busmaster to stop DMA seems like a
> > sane thing to do, but doesn't help here.  
> 
> Let me explain what I guessed. You might be able to fill in the blanks
> where I am completely off.
> 
> We do vfio initiated flr reset immediately following guest machine
> shutdown. The card could be fully enabled and pushing traffic to the
> system at this moment.
> 
> I don't know if vfio does any device disable or not.

Yes, pci_clear_master() is the very first thing we do in
vfio_pci_disable(), well before we try to reset the device.

> FLR is supposed to reset the endpoint but endpoint doesn't recover per
> your report.
> 
> Having vendor specific reset routines for PCIE endpoints defeats the
> purpose of FLR.
> 
> Since the adapter is fully functional, i suggested turning off bus
> master and memory enable bits to stop endpoint from sending packets.
> 
> But, this is not helping either.
> 
> Those sleep statements looked very fragile to be honest.
> 
> I was curious if there is something else that we could do for other endpoints.
> 
> No objections otherwise.

I certainly agree that it would be nice if FLR was more robust on these
devices, but if all devices behaved within the specs we wouldn't have
these quirks to start with ;)  Just as you're suggesting maybe we could
disable busmaster before FLR, which is reasonable but doesn't work
here, I'm basically moving that to a class specific action, quiesce the
controller at the NVMe level rather than PCI level.  Essentially that's
why I thought it reasonable to apply to all NVMe class devices rather
than create just a quirk that delays after FLR for Intel and another
that disables the NVMe controller just for Samsung.  Once I decide to
apply to the whole class, then I need to bring in the device specific
knowledge already found in the native nvme driver for the delay between
clearing the enable bit and checking the ready status bit.  If it's
fragile, then the bare metal nvme driver has the same frailty.  For the
delay I added, all I can say is that it works for me and improves the
usability of the device for this purpose.  I know that 200ms is too
low, ISTR the issue was fixed at 210-220ms, so 250ms provides some
headroom and I've not seen any issues there.  If we want to make it 500
or 1000ms, that's fine by me, I expect it'd work, it's just unnecessary
until we find devices that need longer delays.  Thanks,

Alex

Re: [PATCH v2 2/2] PCI: NVMe device specific reset quirk

2018-07-23 Thread Alex Williamson

On Mon, 23 Jul 2018 17:40:02 -0700
Sinan Kaya  wrote:

> On 7/23/2018 5:13 PM, Alex Williamson wrote:
> > + * The NVMe specification requires that controllers support PCIe FLR, but
> > + * but some Samsung SM961/PM961 controllers fail to recover after FLR (-1
> > + * config space) unless the device is quiesced prior to FLR.  
> 
> Does disabling the memory bit in PCI config space as part of the FLR 
> reset function help? (like the very first thing)

No, it does not.  I modified this to only clear PCI_COMMAND_MEMORY and
call pcie_flr(), the Samsung controller dies just as it did previously.

> Can we do that in the pcie_flr() function to cover other endpoint types
> that might be pushing traffic while code is trying to do a reset?

Do you mean PCI_COMMAND_MASTER rather than PCI_COMMAND_MEMORY?  I tried
that too, it doesn't work either.  I'm not really sure the theory
behind clearing memory, clearing busmaster to stop DMA seems like a
sane thing to do, but doesn't help here.  Thanks,

Alex

[PATCH v2 2/2] PCI: NVMe device specific reset quirk

2018-07-23 Thread Alex Williamson

Take advantage of NVMe devices using a standard interface to quiesce
the controller prior to reset, including device specific delays before
and after that reset.  This resolves several NVMe device assignment
scenarios with two different vendors.  The Intel DC P3700 controller
has been shown to only work as a VM boot device on the initial VM
startup, failing after reset or reboot, and also fails to initialize
after hot-plug into a VM.  Adding a delay after FLR resolves these
cases.  The Samsung SM961/PM961 (960 EVO) sometimes fails to return
from FLR with the PCI config space reading back as -1.  A reproducible
instance of this behavior is resolved by clearing the enable bit in
the configuration register and waiting for the ready status to clear
(disabling the NVMe controller) prior to FLR.

As all NVMe devices make use of this standard interface and the NVMe
specification also requires PCIe FLR support, we can apply this quirk
to all devices with matching class code.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1592654
Signed-off-by: Alex Williamson 
---
 drivers/pci/quirks.c |  118 ++
 1 file changed, 118 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index e72c8742aafa..bbd029e8d3ae 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include/* isa_dma_bridge_buggy */
 #include "pci.h"
 
@@ -3669,6 +3670,122 @@ static int reset_chelsio_generic_dev(struct pci_dev 
*dev, int probe)
 #define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
 #define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
 
+/* NVMe controller needs delay before testing ready status */
+#define NVME_QUIRK_CHK_RDY_DELAY   (1 << 0)
+/* NVMe controller needs post-FLR delay */
+#define NVME_QUIRK_POST_FLR_DELAY  (1 << 1)
+
+static const struct pci_device_id nvme_reset_tbl[] = {
+   { PCI_DEVICE(0x1bb1, 0x0100),   /* Seagate Nytro Flash Storage */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(0x1c58, 0x0003),   /* HGST adapter */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(0x1c58, 0x0023),   /* WDC SN200 adapter */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(0x1c5f, 0x0540),   /* Memblaze Pblaze4 adapter */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(PCI_VENDOR_ID_SAMSUNG, 0xa821),   /* Samsung PM1725 */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(PCI_VENDOR_ID_SAMSUNG, 0xa822),   /* Samsung PM1725a */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x0953),   /* Intel DC P3700 */
+   .driver_data = NVME_QUIRK_POST_FLR_DELAY, },
+   { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
+   { 0 }
+};
+
+/*
+ * The NVMe specification requires that controllers support PCIe FLR, but
+ * but some Samsung SM961/PM961 controllers fail to recover after FLR (-1
+ * config space) unless the device is quiesced prior to FLR.  Do this for
+ * all NVMe devices by disabling the controller before reset.  Some Intel
+ * controllers also require an additional post-FLR delay or else attempts
+ * to re-enable will timeout, do that here as well with heuristically
+ * determined delay value.  Also maintain the delay between disabling and
+ * checking ready status as used by the native NVMe driver.
+ *
+ * NVMe specification: https://nvmexpress.org/resources/specifications/
+ * Revision 1.0e:
+ *Chapter 2: Required and optional PCI config registers
+ *Chapter 3: NVMe control registers
+ *Chapter 7.3: Reset behavior
+ */
+static int reset_nvme(struct pci_dev *dev, int probe)
+{
+   const struct pci_device_id *id;
+   void __iomem *bar;
+   u16 cmd;
+   u32 cfg;
+
+   id = pci_match_id(nvme_reset_tbl, dev);
+   if (!id || !pcie_has_flr(dev) || !pci_resource_start(dev, 0))
+   return -ENOTTY;
+
+   if (probe)
+   return 0;
+
+   bar = pci_iomap(dev, 0, NVME_REG_CC + sizeof(cfg));
+   if (!bar)
+   return -ENOTTY;
+
+   pci_read_config_word(dev, PCI_COMMAND, &cmd);
+   pci_write_config_word(dev, PCI_COMMAND, cmd | PCI_COMMAND_MEMORY);
+
+   cfg = readl(bar + NVME_REG_CC);
+
+   /* Disable controller if enabled */
+   if (cfg & NVME_CC_ENABLE) {
+   u64 cap = readq(bar + NVME_REG_CAP);
+   unsigned long timeout;
+
+   /*
+* Per nvme_disable_ctrl() skip shutdown notification as it
+* could complete commands to the admin queue.  We only intend
+* to quiesce the device before reset.
+*/
+   cfg &= ~(NVME_CC_SHN_MASK | NVME_CC_ENABLE);
+
+   writel(cfg, bar + NVME_REG_CC);
+
+

[PATCH v2 0/2] PCI: NVMe reset quirk

2018-07-23 Thread Alex Williamson

v2: Add bug link, use Samsung vendor ID, add spec references

As discussed in the 2nd patch, at least one NVMe controller sometimes
doesn't like being reset while enabled and another will timeout during
a subsequent re-enable if it happens too quickly after reset.
Introduce a device specific reset quirk for all NVMe class devices so
that we can try to get reliable behavior from them for device
assignment and any other users of the PCI subsystem reset interface.

Patches against current PCI next branch.  Thanks,

Alex

---

Alex Williamson (2):
  PCI: Export pcie_has_flr()
  PCI: NVMe device specific reset quirk


 drivers/pci/pci.c|3 +
 drivers/pci/quirks.c |  118 ++
 include/linux/pci.h  |1 
 3 files changed, 121 insertions(+), 1 deletion(-)

[PATCH v2 1/2] PCI: Export pcie_has_flr()

2018-07-23 Thread Alex Williamson

pcie_flr() suggests pcie_has_flr() to ensure that PCIe FLR support is
present prior to calling.  pcie_flr() is exported while pcie_has_flr()
is not.  Resolve this.

Signed-off-by: Alex Williamson 
---
 drivers/pci/pci.c   |3 ++-
 include/linux/pci.h |1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2bec76c9d9a7..52fe2d72a99c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4071,7 +4071,7 @@ static int pci_dev_wait(struct pci_dev *dev, char 
*reset_type, int timeout)
  * Returns true if the device advertises support for PCIe function level
  * resets.
  */
-static bool pcie_has_flr(struct pci_dev *dev)
+bool pcie_has_flr(struct pci_dev *dev)
 {
u32 cap;
 
@@ -4081,6 +4081,7 @@ static bool pcie_has_flr(struct pci_dev *dev)
pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap);
return cap & PCI_EXP_DEVCAP_FLR;
 }
+EXPORT_SYMBOL_GPL(pcie_has_flr);
 
 /**
  * pcie_flr - initiate a PCIe function level reset
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 04c7ea6ed67b..bbe030d7814f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1092,6 +1092,7 @@ u32 pcie_bandwidth_available(struct pci_dev *dev, struct 
pci_dev **limiting_dev,
 enum pci_bus_speed *speed,
 enum pcie_link_width *width);
 void pcie_print_link_status(struct pci_dev *dev);
+bool pcie_has_flr(struct pci_dev *dev);
 int pcie_flr(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);

Re: [PATCH 2/2] PCI: NVMe device specific reset quirk

2018-07-23 Thread Alex Williamson

On Mon, 23 Jul 2018 17:45:33 -0500
Bjorn Helgaas  wrote:

> On Mon, Jul 23, 2018 at 04:24:31PM -0600, Alex Williamson wrote:
> > Take advantage of NVMe devices using a standard interface to quiesce
> > the controller prior to reset, including device specific delays before
> > and after that reset.  This resolves several NVMe device assignment
> > scenarios with two different vendors.  The Intel DC P3700 controller
> > has been shown to only work as a VM boot device on the initial VM
> > startup, failing after reset or reboot, and also fails to initialize
> > after hot-plug into a VM.  Adding a delay after FLR resolves these
> > cases.  The Samsung SM961/PM961 (960 EVO) sometimes fails to return
> > from FLR with the PCI config space reading back as -1.  A reproducible
> > instance of this behavior is resolved by clearing the enable bit in
> > the configuration register and waiting for the ready status to clear
> > (disabling the NVMe controller) prior to FLR.
> > 
> > As all NVMe devices make use of this standard interface and the NVMe
> > specification also requires PCIe FLR support, we can apply this quirk
> > to all devices with matching class code.  
> 
> Do you have any pointers to problem reports or bugzilla entries that
> we could include here?

Yes, https://bugzilla.redhat.com/show_bug.cgi?id=1592654

This only covers the Intel P3700 issue.  The Samsung issue has been
reported via a couple bugs, but was not reproducible until recently.
Those bugs were previously closed due to lack of information.  I'll add
the above link.

> > Signed-off-by: Alex Williamson 
> > ---
> >  drivers/pci/quirks.c |  112 
> > ++
> >  1 file changed, 112 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index e72c8742aafa..83853562f220 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -28,6 +28,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include/* isa_dma_bridge_buggy */
> >  #include "pci.h"
> >  
> > @@ -3669,6 +3670,116 @@ static int reset_chelsio_generic_dev(struct pci_dev 
> > *dev, int probe)
> >  #define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
> >  #define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
> >  
> > +/* NVMe controller needs delay before testing ready status */
> > +#define NVME_QUIRK_CHK_RDY_DELAY   (1 << 0)
> > +/* NVMe controller needs post-FLR delay */
> > +#define NVME_QUIRK_POST_FLR_DELAY  (1 << 1)
> > +
> > +static const struct pci_device_id nvme_reset_tbl[] = {
> > +   { PCI_DEVICE(0x1bb1, 0x0100),   /* Seagate Nytro Flash Storage */
> > +   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
> > +   { PCI_DEVICE(0x1c58, 0x0003),   /* HGST adapter */
> > +   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
> > +   { PCI_DEVICE(0x1c58, 0x0023),   /* WDC SN200 adapter */
> > +   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
> > +   { PCI_DEVICE(0x1c5f, 0x0540),   /* Memblaze Pblaze4 adapter */
> > +   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
> > +   { PCI_DEVICE(0x144d, 0xa821),   /* Samsung PM1725 */  
> 
> We do have PCI_VENDOR_ID_SAMSUNG if you want to use it here.  I
> don't see Seagate, HGST, etc.

Oops, cut and pasted those from the nvme driver, I'll use the Samsung
macro.
 
> > +   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
> > +   { PCI_DEVICE(0x144d, 0xa822),   /* Samsung PM1725a */
> > +   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
> > +   { PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x0953),   /* Intel DC P3700 */
> > +   .driver_data = NVME_QUIRK_POST_FLR_DELAY, },
> > +   { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
> > +   { 0 }
> > +};
> > +
> > +/*
> > + * The NVMe specification requires that controllers support PCIe FLR, but
> > + * but some Samsung SM961/PM961 controllers fail to recover after FLR (-1
> > + * config space) unless the device is quiesced prior to FLR.  Do this for
> > + * all NVMe devices by disabling the controller before reset.  Some Intel
> > + * controllers also require an additional post-FLR delay or else attempts
> > + * to re-enable will timeout, do that here as well with heuristically
> > + * determined delay value.  Also maintain the delay between disabling and
> > + * checking ready status as used by the native NVMe driver.
> > + */
> > +static int reset_nvme(struct pci_dev *dev, int probe)
> > +{
> > +   const struct pci_device_id *id;
> > +   void __i

Re: [PATCH 2/2] PCI: NVMe device specific reset quirk

2018-07-23 Thread Alex Williamson

On Mon, 23 Jul 2018 16:45:08 -0600
Keith Busch  wrote:

> On Mon, Jul 23, 2018 at 04:24:31PM -0600, Alex Williamson wrote:
> > Take advantage of NVMe devices using a standard interface to quiesce
> > the controller prior to reset, including device specific delays before
> > and after that reset.  This resolves several NVMe device assignment
> > scenarios with two different vendors.  The Intel DC P3700 controller
> > has been shown to only work as a VM boot device on the initial VM
> > startup, failing after reset or reboot, and also fails to initialize
> > after hot-plug into a VM.  Adding a delay after FLR resolves these
> > cases.  The Samsung SM961/PM961 (960 EVO) sometimes fails to return
> > from FLR with the PCI config space reading back as -1.  A reproducible
> > instance of this behavior is resolved by clearing the enable bit in
> > the configuration register and waiting for the ready status to clear
> > (disabling the NVMe controller) prior to FLR.
> > 
> > As all NVMe devices make use of this standard interface and the NVMe
> > specification also requires PCIe FLR support, we can apply this quirk
> > to all devices with matching class code.  
> 
> Shouldn't this go in the nvme driver's reset_prepare/reset_done callbacks?

The scenario I'm trying to fix is device assignment, the nvme driver
isn't in play there.  The device is bound to the vfio-pci driver at the
time of these resets.  Thanks,

Alex

[PATCH 1/2] PCI: Export pcie_has_flr()

2018-07-23 Thread Alex Williamson

pcie_flr() suggests pcie_has_flr() to ensure that PCIe FLR support is
present prior to calling.  pcie_flr() is exported while pcie_has_flr()
is not.  Resolve this.

Signed-off-by: Alex Williamson 
---
 drivers/pci/pci.c   |3 ++-
 include/linux/pci.h |1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2bec76c9d9a7..52fe2d72a99c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4071,7 +4071,7 @@ static int pci_dev_wait(struct pci_dev *dev, char 
*reset_type, int timeout)
  * Returns true if the device advertises support for PCIe function level
  * resets.
  */
-static bool pcie_has_flr(struct pci_dev *dev)
+bool pcie_has_flr(struct pci_dev *dev)
 {
u32 cap;
 
@@ -4081,6 +4081,7 @@ static bool pcie_has_flr(struct pci_dev *dev)
pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap);
return cap & PCI_EXP_DEVCAP_FLR;
 }
+EXPORT_SYMBOL_GPL(pcie_has_flr);
 
 /**
  * pcie_flr - initiate a PCIe function level reset
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 04c7ea6ed67b..bbe030d7814f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1092,6 +1092,7 @@ u32 pcie_bandwidth_available(struct pci_dev *dev, struct 
pci_dev **limiting_dev,
 enum pci_bus_speed *speed,
 enum pcie_link_width *width);
 void pcie_print_link_status(struct pci_dev *dev);
+bool pcie_has_flr(struct pci_dev *dev);
 int pcie_flr(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);

[PATCH 0/2] PCI: NVMe reset quirk

2018-07-23 Thread Alex Williamson

As discussed in the 2nd patch, at least one NVMe controller sometimes
doesn't like being reset while enabled and another will timeout during
a subsequent re-enable if it happens too quickly after reset.
Introduce a device specific reset quirk for all NVMe class devices so
that we can try to get reliable behavior from them for device
assignment and any other users of the PCI subsystem reset interface.

Patches against current PCI next branch.  Thanks,

Alex

---

Alex Williamson (2):
  PCI: Export pcie_has_flr()
  PCI: NVMe device specific reset quirk


 drivers/pci/pci.c|3 +
 drivers/pci/quirks.c |  112 ++
 include/linux/pci.h  |1 
 3 files changed, 115 insertions(+), 1 deletion(-)

[PATCH 2/2] PCI: NVMe device specific reset quirk

2018-07-23 Thread Alex Williamson

Take advantage of NVMe devices using a standard interface to quiesce
the controller prior to reset, including device specific delays before
and after that reset.  This resolves several NVMe device assignment
scenarios with two different vendors.  The Intel DC P3700 controller
has been shown to only work as a VM boot device on the initial VM
startup, failing after reset or reboot, and also fails to initialize
after hot-plug into a VM.  Adding a delay after FLR resolves these
cases.  The Samsung SM961/PM961 (960 EVO) sometimes fails to return
from FLR with the PCI config space reading back as -1.  A reproducible
instance of this behavior is resolved by clearing the enable bit in
the configuration register and waiting for the ready status to clear
(disabling the NVMe controller) prior to FLR.

As all NVMe devices make use of this standard interface and the NVMe
specification also requires PCIe FLR support, we can apply this quirk
to all devices with matching class code.

Signed-off-by: Alex Williamson 
---
 drivers/pci/quirks.c |  112 ++
 1 file changed, 112 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index e72c8742aafa..83853562f220 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include/* isa_dma_bridge_buggy */
 #include "pci.h"
 
@@ -3669,6 +3670,116 @@ static int reset_chelsio_generic_dev(struct pci_dev 
*dev, int probe)
 #define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
 #define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
 
+/* NVMe controller needs delay before testing ready status */
+#define NVME_QUIRK_CHK_RDY_DELAY   (1 << 0)
+/* NVMe controller needs post-FLR delay */
+#define NVME_QUIRK_POST_FLR_DELAY  (1 << 1)
+
+static const struct pci_device_id nvme_reset_tbl[] = {
+   { PCI_DEVICE(0x1bb1, 0x0100),   /* Seagate Nytro Flash Storage */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(0x1c58, 0x0003),   /* HGST adapter */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(0x1c58, 0x0023),   /* WDC SN200 adapter */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(0x1c5f, 0x0540),   /* Memblaze Pblaze4 adapter */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(0x144d, 0xa821),   /* Samsung PM1725 */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(0x144d, 0xa822),   /* Samsung PM1725a */
+   .driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+   { PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x0953),   /* Intel DC P3700 */
+   .driver_data = NVME_QUIRK_POST_FLR_DELAY, },
+   { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
+   { 0 }
+};
+
+/*
+ * The NVMe specification requires that controllers support PCIe FLR, but
+ * but some Samsung SM961/PM961 controllers fail to recover after FLR (-1
+ * config space) unless the device is quiesced prior to FLR.  Do this for
+ * all NVMe devices by disabling the controller before reset.  Some Intel
+ * controllers also require an additional post-FLR delay or else attempts
+ * to re-enable will timeout, do that here as well with heuristically
+ * determined delay value.  Also maintain the delay between disabling and
+ * checking ready status as used by the native NVMe driver.
+ */
+static int reset_nvme(struct pci_dev *dev, int probe)
+{
+   const struct pci_device_id *id;
+   void __iomem *bar;
+   u16 cmd;
+   u32 cfg;
+
+   id = pci_match_id(nvme_reset_tbl, dev);
+   if (!id || !pcie_has_flr(dev) || !pci_resource_start(dev, 0))
+   return -ENOTTY;
+
+   if (probe)
+   return 0;
+
+   bar = pci_iomap(dev, 0, NVME_REG_CC + sizeof(cfg));
+   if (!bar)
+   return -ENOTTY;
+
+   pci_read_config_word(dev, PCI_COMMAND, &cmd);
+   pci_write_config_word(dev, PCI_COMMAND, cmd | PCI_COMMAND_MEMORY);
+
+   cfg = readl(bar + NVME_REG_CC);
+
+   /* Disable controller if enabled */
+   if (cfg & NVME_CC_ENABLE) {
+   u64 cap = readq(bar + NVME_REG_CAP);
+   unsigned long timeout;
+
+   /*
+* Per nvme_disable_ctrl() skip shutdown notification as it
+* could complete commands to the admin queue.  We only intend
+* to quiesce the device before reset.
+*/
+   cfg &= ~(NVME_CC_SHN_MASK | NVME_CC_ENABLE);
+
+   writel(cfg, bar + NVME_REG_CC);
+
+   /* A heuristic value, matches NVME_QUIRK_DELAY_AMOUNT */
+   if (id->driver_data & NVME_QUIRK_CHK_RDY_DELAY)
+   msleep(2300);
+
+   /* Cap register provides max timeout in 500ms increments */
+   timeout = ((NVME_CAP_TIMEOUT(cap) + 1) *

[GIT PULL] VFIO fixes for v4.18-rc6

2018-07-20 Thread Alex Williamson

Hi Linus,

The following changes since commit 9d3cce1e8b8561fed5f383d22a4d6949db4eadbe:

  Linux 4.18-rc5 (2018-07-15 12:49:31 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v4.18-rc6

for you to fetch changes up to 0e714d27786ce1fb3efa9aac58abc096e68b1c2a:

  vfio/pci: Fix potential Spectre v1 (2018-07-18 12:57:25 -0600)


VFIO fixes for v4.18

 - Harden potential Spectre v1 issue (Gustavo A. R. Silva)


Gustavo A. R. Silva (1):
  vfio/pci: Fix potential Spectre v1

 drivers/vfio/pci/vfio_pci.c | 4 
 1 file changed, 4 insertions(+)

Re: [RFC PATCH] vfio/pci: map prefetchble bars as writecombine

2018-07-20 Thread Alex Williamson

On Thu, 19 Jul 2018 21:49:48 +0530
Srinath Mannam  wrote:

> HI Alex,
> 
> On Thu, Jul 19, 2018 at 8:42 PM, Alex Williamson
>  wrote:
> > On Thu, 19 Jul 2018 20:17:11 +0530
> > Srinath Mannam  wrote:
> >  
> >> HI Alex,
> >>
> >> On Thu, Jul 19, 2018 at 2:55 AM, Alex Williamson
> >>  wrote:  
> >> > On Thu, 19 Jul 2018 00:05:18 +0530
> >> > Srinath Mannam  wrote:
> >> >  
> >> >> Hi Alex,
> >> >>
> >> >> On Tue, Jul 17, 2018 at 8:52 PM, Alex Williamson
> >> >>  wrote:  
> >> >> > On Fri, 13 Jul 2018 10:26:17 +0530
> >> >> > Srinath Mannam  wrote:
> >> >> >  
> >> >> >> By default all BARs map with VMA access permissions
> >> >> >> as pgprot_noncached.
> >> >> >>
> >> >> >> In ARM64 pgprot_noncached is MT_DEVICE_nGnRnE which
> >> >> >> is strongly ordered and allows aligned access.
> >> >> >> This type of mapping works for NON-PREFETCHABLE bars
> >> >> >> containing EP controller registers.
> >> >> >> But it restricts PREFETCHABLE bars from doing
> >> >> >> unaligned access.
> >> >> >>
> >> >> >> In CMB NVMe drives PREFETCHABLE bars are required to
> >> >> >> map as MT_NORMAL_NC to do unaligned access.
> >> >> >>
> >> >> >> Signed-off-by: Srinath Mannam 
> >> >> >> Reviewed-by: Ray Jui 
> >> >> >> Reviewed-by: Vikram Prakash 
> >> >> >> ---  
> >> >> >
> >> >> > This has been discussed before:
> >> >> >
> >> >> > https://www.spinics.net/lists/kvm/msg156548.html  
> >> >> Thank you for inputs.. I have gone through the long list of mail chain
> >> >> discussion.  
> >> >> >
> >> >> > CC'ing the usual suspects from the previous thread.  I'm not convinced
> >> >> > that the patch here has considered anything other than the ARM64
> >> >> > implications and it's not clear that it considers compatibility with
> >> >> > existing users or devices at all.  Can we guarantee for all devices 
> >> >> > and
> >> >> > use cases that WC is semantically equivalent and preferable to UC?  If
> >> >> > not then we need to device an extension to the interface that allows
> >> >> > the user to specify WC.  Thanks,
> >> >> >  
> >> >> To implement with user specified WC flags, many changes need to be done.
> >> >> Suppose In DPDK, prefetcable BARs map using WC flag, then also same
> >> >> question comes
> >> >> that WC may be different for different CPUs.
> >> >> As per functionality, both WC and PREFETCHABLE are same, like merging 
> >> >> writes and
> >> >> typically WC is uncached.
> >> >> So, based on prefetchable BARs behavior and usage we need to map bar 
> >> >> memory.
> >> >> Is it right to map prefetchable BARs as strongly ordered, aligned
> >> >> access and uncached?  
> >> >
> >> > Is it possible to answer that question generically?  Whether to map a
> >> > BAR as UC or WC is generally a question for the driver.  Does the
> >> > device handle unaligned accesses?  Does the device need strong memory
> >> > ordering?  If this is a driver level question then the driver that
> >> > needs to make that decision is the userspace driver.  VFIO is just a
> >> > pass-through here and since we don't offer the user a choice of
> >> > mappings, we take the safer and more conservative mapping, ie. UC.
> >> >  
> >> Yes, you are right, driver should make the decision based on its 
> >> requirement.
> >> In my case, user space driver is part of SPDK, so SPDK should request DPDK
> >> and DPDK should request VFIO to map BAR for its choice of mapping.
> >> So to implement this we need code changes in VFIO, DPDK and SPDK.
> >>  
> >> > You're suggesting that there are many changes to be done if we modify
> >> > the vfio interface to expose WC under the user's control rather than
> >> > simply transparently impose WC for all mappings, but is that really the
> >> > case?  Most devices on most pla

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 2486 matches

Mail list logo