RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices

2021-11-11 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Wednesday, November 3, 2021 9:26 PM
> 
> On Tue, Nov 02, 2021 at 09:53:29AM +0000, Liu, Yi L wrote:
> 
> > >   vfio_uninit_group_dev(_state->vdev);
> > >   kfree(mdev_state->pages);
> > >   kfree(mdev_state->vconfig);
> > >   kfree(mdev_state);
> > >
> > > pages/vconfig would logically be in a release function
> >
> > I see. So the criteria is: the pointer fields pointing to a memory buffer
> > allocated by the device driver should be logically be free in a release
> > function. right?
> 
> Often yes, that is usually a good idea
> 
> >I can see there are such fields in struct vfio_pci_core_device
> > and mdev_state (both mbochs and mdpy). So we may go with your option
> #2.
> > Is it? otherwise, needs to add release callback for all the related drivers.
> 
> Yes, that is the approx trade off
> 
> > > On the other hand ccw needs to rcu free the vfio_device, so that would
> > > have to be global overhead with this api design.
> >
> > not quite get. why ccw is special here? could you elaborate?
> 
> I added a rcu usage to it in order to fix a race
> 
> +static inline struct vfio_ccw_private *vfio_ccw_get_priv(struct subchannel
> *sch)
> +{
> +   struct vfio_ccw_private *private;
> +
> +   rcu_read_lock();
> +   private = dev_get_drvdata(>dev);
> +   if (private && !vfio_device_try_get(>vdev))
> +   private = NULL;
> +   rcu_read_unlock();
> +   return private;
> +}

you are right. After checking your ccw patch, the private free triggered
by vfio_ccw_free_private() should use kfree_rcu(). So it is not quite
same with other vfio_device users which only need kfree() to free the
vfio_device. So how can I address the difference when moving the vfio_device
alloc/free into vfio core? any suggestion?

@@ -164,14 +173,14 @@ static void vfio_ccw_free_private(struct vfio_ccw_private 
*private)
kmem_cache_free(vfio_ccw_io_region, private->io_region);
kfree(private->cp.guest_cp);
mutex_destroy(>io_mutex);
-   kfree(private);
+   vfio_uninit_group_dev(>vdev);
+   kfree_rcu(private, rcu);
 }

https://lore.kernel.org/kvm/10-v3-57c1502c62fd+2190-ccw_mdev_...@nvidia.com/

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices

2021-11-02 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Monday, November 1, 2021 8:50 PM
> 
> On Fri, Oct 29, 2021 at 09:47:27AM +0000, Liu, Yi L wrote:
> > Hi Jason,
> >
> > > From: Jason Gunthorpe 
> > > Sent: Monday, October 25, 2021 8:53 PM
> > >
> > > On Mon, Oct 25, 2021 at 06:28:09AM +, Liu, Yi L wrote:
> > > >thanks for the guiding. will also refer to your vfio_group_cdev 
> > > > series.
> > > >
> > > >Need to double confirm here. Not quite following on the kfree. Is
> > > >this kfree to free the vfio_device structure? But now the
> > > >vfio_device pointer is provided by callers (e.g. vfio-pci). Do
> > > >you want to let vfio core allocate the vfio_device struct and
> > > >return the pointer to callers?
> > >
> > > There are several common patterns for this problem, two that would be
> > > suitable:
> > >
> > > - Require each driver to provide a release op inside vfio_device_ops
> > >   that does the kfree. Have the core provide a struct device release
> > >   op that calls this one. Keep the kalloc/kfree in the drivers
> >
> > this way sees to suit the existing vfio registration manner listed
> > below. right?
> 
> Not really, most drivers are just doing kfree. The need for release
> comes if the drivers are doing more stuff.
> 
> > But device drivers needs to do the kfree in the
> > newly added release op instead of doing it on their own (e.g.
> > doing kfree in remove).
> 
> Yes
> 
> > > struct ib_device *_ib_alloc_device(size_t size);
> > > #define ib_alloc_device(drv_struct, member)   
> > >  \
> > > container_of(_ib_alloc_device(sizeof(struct drv_struct) + 
> > >  \
> > >   BUILD_BUG_ON_ZERO(offsetof( 
> > >  \
> > >   struct drv_struct, 
> > > member))),\
> > >  struct drv_struct, member)
> > >
> >
> > thanks for the example. If this way, still requires driver to provide
> > a release op inside vfio_device_ops. right?
> 
> No, it would optional. It would contain the stuff the driver is doing
> before kfree()
> 
> For instance mdev looks like the only driver that cares:
> 
>   vfio_uninit_group_dev(_state->vdev);
>   kfree(mdev_state->pages);
>   kfree(mdev_state->vconfig);
>   kfree(mdev_state);
> 
> pages/vconfig would logically be in a release function

I see. So the criteria is: the pointer fields pointing to a memory buffer
allocated by the device driver should be logically be free in a release
function. right? I can see there are such fields in struct vfio_pci_core_device
and mdev_state (both mbochs and mdpy). So we may go with your option #2.
Is it? otherwise, needs to add release callback for all the related drivers.

struct vfio_pci_core_device {
struct vifo_device vdev;
...
u8 *pci_config_map;
u8 *vconfig;
...
};

struct mdev_state {
struct vifo_device vdev;
...
u8 *vconfig;
struct page **pages;
...
};

> On the other hand ccw needs to rcu free the vfio_device, so that would
> have to be global overhead with this api design.

not quite get. why ccw is special here? could you elaborate?

Thanks,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices

2021-10-29 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Monday, October 25, 2021 8:53 PM
> 
> On Mon, Oct 25, 2021 at 06:28:09AM +0000, Liu, Yi L wrote:
> >thanks for the guiding. will also refer to your vfio_group_cdev series.
> >
> >Need to double confirm here. Not quite following on the kfree. Is
> >this kfree to free the vfio_device structure? But now the
> >vfio_device pointer is provided by callers (e.g. vfio-pci). Do
> >you want to let vfio core allocate the vfio_device struct and
> >return the pointer to callers?
> 
> There are several common patterns for this problem, two that would be
> suitable:
> 
> - Require each driver to provide a release op inside vfio_device_ops
>   that does the kfree. Have the core provide a struct device release
>   op that calls this one. Keep the kalloc/kfree in the drivers

this way sees to suit the existing vfio registration manner listed
below. right? But device drivers needs to do the kfree in the
newly added release op instead of doing it on their own (e.g.
doing kfree in remove).

vfio_init_group_dev()
vfio_register_group_dev()
vfio_unregister_group_dev()
vfio_uninit_group_dev()

> - Move the kalloc into the core and have the core provide the kfree
>   with an optional release callback for anydriver specific cleanup
> 
>   This requires some macro to make the memory layout work. RDMA has
>   a version of this:
> 
> struct ib_device *_ib_alloc_device(size_t size);
> #define ib_alloc_device(drv_struct, member)   
>  \
> container_of(_ib_alloc_device(sizeof(struct drv_struct) + 
>  \
>   BUILD_BUG_ON_ZERO(offsetof( 
>  \
>   struct drv_struct, member))),   
>  \
>  struct drv_struct, member)
> 

thanks for the example. If this way, still requires driver to provide
a release op inside vfio_device_ops. right?

> In part the choice is how many drivers require a release callback
> anyhow, if they all do then the first is easier to understand. If only
> few or none do then the latter is less code in drivers, and never
> exposes the driver to the tricky transition from alloc to refcount
> cleanup.

I'm not quite sure. But per my understanding, since the vfio_device
is expected to be embedded in the device state struct (e.g.
vfio_pci_core_device), I guess most of the drivers will require callback
to do driver specific cleanup. Seems like option #1 may make sense?

Regards,
Yi Liu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices

2021-10-25 Thread Liu, Yi L
> From: Jason Gunthorpe mailto:j...@nvidia.com>>

> Sent: Tuesday, September 21, 2021 11:57 PM

>

> On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:

> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for

> > userspace to directly open a vfio device w/o relying on container/group

> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind

> > iommufd (more specifically in iommu core by this RFC) in a device-centric

> > manner.

> >

> > In case a device is exposed in both legacy and new interfaces (see next

> > patch for how to decide it), this patch also ensures that when the device

> > is already opened via one interface then the other one must be blocked.

> >

> > Signed-off-by: Liu Yi L mailto:yi.l@intel.com>>

> > ---

> >  drivers/vfio/vfio.c  | 228 +++---

> -

> >  include/linux/vfio.h |   2 +

> >  2 files changed, 213 insertions(+), 17 deletions(-)

>

> > +static int vfio_init_device_class(void)

> > +{

> > +  int ret;

> > +

> > +  mutex_init(_lock);

> > +  idr_init(_idr);

> > +

> > +  /* /dev/vfio/devices/$DEVICE */

> > +  vfio.device_class = class_create(THIS_MODULE, "vfio-device");

> > +  if (IS_ERR(vfio.device_class))

> > + return PTR_ERR(vfio.device_class);

> > +

> > +  vfio.device_class->devnode = vfio_device_devnode;

> > +

> > +  ret = alloc_chrdev_region(_devt, 0, MINORMASK + 1,

> "vfio-device");

> > +  if (ret)

> > + goto err_alloc_chrdev;

> > +

> > +  cdev_init(_cdev, _device_fops);

> > +  ret = cdev_add(_cdev, vfio.device_devt, MINORMASK +

> 1);

> > +  if (ret)

> > + goto err_cdev_add;

>

> Huh? This is not how cdevs are used. This patch needs rewriting.

>

> The struct vfio_device should gain a 'struct device' and 'struct cdev'

> as non-pointer members

>

> vfio register path should end up doing cdev_device_add() for each

> vfio_device

>

> vfio_unregister path should do cdev_device_del()

>

> No idr should be needed, an ida is used to allocate minor numbers

>

> The struct device release function should trigger a kfree which

> requires some reworking of the callers



thanks for the guiding. will also refer to your vfio_group_cdev series.



Need to double confirm here. Not quite following on the kfree. Is this

kfree to free the vfio_device structure? But now the vfio_device pointer

is provided by callers (e.g. vfio-pci). Do you want to let vfio core

allocate the vfio_device struct and return the pointer to callers?



Thanks,

Yi Liu



> vfio_init_group_dev() should do a device_initialize()

> vfio_uninit_group_dev() should do a device_put()

>

> The opened atomic is aweful. A newly created fd should start in a

> state where it has a disabled fops

>

> The only thing the disabled fops can do is register the device to the

> iommu fd. When successfully registered the device gets the normal fops.

>

> The registration steps should be done under a normal lock inside the

> vfio_device. If a vfio_device is already registered then further

> registration should fail.

>

> Getting the device fd via the group fd triggers the same sequence as

> above.

>

> Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices

2021-10-20 Thread Liu, Yi L
> From: David Gibson 
> Sent: Wednesday, September 29, 2021 10:09 AM
> 
> On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> >
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> >
> > Signed-off-by: Liu Yi L 
> [snip]
> 
> > +static bool vfio_device_in_container(struct vfio_device *device)
> > +{
> > +   return !!(device->group && device->group->container);
> 
> You don't need !! here.  && is already a logical operation, so returns
> a valid bool.

got it. thanks.

Regards,
Yi Liu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 01/20] iommu/iommufd: Add /dev/iommu core

2021-10-15 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Friday, October 15, 2021 7:18 PM
> 
> On Fri, Oct 15, 2021 at 09:18:06AM +0000, Liu, Yi L wrote:
> 
> > >   Acquire from the xarray is
> > >rcu_lock()
> > >ioas = xa_load()
> > >if (ioas)
> > >   if (down_read_trylock(>destroying_lock))
> >
> > all good suggestions, will refine accordingly. Here destroying_lock is a
> > rw_semaphore. right? Since down_read_trylock() accepts a rwsem.
> 
> Yes, you probably need a sleeping lock

got it. thanks,

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

2021-10-15 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Wednesday, September 22, 2021 1:15 AM
> 
> On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:
> 
> > +/*
> > + * A iommufd_device object represents the binding relationship
> > + * between iommufd and device. It is created per a successful
> > + * binding request from device driver. The bound device must be
> > + * a physical device so far. Subdevice will be supported later
> > + * (with additional PASID information). An user-assigned cookie
> > + * is also recorded to mark the device in the /dev/iommu uAPI.
> > + */
> > +struct iommufd_device {
> > +   unsigned int id;
> > +   struct iommufd_ctx *ictx;
> > +   struct device *dev; /* always be the physical device */
> > +   u64 dev_cookie;
> >  };
> >
> >  static int iommufd_fops_open(struct inode *inode, struct file *filep)
> > @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode,
> struct file *filep)
> > return -ENOMEM;
> >
> > refcount_set(>refs, 1);
> > +   mutex_init(>lock);
> > +   xa_init_flags(>device_xa, XA_FLAGS_ALLOC);
> > filep->private_data = ictx;
> >
> > return ret;
> >  }
> >
> > +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> > +{
> > +   refcount_inc(>refs);
> > +}
> 
> See my earlier remarks about how to structure the lifetime logic, this
> ref isn't necessary.
> 
> > +static const struct file_operations iommufd_fops;
> > +
> > +/**
> > + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd
> context.
> > + * @fd: [in] iommufd file descriptor.
> > + *
> > + * Returns a pointer to the iommufd context, otherwise NULL;
> > + *
> > + */
> > +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> > +{
> > +   struct fd f = fdget(fd);
> > +   struct file *file = f.file;
> > +   struct iommufd_ctx *ictx;
> > +
> > +   if (!file)
> > +   return NULL;
> > +
> > +   if (file->f_op != _fops)
> > +   return NULL;
> 
> Leaks the fdget
> 
> > +
> > +   ictx = file->private_data;
> > +   if (ictx)
> > +   iommufd_ctx_get(ictx);
> 
> Use success oriented flow
> 
> > +   fdput(f);
> > +   return ictx;
> > +}
> 
> > + */
> > +struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
> > +  u64 dev_cookie)
> > +{
> > +   struct iommufd_ctx *ictx;
> > +   struct iommufd_device *idev;
> > +   unsigned long index;
> > +   unsigned int id;
> > +   int ret;
> > +
> > +   ictx = iommufd_ctx_fdget(fd);
> > +   if (!ictx)
> > +   return ERR_PTR(-EINVAL);
> > +
> > +   mutex_lock(>lock);
> > +
> > +   /* check duplicate registration */
> > +   xa_for_each(>device_xa, index, idev) {
> > +   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> > +   idev = ERR_PTR(-EBUSY);
> > +   goto out_unlock;
> > +   }
> 
> I can't think of a reason why this expensive check is needed.
> 
> > +   }
> > +
> > +   idev = kzalloc(sizeof(*idev), GFP_KERNEL);
> > +   if (!idev) {
> > +   ret = -ENOMEM;
> > +   goto out_unlock;
> > +   }
> > +
> > +   /* Establish the security context */
> > +   ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
> > +   if (ret)
> > +   goto out_free;
> > +
> > +   ret = xa_alloc(>device_xa, , idev,
> > +  XA_LIMIT(IOMMUFD_DEVID_MIN,
> IOMMUFD_DEVID_MAX),
> > +  GFP_KERNEL);
> 
> idev should be fully initialized before being placed in the xarray, so
> this should be the last thing done.

all good suggestions above. thanks for catching them.

> Why not just use the standard xa_limit_32b instead of special single
> use constants?

yeah. should use xa_limit_32b.

Regards,
Yi Liu

> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 01/20] iommu/iommufd: Add /dev/iommu core

2021-10-15 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, September 21, 2021 11:42 PM
> 
> On Sun, Sep 19, 2021 at 02:38:29PM +0800, Liu Yi L wrote:
> > /dev/iommu aims to provide a unified interface for managing I/O address
> > spaces for devices assigned to userspace. This patch adds the initial
> > framework to create a /dev/iommu node. Each open of this node returns an
> > iommufd. And this fd is the handle for userspace to initiate its I/O
> > address space management.
> >
> > One open:
> > - We call this feature as IOMMUFD in Kconfig in this RFC. However this
> >   name is not clear enough to indicate its purpose to user. Back to 2010
> >   vfio even introduced a /dev/uiommu [1] as the predecessor of its
> >   container concept. Is that a better name? Appreciate opinions here.
> >
> > [1]
> https://lore.kernel.org/kvm/4c0eb470.1hmjondo00nivfm6%25p...@cisco.co
> m/
> >
> > Signed-off-by: Liu Yi L 
> >  drivers/iommu/Kconfig   |   1 +
> >  drivers/iommu/Makefile  |   1 +
> >  drivers/iommu/iommufd/Kconfig   |  11 
> >  drivers/iommu/iommufd/Makefile  |   2 +
> >  drivers/iommu/iommufd/iommufd.c | 112
> 
> >  5 files changed, 127 insertions(+)
> >  create mode 100644 drivers/iommu/iommufd/Kconfig
> >  create mode 100644 drivers/iommu/iommufd/Makefile
> >  create mode 100644 drivers/iommu/iommufd/iommufd.c
> >
> > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > index 07b7c25cbed8..a83ce0acd09d 100644
> > +++ b/drivers/iommu/Kconfig
> > @@ -136,6 +136,7 @@ config MSM_IOMMU
> >
> >  source "drivers/iommu/amd/Kconfig"
> >  source "drivers/iommu/intel/Kconfig"
> > +source "drivers/iommu/iommufd/Kconfig"
> >
> >  config IRQ_REMAP
> > bool "Support for Interrupt Remapping"
> > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> > index c0fb0ba88143..719c799f23ad 100644
> > +++ b/drivers/iommu/Makefile
> > @@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> >  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> >  obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
> >  obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
> > +obj-$(CONFIG_IOMMUFD) += iommufd/
> > diff --git a/drivers/iommu/iommufd/Kconfig
> b/drivers/iommu/iommufd/Kconfig
> > new file mode 100644
> > index ..9fb7769a815d
> > +++ b/drivers/iommu/iommufd/Kconfig
> > @@ -0,0 +1,11 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +config IOMMUFD
> > +   tristate "I/O Address Space management framework for passthrough
> devices"
> > +   select IOMMU_API
> > +   default n
> > +   help
> > + provides unified I/O address space management framework for
> > + isolating untrusted DMAs via devices which are passed through
> > + to userspace drivers.
> > +
> > + If you don't know what to do here, say N.
> > diff --git a/drivers/iommu/iommufd/Makefile
> b/drivers/iommu/iommufd/Makefile
> > new file mode 100644
> > index ..54381a01d003
> > +++ b/drivers/iommu/iommufd/Makefile
> > @@ -0,0 +1,2 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +obj-$(CONFIG_IOMMUFD) += iommufd.o
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > new file mode 100644
> > index ..710b7e62988b
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -0,0 +1,112 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * I/O Address Space Management for passthrough devices
> > + *
> > + * Copyright (C) 2021 Intel Corporation
> > + *
> > + * Author: Liu Yi L 
> > + */
> > +
> > +#define pr_fmt(fmt)"iommufd: " fmt
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +/* Per iommufd */
> > +struct iommufd_ctx {
> > +   refcount_t refs;
> > +};
> 
> A private_data of a struct file should avoid having a refcount (and
> this should have been a kref anyhow)
> 
> Use the refcount on the struct file instead.
> 
> In general the lifetime models look overly convoluted to me with
> refcounts being used as locks and going in all manner of directions.
> 
> - No refcount on iommufd_ctx, this should use the fget on the fd.
>   The driver facing version of the API has the driver holds a fget
>   inside the iommufd_device.
> 
> - Put a rwlock inside the iommufd_ioas that

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-29 Thread Liu, Yi L
> From: Jean-Philippe Brucker 
> Sent: Wednesday, September 22, 2021 9:45 PM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> 
> Is this the reason for passing addr_width to IOASID_ALLOC?  I didn't get
> what it's used for or why it's mandatory. But for PPC it sounds like it
> should be an address range instead of an upper limit?

yes, as this open described, it may need to be a range. But not sure
if PPC requires multiple ranges or just one range. Perhaps, David may
guide there.

Regards,
Yi Liu
 
> Thanks,
> Jean
> 
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> >
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> >   ID). We need find a way to resolve the naming conflict between the
> hardware
> >   ID and software handle. One option is to rename the existing ioasid to be
> >   pasid or ssid, given their full names still sound generic. Appreciate more
> >   thoughts on this open!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 17/20] iommu/iommufd: Report iova range to userspace

2021-09-29 Thread Liu, Yi L
> From: Jean-Philippe Brucker 
> Sent: Wednesday, September 22, 2021 10:49 PM
> 
> On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote:
> > [HACK. will fix in v2]
> >
> > IOVA range is critical info for userspace to manage DMA for an I/O address
> > space. This patch reports the valid iova range info of a given device.
> >
> > Due to aforementioned hack, this info comes from the hacked vfio type1
> > driver. To follow the same format in vfio, we also introduce a cap chain
> > format in IOMMU_DEVICE_GET_INFO to carry the iova range info.
> [...]
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 49731be71213..f408ad3c8ade 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -68,6 +68,7 @@
> >   *+---++
> >   *...
> >   * @addr_width:the address width of supported I/O address spaces.
> > + * @cap_offset:   Offset within info struct of first cap
> >   *
> >   * Availability: after device is bound to iommufd
> >   */
> > @@ -77,9 +78,11 @@ struct iommu_device_info {
> >  #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP(1 << 0) /* IOMMU
> enforced snoop */
> >  #define IOMMU_DEVICE_INFO_PGSIZES  (1 << 1) /* supported page
> sizes */
> >  #define IOMMU_DEVICE_INFO_ADDR_WIDTH   (1 << 2) /*
> addr_wdith field valid */
> > +#define IOMMU_DEVICE_INFO_CAPS (1 << 3) /* info
> supports cap chain */
> > __u64   dev_cookie;
> > __u64   pgsize_bitmap;
> > __u32   addr_width;
> > +   __u32   cap_offset;
> 
> We can also add vendor-specific page table and PASID table properties as
> capabilities, otherwise we'll need giant unions in the iommu_device_info
> struct. That made me wonder whether pgsize and addr_width should also
> be
> separate capabilities for consistency, but this way might be good enough.
> There won't be many more generic capabilities. I have "output address
> width"

what do you mean by "output address width"? Is it the output address
of stage-1 translation?

>
and "PASID width", the rest is specific to Arm and SMMU table
> formats.

When coming to nested translation support, the stage-1 related info are
likely to be vendor-specific, and will be reported in cap chain.

Regards,
Yi Liu

> Thanks,
> Jean
> 
> >  };
> >
> >  #define IOMMU_DEVICE_GET_INFO  _IO(IOMMU_TYPE, IOMMU_BASE +
> 1)
> > --
> > 2.25.1
> >
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Wednesday, September 22, 2021 9:32 PM
> 
> On Wed, Sep 22, 2021 at 12:51:38PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > >
> > [...]
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index 641f199f2d41..4839f128b24a 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -24,6 +24,7 @@
> > > >  struct iommufd_ctx {
> > > > refcount_t refs;
> > > > struct mutex lock;
> > > > +   struct xarray ioasid_xa; /* xarray of ioasids */
> > > > struct xarray device_xa; /* xarray of bound devices */
> > > >  };
> > > >
> > > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > > > u64 dev_cookie;
> > > >  };
> > > >
> > > > +/* Represent an I/O address space */
> > > > +struct iommufd_ioas {
> > > > +   int ioasid;
> > >
> > > xarray id's should consistently be u32s everywhere.
> >
> > sure. just one more check, this id is supposed to be returned to
> > userspace as the return value of ioctl(IOASID_ALLOC). That's why
> > I chose to use "int" as its prototype to make it aligned with the
> > return type of ioctl(). Based on this, do you think it's still better
> > to use "u32" here?
> 
> I suggest not using the return code from ioctl to exchange data.. The
> rest of the uAPI uses an in/out struct, everything should do
> that consistently.

got it.

Thanks,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-22 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
[...]
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 641f199f2d41..4839f128b24a 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -24,6 +24,7 @@
> >  struct iommufd_ctx {
> > refcount_t refs;
> > struct mutex lock;
> > +   struct xarray ioasid_xa; /* xarray of ioasids */
> > struct xarray device_xa; /* xarray of bound devices */
> >  };
> >
> > @@ -42,6 +43,16 @@ struct iommufd_device {
> > u64 dev_cookie;
> >  };
> >
> > +/* Represent an I/O address space */
> > +struct iommufd_ioas {
> > +   int ioasid;
> 
> xarray id's should consistently be u32s everywhere.

sure. just one more check, this id is supposed to be returned to
userspace as the return value of ioctl(IOASID_ALLOC). That's why
I chose to use "int" as its prototype to make it aligned with the
return type of ioctl(). Based on this, do you think it's still better
to use "u32" here?

Regards,
Yi Liu

> Many of the same prior comments repeated here
>
> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

2021-09-21 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, September 21, 2021 9:45 PM
> 
> On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> > Linux now includes multiple device-passthrough frameworks (e.g. VFIO
> and
> > vDPA) to manage secure device access from the userspace. One critical
> task
> > of those frameworks is to put the assigned device in a secure, IOMMU-
> > protected context so user-initiated DMAs are prevented from doing harm
> to
> > the rest of the system.
> 
> Some bot will probably send this too, but it has compile warnings and
> needs to be rebased to 5.15-rc1

thanks Jason, will fix the warnings. yeah, I was using 5.14 in the test, will
rebase to 5.15-rc# in next version.

Regards,
Yi Liu

> drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
> if (refcount_read(>refs) > 1) {
> ^~
> drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs
> here
> return ret;
>^~~
> drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its
> condition is always true
> if (refcount_read(>refs) > 1) {
> ^~~~
> drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret'
> to silence this warning
> int ioasid, ret;
>^
> = 0
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
> if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> ^~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
> return ERR_PTR(ret);
>^~~
> drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its
> condition is always false
> if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> 
> ^
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
> if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> ^~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
> return ERR_PTR(ret);
>^~~
> drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its
> condition is always false
> if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> ^~~
> drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret'
> to silence this warning
> int ret;
>^
> = 0
> 
> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

2021-09-19 Thread Liu, Yi L
> From: Liu, Yi L 
> Sent: Sunday, September 19, 2021 2:38 PM
[...]
> [Series Overview]
>
> * Basic skeleton:
>   0001-iommu-iommufd-Add-dev-iommu-core.patch
> 
> * VFIO PCI creates device-centric interface:
>   0002-vfio-Add-device-class-for-dev-vfio-devices.patch
>   0003-vfio-Add-vfio_-un-register_device.patch
>   0004-iommu-Add-iommu_device_get_info-interface.patch
>   0005-vfio-pci-Register-device-to-dev-vfio-devices.patch
> 
> * Bind device fd with iommufd:
>   0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
>   0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
>   0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch
> 
> * IOASID allocation:
>   0009-iommu-Add-page-size-and-address-width-attributes.patch
>   0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
>   0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
>   0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch
> 
> * IOASID [de]attach:
>   0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
>   0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
>   0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch
> 
> * DMA (un)map:
>   0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
>   0017-iommu-iommufd-Report-iova-range-to-userspace.patch
>   0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch
> 
> * Report the device info in vt-d driver to enable whole series:
>   0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch
> 
> * Add doc:
>   0020-Doc-Add-documentation-for-dev-iommu.patch

Please refer to the above patch overview. sorry for the duplicated contents.

thanks,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC 20/20] Doc: Add documentation for /dev/iommu

2021-09-19 Thread Liu Yi L
Document the /dev/iommu framework for user.

Open:
Do we want to document /dev/iommu in Documentation/userspace-api/iommu.rst?
Existing iommu.rst is for the vSVA interfaces, honestly, may need to rewrite
this doc entirely.

Signed-off-by: Kevin Tian 
Signed-off-by: Liu Yi L 
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 183 
 2 files changed, 184 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 0b5eefed027e..54df5a278023 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
ebpf/index
ioctl/index
iommu
+   iommufd
media/index
sysfs-platform_profile
 
diff --git a/Documentation/userspace-api/iommufd.rst 
b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index ..abffbb47dc02
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. iommu:
+
+===
+IOMMU Userspace API
+===
+
+Direct device access from userspace has been a crtical feature in
+high performance computing and virtualization usages. Linux now
+includes multiple device-passthrough frameworks (e.g. VFIO and vDPA)
+to manage secure device access from the userspace. One critical
+task of those frameworks is to put the assigned device in a secure,
+IOMMU-protected context so the device is prevented from doing harm
+to the rest of the system.
+
+Currently those frameworks implement their own logic for managing
+I/O page tables to isolate user-initiated DMAs. This doesn't scale
+to support many new IOMMU features, such as PASID-granular DMA
+remapping, nested translation, I/O page fault, IOMMU dirty bit, etc.
+
+The /dev/iommu framework provides an unified interface for managing
+I/O page tables for passthrough devices. Existing passthrough
+frameworks are expected to use this interface instead of continuing
+their ad-hoc implementations.
+
+IOMMUFDs, IOASIDs, Devices and Groups
+-
+
+The core concepts in /dev/iommu are IOMMUFDs and IOASIDs. IOMMUFD (by
+opening /dev/iommu) is the container holding multiple I/O address
+spaces for a user, while IOASID is the fd-local software handle
+representing an I/O address space and associated with a single I/O
+page table. User manages those address spaces through fd operations,
+e.g. by using vfio type1v2 mapping semantics to manage respective
+I/O page tables.
+
+IOASID is comparable to the conatiner concept in VFIO. The latter
+is also associated to a single I/O address space. A main difference
+between them is that multiple IOASIDs in the same IOMMUFD can be
+nested together (not supported yet) to allow centralized accounting
+of locked pages, while multiple containers are disconnected thus
+duplicated accounting is incurred. Typically one IOMMUFD is
+sufficient for all intended IOMMU usages for a user.
+
+An I/O address space takes effect in the IOMMU only after it is
+attached by a device. One I/O address space can be attached by
+multiple devices. One device can be only attached to a single I/O
+address space at this point (on par with current vfio behavior).
+
+Device must be bound to an iommufd before the attach operation can
+be conducted. The binding operation builds the connection between
+the devicefd (opened via device-passthrough framework) and IOMMUFD.
+IOMMU-protected security context is esbliashed when the binding
+operation is completed. The passthrough framework must block user
+access to the assigned device until bind() returns success.
+
+The entire /dev/iommu framework adopts a device-centric model w/o
+carrying any container/group legacy as current vfio does. However
+the group is the minimum granularity that must be used to ensure
+secure user access (refer to vfio.rst). This framework relies on
+the IOMMU core layer to map device-centric model into group-granular
+isolation.
+
+Managing I/O Address Spaces
+---
+
+When creating an I/O address space (by allocating IOASID), the user
+must specify the type of underlying I/O page table. Currently only
+one type (kernel-managed) is supported. In the future other types
+will be introduced, e.g. to support user-managed I/O page table or
+a shared I/O page table which is managed by another kernel sub-
+system (mm, ept, etc.). Kernel-managed I/O page table is currently
+managed via vfio type1v2 equivalent mapping semantics.
+
+The user also needs to specify the format of the I/O page table
+when allocating an IOASID. The format must be compatible to the
+attached devices (or more specifically to the IOMMU which serves
+the DMA from the attached devices). User can query the device IOMMU
+format via IOMMUFD once a device is successfully bound. Attaching a
+device

[RFC 19/20] iommu/vt-d: Implement device_info iommu_ops callback

2021-09-19 Thread Liu Yi L
From: Lu Baolu 

Expose per-device IOMMU attributes to the upper layers.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index dd22fc7d5176..d531ea44f418 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5583,6 +5583,40 @@ static void intel_iommu_iotlb_sync_map(struct 
iommu_domain *domain,
}
 }
 
+static int
+intel_iommu_device_info(struct device *dev, enum iommu_devattr type, void 
*data)
+{
+   struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
+   int ret = 0;
+
+   if (!iommu)
+   return -ENODEV;
+
+   switch (type) {
+   case IOMMU_DEV_INFO_PAGE_SIZE:
+   *(u64 *)data = SZ_4K |
+   (cap_super_page_val(iommu->cap) & BIT(0) ? SZ_2M : 0) |
+   (cap_super_page_val(iommu->cap) & BIT(1) ? SZ_1G : 0);
+   break;
+   case IOMMU_DEV_INFO_FORCE_SNOOP:
+   /*
+* Force snoop is always supported in the scalable mode. For 
the legacy
+* mode, check the capability register.
+*/
+   *(bool *)data = sm_supported(iommu) || 
ecap_sc_support(iommu->ecap);
+   break;
+   case IOMMU_DEV_INFO_ADDR_WIDTH:
+   *(u32 *)data = min_t(u32, agaw_to_width(iommu->agaw),
+cap_mgaw(iommu->cap));
+   break;
+   default:
+   ret = -EINVAL;
+   break;
+   }
+
+   return ret;
+}
+
 const struct iommu_ops intel_iommu_ops = {
.capable= intel_iommu_capable,
.domain_alloc   = intel_iommu_domain_alloc,
@@ -5621,6 +5655,7 @@ const struct iommu_ops intel_iommu_ops = {
.sva_get_pasid  = intel_svm_get_pasid,
.page_response  = intel_svm_page_response,
 #endif
+   .device_info= intel_iommu_device_info,
 };
 
 static void quirk_iommu_igfx(struct pci_dev *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC 18/20] iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID

2021-09-19 Thread Liu Yi L
[HACK. will fix in v2]

This patch introduces vfio type1v2-equivalent interface to userspace. Due
to aforementioned hack, iommufd currently calls exported vfio symbols to
handle map/unmap requests from the user.

Signed-off-by: Liu Yi L 
---
 drivers/iommu/iommufd/iommufd.c | 104 
 include/uapi/linux/iommu.h  |  29 +
 2 files changed, 133 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index cbf5e30062a6..f5f2274d658c 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -55,6 +55,7 @@ struct iommufd_ioas {
struct mutex lock;
struct list_head device_list;
struct iommu_domain *domain;
+   struct vfio_iommu *vfio_iommu; /* FIXME: added for reusing 
vfio_iommu_type1 code */
 };
 
 /*
@@ -158,6 +159,7 @@ static void ioas_put_locked(struct iommufd_ioas *ioas)
return;
 
WARN_ON(!list_empty(>device_list));
+   vfio_iommu_type1_release(ioas->vfio_iommu); /* FIXME: reused vfio code 
*/
xa_erase(>ioasid_xa, ioasid);
iommufd_ctx_put(ictx);
kfree(ioas);
@@ -185,6 +187,7 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, 
unsigned long arg)
struct iommufd_ioas *ioas;
unsigned long minsz;
int ioasid, ret;
+   struct vfio_iommu *vfio_iommu;
 
minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
 
@@ -211,6 +214,18 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, 
unsigned long arg)
return ret;
}
 
+   /* FIXME: get a vfio_iommu object for dma map/unmap management */
+   vfio_iommu = vfio_iommu_type1_open(VFIO_TYPE1v2_IOMMU);
+   if (IS_ERR(vfio_iommu)) {
+   pr_err_ratelimited("Failed to get vfio_iommu object\n");
+   mutex_lock(>lock);
+   xa_erase(>ioasid_xa, ioasid);
+   mutex_unlock(>lock);
+   kfree(ioas);
+   return PTR_ERR(vfio_iommu);
+   }
+   ioas->vfio_iommu = vfio_iommu;
+
ioas->ioasid = ioasid;
 
/* only supports kernel managed I/O page table so far */
@@ -383,6 +398,49 @@ static int iommufd_get_device_info(struct iommufd_ctx 
*ictx,
return copy_to_user((void __user *)arg, , minsz) ? -EFAULT : 0;
 }
 
+static int iommufd_process_dma_op(struct iommufd_ctx *ictx,
+ unsigned long arg, bool map)
+{
+   struct iommu_ioasid_dma_op dma;
+   unsigned long minsz;
+   struct iommufd_ioas *ioas = NULL;
+   int ret;
+
+   minsz = offsetofend(struct iommu_ioasid_dma_op, padding);
+
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (dma.argsz < minsz || dma.flags || dma.ioasid < 0)
+   return -EINVAL;
+
+   ioas = ioasid_get_ioas(ictx, dma.ioasid);
+   if (!ioas) {
+   pr_err_ratelimited("unkonwn IOASID %u\n", dma.ioasid);
+   return -EINVAL;
+   }
+
+   mutex_lock(>lock);
+
+   /*
+* Needs to block map/unmap request from userspace before IOASID
+* is attached to any device.
+*/
+   if (list_empty(>device_list)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   if (map)
+   ret = vfio_iommu_type1_map_dma(ioas->vfio_iommu, arg + minsz);
+   else
+   ret = vfio_iommu_type1_unmap_dma(ioas->vfio_iommu, arg + minsz);
+out:
+   mutex_unlock(>lock);
+   ioas_put(ioas);
+   return ret;
+};
+
 static long iommufd_fops_unl_ioctl(struct file *filep,
   unsigned int cmd, unsigned long arg)
 {
@@ -409,6 +467,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
case IOMMU_IOASID_FREE:
ret = iommufd_ioasid_free(ictx, arg);
break;
+   case IOMMU_MAP_DMA:
+   ret = iommufd_process_dma_op(ictx, arg, true);
+   break;
+   case IOMMU_UNMAP_DMA:
+   ret = iommufd_process_dma_op(ictx, arg, false);
+   break;
default:
pr_err_ratelimited("unsupported cmd %u\n", cmd);
break;
@@ -478,6 +542,39 @@ static int ioas_check_device_compatibility(struct 
iommufd_ioas *ioas,
return 0;
 }
 
+/* HACK:
+ * vfio_iommu_add/remove_device() is hacky implementation for
+ * this version to add the device/group to vfio iommu type1.
+ */
+static int vfio_iommu_add_device(struct vfio_iommu *vfio_iommu,
+struct device *dev,
+struct iommu_domain *domain)
+{
+   struct iommu_group *group;
+   int ret;
+
+   group = iommu_group_get(dev);
+   if (!group)
+   return -EINVAL;
+
+   ret = vfio_iommu_add_group(vfio_iommu, group, domain);
+   iommu_group_put(group);
+

[RFC 17/20] iommu/iommufd: Report iova range to userspace

2021-09-19 Thread Liu Yi L
[HACK. will fix in v2]

IOVA range is critical info for userspace to manage DMA for an I/O address
space. This patch reports the valid iova range info of a given device.

Due to aforementioned hack, this info comes from the hacked vfio type1
driver. To follow the same format in vfio, we also introduce a cap chain
format in IOMMU_DEVICE_GET_INFO to carry the iova range info.

Signed-off-by: Liu Yi L 
---
 drivers/iommu/iommu.c   |  2 ++
 drivers/iommu/iommufd/iommufd.c | 41 +++-
 drivers/vfio/vfio_iommu_type1.c | 47 ++---
 include/linux/vfio.h|  2 ++
 include/uapi/linux/iommu.h  |  3 +++
 5 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b6178997aef1..44bba346ab52 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2755,6 +2755,7 @@ void iommu_get_resv_regions(struct device *dev, struct 
list_head *list)
if (ops && ops->get_resv_regions)
ops->get_resv_regions(dev, list);
 }
+EXPORT_SYMBOL_GPL(iommu_get_resv_regions);
 
 void iommu_put_resv_regions(struct device *dev, struct list_head *list)
 {
@@ -2763,6 +2764,7 @@ void iommu_put_resv_regions(struct device *dev, struct 
list_head *list)
if (ops && ops->put_resv_regions)
ops->put_resv_regions(dev, list);
 }
+EXPORT_SYMBOL_GPL(iommu_put_resv_regions);
 
 /**
  * generic_iommu_put_resv_regions - Reserved region driver helper
diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 25373a0e037a..cbf5e30062a6 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Per iommufd */
 struct iommufd_ctx {
@@ -298,6 +299,38 @@ iommu_find_device_from_cookie(struct iommufd_ctx *ictx, 
u64 dev_cookie)
return dev;
 }
 
+static int iommu_device_add_cap_chain(struct device *dev, unsigned long arg,
+ struct iommu_device_info *info)
+{
+   struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+   int ret;
+
+   ret = vfio_device_add_iova_cap(dev, );
+   if (ret)
+   return ret;
+
+   if (caps.size) {
+   info->flags |= IOMMU_DEVICE_INFO_CAPS;
+
+   if (info->argsz < sizeof(*info) + caps.size) {
+   info->argsz = sizeof(*info) + caps.size;
+   } else {
+   vfio_info_cap_shift(, sizeof(*info));
+   if (copy_to_user((void __user *)arg +
+   sizeof(*info), caps.buf,
+   caps.size)) {
+   kfree(caps.buf);
+   info->flags &= ~IOMMU_DEVICE_INFO_CAPS;
+   return -EFAULT;
+   }
+   info->cap_offset = sizeof(*info);
+   }
+
+   kfree(caps.buf);
+   }
+   return 0;
+}
+
 static void iommu_device_build_info(struct device *dev,
struct iommu_device_info *info)
 {
@@ -324,8 +357,9 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
struct iommu_device_info info;
unsigned long minsz;
struct device *dev;
+   int ret;
 
-   minsz = offsetofend(struct iommu_device_info, addr_width);
+   minsz = offsetofend(struct iommu_device_info, cap_offset);
 
if (copy_from_user(, (void __user *)arg, minsz))
return -EFAULT;
@@ -341,6 +375,11 @@ static int iommufd_get_device_info(struct iommufd_ctx 
*ictx,
 
iommu_device_build_info(dev, );
 
+   info.cap_offset = 0;
+   ret = iommu_device_add_cap_chain(dev, arg, );
+   if (ret)
+   pr_info_ratelimited("No cap chain added, error %d\n", ret);
+
return copy_to_user((void __user *)arg, , minsz) ? -EFAULT : 0;
 }
 
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c1c6bc803d94..28c1699aed6b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2963,15 +2963,15 @@ static int vfio_iommu_iova_add_cap(struct vfio_info_cap 
*caps,
return 0;
 }
 
-static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
- struct vfio_info_cap *caps)
+static int vfio_iova_list_build_caps(struct list_head *iova_list,
+struct vfio_info_cap *caps)
 {
struct vfio_iommu_type1_info_cap_iova_range *cap_iovas;
struct vfio_iova *iova;
size_t size;
int iovas = 0, i = 0, ret;
 
-   list_for_each_entry(iova, >iova_list, list)
+   list_for_each_entry(iova, iova_list, list)
iovas++;
 
if (!iovas) {
@@ -2990,7 +2990,7 @@ static int vfio_iommu_iova_build_caps(s

[RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing

2021-09-19 Thread Liu Yi L
[HACK. will fix in v2]

There are two options to impelement vfio type1v2 mapping semantics in
/dev/iommu.

One is to duplicate the related code from vfio as the starting point,
and then merge with vfio type1 at a later time. However vfio_iommu_type1.c
has over 3000LOC with ~80% related to dma management logic, including:

- the dma map/unmap metadata management
- page pinning, and related accounting
- iova range reporting
- dirty bitmap retrieving
- dynamic vaddr update, etc.

Not sure whether duplicating such amount of code in the transition phase
is acceptable.

The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
which requires converting vfio_iommu_type1 to be a shim driver. The upside
is no code duplication and it is anyway the long-term goal even with the
first approach. The downside is that more effort is required for the
'initial' skeleton thus all new iommu features will be blocked for a longer
time. Main task is to figure out how to handle the remaining 20% code (tied
with group) in vfio_iommu_type1 with device-centric model in iommufd (with
group managed by iommu core). It also implies that no-snoop DMA must be
handled now with extra work on reworked kvm-vfio contract. and also need
to support external page pinning as required by sw mdev.

Due to limited time, we choose a hacky approach in this RFC by directly
calling vfio_iommu_type1 functions in iommufd and raising this open for
discussion. This should not impact the review on other key aspects of the
new framework. Once we reach consensus, we'll follow it to do a clean
implementation 'in' next version.

Signed-off-by: Liu Yi L 
---
 drivers/vfio/vfio_iommu_type1.c | 199 +++-
 include/linux/vfio.h|  13 +++
 2 files changed, 206 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4f7c174c7a..c1c6bc803d94 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -115,6 +115,7 @@ struct vfio_iommu_group {
struct list_headnext;
boolmdev_group; /* An mdev group */
boolpinned_page_dirty_scope;
+   int attach_cnt;
 };
 
 struct vfio_iova {
@@ -2240,6 +2241,135 @@ static void vfio_iommu_iova_insert_copy(struct 
vfio_iommu *iommu,
list_splice_tail(iova_copy, iova);
 }
 
+/* HACK: called by /dev/iommu core to init group to vfio_iommu_type1 */
+int vfio_iommu_add_group(struct vfio_iommu *iommu,
+struct iommu_group *iommu_group,
+struct iommu_domain *iommu_domain)
+{
+   struct vfio_iommu_group *group;
+   struct vfio_domain *domain = NULL;
+   struct bus_type *bus = NULL;
+   int ret = 0;
+   bool resv_msi, msi_remap;
+   phys_addr_t resv_msi_base = 0;
+   struct iommu_domain_geometry *geo;
+   LIST_HEAD(iova_copy);
+   LIST_HEAD(group_resv_regions);
+
+   /* Determine bus_type */
+   ret = iommu_group_for_each_dev(iommu_group, , vfio_bus_type);
+   if (ret)
+   return ret;
+
+   mutex_lock(>lock);
+
+   /* Check for duplicates */
+   group = vfio_iommu_find_iommu_group(iommu, iommu_group);
+   if (group) {
+   group->attach_cnt++;
+   mutex_unlock(>lock);
+   return 0;
+   }
+
+   /* Get aperture info */
+   geo = _domain->geometry;
+   if (vfio_iommu_aper_conflict(iommu, geo->aperture_start,
+geo->aperture_end)) {
+   ret = -EINVAL;
+   goto out_free;
+   }
+
+   ret = iommu_get_group_resv_regions(iommu_group, _resv_regions);
+   if (ret)
+   goto out_free;
+
+   if (vfio_iommu_resv_conflict(iommu, _resv_regions)) {
+   ret = -EINVAL;
+   goto out_free;
+   }
+
+   /*
+* We don't want to work on the original iova list as the list
+* gets modified and in case of failure we have to retain the
+* original list. Get a copy here.
+*/
+   ret = vfio_iommu_iova_get_copy(iommu, _copy);
+   if (ret)
+   goto out_free;
+
+   ret = vfio_iommu_aper_resize(_copy, geo->aperture_start,
+geo->aperture_end);
+   if (ret)
+   goto out_free;
+
+   ret = vfio_iommu_resv_exclude(_copy, _resv_regions);
+   if (ret)
+   goto out_free;
+
+   resv_msi = vfio_iommu_has_sw_msi(_resv_regions, _msi_base);
+
+   msi_remap = irq_domain_check_msi_remap() ||
+   iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
+
+   if (!allow_unsafe_interrupts && !msi_remap) {
+   pr_warn("%s: No interrupt remapping support.  Use the module 
param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this 
platform\n",
+ 

[RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

2021-09-19 Thread Liu Yi L
This patch adds interface for userspace to attach device to specified
IOASID.

Note:
One device can only be attached to one IOASID in this version. This is
on par with what vfio provides today. In the future this restriction can
be relaxed when multiple I/O address spaces are supported per device

Signed-off-by: Liu Yi L 
---
 drivers/vfio/pci/vfio_pci.c | 82 +
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 include/linux/iommufd.h |  1 +
 include/uapi/linux/vfio.h   | 26 +
 4 files changed, 110 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 20006bb66430..5b1fda333122 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -557,6 +557,11 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
if (vdev->videv) {
struct vfio_iommufd_device *videv = vdev->videv;
 
+   if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+   iommufd_device_detach_ioasid(videv->idev,
+videv->ioasid);
+   videv->ioasid = IOMMUFD_INVALID_IOASID;
+   }
vdev->videv = NULL;
iommufd_unbind_device(videv->idev);
kfree(videv);
@@ -839,6 +844,7 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
}
videv->idev = idev;
videv->iommu_fd = bind_data.iommu_fd;
+   videv->ioasid = IOMMUFD_INVALID_IOASID;
/*
 * A security context has been established. Unblock
 * user access.
@@ -848,6 +854,82 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
vdev->videv = videv;
mutex_unlock(>videv_lock);
 
+   return 0;
+   } else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
+   struct vfio_device_attach_ioasid attach;
+   unsigned long minsz;
+   struct vfio_iommufd_device *videv;
+   int ret = 0;
+
+   /* not allowed if the device is opened in legacy interface */
+   if (vfio_device_in_container(core_vdev))
+   return -ENOTTY;
+
+   minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (attach.argsz < minsz || attach.flags ||
+   attach.iommu_fd < 0 || attach.ioasid < 0)
+   return -EINVAL;
+
+   mutex_lock(>videv_lock);
+
+   videv = vdev->videv;
+   if (!videv || videv->iommu_fd != attach.iommu_fd) {
+   mutex_unlock(>videv_lock);
+   return -EINVAL;
+   }
+
+   /* Currently only allows one IOASID attach */
+   if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+   mutex_unlock(>videv_lock);
+   return -EBUSY;
+   }
+
+   ret = __pci_iommufd_device_attach_ioasid(vdev->pdev,
+videv->idev,
+attach.ioasid);
+   if (!ret)
+   videv->ioasid = attach.ioasid;
+   mutex_unlock(>videv_lock);
+
+   return ret;
+   } else if (cmd == VFIO_DEVICE_DETACH_IOASID) {
+   struct vfio_device_attach_ioasid attach;
+   unsigned long minsz;
+   struct vfio_iommufd_device *videv;
+
+   /* not allowed if the device is opened in legacy interface */
+   if (vfio_device_in_container(core_vdev))
+   return -ENOTTY;
+
+   minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (attach.argsz < minsz || attach.flags ||
+   attach.iommu_fd < 0 || attach.ioasid < 0)
+   return -EINVAL;
+
+   mutex_lock(>videv_lock);
+
+   videv = vdev->videv;
+   if (!videv || videv->iommu_fd != attach.iommu_fd) {
+   mutex_unlock(>videv_lock);
+   return -EINVAL;
+   }
+
+   if (videv->ioasid == IOMMUFD_INVALID_IOASID ||
+   videv->ioasid != attach.ioasid) {
+   mutex_unlock(>videv_lock);
+   return -EINVAL;
+   }
+
+   videv->ioasid = IOMMUFD_INVALID_IOASID;
+   iommufd_device_d

[RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()

2021-09-19 Thread Liu Yi L
An I/O address space takes effect in the iommu only after it's attached
by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
helpers for this purpose. One device can be only attached to one ioasid
at this point, but one ioasid can be attached by multiple devices.

The caller specifies the iommufd_device (returned at binding time) and
the target ioasid when calling the helper function. Upon request, iommufd
installs the specified I/O page table to the correct place in the IOMMU,
according to the routing information (struct device* which represents
RID) recorded in iommufd_device. Future variants could allow the caller
to specify additional routing information (e.g. pasid/ssid) when multiple
I/O address spaces are supported per device.

Open:
Per Jason's comment in below link, bus-specific wrappers are recommended.
This RFC implements one wrapper for pci device. But it looks that struct
pci_device is not used at all since iommufd_ device already carries all
necessary info. So want to have another discussion on its necessity, e.g.
whether making more sense to have bus-specific wrappers for binding, while
leaving a common attaching helper per iommufd_device.
https://lore.kernel.org/linux-iommu/20210528233649.gb3816...@nvidia.com/

TODO:
When multiple devices are attached to a same ioasid, the permitted iova
ranges and supported pgsize bitmap on this ioasid should be a common
subset of all attached devices. iommufd needs to track such info per
ioasid and update it every time when a new device is attached to the
ioasid. This has not been done in this version yet, due to the temporary
hack adopted in patch 16-18. The hack reuses vfio type1 driver which
already includes the necessary logic for iova ranges and pgsize bitmap.
Once we get a clear direction for those patches, that logic will be moved
to this patch.

Signed-off-by: Liu Yi L 
---
 drivers/iommu/iommufd/iommufd.c | 226 
 include/linux/iommufd.h |  29 
 2 files changed, 255 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e45d76359e34..25373a0e037a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -51,6 +51,19 @@ struct iommufd_ioas {
bool enforce_snoop;
struct iommufd_ctx *ictx;
refcount_t refs;
+   struct mutex lock;
+   struct list_head device_list;
+   struct iommu_domain *domain;
+};
+
+/*
+ * An ioas_device_info object is created per each successful attaching
+ * request. A list of objects are maintained per ioas when the address
+ * space is shared by multiple devices.
+ */
+struct ioas_device_info {
+   struct iommufd_device *idev;
+   struct list_head next;
 };
 
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -119,6 +132,21 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
kfree(ictx);
 }
 
+static struct iommufd_ioas *ioasid_get_ioas(struct iommufd_ctx *ictx, int 
ioasid)
+{
+   struct iommufd_ioas *ioas;
+
+   if (ioasid < 0)
+   return NULL;
+
+   mutex_lock(>lock);
+   ioas = xa_load(>ioasid_xa, ioasid);
+   if (ioas)
+   refcount_inc(>refs);
+   mutex_unlock(>lock);
+   return ioas;
+}
+
 /* Caller should hold ictx->lock */
 static void ioas_put_locked(struct iommufd_ioas *ioas)
 {
@@ -128,11 +156,28 @@ static void ioas_put_locked(struct iommufd_ioas *ioas)
if (!refcount_dec_and_test(>refs))
return;
 
+   WARN_ON(!list_empty(>device_list));
xa_erase(>ioasid_xa, ioasid);
iommufd_ctx_put(ictx);
kfree(ioas);
 }
 
+/*
+ * Caller should hold a ictx reference when calling this function
+ * otherwise ictx might be freed in ioas_put_locked() then the last
+ * unlock becomes problematic. Alternatively we could have a fresh
+ * implementation of ioas_put instead of calling the locked function.
+ * In this case it can ensure ictx is freed after mutext_unlock().
+ */
+static void ioas_put(struct iommufd_ioas *ioas)
+{
+   struct iommufd_ctx *ictx = ioas->ictx;
+
+   mutex_lock(>lock);
+   ioas_put_locked(ioas);
+   mutex_unlock(>lock);
+}
+
 static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 {
struct iommu_ioasid_alloc req;
@@ -178,6 +223,9 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, 
unsigned long arg)
iommufd_ctx_get(ictx);
ioas->ictx = ictx;
 
+   mutex_init(>lock);
+   INIT_LIST_HEAD(>device_list);
+
refcount_set(>refs, 1);
 
return ioasid;
@@ -344,6 +392,166 @@ static struct miscdevice iommu_misc_dev = {
.mode = 0666,
 };
 
+/* Caller should hold ioas->lock */
+static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
+struct iommufd_device *idev)
+{
+   struct ioas_device_inf

[RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group

2021-09-19 Thread Liu Yi L
From: Lu Baolu 

These two helpers could be used when 1) the iommu group is singleton,
or 2) the upper layer has put the iommu group into the secure state by
calling iommu_device_init_user_dma().

As we want the iommufd design to be a device-centric model, we want to
remove any group knowledge in iommufd. Given that we already have
iommu_at[de]tach_device() interface, we could extend it for iommufd
simply by doing below:

 - first device in a group does group attach;
 - last device in a group does group detach.

as long as the group has been put into the secure context.

The commit <426a273834eae> ("iommu: Limit iommu_attach/detach_device to
device with their own group") deliberately restricts the two interfaces
to single-device group. To avoid the conflict with existing usages, we
keep this policy and put the new extension only when the group has been
marked for user_dma.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/iommu.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index bffd84e978fb..b6178997aef1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -47,6 +47,7 @@ struct iommu_group {
struct list_head entry;
unsigned long user_dma_owner_id;
refcount_t owner_cnt;
+   refcount_t attach_cnt;
 };
 
 struct group_device {
@@ -1994,7 +1995,7 @@ static int __iommu_attach_device(struct iommu_domain 
*domain,
 int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 {
struct iommu_group *group;
-   int ret;
+   int ret = 0;
 
group = iommu_group_get(dev);
if (!group)
@@ -2005,11 +2006,23 @@ int iommu_attach_device(struct iommu_domain *domain, 
struct device *dev)
 * change while we are attaching
 */
mutex_lock(>mutex);
-   ret = -EINVAL;
-   if (iommu_group_device_count(group) != 1)
+   if (group->user_dma_owner_id) {
+   if (group->domain) {
+   if (group->domain != domain)
+   ret = -EBUSY;
+   else
+   refcount_inc(>attach_cnt);
+
+   goto out_unlock;
+   }
+   } else if (iommu_group_device_count(group) != 1) {
+   ret = -EINVAL;
goto out_unlock;
+   }
 
ret = __iommu_attach_group(domain, group);
+   if (!ret && group->user_dma_owner_id)
+   refcount_set(>attach_cnt, 1);
 
 out_unlock:
mutex_unlock(>mutex);
@@ -2261,7 +2274,10 @@ void iommu_detach_device(struct iommu_domain *domain, 
struct device *dev)
return;
 
mutex_lock(>mutex);
-   if (iommu_group_device_count(group) != 1) {
+   if (group->user_dma_owner_id) {
+   if (!refcount_dec_and_test(>attach_cnt))
+   goto out_unlock;
+   } else if (iommu_group_device_count(group) != 1) {
WARN_ON(1);
goto out_unlock;
}
@@ -3368,6 +3384,7 @@ static int iommu_group_init_user_dma(struct iommu_group 
*group,
 
group->user_dma_owner_id = owner;
refcount_set(>owner_cnt, 1);
+   refcount_set(>attach_cnt, 0);
 
/* default domain is unsafe for user-initiated dma */
if (group->domain == group->default_domain)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION

2021-09-19 Thread Liu Yi L
As aforementioned, userspace should check extension for what formats
can be specified when allocating an IOASID. This patch adds such
interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
support and no no-snoop support yet.

Signed-off-by: Liu Yi L 
---
 drivers/iommu/iommufd/iommufd.c |  7 +++
 include/uapi/linux/iommu.h  | 27 +++
 2 files changed, 34 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 4839f128b24a..e45d76359e34 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
return ret;
 
switch (cmd) {
+   case IOMMU_CHECK_EXTENSION:
+   switch (arg) {
+   case EXT_MAP_TYPE1V2:
+   return 1;
+   default:
+   return 0;
+   }
case IOMMU_DEVICE_GET_INFO:
ret = iommufd_get_device_info(ictx, arg);
break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 5cbd300eb0ee..49731be71213 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -14,6 +14,33 @@
 #define IOMMU_TYPE (';')
 #define IOMMU_BASE 100
 
+/*
+ * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
+ *
+ * Check whether an uAPI extension is supported.
+ *
+ * It's unlikely that all planned capabilities in IOMMU fd will be ready
+ * in one breath. User should check which uAPI extension is supported
+ * according to its intended usage.
+ *
+ * A rough list of possible extensions may include:
+ *
+ * - EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
+ * - EXT_DMA_NO_SNOOP for no-snoop DMA support;
+ * - EXT_MAP_NEWTYPE for an enhanced map semantics;
+ * - EXT_MULTIDEV_GROUP for 1:N iommu group;
+ * - EXT_IOASID_NESTING for what the name stands;
+ * - EXT_USER_PAGE_TABLE for user managed page table;
+ * - EXT_USER_PASID_TABLE for user managed PASID table;
+ * - EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
+ * - ...
+ *
+ * Return: 0 if not supported, 1 if supported.
+ */
+#define EXT_MAP_TYPE1V21
+#define EXT_DMA_NO_SNOOP   2
+#define IOMMU_CHECK_EXTENSION  _IO(IOMMU_TYPE, IOMMU_BASE + 0)
+
 /*
  * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
  * struct iommu_device_info)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-19 Thread Liu Yi L
This patch adds IOASID allocation/free interface per iommufd. When
allocating an IOASID, userspace is expected to specify the type and
format information for the target I/O page table.

This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
implying a kernel-managed I/O page table with vfio type1v2 mapping
semantics. For this type the user should specify the addr_width of
the I/O address space and whether the I/O page table is created in
an iommu enfore_snoop format. enforce_snoop must be true at this point,
as the false setting requires additional contract with KVM on handling
WBINVD emulation, which can be added later.

Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
for what formats can be specified when allocating an IOASID.

Open:
- Devices on PPC platform currently use a different iommu driver in vfio.
  Per previous discussion they can also use vfio type1v2 as long as there
  is a way to claim a specific iova range from a system-wide address space.
  This requirement doesn't sound PPC specific, as addr_width for pci devices
  can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
  adopted this design yet. We hope to have formal alignment in v1 discussion
  and then decide how to incorporate it in v2.

- Currently ioasid term has already been used in the kernel (drivers/iommu/
  ioasid.c) to represent the hardware I/O address space ID in the wire. It
  covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
  ID). We need find a way to resolve the naming conflict between the hardware
  ID and software handle. One option is to rename the existing ioasid to be
  pasid or ssid, given their full names still sound generic. Appreciate more
  thoughts on this open!

Signed-off-by: Liu Yi L 
---
 drivers/iommu/iommufd/iommufd.c | 120 
 include/linux/iommufd.h |   3 +
 include/uapi/linux/iommu.h  |  54 ++
 3 files changed, 177 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 641f199f2d41..4839f128b24a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -24,6 +24,7 @@
 struct iommufd_ctx {
refcount_t refs;
struct mutex lock;
+   struct xarray ioasid_xa; /* xarray of ioasids */
struct xarray device_xa; /* xarray of bound devices */
 };
 
@@ -42,6 +43,16 @@ struct iommufd_device {
u64 dev_cookie;
 };
 
+/* Represent an I/O address space */
+struct iommufd_ioas {
+   int ioasid;
+   u32 type;
+   u32 addr_width;
+   bool enforce_snoop;
+   struct iommufd_ctx *ictx;
+   refcount_t refs;
+};
+
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
 {
struct iommufd_ctx *ictx;
@@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file 
*filep)
 
refcount_set(>refs, 1);
mutex_init(>lock);
+   xa_init_flags(>ioasid_xa, XA_FLAGS_ALLOC);
xa_init_flags(>device_xa, XA_FLAGS_ALLOC);
filep->private_data = ictx;
 
@@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
if (!refcount_dec_and_test(>refs))
return;
 
+   WARN_ON(!xa_empty(>ioasid_xa));
WARN_ON(!xa_empty(>device_xa));
kfree(ictx);
 }
 
+/* Caller should hold ictx->lock */
+static void ioas_put_locked(struct iommufd_ioas *ioas)
+{
+   struct iommufd_ctx *ictx = ioas->ictx;
+   int ioasid = ioas->ioasid;
+
+   if (!refcount_dec_and_test(>refs))
+   return;
+
+   xa_erase(>ioasid_xa, ioasid);
+   iommufd_ctx_put(ictx);
+   kfree(ioas);
+}
+
+static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
+{
+   struct iommu_ioasid_alloc req;
+   struct iommufd_ioas *ioas;
+   unsigned long minsz;
+   int ioasid, ret;
+
+   minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
+
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (req.argsz < minsz || !req.addr_width ||
+   req.flags != IOMMU_IOASID_ENFORCE_SNOOP ||
+   req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2)
+   return -EINVAL;
+
+   ioas = kzalloc(sizeof(*ioas), GFP_KERNEL);
+   if (!ioas)
+   return -ENOMEM;
+
+   mutex_lock(>lock);
+   ret = xa_alloc(>ioasid_xa, , ioas,
+  XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX),
+  GFP_KERNEL);
+   mutex_unlock(>lock);
+   if (ret) {
+   pr_err_ratelimited("Failed to alloc ioasid\n");
+   kfree(ioas);
+   return ret;
+   }
+
+   ioas->ioasid = ioasid;
+
+   /* only supports kernel managed I/O page table so far */
+   ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2;
+
+   ioas->addr_width = req.addr

[RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-19 Thread Liu Yi L
After a device is bound to the iommufd, userspace can use this interface
to query the underlying iommu capability and format info for this device.
Based on this information the user then creates I/O address space in a
compatible format with the to-be-attached devices.

Device cookie which is registered at binding time is used to mark the
device which is being queried here.

Signed-off-by: Liu Yi L 
---
 drivers/iommu/iommufd/iommufd.c | 68 +
 include/uapi/linux/iommu.h  | 49 
 2 files changed, 117 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e16ca21e4534..641f199f2d41 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, 
struct file *filep)
return 0;
 }
 
+static struct device *
+iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
+{
+   struct iommufd_device *idev;
+   struct device *dev = NULL;
+   unsigned long index;
+
+   mutex_lock(>lock);
+   xa_for_each(>device_xa, index, idev) {
+   if (idev->dev_cookie == dev_cookie) {
+   dev = idev->dev;
+   break;
+   }
+   }
+   mutex_unlock(>lock);
+
+   return dev;
+}
+
+static void iommu_device_build_info(struct device *dev,
+   struct iommu_device_info *info)
+{
+   bool snoop;
+   u64 awidth, pgsizes;
+
+   if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, ))
+   info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
+
+   if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, )) {
+   info->pgsize_bitmap = pgsizes;
+   info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
+   }
+
+   if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, )) {
+   info->addr_width = awidth;
+   info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
+   }
+}
+
+static int iommufd_get_device_info(struct iommufd_ctx *ictx,
+  unsigned long arg)
+{
+   struct iommu_device_info info;
+   unsigned long minsz;
+   struct device *dev;
+
+   minsz = offsetofend(struct iommu_device_info, addr_width);
+
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz < minsz)
+   return -EINVAL;
+
+   info.flags = 0;
+
+   dev = iommu_find_device_from_cookie(ictx, info.dev_cookie);
+   if (!dev)
+   return -EINVAL;
+
+   iommu_device_build_info(dev, );
+
+   return copy_to_user((void __user *)arg, , minsz) ? -EFAULT : 0;
+}
+
 static long iommufd_fops_unl_ioctl(struct file *filep,
   unsigned int cmd, unsigned long arg)
 {
@@ -127,6 +192,9 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
return ret;
 
switch (cmd) {
+   case IOMMU_DEVICE_GET_INFO:
+   ret = iommufd_get_device_info(ictx, arg);
+   break;
default:
pr_err_ratelimited("unsupported cmd %u\n", cmd);
break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 59178fc229ca..76b71f9d6b34 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -7,6 +7,55 @@
 #define _UAPI_IOMMU_H
 
 #include 
+#include 
+
+/*  IOCTLs for IOMMU file descriptor (/dev/iommu)  */
+
+#define IOMMU_TYPE (';')
+#define IOMMU_BASE 100
+
+/*
+ * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
+ * struct iommu_device_info)
+ *
+ * Check IOMMU capabilities and format information on a bound device.
+ *
+ * The device is identified by device cookie (registered when binding
+ * this device).
+ *
+ * @argsz:user filled size of this data.
+ * @flags:tells userspace which capability info is available
+ * @dev_cookie:   user assinged cookie.
+ * @pgsize_bitmap: Bitmap of supported page sizes. 1-setting of the
+ *bit in pgsize_bitmap[63:12] indicates a supported
+ *page size. Details as below table:
+ *
+ *+===++
+ *|  Bit[index]   |  Page Size |
+ *+---++
+ *|  12   |  4 KB  |
+ *+---++
+ *|  13   |  8 KB  |
+ *+---++
+ *|  14   |  16 KB |
+ *+---++
+ *...
+ * @addr_width:the address width of supported I/O address spaces.
+ *
+ * Availability: after device is bound to iommufd
+ */
+struct iommu_device_info {
+   __u

[RFC 09/20] iommu: Add page size and address width attributes

2021-09-19 Thread Liu Yi L
From: Lu Baolu 

This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could use
them to define the IOAS.

Signed-off-by: Lu Baolu 
---
 include/linux/iommu.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 943de6897f56..86d34e4ce05e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -153,9 +153,13 @@ enum iommu_dev_features {
 /**
  * enum iommu_devattr - Per device IOMMU attributes
  * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
+ * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu supports.
+ * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
  */
 enum iommu_devattr {
IOMMU_DEV_INFO_FORCE_SNOOP,
+   IOMMU_DEV_INFO_PAGE_SIZE,
+   IOMMU_DEV_INFO_ADDR_WIDTH,
 };
 
 #define IOMMU_PASID_INVALID(-1U)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD

2021-09-19 Thread Liu Yi L
This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
because it's implicitly done when the device fd is closed.

In concept a vfio device can be bound to multiple iommufds, each hosting
a subset of I/O address spaces attached by this device. However as a
starting point (matching current vfio), only one I/O address space is
supported per vfio device. It implies one device can only be attached
to one iommufd at this point.

Signed-off-by: Liu Yi L 
---
 drivers/vfio/pci/Kconfig|  1 +
 drivers/vfio/pci/vfio_pci.c | 72 -
 drivers/vfio/pci/vfio_pci_private.h |  8 
 include/uapi/linux/vfio.h   | 30 
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 5e2e1b9a9fd3..3abfb098b4dc 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -5,6 +5,7 @@ config VFIO_PCI
depends on MMU
select VFIO_VIRQFD
select IRQ_BYPASS_MANAGER
+   select IOMMUFD
help
  Support for the PCI VFIO bus driver.  This is required to make
  use of PCI drivers using the VFIO framework.
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 145addde983b..20006bb66430 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
vdev->req_trigger = NULL;
}
mutex_unlock(>igate);
+
+   mutex_lock(>videv_lock);
+   if (vdev->videv) {
+   struct vfio_iommufd_device *videv = vdev->videv;
+
+   vdev->videv = NULL;
+   iommufd_unbind_device(videv->idev);
+   kfree(videv);
+   }
+   mutex_unlock(>videv_lock);
}
 
mutex_unlock(>reflck->lock);
@@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
container_of(core_vdev, struct vfio_pci_device, vdev);
unsigned long minsz;
 
-   if (cmd == VFIO_DEVICE_GET_INFO) {
+   if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {
+   struct vfio_device_iommu_bind_data bind_data;
+   unsigned long minsz;
+   struct iommufd_device *idev;
+   struct vfio_iommufd_device *videv;
+
+   /*
+* Reject the request if the device is already opened and
+* attached to a container.
+*/
+   if (vfio_device_in_container(core_vdev))
+   return -ENOTTY;
+
+   minsz = offsetofend(struct vfio_device_iommu_bind_data, 
dev_cookie);
+
+   if (copy_from_user(_data, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (bind_data.argsz < minsz ||
+   bind_data.flags || bind_data.iommu_fd < 0)
+   return -EINVAL;
+
+   mutex_lock(>videv_lock);
+   /*
+* Allow only one iommufd per device until multiple
+* address spaces (e.g. vSVA) support is introduced
+* in the future.
+*/
+   if (vdev->videv) {
+   mutex_unlock(>videv_lock);
+   return -EBUSY;
+   }
+
+   idev = iommufd_bind_device(bind_data.iommu_fd,
+  >pdev->dev,
+  bind_data.dev_cookie);
+   if (IS_ERR(idev)) {
+   mutex_unlock(>videv_lock);
+   return PTR_ERR(idev);
+   }
+
+   videv = kzalloc(sizeof(*videv), GFP_KERNEL);
+   if (!videv) {
+   iommufd_unbind_device(idev);
+   mutex_unlock(>videv_lock);
+   return -ENOMEM;
+   }
+   videv->idev = idev;
+   videv->iommu_fd = bind_data.iommu_fd;
+   /*
+* A security context has been established. Unblock
+* user access.
+*/
+   if (atomic_read(>block_access))
+   atomic_set(>block_access, 0);
+   vdev->videv = videv;
+   mutex_unlock(>videv_lock);
+
+   return 0;
+   } else if (cmd == VFIO_DEVICE_GET_INFO) {
struct vfio_device_info info;
struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
unsigned long capsz;
@@ -2031,6 +2100,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
mutex_init(>vma_lock);
INIT_LIST_HEAD(>vma_list);

[RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

2021-09-19 Thread Liu Yi L
Under the /dev/iommu model, iommufd provides the interface for I/O page
tables management such as dma map/unmap. However, it cannot work
independently since the device is still owned by the device-passthrough
frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
should build a connection between its device and the iommufd to delegate
the I/O page table management affairs to iommufd.

This patch introduces iommufd_[un]bind_device() helpers for the device-
passthrough framework to build such connection. The helper functions then
invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
security context for the bound device. Each successfully bound device is
internally tracked by an iommufd_device object. This object is returned
to the caller for subsequent attaching operations on the device as well.

The caller should pass a user-provided cookie to mark the device in the
iommufd. Later this cookie will be used to represent the device in iommufd
uAPI, e.g. when querying device capabilities or handling per-device I/O
page faults. One alternative is to have iommufd allocate a device label
and return to the user. Either way works, but cookie is slightly preferred
per earlier discussion as it may allow the user to inject faults slightly
faster without ID->vRID lookup.

iommu_[un]bind_device() functions are only used for physical devices. Other
variants will be introduced in the future, e.g.:

-  iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
   DMA isolation;
-  iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
   instead of iommu to isolate DMA;

Signed-off-by: Liu Yi L 
---
 drivers/iommu/iommufd/iommufd.c | 160 +++-
 include/linux/iommufd.h |  38 
 2 files changed, 196 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 710b7e62988b..e16ca21e4534 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -16,10 +16,30 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 /* Per iommufd */
 struct iommufd_ctx {
refcount_t refs;
+   struct mutex lock;
+   struct xarray device_xa; /* xarray of bound devices */
+};
+
+/*
+ * A iommufd_device object represents the binding relationship
+ * between iommufd and device. It is created per a successful
+ * binding request from device driver. The bound device must be
+ * a physical device so far. Subdevice will be supported later
+ * (with additional PASID information). An user-assigned cookie
+ * is also recorded to mark the device in the /dev/iommu uAPI.
+ */
+struct iommufd_device {
+   unsigned int id;
+   struct iommufd_ctx *ictx;
+   struct device *dev; /* always be the physical device */
+   u64 dev_cookie;
 };
 
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct 
file *filep)
return -ENOMEM;
 
refcount_set(>refs, 1);
+   mutex_init(>lock);
+   xa_init_flags(>device_xa, XA_FLAGS_ALLOC);
filep->private_data = ictx;
 
return ret;
 }
 
+static void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+   refcount_inc(>refs);
+}
+
+static const struct file_operations iommufd_fops;
+
+/**
+ * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
+ * @fd: [in] iommufd file descriptor.
+ *
+ * Returns a pointer to the iommufd context, otherwise NULL;
+ *
+ */
+static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
+{
+   struct fd f = fdget(fd);
+   struct file *file = f.file;
+   struct iommufd_ctx *ictx;
+
+   if (!file)
+   return NULL;
+
+   if (file->f_op != _fops)
+   return NULL;
+
+   ictx = file->private_data;
+   if (ictx)
+   iommufd_ctx_get(ictx);
+   fdput(f);
+   return ictx;
+}
+
+/**
+ * iommufd_ctx_put - Releases a reference to the internal iommufd context.
+ * @ictx: [in] Pointer to iommufd context.
+ *
+ */
 static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
-   if (refcount_dec_and_test(>refs))
-   kfree(ictx);
+   if (!refcount_dec_and_test(>refs))
+   return;
+
+   WARN_ON(!xa_empty(>device_xa));
+   kfree(ictx);
 }
 
 static int iommufd_fops_release(struct inode *inode, struct file *filep)
@@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
.mode = 0666,
 };
 
+/**
+ * iommufd_bind_device - Bind a physical device marked by a device
+ *  cookie to an iommu fd.
+ * @fd:[in] iommufd file descriptor.
+ * @dev:   [in] Pointer to a physical device struct.
+ * @dev_cookie:[in] A cookie to mark the device in /dev/iommu uAPI.
+ *
+ * A successful bind establishes 

[RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-19 Thread Liu Yi L
From: Lu Baolu 

This extends iommu core to manage security context for passthrough
devices. Please bear a long explanation for how we reach this design
instead of managing it solely in iommufd like what vfio does today.

Devices which cannot be isolated from each other are organized into an
iommu group. When a device is assigned to the user space, the entire
group must be put in a security context so that user-initiated DMAs via
the assigned device cannot harm the rest of the system. No user access
should be granted on a device before the security context is established
for the group which the device belongs to.

Managing the security context must meet below criteria:

1)  The group is viable for user-initiated DMAs. This implies that the
devices in the group must be either bound to a device-passthrough
framework, or driver-less, or bound to a driver which is known safe
(not do DMA).

2)  The security context should only allow DMA to the user's memory and
devices in this group;

3)  After the security context is established for the group, the group
viability must be continuously monitored before the user relinquishes
all devices belonging to the group. The viability might be broken e.g.
when a driver-less device is later bound to a driver which does DMA.

4)  The security context should not be destroyed before user access
permission is withdrawn.

Existing vfio introduces explicit container/group semantics in its uAPI
to meet above requirements. A single security context (iommu domain)
is created per container. Attaching group to container moves the entire
group into the associated security context, and vice versa. The user can
open the device only after group attach. A group can be detached only
after all devices in the group are closed. Group viability is monitored
by listening to iommu group events.

Unlike vfio, iommufd adopts a device-centric design with all group
logistics hidden behind the fd. Binding a device to iommufd serves
as the contract to get security context established (and vice versa
for unbinding). One additional requirement in iommufd is to manage the
switch between multiple security contexts due to decoupled bind/attach:

1)  Open a device in "/dev/vfio/devices" with user access blocked;

2)  Bind the device to an iommufd with an initial security context
(an empty iommu domain which blocks dma) established for its
group, with user access unblocked;

3)  Attach the device to a user-specified ioasid (shared by all devices
attached to this ioasid). Before attaching, the device should be first
detached from the initial context;

4)  Detach the device from the ioasid and switch it back to the initial
security context;

5)  Unbind the device from the iommufd, back to access blocked state and
move its group out of the initial security context if it's the last
unbound device in the group;

(multiple attach/detach could happen between 2 and 5).

However existing iommu core has problem with above transition. Detach
in step 3/4 makes the device/group re-attached to the default domain
automatically, which opens the door for user-initiated DMAs to attack
the rest of the system. The existing vfio doesn't have this problem as
it combines 2/3 in one step (so does 4/5).

Fixing this problem requires the iommu core to also participate in the
security context management. Following this direction we also move group
viability check into the iommu core, which allows iommufd to stay fully
device-centric w/o keeping any group knowledge (combining with the
extension to iommu_at[de]tach_device() in a latter patch).

Basically two new interfaces are provided:

int iommu_device_init_user_dma(struct device *dev,
unsigned long owner);
void iommu_device_exit_user_dma(struct device *dev);

iommufd calls them respectively when handling device binding/unbinding
requests.

The init_user_dma() for the 1st device in a group marks the entire group
for user-dma and establishes the initial security context (dma blocked)
according to aforementioned criteria. As long as the group is marked for
user-dma, auto-reattaching to default domain is disabled. Instead, upon
detaching the group is moved back to the initial security context.

The caller also provides an owner id to mark the ownership so inadvertent
attempt from another caller on the same device can be captured. In this
RFC iommufd will use the fd context pointer as the owner id.

The exit_user_dma() for the last device in the group clears the user-dma
mark and moves the group back to the default domain.

Signed-off-by: Kevin Tian 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/iommu.c | 145 +-
 include/linux/iommu.h |  12 
 2 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5ea3a007fd7c..bffd84e978fb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ 

[RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

2021-09-19 Thread Liu Yi L
This patch exposes the device-centric interface for vfio-pci devices. To
be compatiable with existing users, vfio-pci exposes both legacy group
interface and device-centric interface.

As explained in last patch, this change doesn't apply to devices which
cannot be forced to snoop cache by their upstream iommu. Such devices
are still expected to be opened via the legacy group interface.

When the device is opened via /dev/vfio/devices, vfio-pci should prevent
the user from accessing the assigned device because the device is still
attached to the default domain which may allow user-initiated DMAs to
touch arbitrary place. The user access must be blocked until the device
is later bound to an iommufd (see patch 08). The binding acts as the
contract for putting the device in a security context which ensures user-
initiated DMAs via this device cannot harm the rest of the system.

This patch introduces a vdev->block_access flag for this purpose. It's set
when the device is opened via /dev/vfio/devices and cleared after binding
to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
user access should be blocked or not.

An alternative option is to use a dummy fops when the device is opened and
then switch to the real fops (replace_fops()) after binding. Appreciate
inputs on which option is better.

The legacy group interface doesn't have this problem. Its uAPI requires the
user to first put the device into a security context via container/group
attaching process, before opening the device through the groupfd.

Signed-off-by: Liu Yi L 
---
 drivers/vfio/pci/vfio_pci.c | 25 +++--
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 drivers/vfio/vfio.c |  3 ++-
 include/linux/vfio.h|  1 +
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 318864d52837..145addde983b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -572,6 +572,10 @@ static int vfio_pci_open(struct vfio_device *core_vdev)
 
vfio_spapr_pci_eeh_open(vdev->pdev);
vfio_pci_vf_token_user_add(vdev, 1);
+   if (!vfio_device_in_container(core_vdev))
+   atomic_set(>block_access, 1);
+   else
+   atomic_set(>block_access, 0);
}
vdev->refcnt++;
 error:
@@ -1374,6 +1378,9 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, 
char __user *buf,
 {
unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
 
+   if (atomic_read(>block_access))
+   return -ENODEV;
+
if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
return -EINVAL;
 
@@ -1640,6 +1647,9 @@ static int vfio_pci_mmap(struct vfio_device *core_vdev, 
struct vm_area_struct *v
u64 phys_len, req_len, pgoff, req_start;
int ret;
 
+   if (atomic_read(>block_access))
+   return -ENODEV;
+
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
@@ -1978,6 +1988,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
struct vfio_pci_device *vdev;
struct iommu_group *group;
int ret;
+   u32 flags;
+   bool snoop = false;
 
if (vfio_pci_is_denylisted(pdev))
return -EINVAL;
@@ -2046,9 +2058,18 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
vfio_pci_set_power_state(vdev, PCI_D3hot);
}
 
-   ret = vfio_register_group_dev(>vdev);
-   if (ret)
+   flags = VFIO_DEVNODE_GROUP;
+   ret = iommu_device_get_info(>dev,
+   IOMMU_DEV_INFO_FORCE_SNOOP, );
+   if (!ret && snoop)
+   flags |= VFIO_DEVNODE_NONGROUP;
+
+   ret = vfio_register_device(>vdev, flags);
+   if (ret) {
+   pr_debug("Failed to register device interface\n");
goto out_power;
+   }
+
dev_set_drvdata(>dev, vdev);
return 0;
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 5a36272cecbf..f12012e30b53 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -143,6 +143,7 @@ struct vfio_pci_device {
struct mutexvma_lock;
struct list_headvma_list;
struct rw_semaphore memory_lock;
+   atomic_tblock_access;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 1e87b25962f1..22851747e92c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1789,10 +1789,11 @@ static int vfio_device_fops_open(struct inode *inode, 
struct file *filep)
  

[RFC 04/20] iommu: Add iommu_device_get_info interface

2021-09-19 Thread Liu Yi L
From: Lu Baolu 

This provides an interface for upper layers to get the per-device iommu
attributes.

int iommu_device_get_info(struct device *dev,
  enum iommu_devattr attr, void *data);

The first attribute (IOMMU_DEV_INFO_FORCE_SNOOP) is added. It tells if
the iommu can force DMA to snoop cache. At this stage, only PCI devices
which have this attribute set could use the iommufd, this is due to
supporting no-snoop DMA requires additional refactoring work on the
current kvm-vfio contract. The following patch will have vfio check this
attribute to decide whether a pci device can be exposed through
/dev/vfio/devices.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/iommu.c | 16 
 include/linux/iommu.h | 19 +++
 2 files changed, 35 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 63f0af10c403..5ea3a007fd7c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3260,3 +3260,19 @@ static ssize_t iommu_group_store_type(struct iommu_group 
*group,
 
return ret;
 }
+
+/* Expose per-device iommu attributes. */
+int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void 
*data)
+{
+   const struct iommu_ops *ops;
+
+   if (!dev->bus || !dev->bus->iommu_ops)
+   return -EINVAL;
+
+   ops = dev->bus->iommu_ops;
+   if (unlikely(!ops->device_info))
+   return -ENODEV;
+
+   return ops->device_info(dev, attr, data);
+}
+EXPORT_SYMBOL_GPL(iommu_device_get_info);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..52a6d33c82dc 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -150,6 +150,14 @@ enum iommu_dev_features {
IOMMU_DEV_FEAT_IOPF,
 };
 
+/**
+ * enum iommu_devattr - Per device IOMMU attributes
+ * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
+ */
+enum iommu_devattr {
+   IOMMU_DEV_INFO_FORCE_SNOOP,
+};
+
 #define IOMMU_PASID_INVALID(-1U)
 
 #ifdef CONFIG_IOMMU_API
@@ -215,6 +223,7 @@ struct iommu_iotlb_gather {
  * - IOMMU_DOMAIN_IDENTITY: must use an identity domain
  * - IOMMU_DOMAIN_DMA: must use a dma domain
  * - 0: use the default setting
+ * @device_info: query per-device iommu attributes
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @owner: Driver module providing these ops
  */
@@ -283,6 +292,8 @@ struct iommu_ops {
 
int (*def_domain_type)(struct device *dev);
 
+   int (*device_info)(struct device *dev, enum iommu_devattr attr, void 
*data);
+
unsigned long pgsize_bitmap;
struct module *owner;
 };
@@ -604,6 +615,8 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
 void iommu_sva_unbind_device(struct iommu_sva *handle);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
+int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void 
*data);
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -999,6 +1012,12 @@ static inline struct iommu_fwspec 
*dev_iommu_fwspec_get(struct device *dev)
 {
return NULL;
 }
+
+static inline int iommu_device_get_info(struct device *dev,
+   enum iommu_devattr type, void *data)
+{
+   return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 /**
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC 03/20] vfio: Add vfio_[un]register_device()

2021-09-19 Thread Liu Yi L
With /dev/vfio/devices introduced, now a vfio device driver has three
options to expose its device to userspace:

a)  only legacy group interface, for devices which haven't been moved to
iommufd (e.g. platform devices, sw mdev, etc.);

b)  both legacy group interface and new device-centric interface, for
devices which supports iommufd but also wants to keep backward
compatibility (e.g. pci devices in this RFC);

c)  only new device-centric interface, for new devices which don't carry
backward compatibility burden (e.g. hw mdev/subdev with pasid);

This patch introduces vfio_[un]register_device() helpers for the device
drivers to specify the device exposure policy to vfio core. Hence the
existing vfio_[un]register_group_dev() become the wrapper of the new
helper functions. The new device-centric interface is described as
'nongroup' to differentiate from existing 'group' stuff.

TBD: this patch needs to rebase on top of below series from Christoph in
next version.

"cleanup vfio iommu_group creation"

Legacy userspace continues to follow the legacy group interface.

Newer userspace can first try the new device-centric interface if the
device is present under /dev/vfio/devices. Otherwise fall back to the
group interface.

One open about how to organize the device nodes under /dev/vfio/devices/.
This RFC adopts a simple policy by keeping a flat layout with mixed devname
from all kinds of devices. The prerequisite of this model is that devnames
from different bus types are unique formats:

/dev/vfio/devices/:00:14.2 (pci)
/dev/vfio/devices/PNP0103:00 (platform)
/dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)

One alternative option is to arrange device nodes in sub-directories based
on the device type. But doing so also adds one trouble to userspace. The
current vfio uAPI is designed to have the user query device type via
VFIO_DEVICE_GET_INFO after opening the device. With this option the user
instead needs to figure out the device type before opening the device, to
identify the sub-directory. Another tricky thing is that "pdev. vs. mdev"
and "pci vs. platform vs. ccw,..." are orthogonal categorizations. Need
more thoughts on whether both or just one category should be used to define
the sub-directories.

Signed-off-by: Liu Yi L 
---
 drivers/vfio/vfio.c  | 137 +++
 include/linux/vfio.h |   9 +++
 2 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 84436d7abedd..1e87b25962f1 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -51,6 +51,7 @@ static struct vfio {
struct cdev device_cdev;
dev_t   device_devt;
struct mutexdevice_lock;
+   struct list_headdevice_list;
struct idr  device_idr;
 } vfio;
 
@@ -757,7 +758,7 @@ void vfio_init_group_dev(struct vfio_device *device, struct 
device *dev,
 }
 EXPORT_SYMBOL_GPL(vfio_init_group_dev);
 
-int vfio_register_group_dev(struct vfio_device *device)
+static int __vfio_register_group_dev(struct vfio_device *device)
 {
struct vfio_device *existing_device;
struct iommu_group *iommu_group;
@@ -794,8 +795,13 @@ int vfio_register_group_dev(struct vfio_device *device)
/* Our reference on group is moved to the device */
device->group = group;
 
-   /* Refcounting can't start until the driver calls register */
-   refcount_set(>refcount, 1);
+   /*
+* Refcounting can't start until the driver call register. Don’t
+* start twice when the device is exposed in both group and nongroup
+* interfaces.
+*/
+   if (!refcount_read(>refcount))
+   refcount_set(>refcount, 1);
 
mutex_lock(>device_lock);
list_add(>group_next, >device_list);
@@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device *device)
 
return 0;
 }
-EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+static int __vfio_register_nongroup_dev(struct vfio_device *device)
+{
+   struct vfio_device *existing_device;
+   struct device *dev;
+   int ret = 0, minor;
+
+   mutex_lock(_lock);
+   list_for_each_entry(existing_device, _list, vfio_next) {
+   if (existing_device == device) {
+   ret = -EBUSY;
+   goto out_unlock;
+   }
+   }
+
+   minor = idr_alloc(_idr, device, 0, MINORMASK + 1, 
GFP_KERNEL);
+   pr_debug("%s - mnior: %d\n", __func__, minor);
+   if (minor < 0) {
+   ret = minor;
+   goto out_unlock;
+   }
+
+   dev = device_create(vfio.device_class, NULL,
+   MKDEV(MAJOR(vfio.device_devt), minor),
+   device, "%s", dev_name(device

[RFC 02/20] vfio: Add device class for /dev/vfio/devices

2021-09-19 Thread Liu Yi L
This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
userspace to directly open a vfio device w/o relying on container/group
(/dev/vfio/$GROUP). Anything related to group is now hidden behind
iommufd (more specifically in iommu core by this RFC) in a device-centric
manner.

In case a device is exposed in both legacy and new interfaces (see next
patch for how to decide it), this patch also ensures that when the device
is already opened via one interface then the other one must be blocked.

Signed-off-by: Liu Yi L 
---
 drivers/vfio/vfio.c  | 228 +++
 include/linux/vfio.h |   2 +
 2 files changed, 213 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 02cc51ce6891..84436d7abedd 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -46,6 +46,12 @@ static struct vfio {
struct mutexgroup_lock;
struct cdev group_cdev;
dev_t   group_devt;
+   /* Fields for /dev/vfio/devices interface */
+   struct class*device_class;
+   struct cdev device_cdev;
+   dev_t   device_devt;
+   struct mutexdevice_lock;
+   struct idr  device_idr;
 } vfio;
 
 struct vfio_iommu_driver {
@@ -81,9 +87,11 @@ struct vfio_group {
struct list_headcontainer_next;
struct list_headunbound_list;
struct mutexunbound_lock;
-   atomic_topened;
-   wait_queue_head_t   container_q;
+   struct mutexopened_lock;
+   u32 opened;
+   boolopened_by_nongroup_dev;
boolnoiommu;
+   wait_queue_head_t   container_q;
unsigned intdev_counter;
struct kvm  *kvm;
struct blocking_notifier_head   notifier;
@@ -327,7 +335,7 @@ static struct vfio_group *vfio_create_group(struct 
iommu_group *iommu_group)
INIT_LIST_HEAD(>unbound_list);
mutex_init(>unbound_lock);
atomic_set(>container_users, 0);
-   atomic_set(>opened, 0);
+   mutex_init(>opened_lock);
init_waitqueue_head(>container_q);
group->iommu_group = iommu_group;
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -1489,10 +1497,53 @@ static long vfio_group_fops_unl_ioctl(struct file 
*filep,
return ret;
 }
 
+/*
+ * group->opened is used to ensure that the group can be opened only via
+ * one of the two interfaces (/dev/vfio/$GROUP and /dev/vfio/devices/
+ * $DEVICE) instead of both.
+ *
+ * We also introduce a new group flag to indicate whether this group is
+ * opened via /dev/vfio/devices/$DEVICE. For multi-devices group,
+ * group->opened also tracks how many devices have been opened in the
+ * group if the new flag is true.
+ *
+ * Also add a new lock since two flags are operated here.
+ */
+static int vfio_group_try_open(struct vfio_group *group, bool nongroup_dev)
+{
+   int ret = 0;
+
+   mutex_lock(>opened_lock);
+   if (group->opened) {
+   if (nongroup_dev && group->opened_by_nongroup_dev)
+   group->opened++;
+   else
+   ret = -EBUSY;
+   goto out;
+   }
+
+   /*
+* Is something still in use from a previous open? Should
+* not allow new open if it is such case.
+*/
+   if (group->container) {
+   ret = -EBUSY;
+   goto out;
+   }
+
+   group->opened = 1;
+   group->opened_by_nongroup_dev = nongroup_dev;
+
+out:
+   mutex_unlock(>opened_lock);
+
+   return ret;
+}
+
 static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 {
struct vfio_group *group;
-   int opened;
+   int ret;
 
group = vfio_group_get_from_minor(iminor(inode));
if (!group)
@@ -1503,18 +1554,10 @@ static int vfio_group_fops_open(struct inode *inode, 
struct file *filep)
return -EPERM;
}
 
-   /* Do we need multiple instances of the group open?  Seems not. */
-   opened = atomic_cmpxchg(>opened, 0, 1);
-   if (opened) {
-   vfio_group_put(group);
-   return -EBUSY;
-   }
-
-   /* Is something still in use from a previous open? */
-   if (group->container) {
-   atomic_dec(>opened);
+   ret = vfio_group_try_open(group, false);
+   if (ret) {
vfio_group_put(group);
-   return -EBUSY;
+   return ret;
}
 
/* Warn if previous user didn't cleanup and re-init to drop them */
@@ -1534,7 +1577,9 @@ static 

[RFC 01/20] iommu/iommufd: Add /dev/iommu core

2021-09-19 Thread Liu Yi L
/dev/iommu aims to provide a unified interface for managing I/O address
spaces for devices assigned to userspace. This patch adds the initial
framework to create a /dev/iommu node. Each open of this node returns an
iommufd. And this fd is the handle for userspace to initiate its I/O
address space management.

One open:
- We call this feature as IOMMUFD in Kconfig in this RFC. However this
  name is not clear enough to indicate its purpose to user. Back to 2010
  vfio even introduced a /dev/uiommu [1] as the predecessor of its
  container concept. Is that a better name? Appreciate opinions here.

[1] https://lore.kernel.org/kvm/4c0eb470.1hmjondo00nivfm6%25p...@cisco.com/

Signed-off-by: Liu Yi L 
---
 drivers/iommu/Kconfig   |   1 +
 drivers/iommu/Makefile  |   1 +
 drivers/iommu/iommufd/Kconfig   |  11 
 drivers/iommu/iommufd/Makefile  |   2 +
 drivers/iommu/iommufd/iommufd.c | 112 
 5 files changed, 127 insertions(+)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 07b7c25cbed8..a83ce0acd09d 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -136,6 +136,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c0fb0ba88143..719c799f23ad 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
 obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
+obj-$(CONFIG_IOMMUFD) += iommufd/
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index ..9fb7769a815d
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+   tristate "I/O Address Space management framework for passthrough 
devices"
+   select IOMMU_API
+   default n
+   help
+ provides unified I/O address space management framework for
+ isolating untrusted DMAs via devices which are passed through
+ to userspace drivers.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index ..54381a01d003
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
new file mode 100644
index ..710b7e62988b
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * I/O Address Space Management for passthrough devices
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L 
+ */
+
+#define pr_fmt(fmt)"iommufd: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Per iommufd */
+struct iommufd_ctx {
+   refcount_t refs;
+};
+
+static int iommufd_fops_open(struct inode *inode, struct file *filep)
+{
+   struct iommufd_ctx *ictx;
+   int ret = 0;
+
+   ictx = kzalloc(sizeof(*ictx), GFP_KERNEL);
+   if (!ictx)
+   return -ENOMEM;
+
+   refcount_set(>refs, 1);
+   filep->private_data = ictx;
+
+   return ret;
+}
+
+static void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+   if (refcount_dec_and_test(>refs))
+   kfree(ictx);
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filep)
+{
+   struct iommufd_ctx *ictx = filep->private_data;
+
+   filep->private_data = NULL;
+
+   iommufd_ctx_put(ictx);
+
+   return 0;
+}
+
+static long iommufd_fops_unl_ioctl(struct file *filep,
+  unsigned int cmd, unsigned long arg)
+{
+   struct iommufd_ctx *ictx = filep->private_data;
+   long ret = -EINVAL;
+
+   if (!ictx)
+   return ret;
+
+   switch (cmd) {
+   default:
+   pr_err_ratelimited("unsupported cmd %u\n", cmd);
+   break;
+   }
+   return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+   .owner  = THIS_MODULE,
+   .open   = iommufd_fops_open,
+   .release= iommufd_fops_release,
+   .unlocked_ioctl = iommufd_fops_unl_ioctl,
+};
+
+static struct miscdevice iommu_misc_dev = {
+   .minor = MISC_DYNAMIC_MINOR,
+   .name = "iommu",
+   .fops = _fops,
+   .nodena

[RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

2021-09-19 Thread Liu Yi L
detail please refer to the design proposal [2]:

1. Move more vfio device types to iommufd:
* device which does no-snoop DMA
* software mdev
* PPC device
* platform device

2. New vfio device type
* hardware mdev/subdev (with PASID)

3. vDPA adoption

4. User-managed I/O page table
* ioasid nesting (hardware)
* ioasid nesting (software)
* pasid virtualization
o pdev (arm/amd)
o pdev/mdev which doesn't support enqcmd (intel)
o pdev/mdev which supports enqcmd (intel)
* I/O page fault (stage-1)

5. Miscellaneous
* I/O page fault (stage-2), for on-demand paging
* IOMMU dirty bit, for hardware-assisted dirty page tracking
* shared I/O page table (mm, ept, etc.)
* vfio/vdpa shim to avoid code duplication for legacy uAPI
* hardware-assisted vIOMMU

[1] https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/
[2] 
https://lore.kernel.org/kvm/bn9pr11mb5433b1e4ae5b0480369f97178c...@bn9pr11mb5433.namprd11.prod.outlook.com/

[Series Overview]
* Basic skeleton:
  0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
  0002-vfio-Add-vfio-device-class-for-device-nodes.patch
  0003-vfio-Add-vfio_-un-register_device.patch
  0004-iommu-Add-iommu_device_get_info-interface.patch
  0005-vfio-pci-Register-device-centric-interface.patch

* Bind device fd with iommufd:
  0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
  0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
  0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
  0009-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
  0010-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch

* IOASID [de]attach:
  0011-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
  0012-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
  0013-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* /dev/iommu DMA (un)map:
  0014-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
  0015-iommu-iommufd-Report-iova-range-to-userspace.patch
  0016-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info:
  0017-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
  0018-Doc-Add-documentation-for-dev-iommu.patch
 
* Basic skeleton:
  0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
  0002-vfio-Add-device-class-for-dev-vfio-devices.patch
  0003-vfio-Add-vfio_-un-register_device.patch
  0004-iommu-Add-iommu_device_get_info-interface.patch
  0005-vfio-pci-Register-device-to-dev-vfio-devices.patch

* Bind device fd with iommufd:
  0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
  0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
  0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
  0009-iommu-Add-page-size-and-address-width-attributes.patch
  0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
  0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
  0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch

* IOASID [de]attach:
  0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
  0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
  0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* DMA (un)map:
  0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
  0017-iommu-iommufd-Report-iova-range-to-userspace.patch
  0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info in vt-d driver to enable whole series:
  0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
  0020-Doc-Add-documentation-for-dev-iommu.patch

Complete code can be found in:
https://github.com/luxis1999/dev-iommu/commits/dev-iommu-5.14-rfcv1

Thanks for your time!

Regards,
Yi Liu
---

Liu Yi L (15):
  iommu/iommufd: Add /dev/iommu core
  vfio: Add device class for /dev/vfio/devices
  vfio: Add vfio_[un]register_device()
  vfio/pci: Register device to /dev/vfio/devices
  iommu/iommufd: Add iommufd_[un]bind_device()
  vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  vfio/type1: Export symbols for dma [un]map code sharing
  iommu/iommufd: Report iova range to userspace
  iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
  Doc: Add documentation for /dev/iommu

Lu Baolu (5):
  iommu: Add iommu_device_get_info interface
  iommu: Add iommu_device_init[exit]_user_dma interfaces
  iommu: Add page size and address width attributes
  iommu: Extend iommu_at[de]tach_device() for multiple devices group
  iommu/vt-d: Implement device_info iommu_ops callback

 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 183 ++
 drivers/iommu/Kconfig   |   1 +
 drivers/iommu/Makefile  |   1 +
 drivers

[PATCH v1 1/3] iommu/vt-d: Using pasid_pte_is_present() helper function

2021-08-16 Thread Liu Yi L
Use pasid_pte_is_present() for present bit check in 
intel_pasid_tear_down_entry().

Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/pasid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index c6cf44a6c923..02e10491184b 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -517,7 +517,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, 
struct device *dev,
if (WARN_ON(!pte))
return;
 
-   if (!(pte->val[0] & PASID_PTE_PRESENT))
+   if (!pasid_pte_is_present(pte))
return;
 
did = pasid_get_domain_id(pte);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 3/3] iommu/vt-d: Fix Unexpected page request in Privilege Mode

2021-08-16 Thread Liu Yi L
This patch fixes the below error. This is due to improper iotlb invalidation
in intel_pasid_tear_down_entry().

[  180.187556] Unexpected page request in Privilege Mode
[  180.187565] Unexpected page request in Privilege Mode
[  180.279933] Unexpected page request in Privilege Mode
[  180.279937] Unexpected page request in Privilege Mode

Per chapter 6.5.3.3 of VT-d spec 3.3, when tear down a pasid entry, software
should use Domain selective IOTLB flush if the PGTT of the pasid entry is
SL only or Nested, while for the pasid entries whose PGTT is FL only or PT
using PASID-based IOTLB flush is enough.

Fixes: 1c4f88b7f1f9 ("iommu/vt-d: Shared virtual address in scalable mode")
Cc: Lu Baolu 
Signed-off-by: Kumar Sanjay K 
Signed-off-by: Liu Yi L 
Tested-by: Yi Sun 
---
 drivers/iommu/intel/pasid.c | 10 --
 drivers/iommu/intel/pasid.h |  5 +
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index b1d0c2945c9a..07c390aed1fe 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -511,7 +511,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, 
struct device *dev,
 u32 pasid, bool fault_ignore)
 {
struct pasid_entry *pte;
-   u16 did;
+   u16 did, pgtt;
 
pte = intel_pasid_get_entry(dev, pasid);
if (WARN_ON(!pte))
@@ -521,13 +521,19 @@ void intel_pasid_tear_down_entry(struct intel_iommu 
*iommu, struct device *dev,
return;
 
did = pasid_get_domain_id(pte);
+   pgtt = pasid_pte_get_pgtt(pte);
+
intel_pasid_clear_entry(dev, pasid, fault_ignore);
 
if (!ecap_coherent(iommu->ecap))
clflush_cache_range(pte, sizeof(*pte));
 
pasid_cache_invalidation_with_pasid(iommu, did, pasid);
-   qi_flush_piotlb(iommu, did, pasid, 0, -1, 0);
+
+   if (pgtt == PASID_ENTRY_PGTT_PT || pgtt == PASID_ENTRY_PGTT_FL_ONLY)
+   qi_flush_piotlb(iommu, did, pasid, 0, -1, 0);
+   else
+   iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
 
/* Device IOTLB doesn't need to be flushed in caching mode. */
if (!cap_caching_mode(iommu->cap))
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 5ff61c3d401f..637141d71092 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -99,6 +99,11 @@ static inline bool pasid_pte_is_present(struct pasid_entry 
*pte)
return READ_ONCE(pte->val[0]) & PASID_PTE_PRESENT;
 }
 
+static inline u16 pasid_pte_get_pgtt(struct pasid_entry *pte)
+{
+   return (READ_ONCE(pte->val[0]) >> 6) & 0x7;
+}
+
 extern unsigned int intel_pasid_max_id;
 int intel_pasid_alloc_table(struct device *dev);
 void intel_pasid_free_table(struct device *dev);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 2/3] iommu/vt-d: Add present bit check in pasid entry setup helper functions

2021-08-16 Thread Liu Yi L
The helper functions is not capable to modify the pasid entries which
are still in use. So should have a check against present bit.

Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/pasid.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 02e10491184b..b1d0c2945c9a 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -534,6 +534,10 @@ void intel_pasid_tear_down_entry(struct intel_iommu 
*iommu, struct device *dev,
devtlb_invalidation_with_pasid(iommu, dev, pasid);
 }
 
+/*
+ * This function flushes cache for a newly setup pasid table entry.
+ * Caller of it should not modify the in-use pasid table entries.
+ */
 static void pasid_flush_caches(struct intel_iommu *iommu,
struct pasid_entry *pte,
   u32 pasid, u16 did)
@@ -585,6 +589,10 @@ int intel_pasid_setup_first_level(struct intel_iommu 
*iommu,
if (WARN_ON(!pte))
return -EINVAL;
 
+   /* Caller must ensure PASID entry is not in use. */
+   if (pasid_pte_is_present(pte))
+   return -EBUSY;
+
pasid_clear_entry(pte);
 
/* Setup the first level page table pointer: */
@@ -684,6 +692,10 @@ int intel_pasid_setup_second_level(struct intel_iommu 
*iommu,
return -ENODEV;
}
 
+   /* Caller must ensure PASID entry is not in use. */
+   if (pasid_pte_is_present(pte))
+   return -EBUSY;
+
pasid_clear_entry(pte);
pasid_set_domain_id(pte, did);
pasid_set_slptr(pte, pgd_val);
@@ -723,6 +735,10 @@ int intel_pasid_setup_pass_through(struct intel_iommu 
*iommu,
return -ENODEV;
}
 
+   /* Caller must ensure PASID entry is not in use. */
+   if (pasid_pte_is_present(pte))
+   return -EBUSY;
+
pasid_clear_entry(pte);
pasid_set_domain_id(pte, did);
pasid_set_address_width(pte, iommu->agaw);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 0/3] Misc fixes to intel iommu driver

2021-08-16 Thread Liu Yi L
Hi,

This series includes two minor enhancements and one bug fix. Please have
a review.

Thanks,
Yi Liu
---

Liu Yi L (3):
  iommu/vt-d: Using pasid_pte_is_present() helper function
  iommu/vt-d: Add present bit check in pasid entry setup helper
functions
  iommu/vt-d: Fix Unexpected page request in Privilege Mode

 drivers/iommu/intel/pasid.c | 28 +---
 drivers/iommu/intel/pasid.h |  5 +
 2 files changed, 30 insertions(+), 3 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Plan for /dev/ioasid RFC v2

2021-06-16 Thread Liu Yi L
Hi Alex,

On Wed, 16 Jun 2021 13:39:37 -0600, Alex Williamson wrote:

> On Wed, 16 Jun 2021 06:43:23 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Wednesday, June 16, 2021 12:12 AM
> > > 
> > > On Tue, 15 Jun 2021 02:31:39 +
> > > "Tian, Kevin"  wrote:
> > > 
> > > > > From: Alex Williamson 
> > > > > Sent: Tuesday, June 15, 2021 12:28 AM
> > > > >
> > > > [...]
> > > > > > IOASID. Today the group fd requires an IOASID before it hands out a
> > > > > > device_fd. With iommu_fd the device_fd will not allow IOCTLs until 
> > > > > > it
> > > > > > has a blocked DMA IOASID and is successefully joined to an 
> > > > > > iommu_fd.
> > > > >
> > > > > Which is the root of my concern.  Who owns ioctls to the device fd?
> > > > > It's my understanding this is a vfio provided file descriptor and it's
> > > > > therefore vfio's responsibility.  A device-level IOASID interface
> > > > > therefore requires that vfio manage the group aspect of device access.
> > > > > AFAICT, that means that device access can therefore only begin when 
> > > > > all
> > > > > devices for a given group are attached to the IOASID and must halt for
> > > > > all devices in the group if any device is ever detached from an 
> > > > > IOASID,
> > > > > even temporarily.  That suggests a lot more oversight of the IOASIDs 
> > > > > by
> > > > > vfio than I'd prefer.
> > > > >
> > > >
> > > > This is possibly the point that is worthy of more clarification and
> > > > alignment, as it sounds like the root of controversy here.
> > > >
> > > > I feel the goal of vfio group management is more about ownership, i.e.
> > > > all devices within a group must be assigned to a single user. Following
> > > > the three rules defined by Jason, what we really care is whether a group
> > > > of devices can be isolated from the rest of the world, i.e. no access to
> > > > memory/device outside of its security context and no access to its
> > > > security context from devices outside of this group. This can be 
> > > > achieved
> > > > as long as every device in the group is either in block-DMA state when
> > > > it's not attached to any security context or attached to an IOASID 
> > > > context
> > > > in IOMMU fd.
> > > >
> > > > As long as group-level isolation is satisfied, how devices within a 
> > > > group
> > > > are further managed is decided by the user (unattached, all attached to
> > > > same IOASID, attached to different IOASIDs) as long as the user
> > > > understands the implication of lacking of isolation within the group. 
> > > > This
> > > > is what a device-centric model comes to play. Misconfiguration just 
> > > > hurts
> > > > the user itself.
> > > >
> > > > If this rationale can be agreed, then I didn't see the point of having 
> > > > VFIO
> > > > to mandate all devices in the group must be attached/detached in
> > > > lockstep.
> > > 
> > > In theory this sounds great, but there are still too many assumptions
> > > and too much hand waving about where isolation occurs for me to feel
> > > like I really have the complete picture.  So let's walk through some
> > > examples.  Please fill in and correct where I'm wrong.
> > 
> > Thanks for putting these examples. They are helpful for clearing the 
> > whole picture.
> > 
> > Before filling in let's first align on what is the key difference between
> > current VFIO model and this new proposal. With this comparison we'll
> > know which of following questions are answered with existing VFIO
> > mechanism and which are handled differently.
> > 
> > With Yi's help we figured out the current mechanism:
> > 
> > 1) vfio_group_viable. The code comment explains the intention clearly:
> > 
> > --
> > * A vfio group is viable for use by userspace if all devices are in
> >  * one of the following states:
> >  *  - driver-less
> >  *  - bound to a vfio driver
> >  *  - bound to an otherwise allowed driver
> >  *  - a PCI interconnect device
> > --
> > 
> > Note this check is not related to an IOMMU security context.  
> 
> Because this is a pre-requisite for imposing that IOMMU security
> context.
>  
> > 2) vfio_iommu_group_notifier. When an IOMMU_GROUP_NOTIFY_
> > BOUND_DRIVER event is notified, vfio_group_viable is re-evaluated.
> > If the affected group was previously viable but now becomes not 
> > viable, BUG_ON() as it implies that this device is bound to a non-vfio 
> > driver which breaks the group isolation.  
> 
> This notifier action is conditional on there being users of devices
> within a secure group IOMMU context.
> 
> > 3) vfio_group_get_device_fd. User can acquire a device fd only after
> > a) the group is viable;
> > b) the group is attached to a container;
> > c) iommu is set on the container (implying a security context
> > established);  
> 
> The order is actually b) a) c) but arguably b) is a no-op until:
> 
> d) a device fd is provided to the user

Per the code in QEMU vfio_get_group(). The 

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Liu, Yi L
> From: Shenming Lu 
> Sent: Friday, June 4, 2021 10:03 AM
> 
> On 2021/6/4 2:19, Jacob Pan wrote:
> > Hi Shenming,
> >
> > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
> 
> > wrote:
> >
> >> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >>>
>  The drivers register per page table fault handlers to /dev/ioasid which
>  will then register itself to iommu core to listen and route the per-
>  device I/O page faults.
> >>>
> >>> I'm still confused why drivers need fault handlers at all?
> >>
> >> Essentially it is the userspace that needs the fault handlers,
> >> one case is to deliver the faults to the vIOMMU, and another
> >> case is to enable IOPF on the GPA address space for on-demand
> >> paging, it seems that both could be specified in/through the
> >> IOASID_ALLOC ioctl?
> >>
> > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > registered. There wouldn't be any IO page fault without the binding
> anyway.
> 
> Yeah, I also proposed this before, registering the handler in the
> BIND_PGTABLE
> ioctl does make sense for the guest page faults. :-)
> 
> But how about the page faults from the GPA address space (it's page table is
> mapped through the MAP_DMA ioctl)? From your point of view, it seems
> that we should register the handler for the GPA address space in the (first)
> MAP_DMA ioctl.

under new proposal, I think the page fault handler is also registered
per ioasid object. The difference compared with guest page table case
is there is no need to inject the fault to VM.
 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Liu Yi L
On Tue, 1 Jun 2021 10:36:36 +0800, Jason Wang wrote:

> 在 2021/5/31 下午4:41, Liu Yi L 写道:
> >> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
> >> hardware nesting. Or is there way to detect the capability before?  
> > I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
> > is not able to support nesting, then should fail it.
> >  
> >> I think GET_INFO only works after the ATTACH.  
> > yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
> > gpa_ioasid and check if nesting is supported or not. right?  
> 
> 
> Some more questions:
> 
> 1) Is the handle returned by IOASID_ALLOC an fd?

it's an ID so far in this proposal.

> 2) If yes, what's the reason for not simply use the fd opened from 
> /dev/ioas. (This is the question that is not answered) and what happens 
> if we call GET_INFO for the ioasid_fd?
> 3) If not, how GET_INFO work?

oh, missed this question in prior reply. Personally, no special reason
yet. But using ID may give us opportunity to customize the management
of the handle. For one, better lookup efficiency by using xarray to
store the allocated IDs. For two, could categorize the allocated IDs
(parent or nested). GET_INFO just works with an input FD and an ID.

> 
> >  
> >>>   /* Bind guest I/O page table  */
> >>>   bind_data = {
> >>>   .ioasid = giova_ioasid;
> >>>   .addr   = giova_pgtable;
> >>>   // and format information
> >>>   };
> >>>   ioctl(ioasid_fd, IOASID_BIND_PGTABLE, _data);
> >>>
> >>>   /* Invalidate IOTLB when required */
> >>>   inv_data = {
> >>>   .ioasid = giova_ioasid;
> >>>   // granular information
> >>>   };
> >>>   ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, _data);
> >>>
> >>>   /* See 5.6 for I/O page fault handling */
> >>>   
> >>> 5.5. Guest SVA (vSVA)
> >>> ++
> >>>
> >>> After boots the guest further create a GVA address spaces (gpasid1) on
> >>> dev1. Dev2 is not affected (still attached to giova_ioasid).
> >>>
> >>> As explained in section 4, user should avoid expose ENQCMD on both
> >>> pdev and mdev.
> >>>
> >>> The sequence applies to all device types (being pdev or mdev), except
> >>> one additional step to call KVM for ENQCMD-capable mdev:  
> >> My understanding is ENQCMD is Intel specific and not a requirement for
> >> having vSVA.  
> > ENQCMD is not really Intel specific although only Intel supports it today.
> > The PCIe DMWr capability is the capability for software to enumerate the
> > ENQCMD support in device side. yes, it is not a requirement for vSVA. They
> > are orthogonal.  
> 
> 
> Right, then it's better to mention DMWr instead of a vendor specific 
> instruction in a general framework like ioasid.

good suggestion. :)

-- 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Liu Yi L
On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:

> On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +
> > 
> > /*
> >   * Check whether an uAPI extension is supported. 
> >   *
> >   * This is for FD-level capabilities, such as locked page 
> > pre-registration. 
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION  _IO(IOASID_TYPE, IOASID_BASE + 0)  
> 
>  
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   * - vaddr;
> >   * - size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY  _IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY_IO(IOASID_TYPE, IOASID_BASE + 2)  
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
> 
> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID. 
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address 
> >   * space. Each IOASID is associated with a single I/O page table. User 
> >   * must call this ioctl to get an IOASID for every I/O address space that 
> > is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is 
> >   * attached to a device. Once attached, an empty I/O page table is 
> >   * bound with the IOMMU then the user could use either DMA mapping 
> >   * or pgtable binding commands to manage this I/O page table.  
> 
> Can the IOASID can be populated before being attached?

perhaps a MAP/UNMAP operation on a gpa_ioasid?

> 
> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)  
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
>
> 
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   * - VFIO type1 map/unmap;
> >   * - pgtable/pasid_table binding
> >   * - hardware nesting vs. software nesting;
> >   * - ...
> >   *
> >   * Related attributes:
> >   * - supported page sizes, reserved IOVA ranges (DMA mapping);
> >   * - vendor pgtable formats (pgtable binding);
> >   * - number of child IOASIDs (nesting);
> >   * - ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.  
> 
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

but an IOASID is just a software handle before attached to a specific
device. e.g. before attaching to a device, we have no idea about the
supported page size in underlying iommu, coherent etc.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.

Actually, we have only two kinds of IOASIDs so far. One is used as parent
and another is child. For child, this proposal has defined IOASID_CREATE_NESTING
for it. But yeah, I think it is doable to indicate the type in ALLOC. But
for child IOASID, there require one more step to config its parent IOASID
or may include such info in the ioctl input as well.
 
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

yeah, I guess you mean to fail the device attach when the IOASID is a
nesting IOASID but the device is behind an iommu without nesting support.
right?

> 
> > /*
> >   * Map/unmap process virtual addresses to I/O virtual 

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Liu Yi L
On Fri, 28 May 2021 10:24:56 +0800, Jason Wang wrote:

> 在 2021/5/27 下午3:58, Tian, Kevin 写道:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> > etc.) are expected to use this interface instead of creating their own 
> > logic to
> > isolate untrusted device DMAs initiated by userspace.  
> 
> 
> Not a native speaker but /dev/ioas seems better?

we are open on it. using /dev/ioasid is just because of we are using it in the
previous discussion. ^_^

> 
> >
> > This proposal describes the uAPI of /dev/ioasid and also sample sequences
> > with VFIO as example in typical usages. The driver-facing kernel API 
> > provided
> > by the iommu layer is still TBD, which can be discussed after consensus is
> > made on this uAPI.
> >
> > It's based on a lengthy discussion starting from here:
> > https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/
> >
> > It ends up to be a long writing due to many things to be summarized and
> > non-trivial effort required to connect them into a complete proposal.
> > Hope it provides a clean base to converge.
> >
> > TOC
> > 
> > 1. Terminologies and Concepts
> > 2. uAPI Proposal
> >  2.1. /dev/ioasid uAPI
> >  2.2. /dev/vfio uAPI
> >  2.3. /dev/kvm uAPI
> > 3. Sample structures and helper functions
> > 4. PASID virtualization
> > 5. Use Cases and Flows
> >  5.1. A simple example
> >  5.2. Multiple IOASIDs (no nesting)
> >  5.3. IOASID nesting (software)
> >  5.4. IOASID nesting (hardware)
> >  5.5. Guest SVA (vSVA)
> >  5.6. I/O page fault
> >  5.7. BIND_PASID_TABLE
> > 
> >
[...]
> >
> > With above design /dev/ioasid uAPI is all about I/O address spaces.
> > It doesn't include any device routing information, which is only
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). If there is a
> > need of further relaying this fault into the guest, the user is responsible
> > of identifying the device attached to this IOASID (randomly pick one if
> > multiple attached devices) and then generates a per-device virtual I/O
> > page fault into guest. Similarly the iotlb invalidation uAPI describes the
> > granularity in the I/O address space (all, or a range), different from the
> > underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> >
> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure.  
> 
> 
> I'm not sure this is true for all archs.

today, yes. and echo JasonG's comment on it.

> 
> >   Some platforms implement the PASID table in the guest
> > physical space (GPA), expecting it managed by the guest. The guest
> > PASID table is bound to the IOMMU also by attaching to an IOASID,
> > representing the per-RID vPASID space.
> >
[...]
> >
> > /*
> >* Get information about an I/O address space
> >*
> >* Supported capabilities:
> >*- VFIO type1 map/unmap;
> >*- pgtable/pasid_table binding
> >*- hardware nesting vs. software nesting;
> >*- ...
> >*
> >* Related attributes:
> >*- supported page sizes, reserved IOVA ranges (DMA mapping);
> >*- vendor pgtable formats (pgtable binding);
> >*- number of child IOASIDs (nesting);
> >*- ...
> >*
> >* Above information is available only after one or more devices are
> >* attached to the specified IOASID. Otherwise the IOASID is just a
> >* number w/o any capability or attribute.
> >*
> >* Input parameters:
> >*- u32 ioasid;
> >*
> >* Output parameters:
> >*- many. TBD.
> >*/
> > #define IOASID_GET_INFO _IO(IOASID_TYPE, IOASID_BASE + 5)
> >
> >
> > /*
> >* Map/unmap process virtual addresses to I/O virtual addresses.
> >*
> >* Provide VFIO type1 equivalent semantics. Start with the same
> >* restriction e.g. the unmap size should match those used in the
> >* original mapping call.
> >*
> >* If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> >* must be already in the preregistered list.
> >*
> >* Input parameters:
> >*- u32 ioasid;
> >*- refer to vfio_iommu_type1_dma_{un}map
> >*
> >* Return: 0 on success, -errno on failure.
> >*/
> > #define IOASID_MAP_DMA  _IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA_IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> >
> > /*
> >* Create a nesting IOASID (child) on an existing IOASID (parent)
> >*
> >* IOASIDs can be nested together, implying that the output address
> >* from one I/O page table (child) must be further translated by
> >* another I/O page table (parent).
> >*
> >* As the child adds essentially another reference to 

Re: [PATCH 3/6] vfio: remove the unused mdev iommu hook

2021-05-14 Thread Liu Yi L
Morning Jason,

On Fri, 14 May 2021 10:39:39 -0300, Jason Gunthorpe wrote:

> On Fri, May 14, 2021 at 01:17:23PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, May 13, 2021 8:01 PM
> > > 
> > > On Thu, May 13, 2021 at 03:28:52AM +, Tian, Kevin wrote:
> > >   
> > > > Are you specially concerned about this iommu_device hack which
> > > > directly connects mdev_device to iommu layer or the entire removed
> > > > logic including the aux domain concept? For the former we are now
> > > > following up the referred thread to find a clean way. But for the latter
> > > > we feel it's still necessary regardless of how iommu interface is 
> > > > redesigned
> > > > to support device connection from the upper level driver. The reason is
> > > > that with mdev or subdevice one physical device could be attached to
> > > > multiple domains now. there could be a primary domain with DOMAIN_
> > > > DMA type for DMA_API use by parent driver itself, and multiple auxiliary
> > > > domains with DOMAIN_UNMANAGED types for subdevices assigned to
> > > > different VMs.  
> > > 
> > > Why do we need more domains than just the physical domain for the
> > > parent? How does auxdomain appear in /dev/ioasid?
> > >   
> > 
> > Another simple reason. Say you have 4 mdevs each from a different 
> > parent are attached to an ioasid. If only using physical domain of the 
> > parent + PASID it means there are 4 domains (thus 4 page tables) under 
> > this IOASID thus every dma map operation must be replicated in all
> > 4 domains which is really unnecessary. Having the domain created
> > with ioasid and allow a device attaching to multiple domains is much
> > cleaner for the upper-layer drivers to work with iommu interface.  
> 
> Eh? That sounds messed up.
> 
> The IOASID is the page table. If you have one IOASID and you attach it
> to 4 IOMMU routings (be it pasid, rid, whatever) then there should
> only ever by one page table.

yes, ioasid is the page table. But if want to let the 4 mdevs share the
same page table, it would be natural to let them share a domain. Since
mdev_device is not hw device, we should not let it participate in the
IOMMU. Therefore we got the aux-domain concept. mdev(RID#+PASID) is
attached to aux-domain. Such solution also fits the hybrid cases. e.g.
When there are both PF(RID#1) and mdev(RID#2+PASID) assigned to an ioasid,
they should share a page table as well. right? Surely we cannot attach the
PF(RID#1) to the domain of mdev's parent device(RID#2). Good way is PF(RID#1)
and the mdev (RID#2+PASID) attached to a single domain. This domain is
the primary domain for the PF(RID#1) but an aux-domain mdev's paretn(RID#2).

-- 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Liu Yi L
On Tue, 11 May 2021 09:10:03 +, Tian, Kevin wrote:

> > From: Jason Gunthorpe
> > Sent: Monday, May 10, 2021 8:37 PM
> >   
> [...] 
> > > gPASID!=hPASID has a problem when assigning a physical device which
> > > supports both shared work queue (ENQCMD with PASID in MSR)
> > > and dedicated work queue (PASID in device register) to a guest
> > > process which is associated to a gPASID. Say the host kernel has setup
> > > the hPASID entry with nested translation though /dev/ioasid. For
> > > shared work queue the CPU is configured to translate gPASID in MSR
> > > into **hPASID** before the payload goes out to the wire. However
> > > for dedicated work queue the device MMIO register is directly mapped
> > > to and programmed by the guest, thus containing a **gPASID** value
> > > implying DMA requests through this interface will hit IOMMU faults
> > > due to invalid gPASID entry. Having gPASID==hPASID is a simple
> > > workaround here. mdev doesn't have this problem because the
> > > PASID register is in emulated control-path thus can be translated
> > > to hPASID manually by mdev driver.  
> > 
> > This all must be explicit too.
> > 
> > If a PASID is allocated and it is going to be used with ENQCMD then
> > everything needs to know it is actually quite different than a PASID
> > that was allocated to be used with a normal SRIOV device, for
> > instance.
> > 
> > The former case can accept that the guest PASID is virtualized, while
> > the lattter can not.
> > 
> > This is also why PASID per RID has to be an option. When I assign a
> > full SRIOV function to the guest then that entire RID space needs to
> > also be assigned to the guest. Upon migration I need to take all the
> > physical PASIDs and rebuild them in another hypervisor exactly as is.
> > 
> > If you force all RIDs into a global PASID pool then normal SRIOV
> > migration w/PASID becomes impossible. ie ENQCMD breaks everything else
> > that should work.
> > 
> > This is why you need to sort all this out and why it feels like some
> > of the specs here have been mis-designed.
> > 
> > I'm not sure carving out ranges is really workable for migration.
> > 
> > I think the real answer is to carve out entire RIDs as being in the
> > global pool or not. Then the ENQCMD HW can be bundled together and
> > everything else can live in the natural PASID per RID world.
> >   
> 
> OK. Here is the revised scheme by making it explicitly.
> 
> There are three scenarios to be considered:
> 
> 1) SR-IOV (AMD/ARM):
>   - "PASID per RID" with guest-allocated PASIDs;
>   - PASID table managed by guest (in GPA space);
>   - the entire PASID space delegated to guest;
>   - no need to explicitly register guest-allocated PASIDs to host;
>   - uAPI for attaching PASID table:
> 
> // set to "PASID per RID"
> ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);
> 
> // When Qemu captures a new PASID table through vIOMMU;
> pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid);
> 
> // Set the PASID table to the RID associated with pasidtbl_ioasid;
> ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr);
> 
> 2) SR-IOV, no ENQCMD (Intel):
>   - "PASID per RID" with guest-allocated PASIDs;
>   - PASID table managed by host (in HPA space);
>   - the entire PASID space delegated to guest too;
>   - host must be explicitly notified for guest-allocated PASIDs;
>   - uAPI for binding user-allocated PASIDs:
> 
> // set to "PASID per RID"
> ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);
> 
> // When Qemu captures a new PASID allocated through vIOMMU;

Is this achieved by VCMD or by capturing guest's PASID cache invalidation?

> pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);
> 
> // Tell the kernel to associate pasid to pgtbl_ioasid in internal 
> structure;
> //  being a pointer due to a requirement in scenario-3
> ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, );
> 
> // Set guest page table to the RID+pasid associated to pgtbl_ioasid
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);
> 
> 3) SRIOV, ENQCMD (Intel):
>   - "PASID global" with host-allocated PASIDs;
>   - PASID table managed by host (in HPA space);
>   - all RIDs bound to this ioasid_fd use the global pool;
>   - however, exposing global PASID into guest breaks migration;
>   - hybrid scheme: split local PASID range and global PASID range;
>   - force guest to use only local PASID range (through vIOMMU);
>   - for ENQCMD, configure CPU to translate local->global;
>   - for non-ENQCMD, setup both local/global pasid entries;
>   - uAPI for range split and CPU pasid mapping:
> 
> // set to "PASID global"
> ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);
> 
> // split local/global range, applying to 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-08 Thread Liu Yi L
Hi Jason,

On Wed, 5 May 2021 19:21:20 -0300, Jason Gunthorpe wrote:

> On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe  wrote:
> >   
> > > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote:
> > >   
> > > > Global and pluggable are for slightly separate reasons.
> > > > - We need global PASID on VT-d in that we need to support shared
> > > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then
> > > > assigned to two VMs. Each VM uses its private guest PASID to submit
> > > > work but each guest PASID must be translated to a global (system-wide)
> > > > host PASID to avoid conflict. Also, since PASID table storage is per
> > > > PF, if two mdevs of the same PF are assigned to different VMs, the
> > > > PASIDs must be unique.
> > > 
> > > From a protocol perspective each RID has a unique PASID table, and
> > > RIDs can have overlapping PASIDs.
> > >   
> > True, per RID or per PF as I was referring to.
> >   
> > > Since your SWQ is connected to a single RID the requirement that
> > > PASIDs are unique to the RID ensures they are sufficiently unique.
> > >   
> > True, but one process can submit work to multiple mdevs from different
> > RIDs/PFs. One process uses one PASID and PASID translation table is per VM.
> > The same PASID is used for all the PASID tables of each RID.  
> 
> If the model is "assign this PASID to this RID" then yes, there is a
> big problem keeping everything straight that can only be solved with a
> global table.
> 
> But if the model is "give me a PASID for this RID" then it isn't such
> a problem.

Let me double confirm if I'm understanding you correctly. So your suggestion
is to have a per-RID PASID namespace, which can be maintainer by IOMMU driver.
right? Take native SVM usage as an example, everytime a process is bound with
a device, a PASID within this RID will be allocated. Am I correct so far?

If yes, then there is a case in which IOTLB efficiency is really low. Let's ay
there is a process bound with multiple devices(RIDs) and has different PASIDs
allocated for each RID. In such case, the PASID values are different for each
RID. As most vendor will do, PASID will be used to tag IOTLB entries. So in such
case, here will be multiple IOTLB entries for a single VA->PA mapping. And the
number of such duplicate IOTLB entries increases linearly per the number of the
device number. Seems not good from performance perspective.

> 
> Basically trying to enforce a uniform PASID for an IOASID across all
> RIDs attached to it is not such a nice choice.
> 
> > > That is fine, but all this stuff should be inside the Intel vIOMMU
> > > driver not made into a global resource of the entire iommu subsystem.
> > >   
> > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic
> > code need to have this option. I guess you are saying we should also have a
> > per RID allocation option in addition to global?  
> 
> There always has to be a RID involvement for the PASID, for security,
> this issue really boils down to where the PASID lives.
> 
> If you need the PASID attached to the IOASID then it has to be global
> because the IOASID can be attached to any RID and must keep the same
> PASID.
> 
> If the PASID is learned when the IOASID is attached to a RID then the
> PASID is more flexible and isn't attached to the IOASID.
> 
> Honestly I'm a little leary to bake into a UAPI a specific HW choice
> that Intel made here.
> 
> I would advise making the "attach a global PASID to this IOASID"
> operation explicit and opt into for case that actually need it.
> 
> Which implies the API to the iommu driver should be more like:
> 
>   'assign an IOASID to this RID and return the PASID'
>   'reserve a PASID from every RID'
>   'assign an IOASID to this RID and use this specific PASID'
> 
> In all cases the scope of those operations are completely local to a
> certain IOMMU driver - 'reserver a PASID from every RID' is really
> every RID that driver can operate on.

Also, this reservation will be failed if the PASID happens to be occupied
by previous usage. As the PASID translation table is per-VM, ENQCMD in VM
will be a problem under such PASID management model.

> 
> So it is hard to see why the allocator should be a global resource and
> not something that is part of the iommu driver exclusively.
> 
> Jason

-- 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-22 Thread Liu Yi L
On Wed, 21 Apr 2021 13:33:12 -0600, Alex Williamson wrote:

> On Wed, 21 Apr 2021 14:52:03 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Apr 21, 2021 at 10:54:51AM -0600, Alex Williamson wrote:
> >   
> > > That's essentially replacing vfio-core, where I think we're more
> > 
> > I am only talking about /dev/vfio here which is basically the IOMMU
> > interface part.
> > 
> > I still expect that VFIO_GROUP_SET_CONTAINER will be used to connect
> > /dev/{ioasid,vfio} to the VFIO group and all the group and device
> > logic stays inside VFIO.  
> 
> But that group and device logic is also tied to the container, where
> the IOMMU backend is the interchangeable thing that provides the IOMMU
> manipulation for that container.  If you're using
> VFIO_GROUP_SET_CONTAINER to associate a group to a /dev/ioasid, then
> you're really either taking that group outside of vfio or you're
> re-implementing group management in /dev/ioasid.  I'd expect the
> transition point at VFIO_SET_IOMMU.

per my understanding, transiting at the VFIO_SET_IOMMU point makes more
sense as VFIO can still have the group and device logic, which is the key
concept of group granularity isolation for userspace direct access.

-- 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-21 Thread Liu, Yi L
Hi Alex,

> From: Alex Williamson 
> Sent: Friday, April 16, 2021 11:46 PM
[...]
> > This is not a tactic or excuse for not working on the new /dev/ioasid
> > interface. In fact, I believe we can benefit from the lessons learned
> > while completing the existing. This will give confidence to the new
> > interface. Thoughts?
> 
> I understand a big part of Jason's argument is that we shouldn't be in
> the habit of creating duplicate interfaces, we should create one, well
> designed interfaces to share among multiple subsystems.  As new users
> have emerged, our solution needs to change to a common one rather than
> a VFIO specific one.  The IOMMU uAPI provides an abstraction, but at
> the wrong level, requiring userspace interfaces for each subsystem.
> 
> Luckily the IOMMU uAPI is not really exposed as an actual uAPI, but
> that changes if we proceed to enable the interfaces to tunnel it
> through VFIO.
> 
> The logical answer would therefore be that we don't make that
> commitment to the IOMMU uAPI if we believe now that it's fundamentally
> flawed.
> 
> Ideally this new /dev/ioasid interface, and making use of it as a VFIO
> IOMMU backend, should replace type1. 

yeah, just a double check, I think this also requires a new set of uAPIs
(e.g. new MAP/UNMAP), which means the current VFIO IOMMU type1 related ioctls
would be deprecated in future. right?

> Type1 will live on until that
> interface gets to parity, at which point we may deprecate type1, but it
> wouldn't make sense to continue to expand type1 in the same direction
> as we intend /dev/ioasid to take over in the meantime, especially if it
> means maintaining an otherwise dead uAPI.  Thanks,

understood.

Regards,
Yi Liu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 7:54 PM
> 
> On Thu, Apr 01, 2021 at 07:04:01AM +0000, Liu, Yi L wrote:
> 
> > After reading your reply in https://lore.kernel.org/linux-
> iommu/20210331123801.gd1463...@nvidia.com/#t
> > So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above
> skeleton
> > doesn't suit your idea.
> 
> You can do it one PASID per FD or multiple PASID's per FD. Most likely
> we will have high numbers of PASID's in a qemu process so I assume
> that number of FDs will start to be a contraining factor, thus
> multiplexing is reasonable.
> 
> It doesn't really change anything about the basic flow.
> 
> digging deeply into it either seems like a reasonable choice.
> 
> > +-+---+
> > |  userspace  |   kernel space  
> >   |
> > +-+---+
> > | ioasid_fd = | /dev/ioasid does below: 
> >   |
> > | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {
> >   |
> > | |   struct list_head ioasid_list; 
> >   |
> > | |   ...   
> >   |
> > | |   } ifd_ctx; // ifd_ctx is per ioasid_fd
> >   |
> 
> Sure, possibly an xarray not a list
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   ALLOC, );  |   struct ioasid_data {  
> >   |
> > | |   ioasid_t ioasid;  
> >   |
> > | |   struct list_head device_list; 
> >   |
> > | |   struct list_head next;
> >   |
> > | |   ...   
> >   |
> > | |   } id_data; // id_data is per ioasid   
> >   |
> > | | 
> >   |
> > | |   list_add(_data.next,   
> >   |
> > | |_ctx.ioasid_list);
> > |
> 
> Yes, this should have a kref in it too
> 
> > +-+---+
> > | ioctl(device_fd,| VFIO does below:
> >   |
> > |   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is 
> > valid |
> > |   ioasid_fd,| 2) check if ioasid is allocated from 
> > ioasid_fd|
> > |   ioasid);  | 3) register device/domain info to 
> > /dev/ioasid |
> > | |tracked in id_data.device_list   
> >   |
> > | | 4) record the ioasid in VFIO's per-device   
> >   |
> > | |ioasid list for future security check
> >   |
> 
> You would provide a function that does steps 1&2 look at eventfd for
> instance.
> 
> I'm not sure we need to register the device with the ioasid. device
> should incr the kref on the ioasid_data at this point.
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   BIND_PGTBL,   | 1) find ioasid's id_data
> >   |
> > |   pgtbl_data,   | 2) loop the id_data.device_list and tell 
> > iommu|
> > |   ioasid);  |give ioasid access to the devices
> > |
> 
> This seems backwards, DEVICE_ALLOW_IOASID should tell the iommu to
> give the ioasid to the device.
> 
> Here the ioctl should be about assigning a memory map from the the
> current
> mm_struct to the pasid
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   UNBIND_PGTBL, | 1) find ioasid's id_data
> >   |
> > |   ioasid);  | 2) loop the id_data.device_list and tell 
> > iommu|
> > |  

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 9:43 PM
> 
> On Thu, Apr 01, 2021 at 01:38:46PM +0000, Liu, Yi L wrote:
> > > From: Jean-Philippe Brucker 
> > > Sent: Thursday, April 1, 2021 8:05 PM
> > [...]
> > >
> > > Also wondering about:
> > >
> > > * Querying IOMMU nesting capabilities before binding page tables
> (which
> > >   page table formats are supported?). We were planning to have a VFIO
> cap,
> > >   but I'm guessing we need to go back to the sysfs solution?
> >
> > I think it can also be with /dev/ioasid.
> 
> Sure, anything to do with page table formats and setting page tables
> should go through ioasid.
> 
> > > * Invalidation, probably an ioasid_fd ioctl?
> >
> > yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
> > invalidation should go this way as well. This is why I worried it may
> > fail to meet the requirement from you and Eric.
> 
> Yes, all manipulation of page tables, including removing memory ranges, or
> setting memory ranges to trigger a page fault behavior should go
> through here.
> 
> > > * Page faults, page response. From and to devices, and don't necessarily
> > >   have a PASID. But needed by vdpa as well, so that's also going through
> > >   /dev/ioasid?
> >
> > page faults should still be per-device, but the fault event fd may be stored
> > in /dev/ioasid. page response would be in /dev/ioasid just like 
> > invalidation.
> 
> Here you mean non-SVA page faults that are delegated to userspace to
> handle?

no, just SVA page faults. otherwise, no need to let userspace handle.

> 
> Why would that be per-device?
>
> Can you show the flow you imagine?

DMA page faults are delivered to root-complex via page request message and
it is per-device according to PCIe spec. Page request handling flow is:

1) iommu driver receives a page request from device
2) iommu driver parses the page request message. Get the RID,PASID, faulted
   page and requested permissions etc.
3) iommu driver triggers fault handler registered by device driver with
   iommu_report_device_fault()
4) device driver's fault handler signals an event FD to notify userspace to
   fetch the information about the page fault. If it's VM case, inject the
   page fault to VM and let guest to solve it.

Eric has sent below series for the page fault reporting for VM with passthru
device.
https://lore.kernel.org/kvm/20210223210625.604517-5-eric.au...@redhat.com/

Regards,
Yi Liu

> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 9:16 PM
> 
> On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, April 1, 2021 7:47 PM
> > [...]
> > > I'm worried Intel views the only use of PASID in a guest is with
> > > ENQCMD, but that is not consistent with the industry. We need to see
> > > normal nested PASID support with assigned PCI VFs.
> >
> > I'm not quire flow here. Intel also allows PASID usage in guest without
> > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> ENQCMD.
> 
> Then you need all the parts, the hypervisor calls from the vIOMMU, and
> you can't really use a vPASID.

This is a diagram shows the vSVA setup.

.-.  .---.
|   vIOMMU|  | Guest process CR3, FL only|
| |  '---'
./
| PASID Entry |--- PASID cache flush -
'-'   |
| |   V
| |CR3 in GPA
'-'
Guest
--| Shadow |--|
  vv  v
Host
.-.  .--.
|   pIOMMU|  | Bind FL for GVA-GPA  |
| |  '--'
./  |
| PASID Entry | V (Nested xlate)
'\.--.
| |   |SL for GPA-HPA, default domain|
| |   '--'
'-'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

https://lore.kernel.org/linux-iommu/20210302203545.436623-1-yi.l@intel.com/

> 
> I'm not sure how Intel intends to resolve all of this.
> 
> > > > - this per-ioasid SVA operations is not aligned with the native SVA
> usage
> > > >   model. Native SVA bind is per-device.
> > >
> > > Seems like that is an error in native SVA.
> > >
> > > SVA is a particular mode of the PASID's memory mapping table, it has
> > > nothing to do with a device.
> >
> > I think it still has relationship with device. This is determined by the
> > DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation
> is
> > enforced first in device granularity and then PASID granularity. SVA makes
> > usage of both PASID and device granularity isolation.
> 
> When the device driver authorizes a PASID the VT-d stuff should setup
> the isolation parameters for the give pci_device and PASID.

yes, both device and PASID is needed to setup VT-d stuff.

> Do not leak implementation details like this as uAPI. Authorization
> and memory map are distinct ideas with distinct interfaces. Do not mix
> them.

got you. Let's focus on the uAPI things here and leave implementation details
in RFC patches.

Thanks,
Yi Liu

> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jean-Philippe Brucker 
> Sent: Thursday, April 1, 2021 8:05 PM
[...]
> 
> Also wondering about:
> 
> * Querying IOMMU nesting capabilities before binding page tables (which
>   page table formats are supported?). We were planning to have a VFIO cap,
>   but I'm guessing we need to go back to the sysfs solution?

I think it can also be with /dev/ioasid.

> 
> * Invalidation, probably an ioasid_fd ioctl?

yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
invalidation should go this way as well. This is why I worried it may
fail to meet the requirement from you and Eric.

> * Page faults, page response. From and to devices, and don't necessarily
>   have a PASID. But needed by vdpa as well, so that's also going through
>   /dev/ioasid?

page faults should still be per-device, but the fault event fd may be stored
in /dev/ioasid. page response would be in /dev/ioasid just like invalidation.

Regards,
Yi Liu

> 
> Thanks,
> Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 7:47 PM
[...]
> I'm worried Intel views the only use of PASID in a guest is with
> ENQCMD, but that is not consistent with the industry. We need to see
> normal nested PASID support with assigned PCI VFs.

I'm not quire flow here. Intel also allows PASID usage in guest without
ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without ENQCMD.

[...]

> I'm sure there will be some small differences, and you should clearly
> explain the entire uAPI surface so that soneone from AMD and ARM can
> review it.

good suggestion, will do.

> > - this per-ioasid SVA operations is not aligned with the native SVA usage
> >   model. Native SVA bind is per-device.
> 
> Seems like that is an error in native SVA.
> 
> SVA is a particular mode of the PASID's memory mapping table, it has
> nothing to do with a device.

I think it still has relationship with device. This is determined by the
DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation is
enforced first in device granularity and then PASID granularity. SVA makes
usage of both PASID and device granularity isolation.

Regards,
Yi Liu

> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
Hi Jason,

> From: Liu, Yi L 
> Sent: Thursday, April 1, 2021 12:39 PM
> 
> > From: Jason Gunthorpe 
> > Sent: Wednesday, March 31, 2021 8:41 PM
> >
> > On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:
> >
> > > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > > the VM should be able to be shared by all assigned device for the VM.
> > > But the SVA operations (bind/unbind page table, cache_invalidate)
> should
> > > be per-device.
> >
> > It is not *per-device* it is *per-ioasid*
> >
> > And as /dev/ioasid is an interface for controlling multiple ioasid's
> > there is no issue to also multiplex the page table manipulation for
> > multiple ioasids as well.
> >
> > What you should do next is sketch out in some RFC the exactl ioctls
> > each FD would have and show how the parts I outlined would work and
> > point out any remaining gaps.
> >
> > The device FD is something like the vfio_device FD from VFIO, it has
> > *nothing* to do with PASID beyond having a single ioctl to authorize
> > the device to use the PASID. All control of the PASID is in
> > /dev/ioasid.
> 
> good to see this reply. Your idea is much clearer to me now. If I'm getting
> you correctly. I think the skeleton is something like below:
> f
> 1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
>allocated and a per-ioasid context which can be used to do bind page
>table and cache invalidate, an ioasid FD returned to userspace.
> 2) userspace passes the ioasid FD to VFIO, let it associated with a device
>FD (like vfio_device FD).
> 3) userspace binds page table on the ioasid FD with the page table info.
> 4) userspace unbinds the page table on the ioasid FD
> 5) userspace de-associates the ioasid FD and device FD
> 
> Does above suit your outline?
> 
> If yes, I still have below concern and wish to see your opinion.
> - the ioasid FD and device association will happen at runtime instead of
>   just happen in the setup phase.
> - how about AMD and ARM's vSVA support? Their PASID allocation and page
> table
>   happens within guest. They only need to bind the guest PASID table to
> host.
>   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
>   to correct me)
> - this per-ioasid SVA operations is not aligned with the native SVA usage
>   model. Native SVA bind is per-device.

After reading your reply in 
https://lore.kernel.org/linux-iommu/20210331123801.gd1463...@nvidia.com/#t
So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
doesn't suit your idea. I draft below skeleton to see if our mind is the
same. But I still believe there is an open on how to fit ARM and AMD's
vSVA support in this the per-ioasid SVA operation model. thoughts?

+-+---+
|  userspace  |   kernel space|
+-+---+
| ioasid_fd = | /dev/ioasid does below:   |
| open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {  |
| |   struct list_head ioasid_list;   |
| |   ... |
| |   } ifd_ctx; // ifd_ctx is per ioasid_fd  |
+-+---+
| ioctl(ioasid_fd,| /dev/ioasid does below:   |
|   ALLOC, );  |   struct ioasid_data {|
| |   ioasid_t ioasid;|
| |   struct list_head device_list;   |
| |   struct list_head next;  |
| |   ... |
| |   } id_data; // id_data is per ioasid |
| |   |
| |   list_add(_data.next, |
| |_ctx.ioasid_list); |
+-+---+
| ioctl(device_fd,| VFIO does below:  |
|   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid |
|   ioasid_fd,| 2) check if ioasid is allocated from ioasid_fd|
|   ioasid);  | 3) register device/domain info to /dev/ioasid |
| |tracked in id_data.device_list  

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Wednesday, March 31, 2021 8:41 PM
> 
> On Wed, Mar 31, 2021 at 07:38:36AM +0000, Liu, Yi L wrote:
> 
> > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > the VM should be able to be shared by all assigned device for the VM.
> > But the SVA operations (bind/unbind page table, cache_invalidate) should
> > be per-device.
> 
> It is not *per-device* it is *per-ioasid*
>
> And as /dev/ioasid is an interface for controlling multiple ioasid's
> there is no issue to also multiplex the page table manipulation for
> multiple ioasids as well.
> 
> What you should do next is sketch out in some RFC the exactl ioctls
> each FD would have and show how the parts I outlined would work and
> point out any remaining gaps.
> 
> The device FD is something like the vfio_device FD from VFIO, it has
> *nothing* to do with PASID beyond having a single ioctl to authorize
> the device to use the PASID. All control of the PASID is in
> /dev/ioasid.

good to see this reply. Your idea is much clearer to me now. If I'm getting
you correctly. I think the skeleton is something like below:

1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
   allocated and a per-ioasid context which can be used to do bind page
   table and cache invalidate, an ioasid FD returned to userspace.
2) userspace passes the ioasid FD to VFIO, let it associated with a device
   FD (like vfio_device FD).
3) userspace binds page table on the ioasid FD with the page table info.
4) userspace unbinds the page table on the ioasid FD
5) userspace de-associates the ioasid FD and device FD

Does above suit your outline?

If yes, I still have below concern and wish to see your opinion.
- the ioasid FD and device association will happen at runtime instead of
  just happen in the setup phase.
- how about AMD and ARM's vSVA support? Their PASID allocation and page table
  happens within guest. They only need to bind the guest PASID table to host.
  Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
  to correct me)
- this per-ioasid SVA operations is not aligned with the native SVA usage
  model. Native SVA bind is per-device.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:43 PM
[..]
> No, the mdev device driver must enforce this directly. It is the one
> that programms the physical shared HW, it is the one that needs a list
> of PASID's it is allowed to program *for each mdev*
> 
> ioasid_set doesn't seem to help at all, certainly not as a concept
> tied to /dev/ioasid.
> 

As replied in another thread. We introduced ioasid_set based on the
motivation to have per-VM ioasid track, which is required when user
space tries to bind an ioasid with a device. Should ensure the ioasid
it is using was allocated to it. otherwise, we may suffer inter-VM ioasid
problem. It may not necessaty to be ioasid_set but a per-VM ioasid track
is necessary.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:28 PM
> 
> On Tue, Mar 30, 2021 at 04:14:58AM +, Tian, Kevin wrote:
> 
> > One correction. The mdev should still construct the list of allowed PASID's
> as
> > you said (by listening to IOASID_BIND/UNBIND event), in addition to the
> ioasid
> > set maintained per VM (updated when a PASID is allocated/freed). The
> per-VM
> > set is required for inter-VM isolation (verified when a pgtable is bound to
> the
> > mdev/PASID), while the mdev's own list is necessary for intra-VM isolation
> when
> > multiple mdevs are assigned to the same VM (verified before loading a
> PASID
> > to the mdev). This series just handles the general part i.e. per-VM ioasid
> set and
> > leaves the mdev's own list to be managed by specific mdev driver which
> listens
> > to various IOASID events).
> 
> This is better, but I don't understand why we need such a convoluted
> design.
> 
> Get rid of the ioasid set.
>
> Each driver has its own list of allowed ioasids.

First, I agree with you it's necessary to have a per-device allowed ioasid
list. But besides it, I think we still need to ensure the ioasid used by a
VM is really allocated to this VM. A VM should not use an ioasid allocated
to another VM. right? Actually, this is the major intention for introducing
ioasid_set.

> Register a ioasid in the driver's list by passing the fd and ioasid #

The fd here is a device fd. Am I right? If yes, your idea is ioasid is
allocated via /dev/ioasid and associated with device fd via either VFIO
or vDPA ioctl. right? sorry I may be asking silly questions but really
need to ensure we are talking in the same page.

> No listening to events. A simple understandable security model.

For this suggestion, I have a little bit concern if we may have A-B/B-A
lock sequence issue since it requires the /dev/ioasid (if it supports)
to call back into VFIO/VDPA to check if the ioasid has been registered to
device FD and record it in the per-device list. right? Let's have more
discussion based on the skeleton sent by Kevin.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:29 PM
> 
> On Tue, Mar 30, 2021 at 01:37:05AM +, Tian, Kevin wrote:
[...]
> > Hi, Jason,
> >
> > Actually above is a major open while we are refactoring vSVA uAPI toward
> > this direction. We have two concerns about merging /dev/ioasid with
> > /dev/sva, and would like to hear your thought whether they are valid.
> >
> > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > bound to specific security context (e.g. a control vq in vDPA) instead of
> > tying to mm. In this case there is no pgtable binding initiated from user
> > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > to the intended security context through specific passthrough framework
> > which manages that context.
> 
> This sounds like the exact opposite of what I'd like to see.
> 
> I do not want to see every subsystem gaining APIs to program a
> PASID. All of that should be consolidated in *one place*.
> 
> I do not want to see VDPA and VFIO have two nearly identical sets of
> APIs to control the PASID.
> 
> Drivers consuming a PASID, like VDPA, should consume the PASID and do
> nothing more than authorize the HW to use it.
> 
> quemu should have general code under the viommu driver that drives
> /dev/ioasid to create PASID's and manage the IO mapping according to
> the guest's needs.
> 
> Drivers like VDPA and VFIO should simply accept that PASID and
> configure/authorize their HW to do DMA's with its tag.
> 
> > Second, ioasid is managed per process/VM while pgtable binding is a
> > device-wise operation.  The userspace flow looks like below for an integral
> > /dev/ioasid interface:
> >
> > - ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
> > - ioasid_fd = open(/dev/ioasid)
> > - ioctl(ioasid_fd, IOASID_GET_USVA_FD, _fd) //an empty context
> > - ioctl(device->fd, VFIO_DEVICE_SET_SVA, _fd); //sva_fd ties to
> device
> > - ioctl(sva_fd, USVA_GET_INFO, _info);
> > - ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, );
> > - ioctl(sva_fd, USVA_BIND_PGTBL, _data);
> > - ioctl(sva_fd, USVA_FLUSH_CACHE, _info);
> > - ioctl(sva_fd, USVA_UNBIND_PGTBL, _data);
> > - ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, _fd);
> > - close(sva_fd)
> > - close(ioasid_fd)
> >
> > Our hesitation here is based on one of your earlier comments that
> > you are not a fan of constructing fd's through ioctl. Are you OK with
> > above flow or have a better idea of handling it?
> 
> My reaction is to squash 'sva' and ioasid fds together, I can't see
> why you'd need two fds to manipulate a PASID.

The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
the VM should be able to be shared by all assigned device for the VM.
But the SVA operations (bind/unbind page table, cache_invalidate) should
be per-device. If squashing the two fds to be one, then requires a device
tag for each vSVA ioctl. I'm not sure if it is good. Per me, it looks
better to have a SVA FD and associated with a device FD so that any ioctl
on it will be in the device level. This also benefits ARM and AMD's vSVA
support since they binds guest PASID table to host instead of binding
guest page tables to specific PASIDs.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [Patch v8 04/10] vfio/type1: Support binding guest page tables to PASID

2021-03-03 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Thursday, March 4, 2021 3:45 AM
> 
> On Wed, Mar 03, 2021 at 11:42:12AM -0800, Jacob Pan wrote:
> > Hi Jason,
> >
> > On Tue, 2 Mar 2021 13:15:51 -0400, Jason Gunthorpe 
> wrote:
> >
> > > On Tue, Mar 02, 2021 at 09:13:19AM -0800, Jacob Pan wrote:
> > > > Hi Jason,
> > > >
> > > > On Tue, 2 Mar 2021 08:56:28 -0400, Jason Gunthorpe 
> > > > wrote:
> > > > > On Wed, Mar 03, 2021 at 04:35:39AM +0800, Liu Yi L wrote:
> > > > > >
> > > > > > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> > > > > > +{
> > > > > > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > > > +   unsigned long arg = *(unsigned long *)dc->data;
> > > > > > +
> > > > > > +   return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> > > > > > + (void __user *)arg);
> > > > >
> > > > > This arg buisness is really tortured. The type should be set at the
> > > > > ioctl, not constantly passed down as unsigned long or worse void *.
> > > > >
> > > > > And why is this passing a __user pointer deep into an iommu_* API??
> > > > >
> > > > The idea was that IOMMU UAPI (not API) is independent of VFIO or
> other
> > > > user driver frameworks. The design is documented here:
> > > > Documentation/userspace-api/iommu.rst
> > > > IOMMU UAPI handles the type and sanitation of user provided data.
> > >
> > > Why? If it is uapi it has defined types and those types should be
> > > completely clear from the C code, not obfuscated.
> > >
> > From the user's p.o.v., it is plain c code nothing obfuscated. As for
> > kernel handling of the data types, it has to be answered by the bigger
> > question of how we deal with sharing IOMMU among multiple
> subsystems with
> > UAPIs.
> 
> As I said, don't obfuscate types like this in the kernel. It is not
> good style.
> 
> > However, IOMMU is a system device which has little value to be exposed
> to
> > the userspace. Not to mention the device-IOMMU affinity/topology. VFIO
> > nicely abstracts IOMMU from the userspace, why do we want to reverse
> that?
> 
> The other patch was talking about a /dev/ioasid - why can't this stuff
> be run over that?

The stuff in this patch are actually iommu domain operations, which are
finally supported by iommu domain ops. While /dev/ioasid in another patch
is created for IOASID allocation/free to fit the PASID allocation requirement
from both vSVA and vDPA. It has no idea about iommu domain and neither the
device information. Without such info, /dev/ioasid is unable to run this
stuff.

Thanks,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [Patch v8 03/10] vfio/type1: Report iommu nesting info to userspace

2021-03-03 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Tuesday, March 2, 2021 8:52 PM
> 
> On Wed, Mar 03, 2021 at 04:35:38AM +0800, Liu Yi L wrote:
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> > index 4bb162c1d649..3a5c84d4f19b 100644
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -63,22 +63,24 @@ MODULE_PARM_DESC(dma_entry_limit,
> >  "Maximum number of user DMA mappings per container
> (65535).");
> >
> >  struct vfio_iommu {
> > -   struct list_headdomain_list;
> > -   struct list_headiova_list;
> > -   struct vfio_domain  *external_domain; /* domain for external
> user */
> > -   struct mutexlock;
> > -   struct rb_root  dma_list;
> > -   struct blocking_notifier_head notifier;
> > -   unsigned intdma_avail;
> > -   unsigned intvaddr_invalid_count;
> > -   uint64_tpgsize_bitmap;
> > -   uint64_tnum_non_pinned_groups;
> > -   wait_queue_head_t   vaddr_wait;
> > -   boolv2;
> > -   boolnesting;
> > -   booldirty_page_tracking;
> > -   boolpinned_page_dirty_scope;
> > -   boolcontainer_open;
> > +   struct list_headdomain_list;
> > +   struct list_headiova_list;
> > +   /* domain for external user */
> > +   struct vfio_domain  *external_domain;
> > +   struct mutexlock;
> > +   struct rb_root  dma_list;
> > +   struct blocking_notifier_head   notifier;
> > +   unsigned intdma_avail;
> > +   unsigned intvaddr_invalid_count;
> > +   uint64_tpgsize_bitmap;
> > +   uint64_tnum_non_pinned_groups;
> > +   wait_queue_head_t   vaddr_wait;
> > +   struct iommu_nesting_info   *nesting_info;
> > +   boolv2;
> > +   boolnesting;
> > +   booldirty_page_tracking;
> > +   boolpinned_page_dirty_scope;
> > +   boolcontainer_open;
> >  };
> 
> I always hate seeing one line patches done like this. If you want to
> re-indent you should remove the horizontal whitespace, not add an
> unreadable amount more.

Oops. will be careful in next version. Perhaps no need to re-indent
the existing fields to avoid the whitespace?

> 
> Also, Linus has been unhappy before to see lists of bool's in structs
> due to the huge amount of memory they waste.

How about something like below? I can do it if Alex is fine with it.

u64 v2:1;
u64 nesting:1;
u64 dirty_page_tracking:1;
u64 pinned_page_dirty_scope:1;
u64 container_open:1;
u64 reserved:59;

And thanks for sharing me what Linus prefers.

Regards,
Yi Liu

> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info

2021-03-03 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Friday, February 12, 2021 5:58 PM
> 
> Hi Vivek, Yi,
> 
> On 2/12/21 8:14 AM, Vivek Gautam wrote:
> > Hi Yi,
> >
> >
> > On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L  wrote:
> >>
> >> Hi Eric,
> >>
> >>> From: Auger Eric 
> >>> Sent: Tuesday, January 19, 2021 6:03 PM
> >>>
> >>> Hi Yi, Vivek,
> >>>
> >> [...]
> >>>> I see. I think there needs a change in the code there. Should also expect
> >>>> a nesting_info returned instead of an int anymore. @Eric, how about your
> >>>> opinion?
> >>>>
> >>>> domain = iommu_get_domain_for_dev(>pdev->dev);
> >>>> ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING,
> >>> );
> >>>> if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
> >>>> /*
> >>>>  * No need go futher as no page request service support.
> >>>>  */
> >>>> return 0;
> >>>> }
> >>> Sure I think it is "just" a matter of synchro between the 2 series. Yi,
> >>
> >> exactly.
> >>
> >>> do you have plans to respin part of
> >>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
> >>> or would you allow me to embed this patch in my series.
> >>
> >> My v7 hasn’t touch the prq change yet. So I think it's better for you to
> >> embed it to  your series. ^_^>>
> >
> > Can you please let me know if you have an updated series of these
> > patches? It will help me to work with virtio-iommu/arm side changes.
> 
> As per the previous discussion, I plan to take those 2 patches in my
> SMMUv3 nested stage series:
> 
> [PATCH v7 01/16] iommu: Report domain nesting info
> [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
> 
> we need to upgrade both since we do not want to report an empty nesting
> info anymore, for arm.

sorry for the late response. I've sent out the updated version. Also,
yeah, please feel free to take the patch in your series.

https://lore.kernel.org/linux-iommu/20210302203545.436623-2-yi.l@intel.com/

Regards,
Yi Liu

> Thanks
> 
> Eric
> >
> > Thanks & regards
> > Vivek
> >

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH V4 00/18] IOASID extensions for guest SVA

2021-03-02 Thread Liu, Yi L
> From: Jacob Pan 
> Sent: Sunday, February 28, 2021 6:01 AM
>
> I/O Address Space ID (IOASID) core code was introduced in v5.5 as a generic
> kernel allocator service for both PCIe Process Address Space ID (PASID) and
> ARM SMMU's Substream ID. IOASIDs are used to associate DMA requests
> with
> virtual address spaces, including both host and guest.
> 
> In addition to providing basic ID allocation, ioasid_set was defined as a
> token that is shared by a group of IOASIDs. This set token can be used
> for permission checking, but lack some features to address the following
> needs by guest Shared Virtual Address (SVA).
> - Manage IOASIDs by group, group ownership, quota, etc.
> - State synchronization among IOASID users (e.g. IOMMU driver, KVM,
> device
> drivers)
> - Non-identity guest-host IOASID mapping
> - Lifecycle management
> 
> This patchset introduces the following extensions as solutions to the
> problems above.
> - Redefine and extend IOASID set such that IOASIDs can be managed by
> groups/pools.
> - Add notifications for IOASID state synchronization
> - Extend reference counting for life cycle alignment among multiple users
> - Support ioasid_set private IDs, which can be used as guest IOASIDs
> - Add a new cgroup controller for resource distribution
> 
> Please refer to Documentation/admin-guide/cgroup-v1/ioasids.rst and
> Documentation/driver-api/ioasid.rst in the enclosed patches for more
> details.
> 
> Based on discussions on LKML[1], a direction change was made in v4 such
> that
> the user interfaces for IOASID allocation are extracted from VFIO
> subsystem. The proposed IOASID subsystem now consists of three
> components:
> 1. IOASID core[01-14]: provides APIs for allocation, pool management,
>   notifications, and refcounting.
> 2. IOASID cgroup controller[RFC 15-17]: manage resource distribution[2].
> 3. IOASID user[RFC 18]:  provides user allocation interface via /dev/ioasid
> 
> This patchset only included VT-d driver as users of some of the new APIs.
> VFIO and KVM patches are coming up to fully utilize the APIs introduced
> here.
>
> [1] https://lore.kernel.org/linux-iommu/1599734733-6431-1-git-send-email-
> yi.l@intel.com/
> [2] Note that ioasid quota management code can be removed once the
> IOASIDs
> cgroup is ratified.
> 
> You can find this series, VFIO, KVM, and IOASID user at:
> https://github.com/jacobpan/linux.git ioasid_v4
> (VFIO and KVM patches will be available at this branch when published.)

VFIO and QEMU series are listed below:

VFIO: 
https://lore.kernel.org/linux-iommu/20210302203545.436623-1-yi.l@intel.com/
QEMU: 
https://lore.kernel.org/qemu-devel/20210302203827.437645-1-yi.l@intel.com/T/#t

Regards,
Yi Liu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Patch v8 10/10] iommu/vt-d: Support reporting nesting capability info

2021-03-02 Thread Liu Yi L
This patch reports nesting info when iommu_domain_get_attr() is called with
DOMAIN_ATTR_NESTING and one domain with nesting set.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v7 -> v8:
*) tweak per latest code base

v6 -> v7:
*) split the patch in v6 into two patches:
   [PATCH v7 15/16] iommu/vt-d: Only support nesting when nesting caps are 
consistent across iommu units
   [PATCH v7 16/16] iommu/vt-d: Support reporting nesting capability info

v2 -> v3:
*) remove cap/ecap_mask in iommu_nesting_info.
---
 drivers/iommu/intel/cap_audit.h |  7 
 drivers/iommu/intel/iommu.c | 68 -
 2 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/cap_audit.h b/drivers/iommu/intel/cap_audit.h
index 74cfccae0e81..787e98282a02 100644
--- a/drivers/iommu/intel/cap_audit.h
+++ b/drivers/iommu/intel/cap_audit.h
@@ -60,6 +60,13 @@
 #define ECAP_QI_MASK   BIT_ULL(1)
 #define ECAP_C_MASKBIT_ULL(0)
 
+/* Capabilities related to nested translation */
+#define VTD_CAP_MASK   (CAP_FL1GP_MASK | CAP_FL5LP_MASK)
+
+#define VTD_ECAP_MASK  (ECAP_PRS_MASK | ECAP_ERS_MASK | \
+ECAP_SRS_MASK | ECAP_EAFS_MASK | \
+ECAP_PASID_MASK)
+
 /*
  * u64 intel_iommu_cap_sanity, intel_iommu_ecap_sanity will be adjusted as each
  * IOMMU gets audited.
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 4409d86b4e18..f7432fb1c6ea 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5508,13 +5508,79 @@ static bool domain_use_flush_queue(void)
return r;
 }
 
+static int intel_iommu_get_nesting_info(struct iommu_domain *domain,
+   struct iommu_nesting_info *info)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   u64 cap = VTD_CAP_MASK, ecap = VTD_ECAP_MASK;
+   struct device_domain_info *domain_info;
+   struct iommu_nesting_info_vtd vtd;
+   unsigned int size;
+
+   if (!info)
+   return -EINVAL;
+
+   if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
+   return -ENODEV;
+
+   size = sizeof(struct iommu_nesting_info);
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->argsz < size) {
+   info->argsz = size;
+   return 0;
+   }
+
+   /*
+* arbitrary select the first domain_info as all nesting
+* related capabilities should be consistent across iommu
+* units.
+*/
+   domain_info = list_first_entry(_domain->devices,
+  struct device_domain_info, link);
+   cap &= domain_info->iommu->cap;
+   ecap &= domain_info->iommu->ecap;
+
+   info->addr_width = dmar_domain->gaw;
+   info->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+   info->features = IOMMU_NESTING_FEAT_BIND_PGTBL |
+IOMMU_NESTING_FEAT_CACHE_INVLD;
+   info->pasid_bits = ilog2(intel_pasid_max_id);
+   memset(>padding, 0x0, 12);
+
+   vtd.flags = 0;
+   memset(, 0x0, 12);
+   vtd.cap_reg = cap & VTD_CAP_MASK;
+   vtd.ecap_reg = ecap & VTD_ECAP_MASK;
+
+   memcpy(>vendor.vtd, , sizeof(vtd));
+   return 0;
+}
+
 static int
 intel_iommu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
 {
switch (domain->type) {
case IOMMU_DOMAIN_UNMANAGED:
-   return -ENODEV;
+   switch (attr) {
+   case DOMAIN_ATTR_NESTING:
+   {
+   struct iommu_nesting_info *info =
+   (struct iommu_nesting_info *)data;
+   unsigned long flags;
+   int ret;
+
+   spin_lock_irqsave(_domain_lock, flags);
+   ret = intel_iommu_get_nesting_info(domain, info);
+   spin_unlock_irqrestore(_domain_lock, flags);
+   return ret;
+   }
+   default:
+   return -ENODEV;
+   }
case IOMMU_DOMAIN_DMA:
switch (attr) {
case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Patch v8 09/10] vfio: Document dual stage control

2021-03-02 Thread Liu Yi L
From: Eric Auger 

The VFIO API was enhanced to support nested stage control: a bunch of
new ioctls and usage guideline.

Let's document the process to follow to set up nested mode.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Eric Auger 
Signed-off-by: Liu Yi L 
Reviewed-by: Stefan Hajnoczi 
---
v7 -> v8:
*) remove SYSWIDE_PASID description, point to /dev/ioasid when mentioning
   PASID allocation from host.

v6 -> v7:
*) tweak per Eric's comments.

v5 -> v6:
*) tweak per Eric's comments.

v3 -> v4:
*) add review-by from Stefan Hajnoczi

v2 -> v3:
*) address comments from Stefan Hajnoczi

v1 -> v2:
*) new in v2, compared with Eric's original version, pasid table bind
   and fault reporting is removed as this series doesn't cover them.
   Original version from Eric.
   https://lore.kernel.org/kvm/20200320161911.27494-12-eric.au...@redhat.com/
---
 Documentation/driver-api/vfio.rst | 77 +++
 1 file changed, 77 insertions(+)

diff --git a/Documentation/driver-api/vfio.rst 
b/Documentation/driver-api/vfio.rst
index f1a4d3c3ba0b..9ccf9d63b72f 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -239,6 +239,83 @@ group and can access them as follows::
/* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET);
 
+IOMMU Dual Stage Control
+
+
+Some IOMMUs support 2 stages/levels of translation. Stage corresponds
+to the ARM terminology while level corresponds to Intel's terminology.
+In the following text, we use either without distinction.
+
+This is useful when the guest is exposed with a virtual IOMMU and some
+devices are assigned to the guest through VFIO. Then the guest OS can
+use stage-1 (GIOVA -> GPA or GVA->GPA), while the hypervisor uses stage
+2 for VM isolation (GPA -> HPA).
+
+Under dual-stage translation, the guest gets ownership of the stage-1
+page tables or both the stage-1 configuration structures and page tables.
+This depends on vendor. e.g. on Intel platform, the guest owns stage-1
+page tables under nesting. While on ARM, the guest owns both the stage-1
+configuration structures and page tables under nesting. The hypervisor
+owns the root configuration structure (for security reasons), including
+stage-2 configuration. This works as long as configuration structures
+and page table formats are compatible between the virtual IOMMU and the
+physical IOMMU.
+
+Assuming the HW supports it, this nested mode is selected by choosing the
+VFIO_TYPE1_NESTING_IOMMU type through:
+
+ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
+
+This forces the hypervisor to use the stage-2, leaving stage-1 available
+for guest usage.
+The stage-1 format and binding method are reported in nesting capability.
+(VFIO_IOMMU_TYPE1_INFO_CAP_NESTING) through VFIO_IOMMU_GET_INFO:
+
+ioctl(container->fd, VFIO_IOMMU_GET_INFO, _info);
+
+The nesting cap info is available only after NESTING_IOMMU is selected.
+If the underlying IOMMU doesn't support nesting, VFIO_SET_IOMMU fails and
+userspace should try other IOMMU types. Details of the nesting cap info
+can be found in Documentation/userspace-api/iommu.rst.
+
+Bind stage-1 page table to the IOMMU differs per platform. On Intel,
+the stage1 page table info are mediated by the userspace for each PASID.
+On ARM, the userspace directly passes the GPA of the whole PASID table.
+Currently only Intel's binding is supported (IOMMU_NESTING_FEAT_BIND_PGTBL)
+is supported:
+
+nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
+memcpy(_op->data, _data, sizeof(bind_data));
+ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
+
+When multiple stage-1 page tables are supported on a device, each page
+table is associated with a PASID (Process Address Space ID) to differentiate
+with each other. In such case, userspace should include PASID in the
+bind_data when issuing direct binding requests.
+
+PASID could be managed per-device or system-wide which, again, depends on
+IOMMU vendor. e.g. as by Intel platforms, userspace *must* allocate PASID
+from host before attempting binding of stage-1 page table, the allocation
+is done by the /dev/ioasid interface. For systems without /dev/ioasid,
+userspace should not go further binding page table and shall be failed
+by the kernel. For the usage of /dev/ioasid, please refer to below doc:
+
+Documentation/userspace-api/ioasid.rst
+
+Once the stage-1 page table is bound to the IOMMU, the guest is allowed to
+fully manage its mapping at its disposal. The IOMMU walks nested stage-1
+and stage-2 page tables when serving DMA requests from assigned device, and
+may cache the stage-1 mapping in the IOTLB. When required (IOMMU_NESTING_
+FEAT_CACHE_INVLD), userspace *must* forward guest stage-1 invalidation to
+the host, so the IOTLB i

[Patch v8 08/10] vfio/pci: Expose PCIe PASID capability to userspace

2021-03-02 Thread Liu Yi L
This patch exposes PCIe PASID capability to userspace and where to
emulate this capability if wants to further expose it to VM.

And this patch only exposes PASID capability for devices which has PCIe
PASID extended struture in its configuration space. While for VFs, user
space still unable to see this capability as SR-IOV spec forbides VF to
implement PASID capability extended structure. It is a TODO in future.
Related discussion can be found in below link:

https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Reviewed-by: Eric Auger 
---
v7 -> v8:
*) refine the commit message and the subject.

v5 -> v6:
*) add review-by from Eric Auger.

v1 -> v2:
*) added in v2, but it was sent in a separate patchseries before
---
 drivers/vfio/pci/vfio_pci_config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index a402adee8a21..95b5478f51ac 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = 
{
[PCI_EXT_CAP_ID_LTR]=   PCI_EXT_CAP_LTR_SIZEOF,
[PCI_EXT_CAP_ID_SECPCI] =   0,  /* not yet */
[PCI_EXT_CAP_ID_PMUX]   =   0,  /* not yet */
-   [PCI_EXT_CAP_ID_PASID]  =   0,  /* not yet */
+   [PCI_EXT_CAP_ID_PASID]  =   PCI_EXT_CAP_PASID_SIZEOF,
 };
 
 /*
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Patch v8 07/10] vfio/type1: Add vSVA support for IOMMU-backed mdevs

2021-03-02 Thread Liu Yi L
Recent years, mediated device pass-through framework (e.g. vfio-mdev)
is used to achieve flexible device sharing across domains (e.g. VMs).
Also there are hardware assisted mediated pass-through solutions from
platform vendors. e.g. Intel VT-d scalable mode which supports Intel
Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
concept, which means mdevs are protected by an iommu domain which is
auxiliary to the domain that the kernel driver primarily uses for DMA
API. Details can be found in the KVM presentation as below:

https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf

This patch extends NESTING_IOMMU ops to IOMMU-backed mdev devices. The
main requirement is to use the auxiliary domain associated with mdev.

Cc: Kevin Tian 
CC: Jacob Pan 
CC: Jun Tian 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Reviewed-by: Eric Auger 
---
v5 -> v6:
*) add review-by from Eric Auger.

v1 -> v2:
*) check the iommu_device to ensure the handling mdev is IOMMU-backed
---
 drivers/vfio/vfio_iommu_type1.c | 35 -
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 86b6d8f9789a..883a79f36c46 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2635,18 +2635,37 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static struct device *vfio_get_iommu_device(struct vfio_group *group,
+   struct device *dev)
+{
+   if (group->mdev_group)
+   return vfio_mdev_get_iommu_device(dev);
+   else
+   return dev;
+}
+
 static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
unsigned long arg = *(unsigned long *)dc->data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
-   return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
+   return iommu_uapi_sva_bind_gpasid(dc->domain, iommu_device,
  (void __user *)arg);
 }
 
 static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
/*
 * dc->user is a toggle for the unbind operation. When user
@@ -2659,12 +2678,12 @@ static int vfio_dev_unbind_gpasid_fn(struct device 
*dev, void *data)
if (dc->user) {
unsigned long arg = *(unsigned long *)dc->data;
 
-   iommu_uapi_sva_unbind_gpasid(dc->domain,
-dev, (void __user *)arg);
+   iommu_uapi_sva_unbind_gpasid(dc->domain, iommu_device,
+(void __user *)arg);
} else {
ioasid_t pasid = *(ioasid_t *)dc->data;
 
-   iommu_sva_unbind_gpasid(dc->domain, dev, pasid);
+   iommu_sva_unbind_gpasid(dc->domain, iommu_device, pasid);
}
return 0;
 }
@@ -3295,8 +3314,14 @@ static int vfio_dev_cache_invalidate_fn(struct device 
*dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
unsigned long arg = *(unsigned long *)dc->data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
-   iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg);
+   iommu_uapi_cache_invalidate(dc->domain, iommu_device,
+   (void __user *)arg);
return 0;
 }
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Patch v8 06/10] iommu: Pass domain to sva_unbind_gpasid()

2021-03-02 Thread Liu Yi L
From: Yi Sun 

Current interface is good enough for SVA virtualization on an assigned
physical PCI device, but when it comes to mediated devices, a physical
device may be attached with multiple aux-domains. Also, for guest unbind,
the PASID to be unbind should be allocated to the VM. This check requires
to know the ioasid_set which is associated with the domain.

So this interface needs to pass in domain info. Then the iommu driver is
able to know which domain will be used for the 2nd stage translation of
the nesting mode and also be able to do PASID ownership check. This patch
passes @domain per the above reason.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Yi Sun 
Signed-off-by: Liu Yi L 
---
v7 -> v8:
*) tweaked the commit message.

v6 -> v7:
*) correct the link for the details of modifying pasid prototype to bve "u32".
*) hold off r-b from Eric Auger as there is modification in this patch, will
   seek r-b in this version.

v5 -> v6:
*) use "u32" prototype for @pasid.
*) add review-by from Eric Auger.

v2 -> v3:
*) pass in domain info only
*) use u32 for pasid instead of int type

v1 -> v2:
*) added in v2.
---
 drivers/iommu/intel/svm.c   | 3 ++-
 drivers/iommu/iommu.c   | 2 +-
 include/linux/intel-iommu.h | 3 ++-
 include/linux/iommu.h   | 3 ++-
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 561d011c7287..7521b4aefd16 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -496,7 +496,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
return ret;
 }
 
-int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
+int intel_svm_unbind_gpasid(struct iommu_domain *domain,
+   struct device *dev, u32 pasid)
 {
struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
struct intel_svm_dev *sdev;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d46f103a1e4b..822e485683ae 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2185,7 +2185,7 @@ int iommu_sva_unbind_gpasid(struct iommu_domain *domain, 
struct device *dev,
if (unlikely(!domain->ops->sva_unbind_gpasid))
return -ENODEV;
 
-   return domain->ops->sva_unbind_gpasid(dev, pasid);
+   return domain->ops->sva_unbind_gpasid(domain, dev, pasid);
 }
 EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 554aa946f142..aaf403966444 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -755,7 +755,8 @@ extern int intel_svm_enable_prq(struct intel_iommu *iommu);
 extern int intel_svm_finish_prq(struct intel_iommu *iommu);
 int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
  struct iommu_gpasid_bind_data *data);
-int intel_svm_unbind_gpasid(struct device *dev, u32 pasid);
+int intel_svm_unbind_gpasid(struct iommu_domain *domain,
+   struct device *dev, u32 pasid);
 struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm,
 void *drvdata);
 void intel_svm_unbind(struct iommu_sva *handle);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 5e7fe519430a..4840217a590b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -299,7 +299,8 @@ struct iommu_ops {
int (*sva_bind_gpasid)(struct iommu_domain *domain,
struct device *dev, struct iommu_gpasid_bind_data 
*data);
 
-   int (*sva_unbind_gpasid)(struct device *dev, u32 pasid);
+   int (*sva_unbind_gpasid)(struct iommu_domain *domain,
+struct device *dev, u32 pasid);
 
int (*def_domain_type)(struct device *dev);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Patch v8 05/10] vfio/type1: Allow invalidating first-level/stage IOMMU cache

2021-03-02 Thread Liu Yi L
This patch provides an interface allowing the userspace to invalidate
IOMMU cache for first-level page table. It is required when the first
level IOMMU page table is not managed by the host kernel in the nested
translation setup.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Eric Auger 
Signed-off-by: Jacob Pan 
---
v1 -> v2:
*) rename from "vfio/type1: Flush stage-1 IOMMU cache for nesting type"
*) rename vfio_cache_inv_fn() to vfio_dev_cache_invalidate_fn()
*) vfio_dev_cache_inv_fn() always successful
*) remove VFIO_IOMMU_CACHE_INVALIDATE, and reuse VFIO_IOMMU_NESTING_OP
---
 drivers/vfio/vfio_iommu_type1.c | 38 +
 include/uapi/linux/vfio.h   |  3 +++
 2 files changed, 41 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0044931b80dc..86b6d8f9789a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3291,6 +3291,41 @@ static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+   unsigned long arg = *(unsigned long *)dc->data;
+
+   iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg);
+   return 0;
+}
+
+static long vfio_iommu_invalidate_cache(struct vfio_iommu *iommu,
+   unsigned long arg)
+{
+   struct domain_capsule dc = { .data =  };
+   struct iommu_nesting_info *info;
+   int ret;
+
+   mutex_lock(>lock);
+   info = iommu->nesting_info;
+   if (!info || !(info->features & IOMMU_NESTING_FEAT_CACHE_INVLD)) {
+   ret = -EOPNOTSUPP;
+   goto out_unlock;
+   }
+
+   ret = vfio_prepare_nesting_domain_capsule(iommu, );
+   if (ret)
+   goto out_unlock;
+
+   iommu_group_for_each_dev(dc.group->iommu_group, ,
+vfio_dev_cache_invalidate_fn);
+
+out_unlock:
+   mutex_unlock(>lock);
+   return ret;
+}
+
 static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
unsigned long arg)
 {
@@ -3313,6 +3348,9 @@ static long vfio_iommu_type1_nesting_op(struct vfio_iommu 
*iommu,
case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
break;
+   case VFIO_IOMMU_NESTING_OP_CACHE_INVLD:
+   ret = vfio_iommu_invalidate_cache(iommu, arg + minsz);
+   break;
default:
ret = -EINVAL;
}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 985e6cf4c52d..08b8d236dfee 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1245,6 +1245,8 @@ struct vfio_iommu_type1_dirty_bitmap_get {
  * +-+---+
  * | UNBIND_PGTBL|  struct iommu_gpasid_bind_data|
  * +-+---+
+ * | CACHE_INVLD |  struct iommu_cache_invalidate_info   |
+ * +-+---+
  *
  * returns: 0 on success, -errno on failure.
  */
@@ -1258,6 +1260,7 @@ struct vfio_iommu_type1_nesting_op {
 enum {
VFIO_IOMMU_NESTING_OP_BIND_PGTBL,
VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL,
+   VFIO_IOMMU_NESTING_OP_CACHE_INVLD,
 };
 
 #define VFIO_IOMMU_NESTING_OP  _IO(VFIO_TYPE, VFIO_BASE + 18)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Patch v8 04/10] vfio/type1: Support binding guest page tables to PASID

2021-03-02 Thread Liu Yi L
Nesting translation allows two-levels/stages page tables, with 1st level
for guest translations (e.g. GVA->GPA), 2nd level for host translations
(e.g. GPA->HPA). This patch adds interface for binding guest page tables
to a PASID. This PASID must have been allocated by the userspace before
the binding request. e.g. allocated from /dev/ioasid. As the bind data
is parsed by iommu abstract layer, so this patch doesn't have the ownership
check against the PASID from userspace. It would be done in the iommu sub-
system.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Jean-Philippe Brucker 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v7 -> v8:
*) adapt to /dev/ioasid
*) address comments from Alex on v7.
*) adapt to latest iommu_sva_unbind_gpasid() implementation.
*) remove the OP check against VFIO_IOMMU_NESTING_OP_NUM as it's redundant
   to the default switch case in vfio_iommu_handle_pgtbl_op().

v6 -> v7:
*) introduced @user in struct domain_capsule to simplify the code per Eric's
   suggestion.
*) introduced VFIO_IOMMU_NESTING_OP_NUM for sanitizing op from userspace.
*) corrected the @argsz value of unbind_data in vfio_group_unbind_gpasid_fn().

v5 -> v6:
*) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule().
   per comment from Eric.
*) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in
   linux/iommu.h for userspace operation and in-kernel operation.

v3 -> v4:
*) address comments from Alex on v3

v2 -> v3:
*) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
   
https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-jacob.jun@linux.intel.com/

v1 -> v2:
*) rename subject from "vfio/type1: Bind guest page tables to host"
*) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support bind/
   unbind guet page table
*) replaced vfio_iommu_for_each_dev() with a group level loop since this
   series enforces one group per container w/ nesting type as start.
*) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn()
*) vfio_dev_unbind_gpasid() always successful
*) use vfio_mm->pasid_lock to avoid race between PASID free and page table
   bind/unbind
---
 drivers/vfio/vfio_iommu_type1.c | 156 
 include/uapi/linux/vfio.h   |  35 +++
 2 files changed, 191 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3a5c84d4f19b..0044931b80dc 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -164,6 +164,34 @@ struct vfio_regions {
 
 #define WAITED 1
 
+struct domain_capsule {
+   struct vfio_group   *group;
+   struct iommu_domain *domain;
+   void*data;
+   /* set if @data contains a user pointer*/
+   booluser;
+};
+
+/* iommu->lock must be held */
+static int vfio_prepare_nesting_domain_capsule(struct vfio_iommu *iommu,
+  struct domain_capsule *dc)
+{
+   struct vfio_domain *domain;
+   struct vfio_group *group;
+
+   if (!iommu->nesting_info)
+   return -EINVAL;
+
+   domain = list_first_entry(>domain_list,
+ struct vfio_domain, next);
+   group = list_first_entry(>group_list,
+struct vfio_group, next);
+   dc->group = group;
+   dc->domain = domain->domain;
+   dc->user = true;
+   return 0;
+}
+
 static int put_pfn(unsigned long pfn, int prot);
 
 static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
@@ -2607,6 +2635,51 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+   unsigned long arg = *(unsigned long *)dc->data;
+
+   return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
+ (void __user *)arg);
+}
+
+static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+
+   /*
+* dc->user is a toggle for the unbind operation. When user
+* set, the dc->data passes in a __user pointer and requires
+* to use iommu_uapi_sva_unbind_gpasid(), in which it will
+* copy the unbind data from the user buffer. When user is
+* clear, the dc->data passes in a pasid which is going to
+* be unbind no need to copy data from userspace.
+*/
+   if (dc->user) {
+   unsigned long arg = *(unsigned long *)dc->data;
+
+   iommu_uapi_sva_unbind_gpasid(dc->domain,
+

[Patch v8 03/10] vfio/type1: Report iommu nesting info to userspace

2021-03-02 Thread Liu Yi L
This patch exports iommu nesting capability info to user space through
VFIO. Userspace is expected to check this info for supported uAPIs (e.g.
bind page table, cache invalidation) and the vendor specific format
information for first level/stage page table that will be bound to.

The nesting info is available only after container set to be NESTED type.
Current implementation imposes one limitation - one nesting container
should include at most one iommu group. The philosophy of vfio container
is having all groups/devices within the container share the same IOMMU
context. When vSVA is enabled, one IOMMU context could include one 2nd-
level address space and multiple 1st-level address spaces. While the
2nd-level address space is reasonably sharable by multiple groups, blindly
sharing 1st-level address spaces across all groups within the container
might instead break the guest expectation. In the future sub/super container
concept might be introduced to allow partial address space sharing within
an IOMMU context. But for now let's go with this restriction by requiring
singleton container for using nesting iommu features. Below link has the
related discussion about this decision.

https://lore.kernel.org/kvm/20200515115924.37e69...@w520.home/

This patch also changes the NESTING type container behaviour. Something
that would have succeeded before will now fail: Before this series, if
user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have succeeded even
if the SMMU didn't support stage-2, as the driver would have silently
fallen back on stage-1 mappings (which work exactly the same as stage-2
only since there was no nesting supported). After the series, we do check
for DOMAIN_ATTR_NESTING so if user asks for VFIO_IOMMU_TYPE1_NESTING and
the SMMU doesn't support stage-2, the ioctl fails. But it should be a good
fix and completely harmless. Detail can be found in below link as well.

https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
---
v7 -> v8:
*) tweak per Alex's comments against v7.
*) check "iommu->nesting_info->format == 0" in attach_group()

v6 -> v7:
*) using vfio_info_add_capability() for adding nesting cap per suggestion
   from Eric.

v5 -> v6:
*) address comments against v5 from Eric Auger.
*) don't report nesting cap to userspace if the nesting_info->format is
   invalid.

v4 -> v5:
*) address comments from Eric Auger.
*) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
   cap is much "cheap", if needs extension in future, just define another cap.
   https://lore.kernel.org/kvm/20200708132947.5b7ee...@x1.home/

v3 -> v4:
*) address comments against v3.

v1 -> v2:
*) added in v2
---
 drivers/vfio/vfio_iommu_type1.c | 102 +++-
 include/uapi/linux/vfio.h   |  19 ++
 2 files changed, 105 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4bb162c1d649..3a5c84d4f19b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -63,22 +63,24 @@ MODULE_PARM_DESC(dma_entry_limit,
 "Maximum number of user DMA mappings per container (65535).");
 
 struct vfio_iommu {
-   struct list_headdomain_list;
-   struct list_headiova_list;
-   struct vfio_domain  *external_domain; /* domain for external user */
-   struct mutexlock;
-   struct rb_root  dma_list;
-   struct blocking_notifier_head notifier;
-   unsigned intdma_avail;
-   unsigned intvaddr_invalid_count;
-   uint64_tpgsize_bitmap;
-   uint64_tnum_non_pinned_groups;
-   wait_queue_head_t   vaddr_wait;
-   boolv2;
-   boolnesting;
-   booldirty_page_tracking;
-   boolpinned_page_dirty_scope;
-   boolcontainer_open;
+   struct list_headdomain_list;
+   struct list_headiova_list;
+   /* domain for external user */
+   struct vfio_domain  *external_domain;
+   struct mutexlock;
+   struct rb_root  dma_list;
+   struct blocking_notifier_head   notifier;
+   unsigned intdma_avail;
+   unsigned intvaddr_invalid_count;
+   uint64_tpgsize_bitmap;
+   uint64_tnum_non_pinned_groups;
+   wait_queue_head_t   vaddr_wait;
+   struct iommu_nesting_info   *nesting_info;
+   boolv2;
+   boolnesting;
+   bool

[Patch v8 02/10] iommu/smmu: Report empty domain nesting info

2021-03-02 Thread Liu Yi L
This patch is added as instead of returning a boolean for DOMAIN_ATTR_NESTING,
iommu_domain_get_attr() should return an iommu_nesting_info handle. For
now, return an empty nesting info struct for now as true nesting is not
yet supported by the SMMUs.

Note: this patch just ensure no compiling issue, to be functional ready
fro ARM platform, needs to apply patches from Vivek Gautam in below link.

https://lore.kernel.org/linux-iommu/20210212105859.8445-1-vivek.gau...@arm.com/

Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Suggested-by: Jean-Philippe Brucker 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
Reviewed-by: Eric Auger 
---
v5 -> v6:
*) add review-by from Eric Auger.

v4 -> v5:
*) address comments from Eric Auger.
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 +++--
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 29 +++--
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8594b4a83043..99ea3ee35826 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2449,6 +2449,32 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
 }
 
+static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
+   void *data)
+{
+   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
+   unsigned int size;
+
+   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -ENODEV;
+
+   size = sizeof(struct iommu_nesting_info);
+
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->argsz < size) {
+   info->argsz = size;
+   return 0;
+   }
+
+   /* report an empty iommu_nesting_info for now */
+   memset(info, 0x0, size);
+   info->argsz = size;
+   return 0;
+}
+
 static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
 {
@@ -2458,8 +2484,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
*domain,
case IOMMU_DOMAIN_UNMANAGED:
switch (attr) {
case DOMAIN_ATTR_NESTING:
-   *(int *)data = (smmu_domain->stage == 
ARM_SMMU_DOMAIN_NESTED);
-   return 0;
+   return arm_smmu_domain_nesting_info(smmu_domain, data);
default:
return -ENODEV;
}
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index d8c6bfde6a61..d874c580ea80 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1481,6 +1481,32 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
 }
 
+static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
+   void *data)
+{
+   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
+   unsigned int size;
+
+   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -ENODEV;
+
+   size = sizeof(struct iommu_nesting_info);
+
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->argsz < size) {
+   info->argsz = size;
+   return 0;
+   }
+
+   /* report an empty iommu_nesting_info for now */
+   memset(info, 0x0, size);
+   info->argsz = size;
+   return 0;
+}
+
 static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
 {
@@ -1490,8 +1516,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
*domain,
case IOMMU_DOMAIN_UNMANAGED:
switch (attr) {
case DOMAIN_ATTR_NESTING:
-   *(int *)data = (smmu_domain->stage == 
ARM_SMMU_DOMAIN_NESTED);
-   return 0;
+   return arm_smmu_domain_nesting_info(smmu_domain, data);
case DOMAIN_ATTR_IO_PGTABLE_CFG: {
struct io_pgtable_domain_attr *pgtbl_cfg = data;
*pgtbl_cfg = smmu_domain->pgtbl_cfg;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Patch v8 01/10] iommu: Report domain nesting info

2021-03-02 Thread Liu Yi L
IOMMUs that support nesting translation needs report the capability info
to userspace. It gives information about requirements the userspace needs
to implement plus other features characterizing the physical implementation.

This patch introduces a new IOMMU UAPI struct that gives information about
the nesting capabilities and features. This struct is supposed to be returned
by iommu_domain_get_attr() with DOMAIN_ATTR_NESTING attribute parameter, with
one domain whose type has been set to DOMAIN_ATTR_NESTING.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v7 -> v8:
*) add padding in struct iommu_nesting_info_vtd
*) describe future extension rules for struct iommu_nesting_info in iommu.rst.
*) remove SYSWIDE_PASID

v6 -> v7:
*) rephrase the commit message, replace the @data[] field in struct
   iommu_nesting_info with union per comments from Eric Auger.

v5 -> v6:
*) rephrase the feature notes per comments from Eric Auger.
*) rename @size of struct iommu_nesting_info to @argsz.

v4 -> v5:
*) address comments from Eric Auger.

v3 -> v4:
*) split the SMMU driver changes to be a separate patch
*) move the @addr_width and @pasid_bits from vendor specific
   part to generic part.
*) tweak the description for the @features field of struct
   iommu_nesting_info.
*) add description on the @data[] field of struct iommu_nesting_info

v2 -> v3:
*) remvoe cap/ecap_mask in iommu_nesting_info.
*) reuse DOMAIN_ATTR_NESTING to get nesting info.
*) return an empty iommu_nesting_info for SMMU drivers per Jean'
   suggestion.
---
 Documentation/userspace-api/iommu.rst |  5 +-
 include/uapi/linux/iommu.h| 72 +++
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/Documentation/userspace-api/iommu.rst 
b/Documentation/userspace-api/iommu.rst
index d3108c1519d5..ad06bb94aad5 100644
--- a/Documentation/userspace-api/iommu.rst
+++ b/Documentation/userspace-api/iommu.rst
@@ -26,6 +26,7 @@ supported user-kernel APIs are as follows:
 2. Bind/Unbind guest PASID table (e.g. ARM SMMU)
 3. Invalidate IOMMU caches upon guest requests
 4. Report errors to the guest and serve page requests
+5. Read iommu_nesting_info from kernel
 
 Requirements
 
@@ -96,7 +97,9 @@ kernel. Simply recompiling existing code with newer kernel 
header should
 not be an issue in that only existing flags are used.
 
 IOMMU vendor driver should report the below features to IOMMU UAPI
-consumers (e.g. via VFIO).
+consumers (e.g. via VFIO). The feature list is passed by struct
+iommu_nesting_info. The future extension to this structure follows
+the rule defined in section "Extension Rules & Precautions".
 
 1. IOMMU_NESTING_FEAT_SYSWIDE_PASID
 2. IOMMU_NESTING_FEAT_BIND_PGTBL
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index e1d9e75f2c94..e924bfc091e8 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -338,4 +338,76 @@ struct iommu_gpasid_bind_data {
} vendor;
 };
 
+/*
+ * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info.
+ *
+ * @flags: VT-d specific flags. Currently reserved for future
+ * extension. must be set to 0.
+ * @cap_reg:   Describe basic capabilities as defined in VT-d capability
+ * register.
+ * @ecap_reg:  Describe the extended capabilities as defined in VT-d
+ * extended capability register.
+ */
+struct iommu_nesting_info_vtd {
+   __u32   flags;
+   __u8padding[12];
+   __u64   cap_reg;
+   __u64   ecap_reg;
+};
+
+/*
+ * struct iommu_nesting_info - Information for nesting-capable IOMMU.
+ *userspace should check it before using
+ *nesting capability.
+ *
+ * @argsz: size of the whole structure.
+ * @flags: currently reserved for future extension. must set to 0.
+ * @format:PASID table entry format, the same definition as struct
+ * iommu_gpasid_bind_data @format.
+ * @features:  supported nesting features.
+ * @addr_width:the output addr width of first level/stage translation
+ * @pasid_bits:maximum supported PASID bits, 0 represents no PASID
+ * support.
+ * @vendor:vendor specific data, structure type can be deduced from
+ * @format field.
+ *
+ * +===+==+
+ * | feature   |  Notes   |
+ * +===+==+
+ * | BIND_PGTBL|  IOMMU vendor driver sets it to mandate userspace to |
+ * |   |  bind the first level/stage page table to associated |
+ * |   |  PASID (either the one specified in bind request or  |
+ * |   |  the default PASID

[Patch v8 00/10] vfio: expose virtual Shared Virtual Addressing to VMs

2021-03-02 Thread Liu Yi L
-66527-1-git-send-email-yi.l@intel.com/

- Patch v1 -> Patch v2:
  a) Refactor vfio_iommu_type1_ioctl() per suggestion from Christoph
 Hellwig.
  b) Re-sequence the patch series for better bisect support.
  c) Report IOMMU nesting cap info in detail instead of a format in
 v1.
  d) Enforce one group per nesting type container for vfio iommu type1
 driver.
  e) Build the vfio_mm related code from vfio.c to be a separate
 vfio_pasid.ko.
  f) Add PASID ownership check in IOMMU driver.
  g) Adopted to latest IOMMU UAPI design. Removed IOMMU UAPI version
 check. Added iommu_gpasid_unbind_data for unbind requests from
 userspace.
  h) Define a single ioctl:VFIO_IOMMU_NESTING_OP for bind/unbind_gtbl
 and cahce_invld.
  i) Document dual stage control in vfio.rst.
  Patch v1: 
https://lore.kernel.org/kvm/1584880325-10561-1-git-send-email-yi.l@intel.com/

- RFC v3 -> Patch v1:
  a) Address comments to the PASID request(alloc/free) path
  b) Report PASID alloc/free availabitiy to user-space
  c) Add a vfio_iommu_type1 parameter to support pasid quota tuning
  d) Adjusted to latest ioasid code implementation. e.g. remove the
 code for tracking the allocated PASIDs as latest ioasid code
 will track it, VFIO could use ioasid_free_set() to free all
 PASIDs.
  RFC v3: 
https://lore.kernel.org/kvm/1580299912-86084-1-git-send-email-yi.l@intel.com/

- RFC v2 -> v3:
  a) Refine the whole patchset to fit the roughly parts in this series
  b) Adds complete vfio PASID management framework. e.g. pasid alloc,
  free, reclaim in VM crash/down and per-VM PASID quota to prevent
  PASID abuse.
  c) Adds IOMMU uAPI version check and page table format check to ensure
  version compatibility and hardware compatibility.
  d) Adds vSVA vfio support for IOMMU-backed mdevs.
  RFC v2: 
https://lore.kernel.org/kvm/1571919983-3231-1-git-send-email-yi.l@intel.com/

- RFC v1 -> v2:
  Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE.
  RFC v1: 
https://lore.kernel.org/kvm/1562324772-3084-1-git-send-email-yi.l@intel.com/

---
Eric Auger (1):
  vfio: Document dual stage control

Liu Yi L (8):
  iommu: Report domain nesting info
  iommu/smmu: Report empty domain nesting info
  vfio/type1: Report iommu nesting info to userspace
  vfio/type1: Support binding guest page tables to PASID
  vfio/type1: Allow invalidating first-level/stage IOMMU cache
  vfio/type1: Add vSVA support for IOMMU-backed mdevs
  vfio/pci: Expose PCIe PASID capability to userspace
  iommu/vt-d: Support reporting nesting capability info

Yi Sun (1):
  iommu: Pass domain to sva_unbind_gpasid()

 Documentation/driver-api/vfio.rst   |  77 +
 Documentation/userspace-api/iommu.rst   |   5 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  29 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |  29 +-
 drivers/iommu/intel/cap_audit.h |   7 +
 drivers/iommu/intel/iommu.c |  68 -
 drivers/iommu/intel/svm.c   |   3 +-
 drivers/iommu/iommu.c   |   2 +-
 drivers/vfio/pci/vfio_pci_config.c  |   2 +-
 drivers/vfio/vfio_iommu_type1.c | 321 +++-
 include/linux/intel-iommu.h |   3 +-
 include/linux/iommu.h   |   3 +-
 include/uapi/linux/iommu.h  |  72 +
 include/uapi/linux/vfio.h   |  57 
 14 files changed, 651 insertions(+), 27 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info

2021-01-23 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Tuesday, January 19, 2021 6:03 PM
> 
> Hi Yi, Vivek,
> 
[...]
> > I see. I think there needs a change in the code there. Should also expect
> > a nesting_info returned instead of an int anymore. @Eric, how about your
> > opinion?
> >
> > domain = iommu_get_domain_for_dev(>pdev->dev);
> > ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING,
> );
> > if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
> > /*
> >  * No need go futher as no page request service support.
> >  */
> > return 0;
> > }
> Sure I think it is "just" a matter of synchro between the 2 series. Yi,

exactly.

> do you have plans to respin part of
> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
> or would you allow me to embed this patch in my series.

My v7 hasn’t touch the prq change yet. So I think it's better for you to
embed it to  your series. ^_^

Regards,
Yi Liu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info

2021-01-12 Thread Liu, Yi L
Hi Vivek,

> From: Vivek Gautam 
> Sent: Tuesday, January 12, 2021 7:06 PM
> 
> Hi Yi,
> 
> 
> On Tue, Jan 12, 2021 at 2:51 PM Liu, Yi L  wrote:
> >
> > Hi Vivek,
> >
> > > From: Vivek Gautam 
> > > Sent: Tuesday, January 12, 2021 2:50 PM
> > >
> > > Hi Yi,
> > >
> > >
> > > On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L  wrote:
> > > >
> > > > This patch is added as instead of returning a boolean for
> > > DOMAIN_ATTR_NESTING,
> > > > iommu_domain_get_attr() should return an iommu_nesting_info
> handle.
> > > For
> > > > now, return an empty nesting info struct for now as true nesting is not
> > > > yet supported by the SMMUs.
> > > >
> > > > Cc: Will Deacon 
> > > > Cc: Robin Murphy 
> > > > Cc: Eric Auger 
> > > > Cc: Jean-Philippe Brucker 
> > > > Suggested-by: Jean-Philippe Brucker 
> > > > Signed-off-by: Liu Yi L 
> > > > Signed-off-by: Jacob Pan 
> > > > Reviewed-by: Eric Auger 
> > > > ---
> > > > v5 -> v6:
> > > > *) add review-by from Eric Auger.
> > > >
> > > > v4 -> v5:
> > > > *) address comments from Eric Auger.
> > > > ---
> > > >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29
> > > +++--
> > > >  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 29
> > > +++--
> > > >  2 files changed, 54 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > index 7196207..016e2e5 100644
> > > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > @@ -3019,6 +3019,32 @@ static struct iommu_group
> > > *arm_smmu_device_group(struct device *dev)
> > > > return group;
> > > >  }
> > > >
> > > > +static int arm_smmu_domain_nesting_info(struct
> arm_smmu_domain
> > > *smmu_domain,
> > > > +   void *data)
> > > > +{
> > > > +   struct iommu_nesting_info *info = (struct iommu_nesting_info
> > > *)data;
> > > > +   unsigned int size;
> > > > +
> > > > +   if (!info || smmu_domain->stage !=
> ARM_SMMU_DOMAIN_NESTED)
> > > > +   return -ENODEV;
> > > > +
> > > > +   size = sizeof(struct iommu_nesting_info);
> > > > +
> > > > +   /*
> > > > +* if provided buffer size is smaller than expected, should
> > > > +* return 0 and also the expected buffer size to caller.
> > > > +*/
> > > > +   if (info->argsz < size) {
> > > > +   info->argsz = size;
> > > > +   return 0;
> > > > +   }
> > > > +
> > > > +   /* report an empty iommu_nesting_info for now */
> > > > +   memset(info, 0x0, size);
> > > > +   info->argsz = size;
> > > > +   return 0;
> > > > +}
> > > > +
> > > >  static int arm_smmu_domain_get_attr(struct iommu_domain
> *domain,
> > > > enum iommu_attr attr, void *data)
> > > >  {
> > > > @@ -3028,8 +3054,7 @@ static int
> arm_smmu_domain_get_attr(struct
> > > iommu_domain *domain,
> > > > case IOMMU_DOMAIN_UNMANAGED:
> > > > switch (attr) {
> > > > case DOMAIN_ATTR_NESTING:
> > > > -   *(int *)data = (smmu_domain->stage ==
> > > ARM_SMMU_DOMAIN_NESTED);
> > > > -   return 0;
> > > > +   return
> arm_smmu_domain_nesting_info(smmu_domain,
> > > data);
> > >
> > > Thanks for the patch.
> > > This would unnecessarily overflow 'data' for any caller that's expecting
> only
> > > an int data. Dump from one such issue that I was seeing when testing
> > > this change along with local kvmtool changes is pasted below [1].
> > >
> > > I could get around with the issue by adding another (iommu_attr) -
> > > DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info).
> >
> > nice to hear from you. A

RE: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info

2021-01-12 Thread Liu, Yi L
Hi Vivek,

> From: Vivek Gautam 
> Sent: Tuesday, January 12, 2021 2:50 PM
> 
> Hi Yi,
> 
> 
> On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L  wrote:
> >
> > This patch is added as instead of returning a boolean for
> DOMAIN_ATTR_NESTING,
> > iommu_domain_get_attr() should return an iommu_nesting_info handle.
> For
> > now, return an empty nesting info struct for now as true nesting is not
> > yet supported by the SMMUs.
> >
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Suggested-by: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > Reviewed-by: Eric Auger 
> > ---
> > v5 -> v6:
> > *) add review-by from Eric Auger.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29
> +++--
> >  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 29
> +++--
> >  2 files changed, 54 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 7196207..016e2e5 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -3019,6 +3019,32 @@ static struct iommu_group
> *arm_smmu_device_group(struct device *dev)
> > return group;
> >  }
> >
> > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain
> *smmu_domain,
> > +   void *data)
> > +{
> > +   struct iommu_nesting_info *info = (struct iommu_nesting_info
> *)data;
> > +   unsigned int size;
> > +
> > +   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> > +   return -ENODEV;
> > +
> > +   size = sizeof(struct iommu_nesting_info);
> > +
> > +   /*
> > +* if provided buffer size is smaller than expected, should
> > +* return 0 and also the expected buffer size to caller.
> > +*/
> > +   if (info->argsz < size) {
> > +   info->argsz = size;
> > +   return 0;
> > +   }
> > +
> > +   /* report an empty iommu_nesting_info for now */
> > +   memset(info, 0x0, size);
> > +   info->argsz = size;
> > +   return 0;
> > +}
> > +
> >  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> > enum iommu_attr attr, void *data)
> >  {
> > @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct
> iommu_domain *domain,
> > case IOMMU_DOMAIN_UNMANAGED:
> > switch (attr) {
> > case DOMAIN_ATTR_NESTING:
> > -   *(int *)data = (smmu_domain->stage ==
> ARM_SMMU_DOMAIN_NESTED);
> > -   return 0;
> > +   return arm_smmu_domain_nesting_info(smmu_domain,
> data);
> 
> Thanks for the patch.
> This would unnecessarily overflow 'data' for any caller that's expecting only
> an int data. Dump from one such issue that I was seeing when testing
> this change along with local kvmtool changes is pasted below [1].
> 
> I could get around with the issue by adding another (iommu_attr) -
> DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info).

nice to hear from you. At first, we planned to have a separate iommu_attr
for getting nesting_info. However, we considered there is no existing user
which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu nesting
info. Could you share me the code base you are using? If the error you
encountered is due to this change, so there should be a place which gets
DOMAIN_ATTR_NESTING.

Regards,
Yi Liu

> Thanks & regards
> Vivek
> 
> [1]--
> [  811.756516] vfio-pci :08:00.1: vfio_ecap_init: hiding ecap
> 0x1b@0x108
> [  811.756516] Kernel panic - not syncing: stack-protector: Kernel
> stack is corrupted in: vfio_pci_open+0x644/0x648
> [  811.756516] CPU: 0 PID: 175 Comm: lkvm-cleanup-ne Not tainted
> 5.10.0-rc5-00096-gf015061e14cf #43
> [  811.756516] Call trace:
> [  811.756516]  dump_backtrace+0x0/0x1b0
> [  811.756516]  show_stack+0x18/0x68
> [  811.756516]  dump_stack+0xd8/0x134
> [  811.756516]  panic+0x174/0x33c
> [  811.756516]  __stack_chk_fail+0x3c/0x40
> [  811.756516]  vfio_pci_open+0x644/0x648
> [  811.756516]  vfio_group_fops_unl_ioctl+0x4bc/0x648
> [  811.756516]  0x0
> [  811.756516] SMP: stopping secondary CPUs
> [  811.756597] Kernel Offset: disabled
> [  811.756597] CPU features: 0x0040006,6a00aa38
> [  811.756602] Memory Limit: none
> [  811.768497] ---[ end Kernel panic - not syncing: stack-protector:
> Kernel stack is corrupted in: vfio_pci_open+0x644/0x648 ]
> -
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2021-01-06 Thread Liu Yi L
iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
loops the devices which are full-attached to the domain. For sub-devices,
this is ineffective. This results in invalid caching entries left on the
device. Fix it by adding loop for subdevices as well. Also, the domain->
has_iotlb_device needs to be updated when attaching to subdevices.

Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain attach/detach")
Signed-off-by: Liu Yi L 
Acked-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 53 +++--
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d7720a8..65cf06d 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -719,6 +719,8 @@ static int domain_update_device_node(struct dmar_domain 
*domain)
return nid;
 }
 
+static void domain_update_iotlb(struct dmar_domain *domain);
+
 /* Some capabilities may be different across iommus */
 static void domain_update_iommu_cap(struct dmar_domain *domain)
 {
@@ -744,6 +746,8 @@ static void domain_update_iommu_cap(struct dmar_domain 
*domain)
domain->domain.geometry.aperture_end = 
__DOMAIN_MAX_ADDR(domain->gaw - 1);
else
domain->domain.geometry.aperture_end = 
__DOMAIN_MAX_ADDR(domain->gaw);
+
+   domain_update_iotlb(domain);
 }
 
 struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
@@ -1464,17 +1468,22 @@ static void domain_update_iotlb(struct dmar_domain 
*domain)
 
assert_spin_locked(_domain_lock);
 
-   list_for_each_entry(info, >devices, link) {
-   struct pci_dev *pdev;
-
-   if (!info->dev || !dev_is_pci(info->dev))
-   continue;
-
-   pdev = to_pci_dev(info->dev);
-   if (pdev->ats_enabled) {
+   list_for_each_entry(info, >devices, link)
+   if (info->ats_enabled) {
has_iotlb_device = true;
break;
}
+
+   if (!has_iotlb_device) {
+   struct subdev_domain_info *sinfo;
+
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   info = get_domain_info(sinfo->pdev);
+   if (info && info->ats_enabled) {
+   has_iotlb_device = true;
+   break;
+   }
+   }
}
 
domain->has_iotlb_device = has_iotlb_device;
@@ -1555,25 +1564,37 @@ static void iommu_disable_dev_iotlb(struct 
device_domain_info *info)
 #endif
 }
 
+static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
+   u64 addr, unsigned int mask)
+{
+   u16 sid, qdep;
+
+   if (!info || !info->ats_enabled)
+   return;
+
+   sid = info->bus << 8 | info->devfn;
+   qdep = info->ats_qdep;
+   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+  qdep, addr, mask);
+}
+
 static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
  u64 addr, unsigned mask)
 {
-   u16 sid, qdep;
unsigned long flags;
struct device_domain_info *info;
+   struct subdev_domain_info *sinfo;
 
if (!domain->has_iotlb_device)
return;
 
spin_lock_irqsave(_domain_lock, flags);
-   list_for_each_entry(info, >devices, link) {
-   if (!info->ats_enabled)
-   continue;
+   list_for_each_entry(info, >devices, link)
+   __iommu_flush_dev_iotlb(info, addr, mask);
 
-   sid = info->bus << 8 | info->devfn;
-   qdep = info->ats_qdep;
-   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
-   qdep, addr, mask);
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   info = get_domain_info(sinfo->pdev);
+   __iommu_flush_dev_iotlb(info, addr, mask);
}
spin_unlock_irqrestore(_domain_lock, flags);
 }
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info

2021-01-06 Thread Liu Yi L
In the existing code, loop all devices attached to a domain does not
include sub-devices attached via iommu_aux_attach_device().

This was found by when I'm working on the below patch, There is no
device in the domain->devices list, thus unable to get the cap and
ecap of iommu unit. But this domain actually has subdevice which is
attached via aux-manner. But it is tracked by domain. This patch is
going to fix it.

https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-yi.l@intel.com/

And this fix goes beyond the patch above, such sub-device tracking is
necessary for other cases. For example, flushing device_iotlb for a
domain which has sub-devices attached by auxiliary manner.

Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain attach/detach")
Co-developed-by: Xin Zeng 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
Acked-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 95 +
 include/linux/intel-iommu.h | 16 +---
 2 files changed, 82 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c..d7720a8 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1877,6 +1877,7 @@ static struct dmar_domain *alloc_domain(int flags)
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
INIT_LIST_HEAD(>devices);
+   INIT_LIST_HEAD(>subdevices);
 
return domain;
 }
@@ -2547,7 +2548,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->iommu = iommu;
info->pasid_table = NULL;
info->auxd_enabled = 0;
-   INIT_LIST_HEAD(>auxiliary_domains);
+   INIT_LIST_HEAD(>subdevices);
 
if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
@@ -4475,33 +4476,61 @@ is_aux_domain(struct device *dev, struct iommu_domain 
*domain)
domain->type == IOMMU_DOMAIN_UNMANAGED;
 }
 
-static void auxiliary_link_device(struct dmar_domain *domain,
- struct device *dev)
+static inline struct subdev_domain_info *
+lookup_subdev_info(struct dmar_domain *domain, struct device *dev)
+{
+   struct subdev_domain_info *sinfo;
+
+   if (!list_empty(>subdevices)) {
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   if (sinfo->pdev == dev)
+   return sinfo;
+   }
+   }
+
+   return NULL;
+}
+
+static int auxiliary_link_device(struct dmar_domain *domain,
+struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
 
assert_spin_locked(_domain_lock);
if (WARN_ON(!info))
-   return;
+   return -EINVAL;
+
+   if (!sinfo) {
+   sinfo = kzalloc(sizeof(*sinfo), GFP_ATOMIC);
+   sinfo->domain = domain;
+   sinfo->pdev = dev;
+   list_add(>link_phys, >subdevices);
+   list_add(>link_domain, >subdevices);
+   }
 
-   domain->auxd_refcnt++;
-   list_add(>auxd, >auxiliary_domains);
+   return ++sinfo->users;
 }
 
-static void auxiliary_unlink_device(struct dmar_domain *domain,
-   struct device *dev)
+static int auxiliary_unlink_device(struct dmar_domain *domain,
+  struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
+   int ret;
 
assert_spin_locked(_domain_lock);
-   if (WARN_ON(!info))
-   return;
+   if (WARN_ON(!info || !sinfo || sinfo->users <= 0))
+   return -EINVAL;
 
-   list_del(>auxd);
-   domain->auxd_refcnt--;
+   ret = --sinfo->users;
+   if (!ret) {
+   list_del(>link_phys);
+   list_del(>link_domain);
+   kfree(sinfo);
+   }
 
-   if (!domain->auxd_refcnt && domain->default_pasid > 0)
-   ioasid_put(domain->default_pasid);
+   return ret;
 }
 
 static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -4530,6 +4559,19 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
}
 
spin_lock_irqsave(_domain_lock, flags);
+   ret = auxiliary_link_device(domain, dev);
+   if (ret <= 0)
+   goto link_failed;
+
+   /*
+* Subdevices from the same physical device can be attached to the
+* same domain. For such cases, only the first subdevice attachment
+* needs to go through the full steps in this function. So if ret >
+ 

[PATCH v4 1/3] iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev

2021-01-06 Thread Liu Yi L
Current struct intel_svm has a field to record the struct intel_iommu
pointer for a PASID bind. And struct intel_svm will be shared by all
the devices bind to the same process. The devices may be behind different
DMAR units. As the iommu driver code uses the intel_iommu pointer stored
in intel_svm struct to do cache invalidations, it may only flush the cache
on a single DMAR unit, for others, the cache invalidation is missed.

As intel_svm struct already has a device list, this patch just moves the
intel_iommu pointer to be a field of intel_svm_dev struct.

Fixes: 1c4f88b7f1f92 ("iommu/vt-d: Shared virtual address in scalable mode")
Cc: Lu Baolu 
Cc: Jacob Pan 
Cc: Raj Ashok 
Cc: David Woodhouse 
Reported-by: Guo Kaijie 
Reported-by: Xin Zeng 
Signed-off-by: Guo Kaijie 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
Tested-by: Guo Kaijie 
Cc: sta...@vger.kernel.org # v5.0+
Acked-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c   | 9 +
 include/linux/intel-iommu.h | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 4fa248b..6956669 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -142,7 +142,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
 
if (sdev->dev_iotlb) {
desc.qw0 = QI_DEV_EIOTLB_PASID(svm->pasid) |
@@ -166,7 +166,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
}
 }
 
@@ -211,7 +211,7 @@ static void intel_mm_release(struct mmu_notifier *mn, 
struct mm_struct *mm)
 */
rcu_read_lock();
list_for_each_entry_rcu(sdev, >devs, list)
-   intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+   intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
svm->pasid, true);
rcu_read_unlock();
 
@@ -363,6 +363,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
}
sdev->dev = dev;
sdev->sid = PCI_DEVID(info->bus, info->devfn);
+   sdev->iommu = iommu;
 
/* Only count users if device has aux domains */
if (iommu_dev_feature_enabled(dev, IOMMU_DEV_FEAT_AUX))
@@ -546,6 +547,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
goto out;
}
sdev->dev = dev;
+   sdev->iommu = iommu;
 
ret = intel_iommu_enable_pasid(iommu, dev);
if (ret) {
@@ -575,7 +577,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
kfree(sdev);
goto out;
}
-   svm->iommu = iommu;
 
if (pasid_max > intel_pasid_max_id)
pasid_max = intel_pasid_max_id;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d956987..9452268 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -758,6 +758,7 @@ struct intel_svm_dev {
struct list_head list;
struct rcu_head rcu;
struct device *dev;
+   struct intel_iommu *iommu;
struct svm_dev_ops *ops;
struct iommu_sva sva;
u32 pasid;
@@ -771,7 +772,6 @@ struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
 
-   struct intel_iommu *iommu;
unsigned int flags;
u32 pasid;
int gpasid; /* In case that guest PASID is different from host PASID */
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 0/3] iommu/vt-d: Misc fixes on scalable mode

2021-01-06 Thread Liu Yi L
Hi Baolu, Joerg, Will,

This patchset aims to fix a bug regards to native SVM usage, and
also two bugs around subdevice (attached to device via auxiliary
manner) tracking and ineffective device_tlb flush.

v3 -> v4:
- Address comments from Baolu Lu and add acked-by
- Fix issue reported by "Dan Carpenter" and "kernel test robot"
- Add tested-by from Guo Kaijie on patch 1/3
- Rebase to 5.11-rc2
v3: 
https://lore.kernel.org/linux-iommu/20201229032513.486395-1-yi.l@intel.com/

v2 -> v3:
- Address comments from Baolu Lu against v2
- Rebased to 5.11-rc1
v2: 
https://lore.kernel.org/linux-iommu/20201223062720.29364-1-yi.l@intel.com/

v1 -> v2:
- Use a more recent Fix tag in "iommu/vt-d: Move intel_iommu info from struct 
intel_svm to struct intel_svm_dev"
- Refined the "iommu/vt-d: Track device aux-attach with subdevice_domain_info"
- Rename "iommu/vt-d: A fix to iommu_flush_dev_iotlb() for aux-domain" to be
  "iommu/vt-d: Fix ineffective devTLB invalidation for subdevices"
- Refined the commit messages
v1: 
https://lore.kernel.org/linux-iommu/2020122352.183523-1-yi.l@intel.com/

Regards,
Yi Liu

Liu Yi L (3):
  iommu/vt-d: Move intel_iommu info from struct intel_svm to struct
intel_svm_dev
  iommu/vt-d: Track device aux-attach with subdevice_domain_info
  iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

 drivers/iommu/intel/iommu.c | 148 
 drivers/iommu/intel/svm.c   |   9 +--
 include/linux/intel-iommu.h |  18 --
 3 files changed, 125 insertions(+), 50 deletions(-)

-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2021-01-06 Thread Liu, Yi L
Hi Will,

> From: Will Deacon 
> Sent: Wednesday, January 6, 2021 1:24 AM
> 
> On Tue, Jan 05, 2021 at 05:50:22AM +0000, Liu, Yi L wrote:
> > > > +static void __iommu_flush_dev_iotlb(struct device_domain_info
> *info,
> > > > +   u64 addr, unsigned int mask)
> > > > +{
> > > > +   u16 sid, qdep;
> > > > +
> > > > +   if (!info || !info->ats_enabled)
> > > > +   return;
> > > > +
> > > > +   sid = info->bus << 8 | info->devfn;
> > > > +   qdep = info->ats_qdep;
> > > > +   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > > > +  qdep, addr, mask);
> > > > +}
> > > > +
> > > >   static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
> > > >   u64 addr, unsigned mask)
> > > >   {
> > > > -   u16 sid, qdep;
> > > > unsigned long flags;
> > > > struct device_domain_info *info;
> > > > +   struct subdev_domain_info *sinfo;
> > > >
> > > > if (!domain->has_iotlb_device)
> > > > return;
> > > >
> > > > spin_lock_irqsave(_domain_lock, flags);
> > > > -   list_for_each_entry(info, >devices, link) {
> > > > -   if (!info->ats_enabled)
> > > > -   continue;
> > > > +   list_for_each_entry(info, >devices, link)
> > > > +   __iommu_flush_dev_iotlb(info, addr, mask);
> > > >
> > > > -   sid = info->bus << 8 | info->devfn;
> > > > -   qdep = info->ats_qdep;
> > > > -   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > > > -   qdep, addr, mask);
> > > > +   list_for_each_entry(sinfo, >subdevices, link_domain) {
> > > > +   __iommu_flush_dev_iotlb(get_domain_info(sinfo->pdev),
> > > > +   addr, mask);
> > > > }
> > >
> > > Nit:
> > >   list_for_each_entry(sinfo, >subdevices, link_domain) {
> > >   info = get_domain_info(sinfo->pdev);
> > >   __iommu_flush_dev_iotlb(info, addr, mask);
> > >   }
> >
> > you are right. this should be better.
> 
> Please can you post a v4, with Lu's acks and the issue reported by Dan fixed
> too?

sure, will send out later.

Regards,
Yi Liu

> Thanks,
> 
> Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info

2021-01-04 Thread Liu, Yi L
Hi Baolu,

> From: Lu Baolu 
> Sent: Tuesday, December 29, 2020 4:38 PM
> 
> Hi Yi,
> 
> On 2020/12/29 11:25, Liu Yi L wrote:
> > In the existing code, loop all devices attached to a domain does not
> > include sub-devices attached via iommu_aux_attach_device().
> >
> > This was found by when I'm working on the belwo patch, There is no
>  ^
> below

nice catch. 

> > device in the domain->devices list, thus unable to get the cap and
> > ecap of iommu unit. But this domain actually has subdevice which is
> > attached via aux-manner. But it is tracked by domain. This patch is
> > going to fix it.
> >
> > https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-
> yi.l@intel.com/
> >
> > And this fix goes beyond the patch above, such sub-device tracking is
> > necessary for other cases. For example, flushing device_iotlb for a
> > domain which has sub-devices attached by auxiliary manner.
> >
> > Co-developed-by: Xin Zeng 
> > Signed-off-by: Xin Zeng 
> > Signed-off-by: Liu Yi L 
> 
> Others look good to me.
> 
> Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain
> attach/detach")
> Acked-by: Lu Baolu 

thanks,

Regards,
Yi Liu

> Best regards,
> baolu
> 
> > ---
> >   drivers/iommu/intel/iommu.c | 95 +++
> --
> >   include/linux/intel-iommu.h | 16 +--
> >   2 files changed, 82 insertions(+), 29 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 788119c5b021..d7720a836268 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -1877,6 +1877,7 @@ static struct dmar_domain *alloc_domain(int
> flags)
> > domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
> > domain->has_iotlb_device = false;
> > INIT_LIST_HEAD(>devices);
> > +   INIT_LIST_HEAD(>subdevices);
> >
> > return domain;
> >   }
> > @@ -2547,7 +2548,7 @@ static struct dmar_domain
> *dmar_insert_one_dev_info(struct intel_iommu *iommu,
> > info->iommu = iommu;
> > info->pasid_table = NULL;
> > info->auxd_enabled = 0;
> > -   INIT_LIST_HEAD(>auxiliary_domains);
> > +   INIT_LIST_HEAD(>subdevices);
> >
> > if (dev && dev_is_pci(dev)) {
> > struct pci_dev *pdev = to_pci_dev(info->dev);
> > @@ -4475,33 +4476,61 @@ is_aux_domain(struct device *dev, struct
> iommu_domain *domain)
> > domain->type == IOMMU_DOMAIN_UNMANAGED;
> >   }
> >
> > -static void auxiliary_link_device(struct dmar_domain *domain,
> > - struct device *dev)
> > +static inline struct subdev_domain_info *
> > +lookup_subdev_info(struct dmar_domain *domain, struct device *dev)
> > +{
> > +   struct subdev_domain_info *sinfo;
> > +
> > +   if (!list_empty(>subdevices)) {
> > +   list_for_each_entry(sinfo, >subdevices,
> link_domain) {
> > +   if (sinfo->pdev == dev)
> > +   return sinfo;
> > +   }
> > +   }
> > +
> > +   return NULL;
> > +}
> > +
> > +static int auxiliary_link_device(struct dmar_domain *domain,
> > +struct device *dev)
> >   {
> > struct device_domain_info *info = get_domain_info(dev);
> > +   struct subdev_domain_info *sinfo = lookup_subdev_info(domain,
> dev);
> >
> > assert_spin_locked(_domain_lock);
> > if (WARN_ON(!info))
> > -   return;
> > +   return -EINVAL;
> > +
> > +   if (!sinfo) {
> > +   sinfo = kzalloc(sizeof(*sinfo), GFP_ATOMIC);
> > +   sinfo->domain = domain;
> > +   sinfo->pdev = dev;
> > +   list_add(>link_phys, >subdevices);
> > +   list_add(>link_domain, >subdevices);
> > +   }
> >
> > -   domain->auxd_refcnt++;
> > -   list_add(>auxd, >auxiliary_domains);
> > +   return ++sinfo->users;
> >   }
> >
> > -static void auxiliary_unlink_device(struct dmar_domain *domain,
> > -   struct device *dev)
> > +static int auxiliary_unlink_device(struct dmar_domain *domain,
> > +  struct device *dev)
> >   {
> > struct device_domain_info *info = get_domain_info(dev);
> > +   struct subdev_domain_info *sinfo = lookup_subdev_info(domain,

RE: [PATCH v3 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2021-01-04 Thread Liu, Yi L
Hi Baolu,

> From: Lu Baolu 
> Sent: Tuesday, December 29, 2020 4:42 PM
> 
> Hi Yi,
> 
> On 2020/12/29 11:25, Liu Yi L wrote:
> > iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
> > loops the devices which are full-attached to the domain. For sub-devices,
> > this is ineffective. This results in invalid caching entries left on the
> > device. Fix it by adding loop for subdevices as well. Also, the domain->
> > has_iotlb_device needs to be updated when attaching to subdevices.
> >
> > Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain
> attach/detach")
> > Signed-off-by: Liu Yi L 
> > ---
> >   drivers/iommu/intel/iommu.c | 53 ++-
> --
> >   1 file changed, 37 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index d7720a836268..d48a60b61ba6 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -719,6 +719,8 @@ static int domain_update_device_node(struct
> dmar_domain *domain)
> > return nid;
> >   }
> >
> > +static void domain_update_iotlb(struct dmar_domain *domain);
> > +
> >   /* Some capabilities may be different across iommus */
> >   static void domain_update_iommu_cap(struct dmar_domain *domain)
> >   {
> > @@ -744,6 +746,8 @@ static void domain_update_iommu_cap(struct
> dmar_domain *domain)
> > domain->domain.geometry.aperture_end =
> __DOMAIN_MAX_ADDR(domain->gaw - 1);
> > else
> > domain->domain.geometry.aperture_end =
> __DOMAIN_MAX_ADDR(domain->gaw);
> > +
> > +   domain_update_iotlb(domain);
> >   }
> >
> >   struct context_entry *iommu_context_addr(struct intel_iommu *iommu,
> u8 bus,
> > @@ -1464,17 +1468,22 @@ static void domain_update_iotlb(struct
> dmar_domain *domain)
> >
> > assert_spin_locked(_domain_lock);
> >
> > -   list_for_each_entry(info, >devices, link) {
> > -   struct pci_dev *pdev;
> > -
> > -   if (!info->dev || !dev_is_pci(info->dev))
> > -   continue;
> > -
> > -   pdev = to_pci_dev(info->dev);
> > -   if (pdev->ats_enabled) {
> > +   list_for_each_entry(info, >devices, link)
> > +   if (info && info->ats_enabled) {
> > has_iotlb_device = true;
> > break;
> > }
> > +
> > +   if (!has_iotlb_device) {
> > +   struct subdev_domain_info *sinfo;
> > +
> > +   list_for_each_entry(sinfo, >subdevices,
> link_domain) {
> > +   info = get_domain_info(sinfo->pdev);
> > +   if (info && info->ats_enabled) {
> > +   has_iotlb_device = true;
> > +   break;
> > +   }
> > +   }
> > }
> >
> > domain->has_iotlb_device = has_iotlb_device;
> > @@ -1555,25 +1564,37 @@ static void iommu_disable_dev_iotlb(struct
> device_domain_info *info)
> >   #endif
> >   }
> >
> > +static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
> > +   u64 addr, unsigned int mask)
> > +{
> > +   u16 sid, qdep;
> > +
> > +   if (!info || !info->ats_enabled)
> > +   return;
> > +
> > +   sid = info->bus << 8 | info->devfn;
> > +   qdep = info->ats_qdep;
> > +   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > +  qdep, addr, mask);
> > +}
> > +
> >   static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
> >   u64 addr, unsigned mask)
> >   {
> > -   u16 sid, qdep;
> > unsigned long flags;
> > struct device_domain_info *info;
> > +   struct subdev_domain_info *sinfo;
> >
> > if (!domain->has_iotlb_device)
> > return;
> >
> > spin_lock_irqsave(_domain_lock, flags);
> > -   list_for_each_entry(info, >devices, link) {
> > -   if (!info->ats_enabled)
> > -   continue;
> > +   list_for_each_entry(info, >devices, link)
> > +   __iommu_flush_dev_iotlb(info, addr, mask);
> >
> > -   sid = info->bus << 8 | info->devfn;
> > -   qdep = info->ats_qdep;
> > -   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > -   qdep, addr, mask);
> > +   list_for_each_entry(sinfo, >subdevices, link_domain) {
> > +   __iommu_flush_dev_iotlb(get_domain_info(sinfo->pdev),
> > +   addr, mask);
> > }
> 
> Nit:
>   list_for_each_entry(sinfo, >subdevices, link_domain) {
>   info = get_domain_info(sinfo->pdev);
>   __iommu_flush_dev_iotlb(info, addr, mask);
>   }

you are right. this should be better.

> Others look good to me.
>
> Acked-by: Lu Baolu 
> 
> Best regards,
> baolu

Regards,
Yi Liu

> > spin_unlock_irqrestore(_domain_lock, flags);
> >   }
> >
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2020-12-28 Thread Liu Yi L
iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
loops the devices which are full-attached to the domain. For sub-devices,
this is ineffective. This results in invalid caching entries left on the
device. Fix it by adding loop for subdevices as well. Also, the domain->
has_iotlb_device needs to be updated when attaching to subdevices.

Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain attach/detach")
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/iommu.c | 53 ++---
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d7720a836268..d48a60b61ba6 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -719,6 +719,8 @@ static int domain_update_device_node(struct dmar_domain 
*domain)
return nid;
 }
 
+static void domain_update_iotlb(struct dmar_domain *domain);
+
 /* Some capabilities may be different across iommus */
 static void domain_update_iommu_cap(struct dmar_domain *domain)
 {
@@ -744,6 +746,8 @@ static void domain_update_iommu_cap(struct dmar_domain 
*domain)
domain->domain.geometry.aperture_end = 
__DOMAIN_MAX_ADDR(domain->gaw - 1);
else
domain->domain.geometry.aperture_end = 
__DOMAIN_MAX_ADDR(domain->gaw);
+
+   domain_update_iotlb(domain);
 }
 
 struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
@@ -1464,17 +1468,22 @@ static void domain_update_iotlb(struct dmar_domain 
*domain)
 
assert_spin_locked(_domain_lock);
 
-   list_for_each_entry(info, >devices, link) {
-   struct pci_dev *pdev;
-
-   if (!info->dev || !dev_is_pci(info->dev))
-   continue;
-
-   pdev = to_pci_dev(info->dev);
-   if (pdev->ats_enabled) {
+   list_for_each_entry(info, >devices, link)
+   if (info && info->ats_enabled) {
has_iotlb_device = true;
break;
}
+
+   if (!has_iotlb_device) {
+   struct subdev_domain_info *sinfo;
+
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   info = get_domain_info(sinfo->pdev);
+   if (info && info->ats_enabled) {
+   has_iotlb_device = true;
+   break;
+   }
+   }
}
 
domain->has_iotlb_device = has_iotlb_device;
@@ -1555,25 +1564,37 @@ static void iommu_disable_dev_iotlb(struct 
device_domain_info *info)
 #endif
 }
 
+static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
+   u64 addr, unsigned int mask)
+{
+   u16 sid, qdep;
+
+   if (!info || !info->ats_enabled)
+   return;
+
+   sid = info->bus << 8 | info->devfn;
+   qdep = info->ats_qdep;
+   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+  qdep, addr, mask);
+}
+
 static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
  u64 addr, unsigned mask)
 {
-   u16 sid, qdep;
unsigned long flags;
struct device_domain_info *info;
+   struct subdev_domain_info *sinfo;
 
if (!domain->has_iotlb_device)
return;
 
spin_lock_irqsave(_domain_lock, flags);
-   list_for_each_entry(info, >devices, link) {
-   if (!info->ats_enabled)
-   continue;
+   list_for_each_entry(info, >devices, link)
+   __iommu_flush_dev_iotlb(info, addr, mask);
 
-   sid = info->bus << 8 | info->devfn;
-   qdep = info->ats_qdep;
-   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
-   qdep, addr, mask);
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   __iommu_flush_dev_iotlb(get_domain_info(sinfo->pdev),
+   addr, mask);
}
spin_unlock_irqrestore(_domain_lock, flags);
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info

2020-12-28 Thread Liu Yi L
In the existing code, loop all devices attached to a domain does not
include sub-devices attached via iommu_aux_attach_device().

This was found by when I'm working on the belwo patch, There is no
device in the domain->devices list, thus unable to get the cap and
ecap of iommu unit. But this domain actually has subdevice which is
attached via aux-manner. But it is tracked by domain. This patch is
going to fix it.

https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-yi.l@intel.com/

And this fix goes beyond the patch above, such sub-device tracking is
necessary for other cases. For example, flushing device_iotlb for a
domain which has sub-devices attached by auxiliary manner.

Co-developed-by: Xin Zeng 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/iommu.c | 95 +++--
 include/linux/intel-iommu.h | 16 +--
 2 files changed, 82 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c5b021..d7720a836268 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1877,6 +1877,7 @@ static struct dmar_domain *alloc_domain(int flags)
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
INIT_LIST_HEAD(>devices);
+   INIT_LIST_HEAD(>subdevices);
 
return domain;
 }
@@ -2547,7 +2548,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->iommu = iommu;
info->pasid_table = NULL;
info->auxd_enabled = 0;
-   INIT_LIST_HEAD(>auxiliary_domains);
+   INIT_LIST_HEAD(>subdevices);
 
if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
@@ -4475,33 +4476,61 @@ is_aux_domain(struct device *dev, struct iommu_domain 
*domain)
domain->type == IOMMU_DOMAIN_UNMANAGED;
 }
 
-static void auxiliary_link_device(struct dmar_domain *domain,
- struct device *dev)
+static inline struct subdev_domain_info *
+lookup_subdev_info(struct dmar_domain *domain, struct device *dev)
+{
+   struct subdev_domain_info *sinfo;
+
+   if (!list_empty(>subdevices)) {
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   if (sinfo->pdev == dev)
+   return sinfo;
+   }
+   }
+
+   return NULL;
+}
+
+static int auxiliary_link_device(struct dmar_domain *domain,
+struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
 
assert_spin_locked(_domain_lock);
if (WARN_ON(!info))
-   return;
+   return -EINVAL;
+
+   if (!sinfo) {
+   sinfo = kzalloc(sizeof(*sinfo), GFP_ATOMIC);
+   sinfo->domain = domain;
+   sinfo->pdev = dev;
+   list_add(>link_phys, >subdevices);
+   list_add(>link_domain, >subdevices);
+   }
 
-   domain->auxd_refcnt++;
-   list_add(>auxd, >auxiliary_domains);
+   return ++sinfo->users;
 }
 
-static void auxiliary_unlink_device(struct dmar_domain *domain,
-   struct device *dev)
+static int auxiliary_unlink_device(struct dmar_domain *domain,
+  struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
+   int ret;
 
assert_spin_locked(_domain_lock);
-   if (WARN_ON(!info))
-   return;
+   if (WARN_ON(!info || !sinfo || sinfo->users <= 0))
+   return -EINVAL;
 
-   list_del(>auxd);
-   domain->auxd_refcnt--;
+   ret = --sinfo->users;
+   if (!ret) {
+   list_del(>link_phys);
+   list_del(>link_domain);
+   kfree(sinfo);
+   }
 
-   if (!domain->auxd_refcnt && domain->default_pasid > 0)
-   ioasid_put(domain->default_pasid);
+   return ret;
 }
 
 static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -4530,6 +4559,19 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
}
 
spin_lock_irqsave(_domain_lock, flags);
+   ret = auxiliary_link_device(domain, dev);
+   if (ret <= 0)
+   goto link_failed;
+
+   /*
+* Subdevices from the same physical device can be attached to the
+* same domain. For such cases, only the first subdevice attachment
+* needs to go through the full steps in this function. So if ret >
+* 1, just goto out.
+*/
+   if (ret > 1)
+   goto out;
+
/*
 * iomm

[PATCH v3 1/3] iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev

2020-12-28 Thread Liu Yi L
Current struct intel_svm has a field to record the struct intel_iommu
pointer for a PASID bind. And struct intel_svm will be shared by all
the devices bind to the same process. The devices may be behind different
DMAR units. As the iommu driver code uses the intel_iommu pointer stored
in intel_svm struct to do cache invalidations, it may only flush the cache
on a single DMAR unit, for others, the cache invalidation is missed.

As intel_svm struct already has a device list, this patch just moves the
intel_iommu pointer to be a field of intel_svm_dev struct.

Fixes: 1c4f88b7f1f92 ("iommu/vt-d: Shared virtual address in scalable mode")
Cc: Lu Baolu 
Cc: Jacob Pan 
Cc: Raj Ashok 
Cc: David Woodhouse 
Reported-by: Guo Kaijie 
Reported-by: Xin Zeng 
Signed-off-by: Guo Kaijie 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/svm.c   | 9 +
 include/linux/intel-iommu.h | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 4fa248b98031..69566695d032 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -142,7 +142,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
 
if (sdev->dev_iotlb) {
desc.qw0 = QI_DEV_EIOTLB_PASID(svm->pasid) |
@@ -166,7 +166,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
}
 }
 
@@ -211,7 +211,7 @@ static void intel_mm_release(struct mmu_notifier *mn, 
struct mm_struct *mm)
 */
rcu_read_lock();
list_for_each_entry_rcu(sdev, >devs, list)
-   intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+   intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
svm->pasid, true);
rcu_read_unlock();
 
@@ -363,6 +363,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
}
sdev->dev = dev;
sdev->sid = PCI_DEVID(info->bus, info->devfn);
+   sdev->iommu = iommu;
 
/* Only count users if device has aux domains */
if (iommu_dev_feature_enabled(dev, IOMMU_DEV_FEAT_AUX))
@@ -546,6 +547,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
goto out;
}
sdev->dev = dev;
+   sdev->iommu = iommu;
 
ret = intel_iommu_enable_pasid(iommu, dev);
if (ret) {
@@ -575,7 +577,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
kfree(sdev);
goto out;
}
-   svm->iommu = iommu;
 
if (pasid_max > intel_pasid_max_id)
pasid_max = intel_pasid_max_id;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d956987ed032..94522685a0d9 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -758,6 +758,7 @@ struct intel_svm_dev {
struct list_head list;
struct rcu_head rcu;
struct device *dev;
+   struct intel_iommu *iommu;
struct svm_dev_ops *ops;
struct iommu_sva sva;
u32 pasid;
@@ -771,7 +772,6 @@ struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
 
-   struct intel_iommu *iommu;
unsigned int flags;
u32 pasid;
int gpasid; /* In case that guest PASID is different from host PASID */
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 0/3] iommu/vt-d: Misc fixes on scalable mode

2020-12-28 Thread Liu Yi L
Hi Baolu, Joerg, Will,

This patchset aims to fix a bug regards to native SVM usage, and
also several bugs around subdevice (attached to device via auxiliary
manner) tracking and ineffective device_tlb flush.

v2 -> v3:
- Address comments from Baolu Lu against v2
- Rebased to 5.11-rc1
v2: 
https://lore.kernel.org/linux-iommu/20201223062720.29364-1-yi.l@intel.com/

v1 -> v2:
- Use a more recent Fix tag in "iommu/vt-d: Move intel_iommu info from struct 
intel_svm to struct intel_svm_dev"
- Refined the "iommu/vt-d: Track device aux-attach with subdevice_domain_info"
- Rename "iommu/vt-d: A fix to iommu_flush_dev_iotlb() for aux-domain" to be
  "iommu/vt-d: Fix ineffective devTLB invalidation for subdevices"
- Refined the commit messages
v1: 
https://lore.kernel.org/linux-iommu/2020122352.183523-1-yi.l@intel.com/

Liu Yi L (3):
  iommu/vt-d: Move intel_iommu info from struct intel_svm to struct
intel_svm_dev
  iommu/vt-d: Track device aux-attach with subdevice_domain_info
  iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

 drivers/iommu/intel/iommu.c | 148 ++--
 drivers/iommu/intel/svm.c   |   9 ++-
 include/linux/intel-iommu.h |  18 +++--
 3 files changed, 125 insertions(+), 50 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2020-12-25 Thread Liu, Yi L
Hi Baolu,

Well received, all comments accepted. thanks.

Regards,
Yi Liu

> From: Lu Baolu 
> Sent: Wednesday, December 23, 2020 6:10 PM
> 
> Hi Yi,
> 
> On 2020/12/23 14:27, Liu Yi L wrote:
> > iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
> > loops the devices which are full-attached to the domain. For sub-devices,
> > this is ineffective. This results in invalid caching entries left on the
> > device. Fix it by adding loop for subdevices as well. Also, the domain->
> > has_iotlb_device needs to be updated when attaching to subdevices.
> >
> > Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain
> attach/detach")
> > Signed-off-by: Liu Yi L 
> > ---
> >   drivers/iommu/intel/iommu.c | 63 +++-
> -
> >   1 file changed, 47 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index acfe0a5b955e..e97c5ac1d7fc 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -726,6 +726,8 @@ static int domain_update_device_node(struct
> dmar_domain *domain)
> > return nid;
> >   }
> >
> > +static void domain_update_iotlb(struct dmar_domain *domain);
> > +
> >   /* Some capabilities may be different across iommus */
> >   static void domain_update_iommu_cap(struct dmar_domain *domain)
> >   {
> > @@ -739,6 +741,8 @@ static void domain_update_iommu_cap(struct
> dmar_domain *domain)
> >  */
> > if (domain->nid == NUMA_NO_NODE)
> > domain->nid = domain_update_device_node(domain);
> > +
> > +   domain_update_iotlb(domain);
> >   }
> >
> >   struct context_entry *iommu_context_addr(struct intel_iommu *iommu,
> u8 bus,
> > @@ -1459,6 +1463,18 @@ iommu_support_dev_iotlb (struct dmar_domain
> *domain, struct intel_iommu *iommu,
> > return NULL;
> >   }
> >
> > +static bool dev_iotlb_enabled(struct device_domain_info *info)
> > +{
> > +   struct pci_dev *pdev;
> > +
> > +   if (!info->dev || !dev_is_pci(info->dev))
> > +   return false;
> > +
> > +   pdev = to_pci_dev(info->dev);
> > +
> > +   return !!pdev->ats_enabled;
> > +}
> 
> I know this is just separated from below function. But isn't "(info &&
> info->ats_enabled)" is enough?
> 
> > +
> >   static void domain_update_iotlb(struct dmar_domain *domain)
> >   {
> > struct device_domain_info *info;
> > @@ -1466,17 +1482,20 @@ static void domain_update_iotlb(struct
> dmar_domain *domain)
> >
> > assert_spin_locked(_domain_lock);
> >
> > -   list_for_each_entry(info, >devices, link) {
> > -   struct pci_dev *pdev;
> > -
> > -   if (!info->dev || !dev_is_pci(info->dev))
> > -   continue;
> > -
> > -   pdev = to_pci_dev(info->dev);
> > -   if (pdev->ats_enabled) {
> > +   list_for_each_entry(info, >devices, link)
> > +   if (dev_iotlb_enabled(info)) {
> > has_iotlb_device = true;
> > break;
> > }
> > +
> > +   if (!has_iotlb_device) {
> > +   struct subdev_domain_info *sinfo;
> > +
> > +   list_for_each_entry(sinfo, >subdevices, link_domain)
> > +   if (dev_iotlb_enabled(get_domain_info(sinfo->pdev)))
> {
> 
> Please make the code easier for reading by:
> 
>   info = get_domain_info(sinfo->pdev);
>   if (dev_iotlb_enabled(info))
>   
> 
> Best regards,
> baolu
> 
> > +   has_iotlb_device = true;
> > +   break;
> > +   }
> > }
> >
> > domain->has_iotlb_device = has_iotlb_device;
> > @@ -1557,25 +1576,37 @@ static void iommu_disable_dev_iotlb(struct
> device_domain_info *info)
> >   #endif
> >   }
> >
> > +static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
> > +   u64 addr, unsigned int mask)
> > +{
> > +   u16 sid, qdep;
> > +
> > +   if (!info || !info->ats_enabled)
> > +   return;
> > +
> > +   sid = info->bus << 8 | info->devfn;
> > +   qdep = info->ats_qdep;
> > +   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > +  qdep, addr, mask);
> >

[PATCH v2 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2020-12-22 Thread Liu Yi L
iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
loops the devices which are full-attached to the domain. For sub-devices,
this is ineffective. This results in invalid caching entries left on the
device. Fix it by adding loop for subdevices as well. Also, the domain->
has_iotlb_device needs to be updated when attaching to subdevices.

Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain attach/detach")
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/iommu.c | 63 +++--
 1 file changed, 47 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index acfe0a5b955e..e97c5ac1d7fc 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -726,6 +726,8 @@ static int domain_update_device_node(struct dmar_domain 
*domain)
return nid;
 }
 
+static void domain_update_iotlb(struct dmar_domain *domain);
+
 /* Some capabilities may be different across iommus */
 static void domain_update_iommu_cap(struct dmar_domain *domain)
 {
@@ -739,6 +741,8 @@ static void domain_update_iommu_cap(struct dmar_domain 
*domain)
 */
if (domain->nid == NUMA_NO_NODE)
domain->nid = domain_update_device_node(domain);
+
+   domain_update_iotlb(domain);
 }
 
 struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
@@ -1459,6 +1463,18 @@ iommu_support_dev_iotlb (struct dmar_domain *domain, 
struct intel_iommu *iommu,
return NULL;
 }
 
+static bool dev_iotlb_enabled(struct device_domain_info *info)
+{
+   struct pci_dev *pdev;
+
+   if (!info->dev || !dev_is_pci(info->dev))
+   return false;
+
+   pdev = to_pci_dev(info->dev);
+
+   return !!pdev->ats_enabled;
+}
+
 static void domain_update_iotlb(struct dmar_domain *domain)
 {
struct device_domain_info *info;
@@ -1466,17 +1482,20 @@ static void domain_update_iotlb(struct dmar_domain 
*domain)
 
assert_spin_locked(_domain_lock);
 
-   list_for_each_entry(info, >devices, link) {
-   struct pci_dev *pdev;
-
-   if (!info->dev || !dev_is_pci(info->dev))
-   continue;
-
-   pdev = to_pci_dev(info->dev);
-   if (pdev->ats_enabled) {
+   list_for_each_entry(info, >devices, link)
+   if (dev_iotlb_enabled(info)) {
has_iotlb_device = true;
break;
}
+
+   if (!has_iotlb_device) {
+   struct subdev_domain_info *sinfo;
+
+   list_for_each_entry(sinfo, >subdevices, link_domain)
+   if (dev_iotlb_enabled(get_domain_info(sinfo->pdev))) {
+   has_iotlb_device = true;
+   break;
+   }
}
 
domain->has_iotlb_device = has_iotlb_device;
@@ -1557,25 +1576,37 @@ static void iommu_disable_dev_iotlb(struct 
device_domain_info *info)
 #endif
 }
 
+static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
+   u64 addr, unsigned int mask)
+{
+   u16 sid, qdep;
+
+   if (!info || !info->ats_enabled)
+   return;
+
+   sid = info->bus << 8 | info->devfn;
+   qdep = info->ats_qdep;
+   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+  qdep, addr, mask);
+}
+
 static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
  u64 addr, unsigned mask)
 {
-   u16 sid, qdep;
unsigned long flags;
struct device_domain_info *info;
+   struct subdev_domain_info *sinfo;
 
if (!domain->has_iotlb_device)
return;
 
spin_lock_irqsave(_domain_lock, flags);
-   list_for_each_entry(info, >devices, link) {
-   if (!info->ats_enabled)
-   continue;
+   list_for_each_entry(info, >devices, link)
+   __iommu_flush_dev_iotlb(info, addr, mask);
 
-   sid = info->bus << 8 | info->devfn;
-   qdep = info->ats_qdep;
-   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
-   qdep, addr, mask);
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   __iommu_flush_dev_iotlb(get_domain_info(sinfo->pdev),
+   addr, mask);
}
spin_unlock_irqrestore(_domain_lock, flags);
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info

2020-12-22 Thread Liu Yi L
In the existing code, loop all devices attached to a domain does not
include sub-devices attached via iommu_aux_attach_device().

This was found by when I'm working on the belwo patch, There is no
device in the domain->devices list, thus unable to get the cap and
ecap of iommu unit. But this domain actually has subdevice which is
attached via aux-manner. But it is tracked by domain. This patch is
going to fix it.

https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-yi.l@intel.com/

And this fix goes beyond the patch above, such sub-device tracking is
necessary for other cases. For example, flushing device_iotlb for a
domain which has sub-devices attached by auxiliary manner.

Co-developed-by: Xin Zeng 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/iommu.c | 95 +++--
 include/linux/intel-iommu.h | 16 +--
 2 files changed, 82 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index a49afa11673c..acfe0a5b955e 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1881,6 +1881,7 @@ static struct dmar_domain *alloc_domain(int flags)
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
INIT_LIST_HEAD(>devices);
+   INIT_LIST_HEAD(>subdevices);
 
return domain;
 }
@@ -2632,7 +2633,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->iommu = iommu;
info->pasid_table = NULL;
info->auxd_enabled = 0;
-   INIT_LIST_HEAD(>auxiliary_domains);
+   INIT_LIST_HEAD(>subdevices);
 
if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
@@ -5172,33 +5173,61 @@ is_aux_domain(struct device *dev, struct iommu_domain 
*domain)
domain->type == IOMMU_DOMAIN_UNMANAGED;
 }
 
-static void auxiliary_link_device(struct dmar_domain *domain,
- struct device *dev)
+static inline struct subdev_domain_info *
+lookup_subdev_info(struct dmar_domain *domain, struct device *dev)
+{
+   struct subdev_domain_info *sinfo;
+
+   if (!list_empty(>subdevices)) {
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   if (sinfo->pdev == dev)
+   return sinfo;
+   }
+   }
+
+   return NULL;
+}
+
+static int auxiliary_link_device(struct dmar_domain *domain,
+struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
 
assert_spin_locked(_domain_lock);
if (WARN_ON(!info))
-   return;
+   return -EINVAL;
+
+   if (!sinfo) {
+   sinfo = kzalloc(sizeof(*sinfo), GFP_ATOMIC);
+   sinfo->domain = domain;
+   sinfo->pdev = dev;
+   list_add(>link_phys, >subdevices);
+   list_add(>link_domain, >subdevices);
+   }
 
-   domain->auxd_refcnt++;
-   list_add(>auxd, >auxiliary_domains);
+   return ++sinfo->users;
 }
 
-static void auxiliary_unlink_device(struct dmar_domain *domain,
-   struct device *dev)
+static int auxiliary_unlink_device(struct dmar_domain *domain,
+  struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
+   int ret;
 
assert_spin_locked(_domain_lock);
-   if (WARN_ON(!info))
-   return;
+   if (WARN_ON(!info || !sinfo || sinfo->users <= 0))
+   return -EINVAL;
 
-   list_del(>auxd);
-   domain->auxd_refcnt--;
+   ret = --sinfo->users;
+   if (!ret) {
+   list_del(>link_phys);
+   list_del(>link_domain);
+   kfree(sinfo);
+   }
 
-   if (!domain->auxd_refcnt && domain->default_pasid > 0)
-   ioasid_free(domain->default_pasid);
+   return ret;
 }
 
 static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -5227,6 +5256,19 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
}
 
spin_lock_irqsave(_domain_lock, flags);
+   ret = auxiliary_link_device(domain, dev);
+   if (ret <= 0)
+   goto link_failed;
+
+   /*
+* Subdevices from the same physical device can be attached to the
+* same domain. For such cases, only the first subdevice attachment
+* needs to go through the full steps in this function. So if ret >
+* 1, just goto out.
+*/
+   if (ret > 1)
+   goto out;
+
/*
   

[PATCH v2 1/3] iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev

2020-12-22 Thread Liu Yi L
Current struct intel_svm has a field to record the struct intel_iommu
pointer for a PASID bind. And struct intel_svm will be shared by all
the devices bind to the same process. The devices may be behind different
DMAR units. As the iommu driver code uses the intel_iommu pointer stored
in intel_svm struct to do cache invalidations, it may only flush the cache
on a single DMAR unit, for others, the cache invalidation is missed.

As intel_svm struct already has a device list, this patch just moves the
intel_iommu pointer to be a field of intel_svm_dev struct.

Fixes: 1c4f88b7f1f92 ("iommu/vt-d: Shared virtual address in scalable mode")
Cc: Lu Baolu 
Cc: Jacob Pan 
Cc: Raj Ashok 
Cc: David Woodhouse 
Reported-by: Guo Kaijie 
Reported-by: Xin Zeng 
Signed-off-by: Guo Kaijie 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
Tested-by: Guo Kaijie 
---
 drivers/iommu/intel/svm.c   | 9 +
 include/linux/intel-iommu.h | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 3242ebd0bca3..4a10c9ff368c 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -142,7 +142,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
 
if (sdev->dev_iotlb) {
desc.qw0 = QI_DEV_EIOTLB_PASID(svm->pasid) |
@@ -166,7 +166,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
}
 }
 
@@ -211,7 +211,7 @@ static void intel_mm_release(struct mmu_notifier *mn, 
struct mm_struct *mm)
 */
rcu_read_lock();
list_for_each_entry_rcu(sdev, >devs, list)
-   intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+   intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
svm->pasid, true);
rcu_read_unlock();
 
@@ -363,6 +363,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
}
sdev->dev = dev;
sdev->sid = PCI_DEVID(info->bus, info->devfn);
+   sdev->iommu = iommu;
 
/* Only count users if device has aux domains */
if (iommu_dev_feature_enabled(dev, IOMMU_DEV_FEAT_AUX))
@@ -546,6 +547,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
goto out;
}
sdev->dev = dev;
+   sdev->iommu = iommu;
 
ret = intel_iommu_enable_pasid(iommu, dev);
if (ret) {
@@ -575,7 +577,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
kfree(sdev);
goto out;
}
-   svm->iommu = iommu;
 
if (pasid_max > intel_pasid_max_id)
pasid_max = intel_pasid_max_id;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d956987ed032..94522685a0d9 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -758,6 +758,7 @@ struct intel_svm_dev {
struct list_head list;
struct rcu_head rcu;
struct device *dev;
+   struct intel_iommu *iommu;
struct svm_dev_ops *ops;
struct iommu_sva sva;
u32 pasid;
@@ -771,7 +772,6 @@ struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
 
-   struct intel_iommu *iommu;
unsigned int flags;
u32 pasid;
int gpasid; /* In case that guest PASID is different from host PASID */
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/3] iommu/vt-d: Misc fixes on scalable mode

2020-12-22 Thread Liu Yi L
This patchset aims to fix a bug regards to native SVM usage, and
also several bugs around subdevice (attached to device via auxiliary
manner) tracking and ineffective device_tlb flush.

Liu Yi L (3):
  iommu/vt-d: Move intel_iommu info from struct intel_svm to struct
intel_svm_dev
  iommu/vt-d: Track device aux-attach with subdevice_domain_info
  iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

 drivers/iommu/intel/iommu.c | 158 +++-
 drivers/iommu/intel/svm.c   |   9 +-
 include/linux/intel-iommu.h |  18 ++--
 3 files changed, 135 insertions(+), 50 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info.

2020-12-22 Thread Liu, Yi L
Hi Jacob,

> From: Jacob Pan 
> Sent: Wednesday, December 23, 2020 4:21 AM
>
> Hi Yi,
> 
> On Sun, 20 Dec 2020 08:03:51 +0800, Liu Yi L  wrote:
> 
> > In existing code, if wanting to loop all devices attached to a domain,
> > current code can only loop the devices which are attached to the domain
> > via normal manner. While for devices attached via auxiliary manner, this
> > is subdevice, they are not tracked in the domain. This patch adds struct
> How about "In the existing code, loop all devices attached to a domain does
> not include sub-devices attached via iommu_aux_attach_device()."

looks good. will refine accordingly. 

> 
> > subdevice_domain_info which is created per domain attachment via
> auxiliary
> > manner. So that such devices are also tracked in domain.
> >
> > This was found by when I'm working on the belwo patch, There is no device
> > in domain->devices, thus unable to get the cap and ecap of iommu unit. But
> > this domain actually has one sub-device which is attached via aux-manner.
> > This patch fixes the issue.
> >
> > https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-
> yi.l@intel.com/
> >
> > But looks like, it doesn't affect me only. Such auxiliary track should be
> > there for example if wanting to flush device_iotlb for a domain which has
> > devices attached by auxiliray manner, then this fix is also necessary.
> Perhaps:
> This fix goes beyond the patch above, such sub-device tracking is
> necessary for other cases. For example, flushing device_iotlb for a domain
> which has sub-devices attached by auxiliary manner.

yep. Baolu also suggested such refine. will tweak in next version.

Regards,
Yi Liu

> 
> > This issue will also be fixed by another patch in this series with some
> > additional changes based on the sudevice tracking framework introduced in
> > this patch.
> >
> > Co-developed-by: Xin Zeng 
> > Signed-off-by: Xin Zeng 
> > Co-developed-by: Liu Yi L 
> > Signed-off-by: Liu Yi L 
> > Co-developed-by: Lu Baolu 
> > Signed-off-by: Lu Baolu 
> > ---
> >  drivers/iommu/intel/iommu.c | 92 --
> ---
> >  include/linux/intel-iommu.h | 11 -
> >  2 files changed, 90 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index a49afa11673c..4274b4acc325 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -1881,6 +1881,7 @@ static struct dmar_domain *alloc_domain(int
> flags)
> > domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
> > domain->has_iotlb_device = false;
> > INIT_LIST_HEAD(>devices);
> > +   INIT_LIST_HEAD(>sub_devices);
> >
> > return domain;
> >  }
> > @@ -5172,33 +5173,79 @@ is_aux_domain(struct device *dev, struct
> > iommu_domain *domain) domain->type ==
> IOMMU_DOMAIN_UNMANAGED;
> >  }
> >
> > +static inline
> > +void _auxiliary_link_device(struct dmar_domain *domain,
> > +   struct subdevice_domain_info *subinfo,
> > +   struct device *dev)
> > +{
> > +   subinfo->users++;
> > +}
> > +
> why pass in more arguments than subinfo? the function name does not match
> what it does, seems just refcount inc.
> 
> > +static inline
> > +int _auxiliary_unlink_device(struct dmar_domain *domain,
> > +struct subdevice_domain_info *subinfo,
> > +struct device *dev)
> > +{
> > +   subinfo->users--;
> > +   return subinfo->users;
> ditto. why not just
>   return subinfo->users--;
> 
> > +}
> > +
> >  static void auxiliary_link_device(struct dmar_domain *domain,
> >   struct device *dev)
> >  {
> > struct device_domain_info *info = get_domain_info(dev);
> > +   struct subdevice_domain_info *subinfo;
> >
> > assert_spin_locked(_domain_lock);
> > if (WARN_ON(!info))
> > return;
> >
> > +   subinfo = kzalloc(sizeof(*subinfo), GFP_ATOMIC);
> > +   if (!subinfo)
> > +   return;
> > +
> > +   subinfo->domain = domain;
> > +   subinfo->dev = dev;
> > +   list_add(>link_domain, >auxiliary_domains);
> > +   list_add(>link_phys, >sub_devices);
> > +   _auxiliary_link_device(domain, subinfo, dev);
> or just opencode subinfo->users++?
> > domain->auxd_refcnt++;
> > -   list_add(>auxd, >auxiliary_domain

RE: [PATCH 0/4] iommu/vtd-: Misc fixes on scalable mode

2020-12-22 Thread Liu, Yi L
> From: Jacob Pan 
> Sent: Wednesday, December 23, 2020 2:17 AM
> 
> Hi Yi,
> 
> nit: The cover letter is 0/4, patches are 1/3 - 3/3. You also need to copy
> LKML.
> 
> On Sun, 20 Dec 2020 08:03:49 +0800, Liu Yi L  wrote:
> 
> > Hi,
> >
> > This patchset aims to fix a bug regards to SVM usage on native, and
> perhaps 'native SVM usage'

got it. thanks. will correct it.

Regards,
Yi Liu

> > also several bugs around subdevice (attached to device via auxiliary
> > manner) tracking and ineffective device_tlb flush.
> >
> > Regards,
> > Yi Liu
> >
> > Liu Yi L (3):
> >   iommu/vt-d: Move intel_iommu info from struct intel_svm to struct
> > intel_svm_dev
> >   iommu/vt-d: Track device aux-attach with subdevice_domain_info.
> >   iommu/vt-d: A fix to iommu_flush_dev_iotlb() for aux-domain
> >
> >  drivers/iommu/intel/iommu.c | 182 ++---
> ---
> >  drivers/iommu/intel/svm.c   |   9 +-
> >  include/linux/intel-iommu.h |  13 ++-
> >  3 files changed, 168 insertions(+), 36 deletions(-)
> >
> 
> 
> Thanks,
> 
> Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/3] iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev

2020-12-18 Thread Liu Yi L
Current struct intel_svm has a field to record the struct intel_iommu
pointer for a PASID bind. And struct intel_svm will be shared by all
the devices bind to the same process. The devices may be behind different
DMAR units. As the iommu driver code uses the intel_iommu pointer stored
in intel_svm struct to do cache invalidations, it may only flush the cache
on a single DMAR unit, for others, the cache invalidation is missed.

As intel_svm struct already has a device list, this patch just moves the
intel_iommu pointer to be a field of intel_svm_dev struct.

Fixes: 2f26e0a9c986 ("iommu/vt-d: Add basic SVM PASID support")
Cc: Lu Baolu 
Cc: Jacob Pan 
Cc: Raj Ashok 
Cc: David Woodhouse 
Reported-by: Guo Kaijie 
Reported-by: Xin Zeng 
Signed-off-by: Guo Kaijie 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/svm.c   | 9 +
 include/linux/intel-iommu.h | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 3242ebd0bca3..4a10c9ff368c 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -142,7 +142,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
 
if (sdev->dev_iotlb) {
desc.qw0 = QI_DEV_EIOTLB_PASID(svm->pasid) |
@@ -166,7 +166,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
}
 }
 
@@ -211,7 +211,7 @@ static void intel_mm_release(struct mmu_notifier *mn, 
struct mm_struct *mm)
 */
rcu_read_lock();
list_for_each_entry_rcu(sdev, >devs, list)
-   intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+   intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
svm->pasid, true);
rcu_read_unlock();
 
@@ -363,6 +363,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
}
sdev->dev = dev;
sdev->sid = PCI_DEVID(info->bus, info->devfn);
+   sdev->iommu = iommu;
 
/* Only count users if device has aux domains */
if (iommu_dev_feature_enabled(dev, IOMMU_DEV_FEAT_AUX))
@@ -546,6 +547,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
goto out;
}
sdev->dev = dev;
+   sdev->iommu = iommu;
 
ret = intel_iommu_enable_pasid(iommu, dev);
if (ret) {
@@ -575,7 +577,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
kfree(sdev);
goto out;
}
-   svm->iommu = iommu;
 
if (pasid_max > intel_pasid_max_id)
pasid_max = intel_pasid_max_id;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d956987ed032..94522685a0d9 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -758,6 +758,7 @@ struct intel_svm_dev {
struct list_head list;
struct rcu_head rcu;
struct device *dev;
+   struct intel_iommu *iommu;
struct svm_dev_ops *ops;
struct iommu_sva sva;
u32 pasid;
@@ -771,7 +772,6 @@ struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
 
-   struct intel_iommu *iommu;
unsigned int flags;
u32 pasid;
int gpasid; /* In case that guest PASID is different from host PASID */
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info.

2020-12-18 Thread Liu Yi L
In existing code, if wanting to loop all devices attached to a domain,
current code can only loop the devices which are attached to the domain
via normal manner. While for devices attached via auxiliary manner, this
is subdevice, they are not tracked in the domain. This patch adds struct
subdevice_domain_info which is created per domain attachment via auxiliary
manner. So that such devices are also tracked in domain.

This was found by when I'm working on the belwo patch, There is no device
in domain->devices, thus unable to get the cap and ecap of iommu unit. But
this domain actually has one sub-device which is attached via aux-manner.
This patch fixes the issue.

https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-yi.l@intel.com/

But looks like, it doesn't affect me only. Such auxiliary track should be
there for example if wanting to flush device_iotlb for a domain which has
devices attached by auxiliray manner, then this fix is also necessary. This
issue will also be fixed by another patch in this series with some additional
changes based on the sudevice tracking framework introduced in this patch.

Co-developed-by: Xin Zeng 
Signed-off-by: Xin Zeng 
Co-developed-by: Liu Yi L 
Signed-off-by: Liu Yi L 
Co-developed-by: Lu Baolu 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 92 -
 include/linux/intel-iommu.h | 11 -
 2 files changed, 90 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index a49afa11673c..4274b4acc325 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1881,6 +1881,7 @@ static struct dmar_domain *alloc_domain(int flags)
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
INIT_LIST_HEAD(>devices);
+   INIT_LIST_HEAD(>sub_devices);
 
return domain;
 }
@@ -5172,33 +5173,79 @@ is_aux_domain(struct device *dev, struct iommu_domain 
*domain)
domain->type == IOMMU_DOMAIN_UNMANAGED;
 }
 
+static inline
+void _auxiliary_link_device(struct dmar_domain *domain,
+   struct subdevice_domain_info *subinfo,
+   struct device *dev)
+{
+   subinfo->users++;
+}
+
+static inline
+int _auxiliary_unlink_device(struct dmar_domain *domain,
+struct subdevice_domain_info *subinfo,
+struct device *dev)
+{
+   subinfo->users--;
+   return subinfo->users;
+}
+
 static void auxiliary_link_device(struct dmar_domain *domain,
  struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdevice_domain_info *subinfo;
 
assert_spin_locked(_domain_lock);
if (WARN_ON(!info))
return;
 
+   subinfo = kzalloc(sizeof(*subinfo), GFP_ATOMIC);
+   if (!subinfo)
+   return;
+
+   subinfo->domain = domain;
+   subinfo->dev = dev;
+   list_add(>link_domain, >auxiliary_domains);
+   list_add(>link_phys, >sub_devices);
+   _auxiliary_link_device(domain, subinfo, dev);
domain->auxd_refcnt++;
-   list_add(>auxd, >auxiliary_domains);
 }
 
-static void auxiliary_unlink_device(struct dmar_domain *domain,
-   struct device *dev)
+static struct subdevice_domain_info *
+subdevice_domain_info_lookup(struct dmar_domain *domain, struct device *dev)
+{
+   struct subdevice_domain_info *subinfo;
+
+   assert_spin_locked(_domain_lock);
+
+   list_for_each_entry(subinfo, >sub_devices, link_phys)
+   if (subinfo->dev == dev)
+   return subinfo;
+
+   return NULL;
+}
+
+static int auxiliary_unlink_device(struct dmar_domain *domain,
+  struct subdevice_domain_info *subinfo,
+  struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   int ret;
 
assert_spin_locked(_domain_lock);
if (WARN_ON(!info))
-   return;
+   return -EINVAL;
 
-   list_del(>auxd);
+   ret = _auxiliary_unlink_device(domain, subinfo, dev);
+   if (ret == 0) {
+   list_del(>link_domain);
+   list_del(>link_phys);
+   kfree(subinfo);
+   }
domain->auxd_refcnt--;
 
-   if (!domain->auxd_refcnt && domain->default_pasid > 0)
-   ioasid_free(domain->default_pasid);
+   return ret;
 }
 
 static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -5207,6 +5254,8 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
int ret;
unsigned long flags;
struct intel_iommu *iommu;
+   struct device_domain_info *info = get_domain_info(dev);
+   struct subdevice_

[PATCH 0/4] iommu/vtd-: Misc fixes on scalable mode

2020-12-18 Thread Liu Yi L
Hi,

This patchset aims to fix a bug regards to SVM usage on native, and
also several bugs around subdevice (attached to device via auxiliary
manner) tracking and ineffective device_tlb flush.

Regards,
Yi Liu

Liu Yi L (3):
  iommu/vt-d: Move intel_iommu info from struct intel_svm to struct
intel_svm_dev
  iommu/vt-d: Track device aux-attach with subdevice_domain_info.
  iommu/vt-d: A fix to iommu_flush_dev_iotlb() for aux-domain

 drivers/iommu/intel/iommu.c | 182 ++--
 drivers/iommu/intel/svm.c   |   9 +-
 include/linux/intel-iommu.h |  13 ++-
 3 files changed, 168 insertions(+), 36 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/3] iommu/vt-d: A fix to iommu_flush_dev_iotlb() for aux-domain

2020-12-18 Thread Liu Yi L
iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
loops the devices which are full-attached to the domain. For sub-devices,
this is ineffective. This results in invalid caching entries left on the
device. Fix it by adding loop for subdevices as well. Also, update the
domain->has_iotlb_device for both device/subdevice attach/detach and
ATS enabling/disabling.

Signed-off-by: Liu Yi L 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 90 +
 1 file changed, 72 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 4274b4acc325..d9b6037b72b1 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1437,6 +1437,10 @@ static void __iommu_flush_iotlb(struct intel_iommu 
*iommu, u16 did,
(unsigned long long)DMA_TLB_IAIG(val));
 }
 
+/**
+ * For a given bus/devfn, fetch its device_domain_info if it supports
+ * device tlb. Only needs to loop devices attached in normal manner.
+ */
 static struct device_domain_info *
 iommu_support_dev_iotlb (struct dmar_domain *domain, struct intel_iommu *iommu,
 u8 bus, u8 devfn)
@@ -1459,6 +1463,18 @@ iommu_support_dev_iotlb (struct dmar_domain *domain, 
struct intel_iommu *iommu,
return NULL;
 }
 
+static bool dev_iotlb_enabled(struct device *dev)
+{
+   struct pci_dev *pdev;
+
+   if (!dev || !dev_is_pci(dev))
+   return false;
+
+   pdev = to_pci_dev(dev);
+
+   return !!pdev->ats_enabled;
+}
+
 static void domain_update_iotlb(struct dmar_domain *domain)
 {
struct device_domain_info *info;
@@ -1467,21 +1483,37 @@ static void domain_update_iotlb(struct dmar_domain 
*domain)
assert_spin_locked(_domain_lock);
 
list_for_each_entry(info, >devices, link) {
-   struct pci_dev *pdev;
-
-   if (!info->dev || !dev_is_pci(info->dev))
-   continue;
-
-   pdev = to_pci_dev(info->dev);
-   if (pdev->ats_enabled) {
+   if (dev_iotlb_enabled(info->dev)) {
has_iotlb_device = true;
break;
}
}
 
+   if (!has_iotlb_device) {
+   struct subdevice_domain_info *subinfo;
+
+   list_for_each_entry(subinfo, >sub_devices, link_phys) {
+   if (dev_iotlb_enabled(subinfo->dev)) {
+   has_iotlb_device = true;
+   break;
+   }
+   }
+   }
domain->has_iotlb_device = has_iotlb_device;
 }
 
+static void dev_update_domain_iotlb(struct device_domain_info *info)
+{
+   struct subdevice_domain_info *subinfo;
+
+   assert_spin_locked(_domain_lock);
+
+   domain_update_iotlb(info->domain);
+
+   list_for_each_entry(subinfo, >auxiliary_domains, link_domain)
+   domain_update_iotlb(subinfo->domain);
+}
+
 static void iommu_enable_dev_iotlb(struct device_domain_info *info)
 {
struct pci_dev *pdev;
@@ -1524,7 +1556,7 @@ static void iommu_enable_dev_iotlb(struct 
device_domain_info *info)
if (info->ats_supported && pci_ats_page_aligned(pdev) &&
!pci_enable_ats(pdev, VTD_PAGE_SHIFT)) {
info->ats_enabled = 1;
-   domain_update_iotlb(info->domain);
+   dev_update_domain_iotlb(info);
info->ats_qdep = pci_ats_queue_depth(pdev);
}
 }
@@ -1543,7 +1575,7 @@ static void iommu_disable_dev_iotlb(struct 
device_domain_info *info)
if (info->ats_enabled) {
pci_disable_ats(pdev);
info->ats_enabled = 0;
-   domain_update_iotlb(info->domain);
+   dev_update_domain_iotlb(info);
}
 #ifdef CONFIG_INTEL_IOMMU_SVM
if (info->pri_enabled) {
@@ -1557,26 +1589,43 @@ static void iommu_disable_dev_iotlb(struct 
device_domain_info *info)
 #endif
 }
 
+static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
+   u64 addr, unsigned mask)
+{
+   u16 sid, qdep;
+
+   if (!info || !info->ats_enabled)
+   return;
+
+   sid = info->bus << 8 | info->devfn;
+   qdep = info->ats_qdep;
+   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+  qdep, addr, mask);
+}
+
 static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
  u64 addr, unsigned mask)
 {
-   u16 sid, qdep;
unsigned long flags;
struct device_domain_info *info;
+   struct subdevice_domain_info *subinfo;
 
if (!domain->has_iotlb_device)
return;
 
spin_lock_irqsave(_domain_lock, flags);
-   list_for_each_entry(info, >devices, link) {
- 

RE: [PATCH v2 1/1] vfio/type1: Add vfio_group_domain()

2020-11-25 Thread Liu, Yi L
On Thurs, Nov 26, 2020, at 9:27 AM, Lu Baolu wrote:
> Add the API for getting the domain from a vfio group. This could be used
> by the physical device drivers which rely on the vfio/mdev framework for
> mediated device user level access. The typical use case like below:
> 
>   unsigned int pasid;
>   struct vfio_group *vfio_group;
>   struct iommu_domain *iommu_domain;
>   struct device *dev = mdev_dev(mdev);
>   struct device *iommu_device = mdev_get_iommu_device(dev);
> 
>   if (!iommu_device ||
>   !iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
>   return -EINVAL;
> 
>   vfio_group = vfio_group_get_external_user_from_dev(dev);(dev);

duplicate (dev); other parts looks good to me. perhaps, you can also
describe that the release function of a sub-device fd should also call
vfio_group_put_external_user() to release its reference on the vfio_group.

Regards,
Yi Liu 

>   if (IS_ERR_OR_NULL(vfio_group))
>   return -EFAULT;
> 
>   iommu_domain = vfio_group_domain(vfio_group);
>   if (IS_ERR_OR_NULL(iommu_domain)) {
>   vfio_group_put_external_user(vfio_group);
>   return -EFAULT;
>   }
> 
>   pasid = iommu_aux_get_pasid(iommu_domain, iommu_device);
>   if (pasid < 0) {
>   vfio_group_put_external_user(vfio_group);
>   return -EFAULT;
>   }
> 
>   /* Program device context with pasid value. */
>   ...
> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/vfio/vfio.c | 18 ++
>  drivers/vfio/vfio_iommu_type1.c | 23 +++
>  include/linux/vfio.h|  3 +++
>  3 files changed, 44 insertions(+)
> 
> Change log:
>  - v1: 
> https://lore.kernel.org/linux-iommu/20201112022407.2063896-1-baolu...@linux.intel.com/
>  - Changed according to comments @ 
> https://lore.kernel.org/linux-iommu/20201116125631.2d043...@w520.home/
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 2151bc7f87ab..62c652111c88 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2331,6 +2331,24 @@ int vfio_unregister_notifier(struct device *dev,
> enum vfio_notify_type type,
>  }
>  EXPORT_SYMBOL(vfio_unregister_notifier);
> 
> +struct iommu_domain *vfio_group_domain(struct vfio_group *group)
> +{
> + struct vfio_container *container;
> + struct vfio_iommu_driver *driver;
> +
> + if (!group)
> + return ERR_PTR(-EINVAL);
> +
> + container = group->container;
> + driver = container->iommu_driver;
> + if (likely(driver && driver->ops->group_domain))
> + return driver->ops->group_domain(container->iommu_data,
> +  group->iommu_group);
> + else
> + return ERR_PTR(-ENOTTY);
> +}
> +EXPORT_SYMBOL(vfio_group_domain);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 67e827638995..783f18f21b95 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2980,6 +2980,28 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data,
> dma_addr_t user_iova,
>   return ret;
>  }
> 
> +static void *vfio_iommu_type1_group_domain(void *iommu_data,
> +struct iommu_group *iommu_group)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> + struct iommu_domain *domain = NULL;
> + struct vfio_domain *d;
> +
> + if (!iommu || !iommu_group)
> + return ERR_PTR(-EINVAL);
> +
> + mutex_lock(>lock);
> + list_for_each_entry(d, >domain_list, next) {
> + if (find_iommu_group(d, iommu_group)) {
> + domain = d->domain;
> + break;
> + }
> + }
> + mutex_unlock(>lock);
> +
> + return domain;
> +}
> +
>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>   .name   = "vfio-iommu-type1",
>   .owner  = THIS_MODULE,
> @@ -2993,6 +3015,7 @@ static const struct vfio_iommu_driver_ops 
> vfio_iommu_driver_ops_type1 = {
>   .register_notifier  = vfio_iommu_type1_register_notifier,
>   .unregister_notifier= vfio_iommu_type1_unregister_notifier,
>   .dma_rw = vfio_iommu_type1_dma_rw,
> + .group_domain   = vfio_iommu_type1_group_domain,
>  };
> 
>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 38d3c6a8dc7e..a0613a6f21cc 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -90,6 +90,7 @@ struct vfio_iommu_driver_ops {
>  struct notifier_block *nb);
>   int (*dma_rw)(void *iommu_data, dma_addr_t user_iova,
> void *data, size_t count, bool write);
> + void*(*group_domain)(void *iommu_data, struct 

  1   2   3   4   5   6   >