RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-04-02 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Friday, April 3, 2020 1:50 AM
> 
> On Sun, 22 Mar 2020 05:31:58 -0700
> "Liu, Yi L"  wrote:
> 
> > From: Liu Yi L 
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the
> PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> 
> Providing an allocation interface only allows for collaborative usage
> of PASIDs though.  Do we have any ability to enforce PASID usage or can
> a user spoof other PASIDs on the same BDF?

An user can access only PASIDs allocated to itself, i.e. the specific IOASID
set tied to its mm_struct.

Thanks
Kevin

> 
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > requests are intended to be invoked by QEMU or other applications which
> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Yi Sun 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/vfio/vfio.c | 130
> 
> >  drivers/vfio/vfio_iommu_type1.c | 104
> 
> >  include/linux/vfio.h|  20 +++
> >  include/uapi/linux/vfio.h   |  41 +
> >  4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #define DRIVER_VERSION "0.3"
> >  #define DRIVER_AUTHOR  "Alex Williamson
> "
> > @@ -46,6 +47,8 @@ static struct vfio {
> > struct mutexgroup_lock;
> > struct cdev group_cdev;
> > dev_t   group_devt;
> > +   struct list_headvfio_mm_list;
> > +   struct mutexvfio_mm_lock;
> > wait_queue_head_t   release_q;
> >  } vfio;
> >
> > @@ -2129,6 

[RFC PATCH v4] iommu/virtio: Use page size bitmap supported by endpoint

2020-04-02 Thread Bharat Bhushan
Different endpoint can support different page size, probe
endpoint if it supports specific page size otherwise use
global page sizes.

Signed-off-by: Bharat Bhushan 
---
v3->v4:
 - Fix whitespace error

v2->v3:
 - Fixed error return for incompatible endpoint
 - __u64 changed to __le64 in header file

 drivers/iommu/virtio-iommu.c  | 53 +++
 include/uapi/linux/virtio_iommu.h |  7 
 2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index cce329d71fba..ab09c04b1702 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -78,6 +78,7 @@ struct viommu_endpoint {
struct viommu_dev   *viommu;
struct viommu_domain*vdomain;
struct list_headresv_regions;
+   u64 pgsize_bitmap;
 };
 
 struct viommu_request {
@@ -415,6 +416,19 @@ static int viommu_replay_mappings(struct viommu_domain 
*vdomain)
return ret;
 }
 
+static int viommu_set_pgsize_bitmap(struct viommu_endpoint *vdev,
+   struct virtio_iommu_probe_pgsize_mask *mask,
+   size_t len)
+{
+   u64 pgsize_bitmap = le64_to_cpu(mask->pgsize_bitmap);
+
+   if (len < sizeof(*mask))
+   return -EINVAL;
+
+   vdev->pgsize_bitmap = pgsize_bitmap;
+   return 0;
+}
+
 static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
   struct virtio_iommu_probe_resv_mem *mem,
   size_t len)
@@ -499,6 +513,9 @@ static int viommu_probe_endpoint(struct viommu_dev *viommu, 
struct device *dev)
case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
ret = viommu_add_resv_mem(vdev, (void *)prop, len);
break;
+   case VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK:
+   ret = viommu_set_pgsize_bitmap(vdev, (void *)prop, len);
+   break;
default:
dev_err(dev, "unknown viommu prop 0x%x\n", type);
}
@@ -607,16 +624,17 @@ static struct iommu_domain *viommu_domain_alloc(unsigned 
type)
return >domain;
 }
 
-static int viommu_domain_finalise(struct viommu_dev *viommu,
+static int viommu_domain_finalise(struct viommu_endpoint *vdev,
  struct iommu_domain *domain)
 {
int ret;
struct viommu_domain *vdomain = to_viommu_domain(domain);
+   struct viommu_dev *viommu = vdev->viommu;
 
vdomain->viommu = viommu;
vdomain->map_flags  = viommu->map_flags;
 
-   domain->pgsize_bitmap   = viommu->pgsize_bitmap;
+   domain->pgsize_bitmap   = vdev->pgsize_bitmap;
domain->geometry= viommu->geometry;
 
ret = ida_alloc_range(>domain_ids, viommu->first_domain,
@@ -642,6 +660,29 @@ static void viommu_domain_free(struct iommu_domain *domain)
kfree(vdomain);
 }
 
+/*
+ * Check whether the endpoint's capabilities are compatible with other
+ * endpoints in the domain. Report any inconsistency.
+ */
+static bool viommu_endpoint_is_compatible(struct viommu_endpoint *vdev,
+ struct viommu_domain *vdomain)
+{
+   struct device *dev = vdev->dev;
+
+   if (vdomain->viommu != vdev->viommu) {
+   dev_err(dev, "cannot attach to foreign vIOMMU\n");
+   return false;
+   }
+
+   if (vdomain->domain.pgsize_bitmap != vdev->pgsize_bitmap) {
+   dev_err(dev, "incompatible domain bitmap 0x%lx != 0x%llx\n",
+   vdomain->domain.pgsize_bitmap, vdev->pgsize_bitmap);
+   return false;
+   }
+
+   return true;
+}
+
 static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
int i;
@@ -657,10 +698,9 @@ static int viommu_attach_dev(struct iommu_domain *domain, 
struct device *dev)
 * Properly initialize the domain now that we know which viommu
 * owns it.
 */
-   ret = viommu_domain_finalise(vdev->viommu, domain);
-   } else if (vdomain->viommu != vdev->viommu) {
-   dev_err(dev, "cannot attach to foreign vIOMMU\n");
-   ret = -EXDEV;
+   ret = viommu_domain_finalise(vdev, domain);
+   } else if (!viommu_endpoint_is_compatible(vdev, vdomain)) {
+   ret = -EINVAL;
}
mutex_unlock(>mutex);
 
@@ -875,6 +915,7 @@ static int viommu_add_device(struct device *dev)
 
vdev->dev = dev;
vdev->viommu = viommu;
+   vdev->pgsize_bitmap = viommu->pgsize_bitmap;
INIT_LIST_HEAD(>resv_regions);
fwspec->iommu_priv = vdev;
 
diff --git a/include/uapi/linux/virtio_iommu.h 
b/include/uapi/linux/virtio_iommu.h
index 237e36a280cb..bd0a1a844081 100644
--- a/include/uapi/linux/virtio_iommu.h
+++ 

[RFC PATCH v3] iommu/virtio: Use page size bitmap supported by endpoint

2020-04-02 Thread Bharat Bhushan
Different endpoint can support different page size, probe
endpoint if it supports specific page size otherwise use
global page sizes.

Signed-off-by: Bharat Bhushan 
---
 drivers/iommu/virtio-iommu.c  | 54 +++
 include/uapi/linux/virtio_iommu.h |  7 
 2 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index cce329d71fba..2ea9e143d95c 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -78,6 +78,7 @@ struct viommu_endpoint {
struct viommu_dev   *viommu;
struct viommu_domain*vdomain;
struct list_headresv_regions;
+   u64 pgsize_bitmap;
 };
 
 struct viommu_request {
@@ -415,6 +416,20 @@ static int viommu_replay_mappings(struct viommu_domain 
*vdomain)
return ret;
 }
 
+static int viommu_set_pgsize_bitmap(struct viommu_endpoint *vdev,
+   struct virtio_iommu_probe_pgsize_mask *mask,
+   size_t len)
+
+{
+   u64 pgsize_bitmap = le64_to_cpu(mask->pgsize_bitmap);
+
+   if (len < sizeof(*mask))
+   return -EINVAL;
+
+   vdev->pgsize_bitmap = pgsize_bitmap;
+   return 0;
+}
+
 static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
   struct virtio_iommu_probe_resv_mem *mem,
   size_t len)
@@ -499,6 +514,9 @@ static int viommu_probe_endpoint(struct viommu_dev *viommu, 
struct device *dev)
case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
ret = viommu_add_resv_mem(vdev, (void *)prop, len);
break;
+   case VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK:
+   ret = viommu_set_pgsize_bitmap(vdev, (void *)prop, len);
+   break;
default:
dev_err(dev, "unknown viommu prop 0x%x\n", type);
}
@@ -607,16 +625,17 @@ static struct iommu_domain *viommu_domain_alloc(unsigned 
type)
return >domain;
 }
 
-static int viommu_domain_finalise(struct viommu_dev *viommu,
+static int viommu_domain_finalise(struct viommu_endpoint *vdev,
  struct iommu_domain *domain)
 {
int ret;
struct viommu_domain *vdomain = to_viommu_domain(domain);
+   struct viommu_dev *viommu = vdev->viommu;
 
vdomain->viommu = viommu;
vdomain->map_flags  = viommu->map_flags;
 
-   domain->pgsize_bitmap   = viommu->pgsize_bitmap;
+   domain->pgsize_bitmap   = vdev->pgsize_bitmap;
domain->geometry= viommu->geometry;
 
ret = ida_alloc_range(>domain_ids, viommu->first_domain,
@@ -642,6 +661,29 @@ static void viommu_domain_free(struct iommu_domain *domain)
kfree(vdomain);
 }
 
+/*
+ * Check whether the endpoint's capabilities are compatible with other
+ * endpoints in the domain. Report any inconsistency.
+ */
+static bool viommu_endpoint_is_compatible(struct viommu_endpoint *vdev,
+ struct viommu_domain *vdomain)
+{
+   struct device *dev = vdev->dev;
+
+   if (vdomain->viommu != vdev->viommu) {
+   dev_err(dev, "cannot attach to foreign vIOMMU\n");
+   return false;
+   }
+
+   if (vdomain->domain.pgsize_bitmap != vdev->pgsize_bitmap) {
+   dev_err(dev, "incompatible domain bitmap 0x%lx != 0x%llx\n",
+   vdomain->domain.pgsize_bitmap, vdev->pgsize_bitmap);
+   return false;
+   }
+
+   return true;
+}
+
 static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
int i;
@@ -657,10 +699,9 @@ static int viommu_attach_dev(struct iommu_domain *domain, 
struct device *dev)
 * Properly initialize the domain now that we know which viommu
 * owns it.
 */
-   ret = viommu_domain_finalise(vdev->viommu, domain);
-   } else if (vdomain->viommu != vdev->viommu) {
-   dev_err(dev, "cannot attach to foreign vIOMMU\n");
-   ret = -EXDEV;
+   ret = viommu_domain_finalise(vdev, domain);
+   } else if (!viommu_endpoint_is_compatible(vdev, vdomain)) {
+   ret = -EINVAL;
}
mutex_unlock(>mutex);
 
@@ -875,6 +916,7 @@ static int viommu_add_device(struct device *dev)
 
vdev->dev = dev;
vdev->viommu = viommu;
+   vdev->pgsize_bitmap = viommu->pgsize_bitmap;
INIT_LIST_HEAD(>resv_regions);
fwspec->iommu_priv = vdev;
 
diff --git a/include/uapi/linux/virtio_iommu.h 
b/include/uapi/linux/virtio_iommu.h
index 237e36a280cb..bd0a1a844081 100644
--- a/include/uapi/linux/virtio_iommu.h
+++ b/include/uapi/linux/virtio_iommu.h
@@ -111,6 +111,7 @@ struct virtio_iommu_req_unmap {
 
 #define VIRTIO_IOMMU_PROBE_T_NONE

Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-02 Thread Alex Williamson
On Sun, 22 Mar 2020 05:33:14 -0700
"Liu, Yi L"  wrote:

> From: Liu Yi L 
> 
> Per PCIe r5.0, sec 9.3.7.14, if a PF implements the PASID Capability, the
> PF PASID configuration is shared by its VFs.  VFs must not implement their
> own PASID Capability.
> 
> Per PCIe r5.0, sec 9.3.7.11, VFs must not implement the PRI Capability. If
> the PF implements PRI, it is shared by the VFs.
> 
> On bare metal, it has been fixed by below efforts.
> to PASID/PRI are
> https://lkml.org/lkml/2019/9/5/996
> https://lkml.org/lkml/2019/9/5/995
> 
> This patch adds emulated PASID/PRI capabilities for VFs when assigned to
> VMs via vfio-pci driver. This is required for enabling vSVA on pass-through
> VFs as VFs have no PASID/PRI capability structure in its configure space.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 325 
> -
>  1 file changed, 323 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index 4b9af99..84b4ea0 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1509,11 +1509,304 @@ static int vfio_cap_init(struct vfio_pci_device 
> *vdev)
>   return 0;
>  }
>  
> +static int vfio_fill_custom_vconfig_bytes(struct vfio_pci_device *vdev,
> + int offset, uint8_t *data, int size)
> +{
> + int ret = 0, data_offset = 0;
> +
> + while (size) {
> + int filled;
> +
> + if (size >= 4 && !(offset % 4)) {
> + __le32 *dwordp = (__le32 *)>vconfig[offset];
> + u32 dword;
> +
> + memcpy(, data + data_offset, 4);
> + *dwordp = cpu_to_le32(dword);

Why wouldn't we do:

*dwordp = cpu_to_le32(*(u32 *)(data + data_offset));

or better yet, increment data on each iteration for:

*dwordp = cpu_to_le32(*(u32 *)data);

vfio_fill_vconfig_bytes() does almost this same thing, getting the data
from config space rather than a buffer, so please figure out how to
avoid duplicating the logic.

> + filled = 4;
> + } else if (size >= 2 && !(offset % 2)) {
> + __le16 *wordp = (__le16 *)>vconfig[offset];
> + u16 word;
> +
> + memcpy(, data + data_offset, 2);
> + *wordp = cpu_to_le16(word);
> + filled = 2;
> + } else {
> + u8 *byte = >vconfig[offset];
> +
> + memcpy(byte, data + data_offset, 1);

This one is really silly.

vdev->vconfig[offset] = *data;

> + filled = 1;
> + }
> +
> + offset += filled;
> + data_offset += filled;
> + size -= filled;
> + }
> +
> + return ret;
> +}
> +
> +static int vfio_pci_get_ecap_content(struct pci_dev *pdev,
> + int cap, int cap_len, u8 *content)
> +{
> + int pos, offset, len = cap_len, ret = 0;
> +
> + pos = pci_find_ext_capability(pdev, cap);
> + if (!pos)
> + return -EINVAL;
> +
> + offset = 0;
> + while (len) {
> + int fetched;
> +
> + if (len >= 4 && !(pos % 4)) {
> + u32 *dwordp = (u32 *) (content + offset);
> + u32 dword;
> + __le32 *dwptr = (__le32 *) 
> +
> + ret = pci_read_config_dword(pdev, pos, );
> + if (ret)
> + return ret;
> + *dwordp = le32_to_cpu(*dwptr);

WHAT???  pci_read_config_dword() returns cpu endian data!

> + fetched = 4;
> + } else if (len >= 2 && !(pos % 2)) {
> + u16 *wordp = (u16 *) (content + offset);
> + u16 word;
> + __le16 *wptr = (__le16 *) 
> +
> + ret = pci_read_config_word(pdev, pos, );
> + if (ret)
> + return ret;
> + *wordp = le16_to_cpu(*wptr);
> + fetched = 2;
> + } else {
> + u8 *byte = (u8 *) (content + offset);
> +
> + ret = pci_read_config_byte(pdev, pos, byte);
> + if (ret)
> + return ret;
> + fetched = 1;
> + }
> +
> + pos += fetched;
> + offset += fetched;
> + len -= fetched;
> + }
> +
> + return ret;
> +}
> +
> +struct vfio_pci_pasid_cap_data {
> + u32 id:16;
> + u32 version:4;
> + u32 next:12;
> + union {
> + u16 cap_reg_val;
> + struct {
> + u16 rsv1:1;
> + u16 execs:1;
> +

Re: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs

2020-04-02 Thread Alex Williamson
On Sun, 22 Mar 2020 05:32:05 -0700
"Liu, Yi L"  wrote:

> From: Liu Yi L 
> 
> Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> are used to achieve flexible device sharing across domains (e.g. VMs).
> Also there are hardware assisted mediated pass-through solutions from
> platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
> concept, which means mdevs are protected by an iommu domain which is
> aux-domain of its physical device. Details can be found in the KVM
> presentation from Kevin Tian. IOMMU-backed equals to IOMMU-capable.
> 
> https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> 
> This patch supports NESTING IOMMU for IOMMU-backed mdevs by figuring
> out the physical device of an IOMMU-backed mdev and then invoking IOMMU
> requests to IOMMU layer with the physical device and the mdev's aux
> domain info.
> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
> on IOMMU-backed mdevs.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> CC: Jun Tian 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 937ec3f..d473665 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -132,6 +132,7 @@ struct vfio_regions {
>  
>  struct domain_capsule {
>   struct iommu_domain *domain;
> + struct vfio_group *group;
>   void *data;
>  };
>  
> @@ -148,6 +149,7 @@ static int vfio_iommu_for_each_dev(struct vfio_iommu 
> *iommu,
>   list_for_each_entry(d, >domain_list, next) {
>   dc.domain = d->domain;
>   list_for_each_entry(g, >group_list, next) {
> + dc.group = g;
>   ret = iommu_group_for_each_dev(g->iommu_group,
>  , fn);
>   if (ret)
> @@ -2347,7 +2349,12 @@ static int vfio_bind_gpasid_fn(struct device *dev, 
> void *data)
>   struct iommu_gpasid_bind_data *gbind_data =
>   (struct iommu_gpasid_bind_data *) dc->data;
>  
> - return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> + if (dc->group->mdev_group)
> + return iommu_sva_bind_gpasid(dc->domain,
> + vfio_mdev_get_iommu_device(dev), gbind_data);

But we can't assume an mdev device is iommu backed, so this can call
with NULL dev, which appears will pretty quickly segfault
intel_svm_bind_gpasid.

> + else
> + return iommu_sva_bind_gpasid(dc->domain,
> + dev, gbind_data);
>  }
>  
>  static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> @@ -2356,8 +2363,13 @@ static int vfio_unbind_gpasid_fn(struct device *dev, 
> void *data)
>   struct iommu_gpasid_bind_data *gbind_data =
>   (struct iommu_gpasid_bind_data *) dc->data;
>  
> - return iommu_sva_unbind_gpasid(dc->domain, dev,
> + if (dc->group->mdev_group)
> + return iommu_sva_unbind_gpasid(dc->domain,
> + vfio_mdev_get_iommu_device(dev),
>   gbind_data->hpasid);

Same

> + else
> + return iommu_sva_unbind_gpasid(dc->domain, dev,
> + gbind_data->hpasid);
>  }
>  
>  /**
> @@ -2429,7 +2441,12 @@ static int vfio_cache_inv_fn(struct device *dev, void 
> *data)
>   struct iommu_cache_invalidate_info *cache_inv_info =
>   (struct iommu_cache_invalidate_info *) dc->data;
>  
> - return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> + if (dc->group->mdev_group)
> + return iommu_cache_invalidate(dc->domain,
> + vfio_mdev_get_iommu_device(dev), cache_inv_info);

And again

> + else
> + return iommu_cache_invalidate(dc->domain,
> + dev, cache_inv_info);
>  }
>  
>  static long vfio_iommu_type1_ioctl(void *iommu_data,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE

2020-04-02 Thread Alex Williamson
On Sun, 22 Mar 2020 05:32:04 -0700
"Liu, Yi L"  wrote:

> From: Liu Yi L 
> 
> For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest "owns" the
> first-level/stage-1 translation structures, the host IOMMU driver has no
> knowledge of first-level/stage-1 structure cache updates unless the guest
> invalidation requests are trapped and propagated to the host.
> 
> This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to propagate guest
> first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU cache
> correctness.
> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
> as the host IOMMU iotlb correctness are ensured.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Eric Auger 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 49 
> +
>  include/uapi/linux/vfio.h   | 22 ++
>  2 files changed, 71 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index a877747..937ec3f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2423,6 +2423,15 @@ static long vfio_iommu_type1_unbind_gpasid(struct 
> vfio_iommu *iommu,
>   return ret;
>  }
>  
> +static int vfio_cache_inv_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_cache_invalidate_info *cache_inv_info =
> + (struct iommu_cache_invalidate_info *) dc->data;
> +
> + return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>   }
>   kfree(gbind_data);
>   return ret;
> + } else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> + struct vfio_iommu_type1_cache_invalidate cache_inv;
> + u32 version;
> + int info_size;
> + void *cache_info;
> + int ret;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_cache_invalidate,
> + flags);

This breaks backward compatibility as soon as struct
iommu_cache_invalidate_info changes size by its defined versioning
scheme.  ie. a field gets added, the version is bumped, all existing
userspace breaks.  Our minsz is offsetofend to the version field,
interpret the version to size, then reevaluate argsz.

> +
> + if (copy_from_user(_inv, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (cache_inv.argsz < minsz || cache_inv.flags)
> + return -EINVAL;
> +
> + /* Get the version of struct iommu_cache_invalidate_info */
> + if (copy_from_user(,
> + (void __user *) (arg + minsz), sizeof(version)))
> + return -EFAULT;
> +
> + info_size = iommu_uapi_get_data_size(
> + IOMMU_UAPI_CACHE_INVAL, version);
> +
> + cache_info = kzalloc(info_size, GFP_KERNEL);
> + if (!cache_info)
> + return -ENOMEM;
> +
> + if (copy_from_user(cache_info,
> + (void __user *) (arg + minsz), info_size)) {
> + kfree(cache_info);
> + return -EFAULT;
> + }
> +
> + mutex_lock(>lock);
> + ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> + cache_info);

How does a user respond when their cache invalidate fails?  Isn't this
also another case where our for_each_dev can fail at an arbitrary point
leaving us with no idea whether each device even had the opportunity to
perform the invalidation request.  I don't see how we have any chance
to maintain coherency after this faults.

> + mutex_unlock(>lock);
> + kfree(cache_info);
> + return ret;
>   }
>  
>   return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2235bc6..62ca791 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
>   */
>  #define VFIO_IOMMU_BIND  _IO(VFIO_TYPE, VFIO_BASE + 23)
>  
> +/**
> + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> + *   struct vfio_iommu_type1_cache_invalidate)
> + *
> + * Propagate guest IOMMU cache invalidation to the host. The cache
> + * invalidation information is conveyed by @cache_info, the content
> + * format would be structures defined in uapi/linux/iommu.h. User
> + * should be aware of that the struct  

Re: [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID

2020-04-02 Thread Jacob Pan
On Thu, 2 Apr 2020 02:18:45 +
"Tian, Kevin"  wrote:

> > From: Jacob Pan 
> > Sent: Wednesday, April 1, 2020 11:48 PM
> > 
> > On Sat, 28 Mar 2020 10:22:41 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Jacob Pan 
> > > > Sent: Saturday, March 21, 2020 7:28 AM
> > > >
> > > > When VT-d driver runs in the guest, PASID allocation must be
> > > > performed via virtual command interface. This patch registers a
> > > > custom IOASID allocator which takes precedence over the default
> > > > XArray based allocator. The resulting IOASID allocation will
> > > > always come from the host. This ensures that PASID namespace is
> > > > system- wide.
> > > >
> > > > Signed-off-by: Lu Baolu 
> > > > Signed-off-by: Liu, Yi L 
> > > > Signed-off-by: Jacob Pan 
> > > > ---
> > > >  drivers/iommu/intel-iommu.c | 84
> > > > +
> > > >  include/linux/intel-iommu.h |  2 ++
> > > >  2 files changed, 86 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/intel-iommu.c
> > > > b/drivers/iommu/intel-iommu.c index a76afb0fd51a..c1c0b0fb93c3
> > > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > > +++ b/drivers/iommu/intel-iommu.c
> > > > @@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct  
> > intel_iommu  
> > > > *iommu)
> > > > if (ecap_prs(iommu->ecap))
> > > > intel_svm_finish_prq(iommu);
> > > > }
> > > > +   if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> > > > +
> > > > ioasid_unregister_allocator(>pasid_allocator); +
> > > >  #endif
> > > >  }
> > > >
> > > > @@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct
> > > > intel_iommu *iommu)
> > > > return ret;
> > > >  }
> > > >
> > > > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > > +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max,
> > > > void *data)  
> > >
> > > the name is too generic... can we add vcmd in the name to clarify
> > > its purpose, e.g. intel_vcmd_ioasid_alloc?
> > >  
> > I feel the intel_ prefix is a natural extension of a generic API,
> > we do that for other IOMMU APIs, right?  
> 
> other IOMMU APIs have no difference between host and guest, but
> this one only applies to guest with vcmd interface. 
> 
OK, sounds good. It is more explicit, improves readability.

> >   
> > > > +{
> > > > +   struct intel_iommu *iommu = data;
> > > > +   ioasid_t ioasid;
> > > > +
> > > > +   if (!iommu)
> > > > +   return INVALID_IOASID;
> > > > +   /*
> > > > +* VT-d virtual command interface always uses the full
> > > > 20 bit
> > > > +* PASID range. Host can partition guest PASID range
> > > > based on
> > > > +* policies but it is out of guest's control.
> > > > +*/
> > > > +   if (min < PASID_MIN || max > intel_pasid_max_id)
> > > > +   return INVALID_IOASID;
> > > > +
> > > > +   if (vcmd_alloc_pasid(iommu, ))
> > > > +   return INVALID_IOASID;
> > > > +
> > > > +   return ioasid;
> > > > +}
> > > > +
> > > > +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> > > > +{
> > > > +   struct intel_iommu *iommu = data;
> > > > +
> > > > +   if (!iommu)
> > > > +   return;
> > > > +   /*
> > > > +* Sanity check the ioasid owner is done at upper
> > > > layer, e.g. VFIO
> > > > +* We can only free the PASID when all the devices are
> > > > unbound.
> > > > +*/
> > > > +   if (ioasid_find(NULL, ioasid, NULL)) {
> > > > +   pr_alert("Cannot free active IOASID %d\n",
> > > > ioasid);
> > > > +   return;
> > > > +   }  
> > >
> > > However the sanity check is not done in default_free. Is there a
> > > reason why using vcmd adds such  new requirement?
> > >  
> > Since we don't support nested guest. This vcmd allocator is only
> > used by the guest IOMMU driver not VFIO. We expect IOMMU driver to
> > have control of the free()/unbind() ordering.
> > 
> > For default_free, it can come from user space and host VFIO which
> > can be out of order. But we will solve that issue with the blocking
> > notifier.
> >   
> > > > +   vcmd_free_pasid(iommu, ioasid);
> > > > +}
> > > > +
> > > > +static void register_pasid_allocator(struct intel_iommu *iommu)
> > > > +{
> > > > +   /*
> > > > +* If we are running in the host, no need for custom
> > > > allocator
> > > > +* in that PASIDs are allocated from the host
> > > > system-wide.
> > > > +*/
> > > > +   if (!cap_caching_mode(iommu->cap))
> > > > +   return;  
> > >
> > > is it more accurate to check against vcmd capability?
> > >  
> > I think this is sufficient. The spec says if vcmd is present, we
> > must use it but not the other way.  
> 
> No, what about an vIOMMU implementation reports CM but not
> VCMD?
> I didn't get the rationale why we check an indirect capability
> when there is already one well defined for the purpose.
> 
We _do_ check 

Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

2020-04-02 Thread Alex Williamson
On Sun, 22 Mar 2020 05:32:03 -0700
"Liu, Yi L"  wrote:

> From: Liu Yi L 
> 
> VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by hardware
> IOMMUs that have nesting DMA translation (a.k.a dual stage address
> translation). For such hardware IOMMUs, there are two stages/levels of
> address translation, and software may let userspace/VM to own the first-
> level/stage-1 translation structures. Example of such usage is vSVA (
> virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> translation structures and bind the structures to host, then hardware
> IOMMU would utilize nesting translation when doing DMA translation fo
> the devices behind such hardware IOMMU.
> 
> This patch adds vfio support for binding guest translation (a.k.a stage 1)
> structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only bind
> guest page table is needed, it also requires to expose interface to guest
> for iommu cache invalidation when guest modified the first-level/stage-1
> translation structures since hardware needs to be notified to flush stale
> iotlbs. This would be introduced in next patch.
> 
> In this patch, guest page table bind and unbind are done by using flags
> VFIO_IOMMU_BIND_GUEST_PGTBL and VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> struct iommu_gpasid_bind_data. Before binding guest page table to host,
> VM should have got a PASID allocated by host via VFIO_IOMMU_PASID_REQUEST.
> 
> Bind guest translation structures (here is guest page table) to host
> are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 158 
> 
>  include/uapi/linux/vfio.h   |  46 
>  2 files changed, 204 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 82a9e0b..a877747 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -130,6 +130,33 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)  \
>   (!list_empty(>domain_list))
>  
> +struct domain_capsule {
> + struct iommu_domain *domain;
> + void *data;
> +};
> +
> +/* iommu->lock must be held */
> +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> +   int (*fn)(struct device *dev, void *data),
> +   void *data)
> +{
> + struct domain_capsule dc = {.data = data};
> + struct vfio_domain *d;
> + struct vfio_group *g;
> + int ret = 0;
> +
> + list_for_each_entry(d, >domain_list, next) {
> + dc.domain = d->domain;
> + list_for_each_entry(g, >group_list, next) {
> + ret = iommu_group_for_each_dev(g->iommu_group,
> +, fn);
> + if (ret)
> + break;
> + }
> + }
> + return ret;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
>  
>  /*
> @@ -2314,6 +2341,88 @@ static int vfio_iommu_info_add_nesting_cap(struct 
> vfio_iommu *iommu,
>   return 0;
>  }
>  
> +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_gpasid_bind_data *gbind_data =
> + (struct iommu_gpasid_bind_data *) dc->data;
> +
> + return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> +}
> +
> +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_gpasid_bind_data *gbind_data =
> + (struct iommu_gpasid_bind_data *) dc->data;
> +
> + return iommu_sva_unbind_gpasid(dc->domain, dev,
> + gbind_data->hpasid);
> +}
> +
> +/**
> + * Unbind specific gpasid, caller of this function requires hold
> + * vfio_iommu->lock
> + */
> +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> + struct iommu_gpasid_bind_data *gbind_data)
> +{
> + return vfio_iommu_for_each_dev(iommu,
> + vfio_unbind_gpasid_fn, gbind_data);
> +}
> +
> +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> + struct iommu_gpasid_bind_data *gbind_data)
> +{
> + int ret = 0;
> +
> + mutex_lock(>lock);
> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + ret = vfio_iommu_for_each_dev(iommu,
> + vfio_bind_gpasid_fn, gbind_data);
> + /*
> +  * If bind failed, it 

Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace

2020-04-02 Thread Alex Williamson
On Sun, 22 Mar 2020 05:32:02 -0700
"Liu, Yi L"  wrote:

> From: Liu Yi L 
> 
> VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> capability to userspace. Thus applications like QEMU could support
> vIOMMU with hardware's nesting translation capability for pass-through
> devices. Before setting up nesting translation for pass-through devices,
> QEMU and other applications need to learn the supported 1st-lvl/stage-1
> translation structure format like page table format.
> 
> Take vSVA (virtual Shared Virtual Addressing) as an example, to support
> vSVA for pass-through devices, QEMU setup nesting translation for pass-
> through devices. The guest page table are configured to host as 1st-lvl/
> stage-1 page table. Therefore, guest format should be compatible with
> host side.
> 
> This patch reports the supported 1st-lvl/stage-1 page table format on the
> current platform to userspace. QEMU and other alike applications should
> use this format info when trying to setup IOMMU nesting translation on
> host IOMMU.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 56 
> +
>  include/uapi/linux/vfio.h   |  1 +
>  2 files changed, 57 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 9aa2a67..82a9e0b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct 
> vfio_iommu *iommu,
>   return ret;
>  }
>  
> +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> +  u32 *stage1_format)
> +{
> + struct vfio_domain *domain;
> + u32 format = 0, tmp_format = 0;
> + int ret;
> +
> + mutex_lock(>lock);
> + if (list_empty(>domain_list)) {
> + mutex_unlock(>lock);
> + return -EINVAL;
> + }
> +
> + list_for_each_entry(domain, >domain_list, next) {
> + if (iommu_domain_get_attr(domain->domain,
> + DOMAIN_ATTR_PASID_FORMAT, )) {
> + ret = -EINVAL;
> + format = 0;
> + goto out_unlock;
> + }
> + /*
> +  * format is always non-zero (the first format is
> +  * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> +  * the reason of potential different backed IOMMU
> +  * formats, here we expect to have identical formats
> +  * in the domain list, no mixed formats support.
> +  * return -EINVAL to fail the attempt of setup
> +  * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> +  * are detected.
> +  */
> + if (tmp_format && tmp_format != format) {
> + ret = -EINVAL;
> + format = 0;
> + goto out_unlock;
> + }
> +
> + tmp_format = format;
> + }
> + ret = 0;
> +
> +out_unlock:
> + if (format)
> + *stage1_format = format;
> + mutex_unlock(>lock);
> + return ret;
> +}
> +
>  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>struct vfio_info_cap *caps)
>  {
>   struct vfio_info_cap_header *header;
>   struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> + u32 formats = 0;
> + int ret;
> +
> + ret = vfio_iommu_get_stage1_format(iommu, );
> + if (ret) {
> + pr_warn("Failed to get stage-1 format\n");
> + return ret;

Looks like this generates a warning and causes the iommu_get_info ioctl
to fail if the hardware doesn't support the pasid format attribute, or
the domain list is empty.  This breaks users on existing hardware.

> + }
>  
>   header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>  VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct 
> vfio_iommu *iommu,
>   /* nesting iommu type supports PASID requests (alloc/free) */
>   nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
>   }
> + nesting_cap->stage1_formats = formats;
>  
>   return 0;
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ed9881d..ebeaf3e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
>   struct  vfio_info_cap_header header;
>  #define VFIO_IOMMU_PASID_REQS(1 << 0)
>   __u32   nesting_capabilities;
> + __u32   stage1_formats;
>  };
>  
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)

___
iommu mailing list

Re: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-04-02 Thread Jacob Pan
On Wed, 1 Apr 2020 05:32:21 +
"Tian, Kevin"  wrote:

> > From: Jacob Pan 
> > Sent: Tuesday, March 31, 2020 11:55 PM
> > 
> > On Tue, 31 Mar 2020 06:06:38 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Jacob Pan 
> > > > Sent: Tuesday, March 31, 2020 12:08 AM
> > > >
> > > > On Mon, 30 Mar 2020 05:40:40 +
> > > > "Tian, Kevin"  wrote:
> > > >  
> > > > > > From: Jacob Pan 
> > > > > > Sent: Saturday, March 28, 2020 7:54 AM
> > > > > >
> > > > > > On Fri, 27 Mar 2020 00:47:02 -0700
> > > > > > Christoph Hellwig  wrote:
> > > > > >  
> > > > > > > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian, Kevin
> > > > > > > wrote:  
> > > > > > > > If those API calls are inter-dependent for composing a
> > > > > > > > feature (e.g. SVA), shouldn't we need a way to check
> > > > > > > > them together before exposing the feature to the guest,
> > > > > > > > e.g. through a iommu_get_uapi_capabilities interface?  
> > > > > > >
> > > > > > > Yes, that makes sense.  The important bit is to have a
> > > > > > > capability flags and not version numbers.  
> > > > > >
> > > > > > The challenge is that there are two consumers in the kernel
> > > > > > for this. 1. VFIO only look for compatibility, and size of
> > > > > > each data struct such that it can copy_from_user.
> > > > > >
> > > > > > 2. IOMMU driver, the "real consumer" of the content.
> > > > > >
> > > > > > For 2, I agree and we do plan to use the capability flags to
> > > > > > check content and maintain backward compatibility etc.
> > > > > >
> > > > > > For VFIO, it is difficult to do size look up based on
> > > > > > capability flags.  
> > > > >
> > > > > Can you elaborate the difficulty in VFIO? if, as Christoph
> > > > > Hellwig pointed out, version number is already avoided
> > > > > everywhere, it is interesting to know whether this work
> > > > > becomes a real exception or just requires a different mindset.
> > > > >  
> > > > From VFIO p.o.v. the IOMMU UAPI data is opaque, it only needs
> > > > to do two things:
> > > > 1. is the UAPI compatible?
> > > > 2. what is the size to copy?
> > > >
> > > > If you look at the version number, this is really a "version as
> > > > size" lookup, as provided by the helper function in this patch.
> > > > An example can be the newly introduced clone3 syscall.
> > > > https://lwn.net/Articles/792628/
> > > > In clone3, new version must have new size. The slight difference
> > > > here is that, unlike clone3, we have multiple data structures
> > > > instead of a single struct clone_args {}. And each struct has
> > > > flags to enumerate its contents besides size.  
> > >
> > > Thanks for providing that link. However clone3 doesn't include a
> > > version field to do "version as size" lookup. Instead, as you
> > > said, it includes a size parameter which sounds like the option 3
> > > (argsz) listed below.
> > >  
> > Right, there is no version in clone3. size = version. I view this as
> > a 1:1 lookup.
> >   
> > > >
> > > > Besides breaching data abstraction, if VFIO has to check IOMMU
> > > > flags to determine the sizes, it has many combinations.
> > > >
> > > > We also separate the responsibilities into two parts
> > > > 1. compatibility - version, size by VFIO
> > > > 2. sanity check - capability flags - by IOMMU  
> > >
> > > I feel argsz+flags approach can perfectly meet above requirement.
> > > The userspace set the size and flags for whatever capabilities it
> > > uses, and VFIO simply copies the parameters by size and pass to
> > > IOMMU for further sanity check. Of course the assumption is that
> > > we do provide an interface for userspace to enumerate all
> > > supported capabilities. 
> > You cannot trust user for argsz. the size to be copied from user
> > must be based on knowledge in kernel. That is why we have this
> > version to size lookup.
> > 
> > In VFIO, the size to copy is based on knowledge of each VFIO UAPI
> > structures and VFIO flags. But here the flags are IOMMU UAPI flags.
> > As you pointed out in another thread, VFIO is one user.  
> 
> If that is the case, can we let VFIO only copy its own UAPI fields
> while simply passing the user pointer of IOMMU UAPI structure to IOMMU
> driver for further size check and copy? Otherwise we are entering a
> dead end that VFIO doesn't want to parse a structure which is not
> defined by him while using version to represent the black box size
> is considered as a discarded scheme and doesn't scale well...
> 
I think this could be an other viable option. Let me try to summarize
since this has been a long discussion since the original version.

Problem statements:
1. When launching vIOMMU in the guest, how can we ensure the host has
compatible support upfront? as compared to fail later.

2. As UAPI data gets extended (both in size and flags), how can we know
the size to copy

3. Maintain backward compatibility while allowing extensions?

I think we all agreed that using flags (capability or types) is the way
to address #3. As Christoph 

Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace

2020-04-02 Thread Alex Williamson
On Sun, 22 Mar 2020 05:32:00 -0700
"Liu, Yi L"  wrote:

> From: Liu Yi L 
> 
> This patch reports PASID alloc/free availability to userspace (e.g. QEMU)
> thus userspace could do a pre-check before utilizing this feature.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 28 
>  include/uapi/linux/vfio.h   |  8 
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e40afc0..ddd1ffe 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct 
> vfio_iommu *iommu,
>   return ret;
>  }
>  
> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> +  struct vfio_info_cap *caps)
> +{
> + struct vfio_info_cap_header *header;
> + struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +
> + header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> +VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> + if (IS_ERR(header))
> + return PTR_ERR(header);
> +
> + nesting_cap = container_of(header,
> + struct vfio_iommu_type1_info_cap_nesting,
> + header);
> +
> + nesting_cap->nesting_capabilities = 0;
> + if (iommu->nesting) {
> + /* nesting iommu type supports PASID requests (alloc/free) */
> + nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> + }
> +
> + return 0;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -2283,6 +2307,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>   if (ret)
>   return ret;
>  
> + ret = vfio_iommu_info_add_nesting_cap(iommu, );
> + if (ret)
> + return ret;
> +
>   if (caps.size) {
>   info.flags |= VFIO_IOMMU_INFO_CAPS;
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 298ac80..8837219 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
>   struct  vfio_iova_range iova_ranges[];
>  };
>  
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> +
> +struct vfio_iommu_type1_info_cap_nesting {
> + struct  vfio_info_cap_header header;
> +#define VFIO_IOMMU_PASID_REQS(1 << 0)
> + __u32   nesting_capabilities;
> +};
> +
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>  
>  /**

I think this answers my PROBE question on patch 1/.  Should the
quota/usage be exposed to the user here?  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-04-02 Thread Alex Williamson
On Mon, 30 Mar 2020 11:44:08 +
"Tian, Kevin"  wrote:

> > From: Liu, Yi L 
> > Sent: Monday, March 30, 2020 5:27 PM
> >   
> > > From: Tian, Kevin 
> > > Sent: Monday, March 30, 2020 5:20 PM
> > > To: Liu, Yi L ; alex.william...@redhat.com;
> > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter  
> > for quota  
> > > tuning
> > >  
> > > > From: Liu, Yi L 
> > > > Sent: Monday, March 30, 2020 4:53 PM
> > > >  
> > > > > From: Tian, Kevin 
> > > > > Sent: Monday, March 30, 2020 4:41 PM
> > > > > To: Liu, Yi L ; alex.william...@redhat.com;
> > > > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1
> > > > > parameter  
> > > > for quota  
> > > > > tuning
> > > > >  
> > > > > > From: Liu, Yi L 
> > > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > > >
> > > > > > From: Liu Yi L 
> > > > > >
> > > > > > This patch adds a module option to make the PASID quota tunable by
> > > > > > administrator.
> > > > > >
> > > > > > TODO: needs to think more on how to  make the tuning to be per-  
> > process.  
> > > > > >
> > > > > > Previous discussions:
> > > > > > https://patchwork.kernel.org/patch/11209429/
> > > > > >
> > > > > > Cc: Kevin Tian 
> > > > > > CC: Jacob Pan 
> > > > > > Cc: Alex Williamson 
> > > > > > Cc: Eric Auger 
> > > > > > Cc: Jean-Philippe Brucker 
> > > > > > Signed-off-by: Liu Yi L 
> > > > > > ---
> > > > > >  drivers/vfio/vfio.c | 8 +++-
> > > > > >  drivers/vfio/vfio_iommu_type1.c | 7 ++-
> > > > > >  include/linux/vfio.h| 3 ++-
> > > > > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > > d13b483..020a792 100644
> > > > > > --- a/drivers/vfio/vfio.c
> > > > > > +++ b/drivers/vfio/vfio.c
> > > > > > @@ -2217,13 +2217,19 @@ struct vfio_mm  
> > > > *vfio_mm_get_from_task(struct  
> > > > > > task_struct *task)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > > >
> > > > > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min,
> > > > > > +int  
> > > > max)  
> > > > > >  {
> > > > > > ioasid_t pasid;
> > > > > > int ret = -ENOSPC;
> > > > > >
> > > > > > mutex_lock(>pasid_lock);
> > > > > >
> > > > > > +   /* update quota as it is tunable by admin */
> > > > > > +   if (vmm->pasid_quota != quota) {
> > > > > > +   vmm->pasid_quota = quota;
> > > > > > +   ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > > > > +   }
> > > > > > +  
> > > > >
> > > > > It's a bit weird to have quota adjusted in the alloc path, since the
> > > > > latter  
> > > > might  
> > > > > be initiated by non-privileged users. Why not doing the simple math
> > > > > in  
> > > > vfio_  
> > > > > create_mm to set the quota when the ioasid set is created? even in
> > > > > the  
> > > > future  
> > > > > you may allow per-process quota setting, that should come from
> > > > > separate privileged path instead of thru alloc..  
> > > >
> > > > The reason is the kernel parameter modification has no event which can
> > > > be used to adjust the quota. So I chose to adjust it in pasid_alloc
> > > > path. If it's not good, how about adding one more IOCTL to let user-
> > > > space trigger a quota adjustment event? Then even non-privileged user
> > > > could trigger quota adjustment, the quota is actually controlled by
> > > > privileged user. How about your opinion?
> > > >  
> > >
> > > why do you need an event to adjust? As I said, you can set the quota when 
> > >  
> > the set is  
> > > created in vfio_create_mm...  
> > 
> > oh, it's to support runtime adjustments. I guess it may be helpful to let
> > per-VM quota tunable even the VM is running. If just set the quota in
> > vfio_create_mm(), it is not able to adjust at runtime.
> >   
> 
> ok, I didn't note the module parameter was granted with a write permission.
> However there is a further problem. We cannot support PASID reclaim now.
> What about the admin sets a quota smaller than previous value while some
> IOASID sets already exceed the new quota? I'm not sure how to fail a runtime
> module parameter change due to that situation. possibly a normal sysfs 
> node better suites the runtime change requirement...

Yep, making this runtime adjustable seems a bit unpredictable and racy,
and it's not clear to me how a user is going to jump in at just the
right time for a user and adjust the limit.  I'd probably go for a
simple non-runtime adjustable module option.  It's a safety net at this
point anyway afaict.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-04-02 Thread Alex Williamson
On Sun, 22 Mar 2020 05:31:58 -0700
"Liu, Yi L"  wrote:

> From: Liu Yi L 
> 
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the
> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access. It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.

Providing an allocation interface only allows for collaborative usage
of PASIDs though.  Do we have any ability to enforce PASID usage or can
a user spoof other PASIDs on the same BDF?

> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such
> requests are intended to be invoked by QEMU or other applications which
> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is
> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
> 
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Yi Sun 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio.c | 130 
> 
>  drivers/vfio/vfio_iommu_type1.c | 104 
>  include/linux/vfio.h|  20 +++
>  include/uapi/linux/vfio.h   |  41 +
>  4 files changed, 295 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DRIVER_VERSION   "0.3"
>  #define DRIVER_AUTHOR"Alex Williamson "
> @@ -46,6 +47,8 @@ static struct vfio {
>   struct mutexgroup_lock;
>   struct cdev group_cdev;
>   dev_t   group_devt;
> + struct list_headvfio_mm_list;
> + struct mutexvfio_mm_lock;
>   wait_queue_head_t   release_q;
>  } vfio;
>  
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum 
> vfio_notify_type type,
>  EXPORT_SYMBOL(vfio_unregister_notifier);
>  
>  /**
> + * VFIO_MM objects - create, release, get, put, search
> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> +{
> + struct vfio_mm 

Re: How to fix WARN from drivers/base/dd.c in next-20200401 if CONFIG_MODULES=y?

2020-04-02 Thread John Stultz
On Thu, Apr 2, 2020 at 3:17 AM Yoshihiro Shimoda
 wrote:
>
> I found an issue after applied the following patches:
> ---
> 64c775f driver core: Rename deferred_probe_timeout and make it global
> 0e9f8d0 driver core: Remove driver_deferred_probe_check_state_continue()
> bec6c0e pinctrl: Remove use of driver_deferred_probe_check_state_continue()
> e2cec7d driver core: Set deferred_probe_timeout to a longer default if 
> CONFIG_MODULES is set
> c8c43ce driver core: Fix driver_deferred_probe_check_state() logic
> ---
>
> Before these patches, on my environment [1], some device drivers
> which has iommus property output the following message when probing:
>
> [3.05] ravb e680.ethernet: ignoring dependency for device, 
> assuming no driver
> [3.257174] ravb e680.ethernet eth0: Base address at 0xe680, 
> 2e:09:0a:02:eb:2d, IRQ 117.
>
> So, since ravb driver is probed within 4 seconds, we can use NFS rootfs 
> correctly.
>
> However, after these patches are applied, since the patches are always 
> waiting for 30 seconds
> for of_iommu_configure() when IOMMU hardware is disabled, drivers/base/dd.c 
> output WARN.
> Also, since ravb cannot be probed for 30 seconds, we cannot use NFS rootfs 
> anymore.
> JFYI, I copied the kernel log to the end of this email.

Hey,
  Terribly sorry for the trouble. So as Robin mentioned I have a patch
to remove the WARN messages, but I'm a bit more concerned about why
after the 30 second delay, the ethernet driver loads:
  [   36.218666] ravb e680.ethernet eth0: Base address at
0xe680, 2e:09:0a:02:eb:2d, IRQ 117.
but NFS fails.

Is it just that the 30 second delay is too long and NFS gives up?

Does booting with deferred_probe_timeout=0 work?

thanks
-john
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/10] IOASID extensions for guest SVA

2020-04-02 Thread Jacob Pan
On Thu, 2 Apr 2020 14:26:33 +0200
Jean-Philippe Brucker  wrote:

> On Wed, Apr 01, 2020 at 04:38:42PM -0700, Jacob Pan wrote:
> > On Wed, 1 Apr 2020 16:03:01 +0200
> > Jean-Philippe Brucker  wrote:
> >   
> > > Hi Jacob,
> > > 
> > > On Wed, Mar 25, 2020 at 10:55:21AM -0700, Jacob Pan wrote:  
> > > > IOASID was introduced in v5.5 as a generic kernel allocator
> > > > service for both PCIe Process Address Space ID (PASID) and ARM
> > > > SMMU's Sub Stream ID. In addition to basic ID allocation,
> > > > ioasid_set was introduced as a token that is shared by a group
> > > > of IOASIDs. This set token can be used for permission checking
> > > > but lack of some features needed by guest Shared Virtual
> > > > Address (SVA). In addition, IOASID support for life cycle
> > > > management is needed among multiple users.
> > > > 
> > > > This patchset introduces two extensions to the IOASID code,
> > > > 1. IOASID set operations
> > > > 2. Notifications for IOASID state synchronization
> > > 
> > > My main concern with this series is patch 7 changing the spinlock
> > > to a mutex, which prevents SVA from calling ioasid_free() from
> > > the RCU callback of MMU notifiers. Could we use atomic notifiers,
> > > or do the FREE notification another way?
> > >   
> > Maybe I am looking at the wrong code, I thought
> > mmu_notifier_ops.free_notifier() is called outside spinlock with
> > call_srcu(), which will be invoked in the thread context.
> > in mmu_notifier.c mmu_notifier_put()
> > spin_unlock(>notifier_subscriptions->lock);
> > 
> > call_srcu(, >rcu,
> > mmu_notifier_free_rcu);  
> 
> free_notifier() is called from RCU callback, and according to
> Documentation/RCU/checklist.txt:
> 
> 5.  If call_rcu() or call_srcu() is used, the callback function
> will be called from softirq context.  In particular, it cannot block.
> 
> When applying the patch I get the sleep-in-atomic warning:
> 
> [   87.861793] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:935 [   87.863293] in_atomic(): 1,
> irqs_disabled(): 0, non_block: 0, pid: 74, name: kworker/6:1
> [   87.863993] 2 locks held by kworker/6:1/74: [   87.864493]  #0:
> ff885ac12538 ((wq_completion)rcu_gp){+.+.}-{0:0}, at:
> process_one_work+0x740/0x1880 [   87.865593]  #1: ff88591efd30
> ((work_completion)(>work)){+.+.}-{0:0}, at:
> process_one_work+0x740/0x1880 [   87.866993] CPU: 6 PID: 74 Comm:
> kworker/6:1 Not tainted 5.6.0-next-20200331+ #121 [   87.867393]
> Hardware name: FVP Base (DT) [   87.867893] Workqueue: rcu_gp
> srcu_invoke_callbacks [   87.868393] Call trace: [   87.868793]
> dump_backtrace+0x0/0x310 [   87.869293]  show_stack+0x14/0x20
> [   87.869693]  dump_stack+0x124/0x180 [   87.870193]
> ___might_sleep+0x2ac/0x428 [   87.870693]  __might_sleep+0x88/0x168
> [   87.871094]  __mutex_lock+0xa0/0x1270
> [   87.871593]  mutex_lock_nested+0x1c/0x28
> [   87.872093]  ioasid_free+0x28/0x48
> [   87.872493]  io_mm_free+0x1d0/0x608
> [   87.872993]  mmu_notifier_free_rcu+0x74/0xe8
> [   87.873393]  srcu_invoke_callbacks+0x1d0/0x2c8
> [   87.873893]  process_one_work+0x858/0x1880
> [   87.874393]  worker_thread+0x314/0xcd0
> [   87.874793]  kthread+0x318/0x400
> [   87.875293]  ret_from_fork+0x10/0x18
> 
You are right, I was reading call_srcu comments too fast. I guess
rcu callbacks are still in softirq not offloaded to kernel threads.

 *
 * The callback will be invoked from process context, but must
 *  nevertheless be fast and must not block.
 */
So even atomic works in principle but not a good idea since it may take
a long time.

> > 
> > Anyway, if we have to use atomic. I tried atomic notifier first,
> > there are two subscribers to the free event on x86.
> > 1. IOMMU
> > 2. KVM
> > 
> > For #1, the problem is that in the free operation, VT-d driver
> > needs to do a lot of clean up in thread context.
> > - hold a mutex to traverse a list of devices
> > - clear PASID entry and flush cache
> > 
> > For #2, KVM might be able to deal with spinlocks for updating VMCS
> > PASID translation table. +Hao
> > 
> > Perhaps two solutions I can think of:
> > 1. Use a cyclic IOASID allocator. The main reason of clean up at
> > free is to prevent race with IOASID alloc. Similar to PID, 2M IOASID
> > will take long time overflow. Then we can use atomic notifier and a
> > deferred workqueue to do IOMMU cleanup. The downside is a large and
> > growing PASID table, may not be a performance issue since it has
> > TLB.  
> 
> That might be a problem for SMMU, which has 1024 * 64kB leaf PASID
> tables, for a total of 64MB per endpoint if there is too much
> fragmentation in the IOASID space.
> 
OK. Not an option here :(

> > 2. Let VFIO ensure free always happen after unbind. Then there is no
> > need to do cleanup. But that requires VFIO to keep track of all the
> > PASIDs within each VM. When the VM terminates, VFIO is responsible
> > for the clean up. That was Yi's original proposal. I also tried to
> > provide an 

Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-04-02 Thread Jacob Pan
On Wed, 1 Apr 2020 09:32:37 +0200
Auger Eric  wrote:

> > I didn’t read through all comments. Here is a concern with this 2-D
> > table, the iommu cache type is defined as below. I suppose there is
> > a problem here. If I'm using IOMMU_CACHE_INV_TYPE_PASID, it will
> > beyond the 2-D table.
> > 
> > /* IOMMU paging structure cache */
> > #define IOMMU_CACHE_INV_TYPE_IOTLB  (1 << 0) /* IOMMU IOTLB */
> > #define IOMMU_CACHE_INV_TYPE_DEV_IOTLB  (1 << 1) /* Device IOTLB */
> > #define IOMMU_CACHE_INV_TYPE_PASID  (1 << 2) /* PASID cache */
> > #define IOMMU_CACHE_INV_TYPE_NR (3)  
> oups indeed

I think it is not an issue, since we use bit position not the raw cache
type as index into the 2D array. Right?

for_each_set_bit(cache_type, 

 ret = to_vtd_granularity(cache_type, inv_info->granularity, &

static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
{

*vtd_granu = inv_type_granu_table[type][granu];
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu: Fix the memory leak in dev_iommu_free()

2020-04-02 Thread Kevin Hao
In iommu_probe_device(), we would invoke dev_iommu_free() to free the
dev->iommu after the ->add_device() returns failure. But after commit
72acd9df18f1 ("iommu: Move iommu_fwspec to struct dev_iommu"), we also
need to free the iommu_fwspec before the dev->iommu is freed. This fixes
the following memory leak reported by kmemleak:
  unreferenced object 0x000bc836c700 (size 128):
comm "swapper/0", pid 1, jiffies 4294896304 (age 782.120s)
hex dump (first 32 bytes):
  00 00 00 00 00 00 00 00 d8 cd 9b ff 0b 00 ff ff  
  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
backtrace:
  [] kmem_cache_alloc_trace+0x244/0x4b0
  [<0e560ac0>] iommu_fwspec_init+0x7c/0xb0
  [<75eda275>] of_iommu_xlate+0x80/0xe8
  [<728d6bf9>] of_pci_iommu_init+0xb0/0xb8
  [] pci_for_each_dma_alias+0x48/0x190
  [<6db6bbce>] of_iommu_configure+0x1ac/0x1d0
  [<634745f8>] of_dma_configure+0xdc/0x220
  [<2cbc8ba0>] pci_dma_configure+0x50/0x78
  [] really_probe+0x8c/0x340
  [] driver_probe_device+0x60/0xf8
  [<61bcdb51>] __device_attach_driver+0x8c/0xd0
  [<9b9ff58e>] bus_for_each_drv+0x80/0xd0
  [<4b9c8aa3>] __device_attach+0xec/0x148
  [] device_attach+0x1c/0x28
  [<5071e151>] pci_bus_add_device+0x58/0xd0
  [<2d4f87d1>] pci_bus_add_devices+0x40/0x90

Fixes: 72acd9df18f1 ("iommu: Move iommu_fwspec to struct dev_iommu")
Signed-off-by: Kevin Hao 
---
 drivers/iommu/iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2b471419e26c..54757c404866 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -170,6 +170,7 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
 
 static void dev_iommu_free(struct device *dev)
 {
+   iommu_fwspec_free(dev);
kfree(dev->iommu);
dev->iommu = NULL;
 }
-- 
2.25.1.377.g2d2118b814c1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/qcom:fix local_base status check

2020-04-02 Thread Tang Bin
Release resources when exiting on error.

Signed-off-by: Tang Bin 
---
 drivers/iommu/qcom_iommu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/qcom_iommu.c b/drivers/iommu/qcom_iommu.c
index 4328da0b0..c08aa9651 100644
--- a/drivers/iommu/qcom_iommu.c
+++ b/drivers/iommu/qcom_iommu.c
@@ -813,8 +813,11 @@ static int qcom_iommu_device_probe(struct platform_device 
*pdev)
qcom_iommu->dev = dev;
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-   if (res)
+   if (res) {
qcom_iommu->local_base = devm_ioremap_resource(dev, res);
+   if (IS_ERR(qcom_iommu->local_base))
+   return PTR_ERR(qcom_iommu->local_base);
+   }
 
qcom_iommu->iface_clk = devm_clk_get(dev, "iface");
if (IS_ERR(qcom_iommu->iface_clk)) {
-- 
2.20.1.windows.1



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-04-02 Thread Jean-Philippe Brucker
Hi Yi,

On Sun, Mar 22, 2020 at 05:31:58AM -0700, Liu, Yi L wrote:
> From: Liu Yi L 
> 
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the
> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access.

It's worth noting that this applies to Intel VT-d with scalable mode, not
IOMMUs that use one PASID space per VM

> It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.
> 
> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such
> requests are intended to be invoked by QEMU or other applications which
> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is
> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
> 
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Yi Sun 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio.c | 130 
> 
>  drivers/vfio/vfio_iommu_type1.c | 104 
>  include/linux/vfio.h|  20 +++
>  include/uapi/linux/vfio.h   |  41 +
>  4 files changed, 295 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DRIVER_VERSION   "0.3"
>  #define DRIVER_AUTHOR"Alex Williamson "
> @@ -46,6 +47,8 @@ static struct vfio {
>   struct mutexgroup_lock;
>   struct cdev group_cdev;
>   dev_t   group_devt;
> + struct list_headvfio_mm_list;
> + struct mutexvfio_mm_lock;
>   wait_queue_head_t   release_q;
>  } vfio;
>  
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum 
> vfio_notify_type type,
>  EXPORT_SYMBOL(vfio_unregister_notifier);
>  
>  /**
> + * VFIO_MM objects - create, release, get, put, search
> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> +{
> + struct vfio_mm *vmm;
> + struct vfio_mm_token *token;
> + int 

Re: How to fix WARN from drivers/base/dd.c in next-20200401 if CONFIG_MODULES=y?

2020-04-02 Thread Robin Murphy

On 2020-04-02 11:16 am, Yoshihiro Shimoda wrote:

Hi John,

I found an issue after applied the following patches:
---
64c775f driver core: Rename deferred_probe_timeout and make it global
0e9f8d0 driver core: Remove driver_deferred_probe_check_state_continue()
bec6c0e pinctrl: Remove use of driver_deferred_probe_check_state_continue()
e2cec7d driver core: Set deferred_probe_timeout to a longer default if 
CONFIG_MODULES is set
c8c43ce driver core: Fix driver_deferred_probe_check_state() logic
---

Before these patches, on my environment [1], some device drivers
which has iommus property output the following message when probing:

[3.05] ravb e680.ethernet: ignoring dependency for device, assuming 
no driver
[3.257174] ravb e680.ethernet eth0: Base address at 0xe680, 
2e:09:0a:02:eb:2d, IRQ 117.

So, since ravb driver is probed within 4 seconds, we can use NFS rootfs 
correctly.

However, after these patches are applied, since the patches are always waiting 
for 30 seconds
for of_iommu_configure() when IOMMU hardware is disabled, drivers/base/dd.c 
output WARN.
Also, since ravb cannot be probed for 30 seconds, we cannot use NFS rootfs 
anymore.
JFYI, I copied the kernel log to the end of this email.

I guess the patches will be merged into v5.7-rc1 because the patches are 
contained from
next-20200316, I'd like to fix the issue in v5.7-rcN cycle somehow.


This already came up in a different context, and there's a proposal from 
John here:


https://lore.kernel.org/lkml/20200330202715.86609-1-john.stu...@linaro.org/

Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/10] IOASID extensions for guest SVA

2020-04-02 Thread Jean-Philippe Brucker
On Wed, Apr 01, 2020 at 04:38:42PM -0700, Jacob Pan wrote:
> On Wed, 1 Apr 2020 16:03:01 +0200
> Jean-Philippe Brucker  wrote:
> 
> > Hi Jacob,
> > 
> > On Wed, Mar 25, 2020 at 10:55:21AM -0700, Jacob Pan wrote:
> > > IOASID was introduced in v5.5 as a generic kernel allocator service
> > > for both PCIe Process Address Space ID (PASID) and ARM SMMU's Sub
> > > Stream ID. In addition to basic ID allocation, ioasid_set was
> > > introduced as a token that is shared by a group of IOASIDs. This
> > > set token can be used for permission checking but lack of some
> > > features needed by guest Shared Virtual Address (SVA). In addition,
> > > IOASID support for life cycle management is needed among multiple
> > > users.
> > > 
> > > This patchset introduces two extensions to the IOASID code,
> > > 1. IOASID set operations
> > > 2. Notifications for IOASID state synchronization  
> > 
> > My main concern with this series is patch 7 changing the spinlock to a
> > mutex, which prevents SVA from calling ioasid_free() from the RCU
> > callback of MMU notifiers. Could we use atomic notifiers, or do the
> > FREE notification another way?
> > 
> Maybe I am looking at the wrong code, I thought
> mmu_notifier_ops.free_notifier() is called outside spinlock with
> call_srcu(), which will be invoked in the thread context.
> in mmu_notifier.c mmu_notifier_put()
>   spin_unlock(>notifier_subscriptions->lock);
> 
>   call_srcu(, >rcu, mmu_notifier_free_rcu);

free_notifier() is called from RCU callback, and according to
Documentation/RCU/checklist.txt:

5.  If call_rcu() or call_srcu() is used, the callback function will
be called from softirq context.  In particular, it cannot block.

When applying the patch I get the sleep-in-atomic warning:

[   87.861793] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:935
[   87.863293] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 74, name: 
kworker/6:1
[   87.863993] 2 locks held by kworker/6:1/74:
[   87.864493]  #0: ff885ac12538 ((wq_completion)rcu_gp){+.+.}-{0:0}, at: 
process_one_work+0x740/0x1880
[   87.865593]  #1: ff88591efd30 
((work_completion)(>work)){+.+.}-{0:0}, at: process_one_work+0x740/0x1880
[   87.866993] CPU: 6 PID: 74 Comm: kworker/6:1 Not tainted 
5.6.0-next-20200331+ #121
[   87.867393] Hardware name: FVP Base (DT)
[   87.867893] Workqueue: rcu_gp srcu_invoke_callbacks
[   87.868393] Call trace:
[   87.868793]  dump_backtrace+0x0/0x310
[   87.869293]  show_stack+0x14/0x20
[   87.869693]  dump_stack+0x124/0x180
[   87.870193]  ___might_sleep+0x2ac/0x428
[   87.870693]  __might_sleep+0x88/0x168
[   87.871094]  __mutex_lock+0xa0/0x1270
[   87.871593]  mutex_lock_nested+0x1c/0x28
[   87.872093]  ioasid_free+0x28/0x48
[   87.872493]  io_mm_free+0x1d0/0x608
[   87.872993]  mmu_notifier_free_rcu+0x74/0xe8
[   87.873393]  srcu_invoke_callbacks+0x1d0/0x2c8
[   87.873893]  process_one_work+0x858/0x1880
[   87.874393]  worker_thread+0x314/0xcd0
[   87.874793]  kthread+0x318/0x400
[   87.875293]  ret_from_fork+0x10/0x18

> 
> Anyway, if we have to use atomic. I tried atomic notifier first, there
> are two subscribers to the free event on x86.
> 1. IOMMU
> 2. KVM
> 
> For #1, the problem is that in the free operation, VT-d driver
> needs to do a lot of clean up in thread context.
> - hold a mutex to traverse a list of devices
> - clear PASID entry and flush cache
> 
> For #2, KVM might be able to deal with spinlocks for updating VMCS
> PASID translation table. +Hao
> 
> Perhaps two solutions I can think of:
> 1. Use a cyclic IOASID allocator. The main reason of clean up at free
> is to prevent race with IOASID alloc. Similar to PID, 2M IOASID
> will take long time overflow. Then we can use atomic notifier and a
> deferred workqueue to do IOMMU cleanup. The downside is a large and
> growing PASID table, may not be a performance issue since it has TLB.

That might be a problem for SMMU, which has 1024 * 64kB leaf PASID tables,
for a total of 64MB per endpoint if there is too much fragmentation in
the IOASID space.

> 2. Let VFIO ensure free always happen after unbind. Then there is no
> need to do cleanup. But that requires VFIO to keep track of all the
> PASIDs within each VM. When the VM terminates, VFIO is responsible for
> the clean up. That was Yi's original proposal. I also tried to provide
> an IOASID set iterator for VFIO to free the IOASIDs within each VM/set,
> but the private data belongs to IOMMU driver.

Not really my place to comment on this, but I find it nicer to use the
same gpasid_unbind() path when VFIO frees a PASID as when the guest
explicitly unbinds before freeing. 

Thanks,
Jean

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


How to fix WARN from drivers/base/dd.c in next-20200401 if CONFIG_MODULES=y?

2020-04-02 Thread Yoshihiro Shimoda
Hi John,

I found an issue after applied the following patches:
---
64c775f driver core: Rename deferred_probe_timeout and make it global
0e9f8d0 driver core: Remove driver_deferred_probe_check_state_continue()
bec6c0e pinctrl: Remove use of driver_deferred_probe_check_state_continue()
e2cec7d driver core: Set deferred_probe_timeout to a longer default if 
CONFIG_MODULES is set
c8c43ce driver core: Fix driver_deferred_probe_check_state() logic
---

Before these patches, on my environment [1], some device drivers
which has iommus property output the following message when probing:

[3.05] ravb e680.ethernet: ignoring dependency for device, assuming 
no driver
[3.257174] ravb e680.ethernet eth0: Base address at 0xe680, 
2e:09:0a:02:eb:2d, IRQ 117.

So, since ravb driver is probed within 4 seconds, we can use NFS rootfs 
correctly.

However, after these patches are applied, since the patches are always waiting 
for 30 seconds
for of_iommu_configure() when IOMMU hardware is disabled, drivers/base/dd.c 
output WARN.
Also, since ravb cannot be probed for 30 seconds, we cannot use NFS rootfs 
anymore.
JFYI, I copied the kernel log to the end of this email.

I guess the patches will be merged into v5.7-rc1 because the patches are 
contained from
next-20200316, I'd like to fix the issue in v5.7-rcN cycle somehow.

[1]
- R-Car H3 (r8a77951).
- Using defconfig of arch/arm64.
-- So, the IOMMU hardware of the environment is not enabled.

Best regards,
Yoshihiro Shimoda

--- log ---
[0.000162] NOTICE:  BL2: R-Car Gen3 Initial Program Loader(CA57) Rev.1.0.21
[0.005735] NOTICE:  BL2: PRR is R-Car H3 Ver.3.0
[0.010412] NOTICE:  BL2: Board is Salvator-XS Rev.1.0
[0.015532] NOTICE:  BL2: Boot device is HyperFlash(160MHz)
[0.021054] NOTICE:  BL2: LCM state is CM
[0.025100] NOTICE:  BL2: AVS setting succeeded. DVFS_SetVID=0x53
[0.031085] NOTICE:  BL2: CH0: 0x4 - 0x48000, 2 GiB
[0.036956] NOTICE:  BL2: CH1: 0x5 - 0x58000, 2 GiB
[0.042840] NOTICE:  BL2: CH2: 0x6 - 0x68000, 2 GiB
[0.048725] NOTICE:  BL2: CH3: 0x7 - 0x78000, 2 GiB
[0.054650] NOTICE:  BL2: DDR3200(rev.0.33)NOTICE:  [COLD_BOOT]NOTICE:  ..0
[0.100344] NOTICE:  BL2: DRAM Split is 4ch(DDR f)
[0.104839] NOTICE:  BL2: QoS is default setting(rev.0.07)
[0.110291] NOTICE:  BL2: DRAM refresh interval 1.95 usec
[0.115720] NOTICE:  BL2: v1.4(release):15dba6b
[0.120156] NOTICE:  BL2: Built : 02:55:27, Sep  4 2018
[0.125352] NOTICE:  BL2: Normal boot
[0.129004] NOTICE:  BL2: dst=0xe6322d00 src=0x818 len=512(0x200)
[0.135395] NOTICE:  BL2: dst=0x43f0 src=0x8180400 len=6144(0x1800)
[0.142003] NOTICE:  BL2: dst=0x4400 src=0x81c len=65536(0x1)
[0.149012] NOTICE:  BL2: dst=0x4410 src=0x820 len=1048576(0x10)
[0.159988] NOTICE:  BL2: dst=0x5000 src=0x864 len=1048576(0x10)


U-Boot 2015.04 (Sep 04 2018 - 10:09:18)

CPU: Renesas Electronics R8A7795 rev 3.0
Board: Salvator-X
I2C:   ready
DRAM:  7.9 GiB
Bank #0: 0x04800 - 0x0bfff, 1.9 GiB
Bank #1: 0x5 - 0x57fff, 2 GiB
Bank #2: 0x6 - 0x67fff, 2 GiB
Bank #3: 0x7 - 0x77fff, 2 GiB

MMC:   sh-sdhi: 0, sh-sdhi: 1, sh-sdhi: 2
In:serial
Out:   serial
Err:   serial
Net:   ravb
Hit any key to stop autoboot:  2 1 0 
ravb Waiting for PHY auto negotiation to complete. done
ravb: 1000Base/Full
Using ravb device
TFTP from server 192.168.10.111; our IP address is 192.168.10.224
Filename 'lava/775/tftp-deploy-8205xjm0/kernel/Image'.
Load address: 0x4808
Loading: * #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 #
 

Re: [EXT] Re: [RFC PATCH v2] iommu/virtio: Use page size bitmap supported by endpoint

2020-04-02 Thread Jean-Philippe Brucker
On Thu, Apr 02, 2020 at 03:53:40AM +, Bharat Bhushan wrote:
> > > > +struct virtio_iommu_probe_pgsize_mask {
> > > > +   struct virtio_iommu_probe_property  head;
> > > > +   __u8reserved[4];
> > > > +   __u64   pgsize_bitmap;
> > 
> > Should be __le64
> 
> Based on" iommu/virtio: Fix sparse warning" patch 
> https://www.spinics.net/lists/linux-virtualization/msg41944.html changed to 
> __u64 (not __le64)

Yes that one was only for the virtio config struct, to play nice with
other devices. We should still use __le for the other structures,
including probe properties.

Thanks,
Jean

> 
> Will keep __le64.
> 
> Thanks
> -Bharat
> 
> > 
> > Thanks,
> > Jean
> > 
> > > > +};
> > > > +
> > > >   #define VIRTIO_IOMMU_RESV_MEM_T_RESERVED  0
> > > >   #define VIRTIO_IOMMU_RESV_MEM_T_MSI   1
> > > >
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

2020-04-02 Thread Liu, Yi L
> From: Tian, Kevin 
> Sent: Thursday, April 2, 2020 10:12 AM
> To: Liu, Yi L ; alex.william...@redhat.com;
> Subject: RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> 
> > From: Liu, Yi L 
> > Sent: Wednesday, April 1, 2020 5:13 PM
> >
> > > From: Tian, Kevin 
> > > Sent: Monday, March 30, 2020 8:46 PM
> > > Subject: RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to
> > > host
> > >
> > > > From: Liu, Yi L 
> > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > >
> > > > From: Liu Yi L 
> > > >
> > > > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > hardware
> > > > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > > > translation). For such hardware IOMMUs, there are two
> > > > stages/levels of address translation, and software may let
> > > > userspace/VM to own the
> > > > first-
> > > > level/stage-1 translation structures. Example of such usage is
> > > > vSVA ( virtual Shared Virtual Addressing). VM owns the
> > > > first-level/stage-1 translation structures and bind the structures
> > > > to host, then hardware IOMMU would utilize nesting translation
> > > > when doing DMA translation fo the devices behind such hardware IOMMU.
> > > >
> > > > This patch adds vfio support for binding guest translation (a.k.a
> > > > stage 1) structure to host iommu. And for
> > > > VFIO_TYPE1_NESTING_IOMMU, not only bind guest page table is
> > > > needed, it also requires to expose interface to guest for iommu
> > > > cache invalidation when guest modified the first-level/stage-1
> > > > translation structures since hardware needs to be notified to
> > > > flush stale iotlbs. This would be introduced in next patch.
> > > >
> > > > In this patch, guest page table bind and unbind are done by using
> > > > flags VFIO_IOMMU_BIND_GUEST_PGTBL and
> > > VFIO_IOMMU_UNBIND_GUEST_PGTBL
> > > > under IOCTL VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > > > struct iommu_gpasid_bind_data. Before binding guest page table to
> > > > host, VM should have got a PASID allocated by host via
> > > > VFIO_IOMMU_PASID_REQUEST.
> > > >
> > > > Bind guest translation structures (here is guest page table) to
> > > > host
> > >
> > > Bind -> Binding
> > got it.
> > > > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> > >
> > > are -> is. and you already explained vSVA earlier.
> > oh yes, it is.
> > > >
> > > > Cc: Kevin Tian 
> > > > CC: Jacob Pan 
> > > > Cc: Alex Williamson 
> > > > Cc: Eric Auger 
> > > > Cc: Jean-Philippe Brucker 
> > > > Signed-off-by: Jean-Philippe Brucker 
> > > > Signed-off-by: Liu Yi L 
> > > > Signed-off-by: Jacob Pan 
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 158
> > > > 
> > > >  include/uapi/linux/vfio.h   |  46 
> > > >  2 files changed, 204 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index 82a9e0b..a877747 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -130,6 +130,33 @@ struct vfio_regions {
> > > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\
> > > > 
> > > > (!list_empty(>domain_list))
> > > >
> > > > +struct domain_capsule {
> > > > +   struct iommu_domain *domain;
> > > > +   void *data;
> > > > +};
> > > > +
> > > > +/* iommu->lock must be held */
> > > > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > > > + int (*fn)(struct device *dev, void *data),
> > > > + void *data)
> > > > +{
> > > > +   struct domain_capsule dc = {.data = data};
> > > > +   struct vfio_domain *d;
> > > > +   struct vfio_group *g;
> > > > +   int ret = 0;
> > > > +
> > > > +   list_for_each_entry(d, >domain_list, next) {
> > > > +   dc.domain = d->domain;
> > > > +   list_for_each_entry(g, >group_list, next) {
> > > > +   ret = iommu_group_for_each_dev(g->iommu_group,
> > > > +  , fn);
> > > > +   if (ret)
> > > > +   break;
> > > > +   }
> > > > +   }
> > > > +   return ret;
> > > > +}
> > > > +
> > > >  static int put_pfn(unsigned long pfn, int prot);
> > > >
> > > >  /*
> > > > @@ -2314,6 +2341,88 @@ static int
> > > > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > > > return 0;
> > > >  }
> > > >
> > > > +static int vfio_bind_gpasid_fn(struct device *dev, void *data) {
> > > > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > +   struct iommu_gpasid_bind_data *gbind_data =
> > > > +   (struct iommu_gpasid_bind_data *) dc->data;
> > > > +
> > >
> > > In Jacob's vSVA iommu series, [PATCH 06/11]:
> > >
> > > + /* REVISIT: upper layer/VFIO can track host process that 

Re: [PATCH] iommu/vt-d: add NUMA awareness to intel_alloc_coherent()

2020-04-02 Thread Christoph Hellwig
On Wed, Apr 01, 2020 at 03:53:38PM -0700, Eric Dumazet wrote:
> 
> 
> On 2/2/18 10:59 AM, Eric Dumazet wrote:
> > On Fri, Feb 2, 2018 at 10:53 AM, Christoph Hellwig  
> > wrote:
> >> I've got patches pending to replace all that code with
> >> dma_direct_alloc, which will do the right thing.  They were
> >> submitted for 4.16, and I will resend them after -rc1.
> > 
> > I see, thanks Christoph !
> > 
> 
> Hi Christoph 
> 
> It seems 4.16 has shipped ( :) ) , and intel_alloc_coherent() still has no 
> NUMA awareness.

Actually, that code went in and then got reverted again..

> Should I respin https://lore.kernel.org/patchwork/patch/884326/

Maybe.  We are still hoping to convert intel-iommu to the dma-iommu
framework, but I'm not sure how long that is going to take, so maybe
just respin it for now.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/qcom:fix local_base status check

2020-04-02 Thread Bjorn Andersson
On Wed 01 Apr 23:33 PDT 2020, Tang Bin wrote:

> Release resources when exiting on error.
> 

Reviewed-by: Bjorn Andersson 

Regards,
Bjorn

> Signed-off-by: Tang Bin 
> ---
>  drivers/iommu/qcom_iommu.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/qcom_iommu.c b/drivers/iommu/qcom_iommu.c
> index 4328da0b0..c08aa9651 100644
> --- a/drivers/iommu/qcom_iommu.c
> +++ b/drivers/iommu/qcom_iommu.c
> @@ -813,8 +813,11 @@ static int qcom_iommu_device_probe(struct 
> platform_device *pdev)
>   qcom_iommu->dev = dev;
>  
>   res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> - if (res)
> + if (res) {
>   qcom_iommu->local_base = devm_ioremap_resource(dev, res);
> + if (IS_ERR(qcom_iommu->local_base))
> + return PTR_ERR(qcom_iommu->local_base);
> + }
>  
>   qcom_iommu->iface_clk = devm_clk_get(dev, "iface");
>   if (IS_ERR(qcom_iommu->iface_clk)) {
> -- 
> 2.20.1.windows.1
> 
> 
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu