Re: [tip: x86/urgent] x86/dma: Tear down DMA ops on driver unbind

2021-04-19 Thread Jean-Philippe Brucker
On Sat, Apr 17, 2021 at 02:06:44PM +0200, Borislav Petkov wrote:
> Nope, sorry, no joy. Zapping it from tip.
> 
> With that patch, it fails booting on my test box with messages like
> (typing up from video I took):
> 
> ...
> ata: softreset failed (1st FIS failed)
> ahci :03:00:1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=...]
> ahci :03:00:1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=...]

Sorry about that, I only tested under QEMU. I'll try to reproduce this on
an AMD laptop.

Thanks,
Jean


[tip: x86/urgent] x86/dma: Tear down DMA ops on driver unbind

2021-04-15 Thread tip-bot2 for Jean-Philippe Brucker
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 9f8614f5567eb4e38579422d38a1bdfeeb648ffc
Gitweb:
https://git.kernel.org/tip/9f8614f5567eb4e38579422d38a1bdfeeb648ffc
Author:Jean-Philippe Brucker 
AuthorDate:Wed, 14 Apr 2021 10:26:34 +02:00
Committer: Borislav Petkov 
CommitterDate: Thu, 15 Apr 2021 10:27:29 +02:00

x86/dma: Tear down DMA ops on driver unbind

Since

  08a27c1c3ecf ("iommu: Add support to change default domain of an iommu group")

a user can switch a device between IOMMU and direct DMA through sysfs.
This doesn't work for AMD IOMMU at the moment because dev->dma_ops is
not cleared when switching from a DMA to an identity IOMMU domain. The
DMA layer thus attempts to use the dma-iommu ops on an identity domain,
causing an oops:

  # echo :00:05.0 > /sys/sys/bus/pci/drivers/e1000e/unbind
  # echo identity > /sys/bus/pci/devices/:00:05.0/iommu_group/type
  # echo :00:05.0 > /sys/sys/bus/pci/drivers/e1000e/bind
   ...
  BUG: kernel NULL pointer dereference, address: 0028
   ...
   Call Trace:
iommu_dma_alloc
e1000e_setup_tx_resources
e1000e_open

Implement arch_teardown_dma_ops() on x86 to clear the device's dma_ops
pointer during driver unbind.

 [ bp: Massage commit message. ]

Fixes: 08a27c1c3ecf ("iommu: Add support to change default domain of an iommu 
group")
Signed-off-by: Jean-Philippe Brucker 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20210414082633.877461-1-jean-phili...@linaro.org
---
 arch/x86/Kconfig  | 1 +
 arch/x86/kernel/pci-dma.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879..2c90f8d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,7 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_SYSCALL_WRAPPER
+   select ARCH_HAS_TEARDOWN_DMA_OPSif IOMMU_DMA
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_DEBUG_WX
select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index de234e7..60a4ec2 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -154,3 +154,10 @@ static void via_no_dac(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_VIA, PCI_ANY_ID,
PCI_CLASS_BRIDGE_PCI, 8, via_no_dac);
 #endif
+
+#ifdef CONFIG_ARCH_HAS_TEARDOWN_DMA_OPS
+void arch_teardown_dma_ops(struct device *dev)
+{
+   set_dma_ops(dev, NULL);
+}
+#endif


[PATCH] x86/dma: Tear down DMA ops on driver unbind

2021-04-14 Thread Jean-Philippe Brucker
Since commit 08a27c1c3ecf ("iommu: Add support to change default domain
of an iommu group") a user can switch a device between IOMMU and direct
DMA through sysfs. This doesn't work for AMD IOMMU at the moment because
dev->dma_ops is not cleared when switching from a DMA to an identity
IOMMU domain. The DMA layer thus attempts to use the dma-iommu ops on an
identity domain, causing an oops:

  # echo :00:05.0 > /sys/sys/bus/pci/drivers/e1000e/unbind
  # echo identity > /sys/bus/pci/devices/:00:05.0/iommu_group/type
  # echo :00:05.0 > /sys/sys/bus/pci/drivers/e1000e/bind
   ...
  [  190.017587] BUG: kernel NULL pointer dereference, address: 0028
   ...
  [  190.027375] Call Trace:
  [  190.027561]  iommu_dma_alloc+0xd0/0x100
  [  190.027840]  e1000e_setup_tx_resources+0x56/0x90
  [  190.028173]  e1000e_open+0x75/0x5b0

Implement arch_teardown_dma_ops() on x86 to clear the device's dma_ops
pointer during driver unbind.

Fixes: 08a27c1c3ecf ("iommu: Add support to change default domain of an iommu 
group")
Signed-off-by: Jean-Philippe Brucker 
---
 arch/x86/Kconfig  | 1 +
 arch/x86/kernel/pci-dma.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..2c90f8de3e20 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,7 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_SYSCALL_WRAPPER
+   select ARCH_HAS_TEARDOWN_DMA_OPSif IOMMU_DMA
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_DEBUG_WX
select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index de234e7a8962..60a4ec22d849 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -154,3 +154,10 @@ static void via_no_dac(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_VIA, PCI_ANY_ID,
PCI_CLASS_BRIDGE_PCI, 8, via_no_dac);
 #endif
+
+#ifdef CONFIG_ARCH_HAS_TEARDOWN_DMA_OPS
+void arch_teardown_dma_ops(struct device *dev)
+{
+   set_dma_ops(dev, NULL);
+}
+#endif
-- 
2.31.1



Re: [PATCH 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-04-09 Thread Jean-Philippe Brucker
On Thu, Apr 08, 2021 at 10:08:55AM -0700, Jacob Pan wrote:
> The void* drvdata parameter isn't really used in iommu_sva_bind_device()
> API,

Right, it used to be a cookie passed to the device driver in the exit_mm()
callback, but that went away with edcc40d2ab5f ("iommu: Remove
iommu_sva_ops::mm_exit()")

> the current IDXD code "borrows" the drvdata for a VT-d private flag
> for supervisor SVA usage.
> 
> Supervisor/Privileged mode request is a generic feature. It should be
> promoted from the VT-d vendor driver to the generic code.
> 
> This patch replaces void* drvdata with a unsigned int flags parameter
> and adjusts callers accordingly.

Thanks for cleaning this up. Making flags unsigned long seems more common
(I suggested int without thinking). But it doesn't matter much, we won't
get to 32 flags.

> 
> Link: https://lore.kernel.org/linux-iommu/YFhiMLR35WWMW%2FHu@myrica/
> Suggested-by: Jean-Philippe Brucker 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/dma/idxd/cdev.c |  2 +-
>  drivers/dma/idxd/init.c |  6 +++---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |  2 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  4 ++--
>  drivers/iommu/intel/Kconfig |  1 +
>  drivers/iommu/intel/svm.c   | 18 ++
>  drivers/iommu/iommu.c   |  9 ++---
>  drivers/misc/uacce/uacce.c  |  2 +-
>  include/linux/intel-iommu.h |  2 +-
>  include/linux/intel-svm.h   | 17 ++---
>  include/linux/iommu.h   | 19 ---
>  11 files changed, 40 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
> index 0db9b82..21ec82b 100644
> --- a/drivers/dma/idxd/cdev.c
> +++ b/drivers/dma/idxd/cdev.c
> @@ -103,7 +103,7 @@ static int idxd_cdev_open(struct inode *inode, struct 
> file *filp)
>   filp->private_data = ctx;
>  
>   if (device_pasid_enabled(idxd)) {
> - sva = iommu_sva_bind_device(dev, current->mm, NULL);
> + sva = iommu_sva_bind_device(dev, current->mm, 0);
>   if (IS_ERR(sva)) {
>   rc = PTR_ERR(sva);
>   dev_err(dev, "pasid allocation failed: %d\n", rc);
> diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
> index 085a0c3..cdc85f1 100644
> --- a/drivers/dma/idxd/init.c
> +++ b/drivers/dma/idxd/init.c
> @@ -300,13 +300,13 @@ static struct idxd_device *idxd_alloc(struct pci_dev 
> *pdev)
>  
>  static int idxd_enable_system_pasid(struct idxd_device *idxd)
>  {
> - int flags;
> + unsigned int flags;
>   unsigned int pasid;
>   struct iommu_sva *sva;
>  
> - flags = SVM_FLAG_SUPERVISOR_MODE;
> + flags = IOMMU_SVA_BIND_SUPERVISOR;
>  
> - sva = iommu_sva_bind_device(>pdev->dev, NULL, );
> + sva = iommu_sva_bind_device(>pdev->dev, NULL, flags);
>   if (IS_ERR(sva)) {
>   dev_warn(>pdev->dev,
>"iommu sva bind failed: %ld\n", PTR_ERR(sva));
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> index bb251ca..23e287e 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> @@ -354,7 +354,7 @@ __arm_smmu_sva_bind(struct device *dev, struct mm_struct 
> *mm)
>  }
>  
>  struct iommu_sva *
> -arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata)
> +arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, unsigned int 
> flags)

Could you add a check on flags:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index bb251cab61f3..145ceb2fc5da 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -354,12 +354,15 @@ __arm_smmu_sva_bind(struct device *dev, struct mm_struct 
*mm)
 }

 struct iommu_sva *
-arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata)
+arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, unsigned int flags)
 {
struct iommu_sva *handle;
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

+   if (flags)
+   return ERR_PTR(-EINVAL);
+
if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
return ERR_PTR(-EINVAL);



>  {
>   struct iommu_sva *handle;
>   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> diff -

Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API

2021-04-09 Thread Jean-Philippe Brucker
On Thu, Apr 08, 2021 at 10:08:56AM -0700, Jacob Pan wrote:
> diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c
> index bd41405..bd99f6b 100644
> --- a/drivers/iommu/iommu-sva-lib.c
> +++ b/drivers/iommu/iommu-sva-lib.c
> @@ -12,27 +12,33 @@ static DECLARE_IOASID_SET(iommu_sva_pasid);
>  
>  /**
>   * iommu_sva_alloc_pasid - Allocate a PASID for the mm
> - * @mm: the mm
>   * @min: minimum PASID value (inclusive)
>   * @max: maximum PASID value (inclusive)
>   *
> - * Try to allocate a PASID for this mm, or take a reference to the existing 
> one
> - * provided it fits within the [@min, @max] range. On success the PASID is
> - * available in mm->pasid, and must be released with iommu_sva_free_pasid().
> + * Try to allocate a PASID for the current mm, or take a reference to the
> + * existing one provided it fits within the [@min, @max] range. On success
> + * the PASID is available in the current mm->pasid, and must be released with
> + * iommu_sva_free_pasid().
>   * @min must be greater than 0, because 0 indicates an unused mm->pasid.
>   *
>   * Returns 0 on success and < 0 on error.
>   */
> -int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
> +int iommu_sva_alloc_pasid(ioasid_t min, ioasid_t max)
>  {
>   int ret = 0;
>   ioasid_t pasid;
> + struct mm_struct *mm;
>  
>   if (min == INVALID_IOASID || max == INVALID_IOASID ||
>   min == 0 || max < min)
>   return -EINVAL;
>  
>   mutex_lock(_sva_lock);
> + mm = get_task_mm(current);
> + if (!mm) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }

I still think it would be more elegant to keep the choice of context in
iommu_sva_bind_device() and pass it down to leaf functions such as
iommu_sva_alloc_pasid(). The patch is trying to solve two separate
problems:

* We don't have a use-case for binding the mm of a remote process (and
  it's supposedly difficult for device drivers to do it securely). So OK,
  we remove the mm argument from iommu_sva_bind_device() and use the
  current mm. But the IOMMU driver isn't going to do get_task_mm(current)
  every time it needs the mm being bound, it will take it from
  iommu_sva_bind_device(). Likewise iommu_sva_alloc_pasid() shouldn't need
  to bother with get_task_mm().

* cgroup accounting for IOASIDs needs to be on the current task. Removing
  the mm parameter from iommu_sva_alloc_pasid() doesn't help with that.
  Sure it indicates that iommu_sva_alloc_pasid() needs a specific task
  context but that's only for cgroup purpose, and I'd rather pass the
  cgroup down from iommu_sva_bind_device() anyway (but am fine with
  keeping it within ioasid_alloc() for now). Plus it's an internal helper,
  easy for us to check that the callers are doing the right thing.

>   if (mm->pasid) {
>   if (mm->pasid >= min && mm->pasid <= max)
>   ioasid_get(mm->pasid);
> @@ -45,22 +51,32 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t 
> min, ioasid_t max)
>   else
>   mm->pasid = pasid;
>   }
> + mmput(mm);
> +out_unlock:
>   mutex_unlock(_sva_lock);
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid);
>  
>  /**
> - * iommu_sva_free_pasid - Release the mm's PASID
> + * iommu_sva_free_pasid - Release the current mm's PASID
>   * @mm: the mm
>   *
>   * Drop one reference to a PASID allocated with iommu_sva_alloc_pasid()
>   */
> -void iommu_sva_free_pasid(struct mm_struct *mm)
> +void iommu_sva_free_pasid(void)
>  {
> + struct mm_struct *mm;
> +
>   mutex_lock(_sva_lock);
> + mm = get_task_mm(current);
> + if (!mm)
> + goto out_unlock;
> +

More importantly, could we at least dissociate free_pasid() from the
current process?  Otherwise drivers can't clean up from a workqueue (as
amdkfd does) or from an rcu callback. Given that iommu_sva_unbind_device()
takes the SVA handle owned by whomever did bind(), there shouldn't be any
security issue. For the cgroup problem, ioasid.c could internally keep
track of the cgroup used during allocation rather than assuming the
context of ioasid_put() is the same as ioasid_get()

>   if (ioasid_put(mm->pasid))
>   mm->pasid = 0;
> + mmput(mm);
> +out_unlock:
>   mutex_unlock(_sva_lock);
>  }
>  EXPORT_SYMBOL_GPL(iommu_sva_free_pasid);
> diff --git a/drivers/iommu/iommu-sva-lib.h b/drivers/iommu/iommu-sva-lib.h
> index b40990a..278b8b4 100644
> --- a/drivers/iommu/iommu-sva-lib.h
> +++ b/drivers/iommu/iommu-sva-lib.h
> @@ -8,8 +8,8 @@
>  #include 
>  #include 
>  
> -int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max);
> -void iommu_sva_free_pasid(struct mm_struct *mm);
> +int iommu_sva_alloc_pasid(ioasid_t min, ioasid_t max);
> +void iommu_sva_free_pasid(void);
>  struct mm_struct *iommu_sva_find(ioasid_t pasid);
>  
>  #endif /* _IOMMU_SVA_LIB_H */
> diff --git a/drivers/iommu/iommu.c 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-08 Thread Jean-Philippe Brucker
On Wed, Apr 07, 2021 at 04:36:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 07, 2021 at 08:43:50PM +0200, Jean-Philippe Brucker wrote:
> 
> > * Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
> >   No operation available since we don't know what the device and IOMMU
> >   capabilities are.
> >
> > * Attach the handle to a VF. With VFIO that would be
> >   VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
> >   with the handle, and decide which operations are available.
> 
> Right, this is basically the point, - the VFIO container (/dev/vfio)
> and the /dev/ioasid we are talking about have a core of
> similarity. ioasid is the generalized, modernized, and cross-subsystem
> version of the same idea. Instead of calling it "vfio container" we
> call it something that evokes the idea of controlling the iommu.
> 
> The issue is to seperate /dev/vfio generic functionality from vfio and
> share it with every subsystem.
> 
> It may be that /dev/vfio and /dev/ioasid end up sharing a lot of code,
> with a different IOCTL interface around it. The vfio_iommu_driver_ops
> is not particularly VFIOy.
> 
> Creating /dev/ioasid may primarily start as a code reorganization
> exercise.
> 
> > * With a map/unmap vIOMMU (or shadow mappings), a single translation level
> >   is supported. With a nesting vIOMMU, we're populating the level-2
> >   translation (some day maybe by binding the KVM page tables, but
> >   currently with map/unmap ioctl).
> > 
> >   Single-level translation needs single VF per container. 
> 
> Really? Why?

The vIOMMU is started in bypass, so the device can do DMA to the GPA space
until the guest configures the vIOMMU, at which point each VF is either
kept in bypass or gets new DMA mappings, which requires the host to tear
down the bypass mappings and set up the guest mappings on a per-VF basis
(I'm not considering nesting translation in the host kernel for this,
because it's not supported by all pIOMMUs and is expensive in terms of TLB
and pinned memory). So keeping a single VF per container is simpler, but
there are certainly other programming models possible.

Thanks,
Jean



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Jean-Philippe Brucker
On Wed, Apr 07, 2021 at 08:17:50AM +, Tian, Kevin wrote:
> btw this discussion was raised when discussing the I/O page fault handling
> process. Currently the IOMMU layer implements a per-device fault reporting
> mechanism, which requires VFIO to register a handler to receive all faults 
> on its device and then forwards to ioasid if it's due to 1st-level. Possibly 
> it 
> makes more sense to convert it into a per-pgtable reporting scheme, and 
> then the owner of each pgtable should register its own handler.

Maybe, but you do need device information in there, since that's how the
fault is reported to the guest and how the response is routed back to the
faulting device (only PASID+PRGI would cause aliasing). And we need to
report non-recoverable faults, as well as recoverable ones without PASID,
once we hand control of level-1 page tables to guests.

> It means
> for 1) VFIO will register a 2nd-level pgtable handler while /dev/ioasid
> will register a 1st-level pgtable handler, while for 3) /dev/ioasid will 
> register 
> handlers for both 1st-level and 2nd-level pgtable. Jean? also want to know 
> your thoughts...  

Moving all IOMMU controls to /dev/ioasid rather that splitting them is
probably better. Hopefully the implementation can reuse most of
vfio_iommu_type1.

I'm trying to sketch what may work for Arm, if we have to reuse
/dev/ioasid to avoid duplication of fault and inval queues:

* Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
  No operation available since we don't know what the device and IOMMU
  capabilities are.

* Attach the handle to a VF. With VFIO that would be
  VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
  with the handle, and decide which operations are available.

* With a map/unmap vIOMMU (or shadow mappings), a single translation level
  is supported. With a nesting vIOMMU, we're populating the level-2
  translation (some day maybe by binding the KVM page tables, but
  currently with map/unmap ioctl).

  Single-level translation needs single VF per container. Two level would
  allow sharing stage-2 between multiple VFs, though it's a pain to define
  and implement.

* Without a vIOMMU or if the vIOMMU starts in bypass, populate the
  container page tables.

Start the guest.

* With a map/unmap vIOMMU, guest creates mappings, userspace populates the
  page tables with map/unmap ioctl.

  It would be possible to add a PASID mode there: guest requests an
  address space with a specific PASID, userspace derives an IOASID handle
  from the container handle and populate that address space with map/unmap
  ioctl. That would enable PASID on sub-VF assignment, which requires the
  host to control which PASID is programmed into the VF (with
  DEVICE_ALLOW_IOASID, I guess). And either the host allocates the PASID
  in this case (which isn't supported by a vSMMU) or we have to do a
  vPASID -> pPASID. I don't know if it's worth the effort.

Or
* With a nesting vIOMMU, the guest attaches a PASID table to a VF,
  userspace issues a SET_PASID_TABLE ioctl on the container handle. If
  we support multiple VFs per container, we first need to derive a child
  container from the main one and the device, then attach the PASID table.

  Guest programs the PASID table, sends invalidations when removing
  mappings which are relayed to the host on the child container. Page
  faults and response queue would be per container, so if multiple VF per
  container, we could have one queue for the parent (level-2 faults) and
  one for each child (level-1 faults).

Thanks,
Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jean-Philippe Brucker
On Thu, Apr 01, 2021 at 07:04:01AM +, Liu, Yi L wrote:
> > - how about AMD and ARM's vSVA support? Their PASID allocation and page
> > table
> >   happens within guest. They only need to bind the guest PASID table to
> > host.

In this case each VM has its own IOASID space, and the host IOASID
allocator doesn't participate. Plus this only makes sense when assigning a
whole VF to a guest, and VFIO is the tool for this. So I wouldn't shoehorn
those ops into /dev/ioasid, though we do need a transport for invalidate
commands.

> >   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
> >   to correct me)
> > - this per-ioasid SVA operations is not aligned with the native SVA usage
> >   model. Native SVA bind is per-device.

Bare-metal SVA doesn't need /dev/ioasid either. A program uses a device
handle to either ask whether SVA is enabled, or to enable it explicitly.
With or without /dev/ioasid, that step is required. OpenCL uses the first
method - automatically enable "fine-grain system SVM" if available, and
provide a flag to userspace.

So userspace does not need to know about PASID. It's only one method for
doing SVA (some GPUs are context-switching page tables instead).

> After reading your reply in 
> https://lore.kernel.org/linux-iommu/20210331123801.gd1463...@nvidia.com/#t
> So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
> doesn't suit your idea. I draft below skeleton to see if our mind is the
> same. But I still believe there is an open on how to fit ARM and AMD's
> vSVA support in this the per-ioasid SVA operation model. thoughts?
> 
> +-+---+
> |  userspace  |   kernel space
> |
> +-+---+
> | ioasid_fd = | /dev/ioasid does below:   
> |
> | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {  
> |
> | |   struct list_head ioasid_list;   
> |
> | |   ... 
> |
> | |   } ifd_ctx; // ifd_ctx is per ioasid_fd  
> |
> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   ALLOC, );  |   struct ioasid_data {
> |
> | |   ioasid_t ioasid;
> |
> | |   struct list_head device_list;   
> |
> | |   struct list_head next;  
> |
> | |   ... 
> |
> | |   } id_data; // id_data is per ioasid 
> |
> | |   
> |
> | |   list_add(_data.next, 
> |
> | |_ctx.ioasid_list); 
> |
> +-+---+
> | ioctl(device_fd,| VFIO does below:  
> |
> |   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid 
> |
> |   ioasid_fd,| 2) check if ioasid is allocated from 
> ioasid_fd|
> |   ioasid);  | 3) register device/domain info to /dev/ioasid 
> |
> | |tracked in id_data.device_list 
> |
> | | 4) record the ioasid in VFIO's per-device 
> |
> | |ioasid list for future security check  
> |
> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   BIND_PGTBL,   | 1) find ioasid's id_data  
> |
> |   pgtbl_data,   | 2) loop the id_data.device_list and tell 
> iommu|
> |   ioasid);  |give ioasid access to the devices  
> |
> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   UNBIND_PGTBL, | 1) find ioasid's id_data  
> |
> |   ioasid);  | 2) loop the id_data.device_list and tell 
> iommu|
> | |clear ioasid access to the devices 
> |
> +-+---+
> | ioctl(device_fd,| VFIO does below:  
> |
> |  DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's
> |
> |  

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jean-Philippe Brucker
On Tue, Mar 30, 2021 at 10:07:55AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 09:06:42AM +0100, Jean-Philippe Brucker wrote:
> 
> > It's not inconceivable to have a control queue doing DMA tagged with
> > PASID. The devices I know either use untagged DMA, or have a choice to use
> > a PASID.
> 
> I don't think we should encourage that. A PASID and all the related is
> so expensive compared to just doing normal untagged kernel DMA.

How is it expensive?  Low number of PASIDs, or slowing down DMA
transactions?  PASIDs aren't a scarce resource on Arm systems, they have
almost 1M unused PASIDs per VM.

Thanks,
Jean

> I assume HW has these features because virtualization use cases might
> use them, eg by using mdev to assign a command queue - then it would
> need be be contained by a PASID.
> 
> Jason


Re: [PATCH RFC v1 02/15] iommu: Add a simple PASID table library

2021-03-29 Thread Jean-Philippe Brucker
On Fri, Mar 12, 2021 at 06:17:55PM +0530, Vivek Kumar Gautam wrote:
> > Regarding the overall design, I was initially assigning page directories
> > instead of whole PASID tables, which would simplify the driver and host
> > implementation. A major complication, however, is SMMUv3 accesses PASID
> > tables using a guest-physical address, so there is a messy negotiation
> > needed between host and guest when the host needs to allocate PASID
> > tables. Plus vSMMU needs PASID table assignment, so that's what the host
> > driver will implement.
> 
> By assigning the page directories, you mean setting up just the stage-1 page
> table ops, and passing that information to the host using ATTACH_TABLE?

Yes. And we can support nested translation with SMMUv2 that way. But with
SMMUv3 the guest has to manage the whole PASID table.

> Right now when using kvmtool, the struct iommu_pasid_table_config is
> populated with the correct information, and this whole memory is mapped
> between host and guest by creating a mem bank using
> kvm__for_each_mem_bank().
> Did I get you or did I fail terribly in understanding the point you are
> making here?

Makes sense

Thanks,
Jean


Re: [PATCH RFC v1 15/15] iommu/virtio: Update fault type and reason info for viommu fault

2021-03-29 Thread Jean-Philippe Brucker
On Fri, Mar 12, 2021 at 06:39:05PM +0530, Vivek Kumar Gautam wrote:
> To complete the page request we would also need to send the response back to
> the host from virtio backend when handling page request. So the virtio
> command should also be accompanied with a vfio api to send the page request
> response back to the host. Isn't it?
> This is where the host smmuv3 can send PRI_RESP command to the device to
> complete the page fault.

It looks like Eric already has this in the VFIO series:
https://lore.kernel.org/linux-iommu/20210223210625.604517-14-eric.au...@redhat.com/

Thanks,
Jean


Re: [PATCH RFC v1 13/15] iommu/virtio: Attach Arm PASID tables when available

2021-03-29 Thread Jean-Philippe Brucker
On Fri, Mar 12, 2021 at 06:59:17PM +0530, Vivek Kumar Gautam wrote:
> > > + /* XXX HACK: set feature bit ARM_SMMU_FEAT_2_LVL_CDTAB */
> > > + pst_cfg->vendor.cfg.feat_flag |= (1 << 1);
> > 
> > Oh right, this flag is missing. I'll add
> > 
> >#define VIRTIO_IOMMU_PST_ARM_SMMU3_F_CD2L (1ULL << 1)
> > 
> > to the spec.
> 
> Regarding this Eric pointed out [1] in my other patch about the scalability
> of the approach where we keep adding flags in 'iommu_nesting_info'
> corresponding to the arm-smmu-v3 capabilities. I guess the same goes to
> these flags in virtio.
> May be the 'iommu_nesting_info' can have a bitmap with the caps for vendor
> specific features, and here we can add the related flags?

Something like that, but I'd keep separate arch-specific structs. Vt-d
reports the capability registers directly through iommu_nesting_info [2].
We could do the same for Arm, copy sanitized values of IDR0..5 into
struct iommu_nesting_info_arm_smmuv3.

I've avoided doing that for virtio-iommu because every field needs a
description in the spec. So where possible I used generic properties that
apply to any architecture, such as page, PASID and address size. What's
left is the minimum arch-specific information to get nested translation
going, leaving out a lot of properties such as big-endian and 32-bit,
which can be added later if needed. The Arm specific properties are split
into page table and pasid table information. Page table info should work
for both SMMUv2 and v3 (where they correspond to an SMMU_IDRx field that
constrains a context descriptor field.) I should move BTM in there since
it's supported by SMMUv2.

Thanks,
Jean

[2] 
https://lore.kernel.org/linux-iommu/20210302203545.436623-11-yi.l@intel.com/


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-26 Thread Jean-Philippe Brucker
On Thu, Mar 25, 2021 at 02:16:45PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 10:02:36AM -0700, Jacob Pan wrote:
> > Hi Jean-Philippe,
> > 
> > On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
> >  wrote:
> > 
> > > On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> > > > Hi Jason,
> > > > 
> > > > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe 
> > > > wrote: 
> > > > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:  
> > > > > > > Also wondering about device driver allocating auxiliary domains
> > > > > > > for their private use, to do iommu_map/unmap on private PASIDs (a
> > > > > > > clean replacement to super SVA, for example). Would that go
> > > > > > > through the same path as /dev/ioasid and use the cgroup of
> > > > > > > current task?
> > > > > >
> > > > > > For the in-kernel private use, I don't think we should restrict
> > > > > > based on cgroup, since there is no affinity to user processes. I
> > > > > > also think the PASID allocation should just use kernel API instead
> > > > > > of /dev/ioasid. Why would user space need to know the actual PASID
> > > > > > # for device private domains? Maybe I missed your idea?
> > > > > 
> > > > > There is not much in the kernel that isn't triggered by a process, I
> > > > > would be careful about the idea that there is a class of users that
> > > > > can consume a cgroup controlled resource without being inside the
> > > > > cgroup.
> > > > > 
> > > > > We've got into trouble before overlooking this and with something
> > > > > greenfield like PASID it would be best built in to the API to prevent
> > > > > a mistake. eg accepting a cgroup or process input to the allocator.
> > > > >   
> > > > Make sense. But I think we only allow charging the current cgroup, how
> > > > about I add the following to ioasid_alloc():
> > > > 
> > > > misc_cg = get_current_misc_cg();
> > > > ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > > > if (ret) {
> > > > put_misc_cg(misc_cg);
> > > > return ret;
> > > > }  
> > > 
> > > Does that allow PASID allocation during driver probe, in kernel_init or
> > > modprobe context?
> > > 
> > Good point. Yes, you can get cgroup subsystem state in kernel_init for
> > charging/uncharging. I would think module_init should work also since it is
> > after kernel_init. I have tried the following:
> > static int __ref kernel_init(void *unused)
> >  {
> > int ret;
> > +   struct cgroup_subsys_state *css;
> > +   css = task_get_css(current, pids_cgrp_id);
> > 
> > But that would imply:
> > 1. IOASID has to be built-in, not as module

If IOASID is a module, the device driver will probe once the IOMMU module
is available, which I think always happens in probe deferral kworker.

> > 2. IOASIDs charged on PID1/init would not subject to cgroup limit since it
> > will be in the root cgroup and we don't support migration nor will migrate.
> > 
> > Then it comes back to the question of why do we try to limit in-kernel
> > users per cgroup if we can't enforce these cases.

It may be better to explicitly pass a cgroup during allocation as Jason
suggested. That way anyone using the API will have to be aware of this and
pass the root cgroup if that's what they want.

> Are these real use cases? Why would a driver binding to a device
> create a single kernel pasid at bind time? Why wouldn't it use
> untagged DMA?

It's not inconceivable to have a control queue doing DMA tagged with
PASID. The devices I know either use untagged DMA, or have a choice to use
a PASID. We're not outright forbidding PASID allocation at boot (I don't
think we can or should) and we won't be able to check every use of the
API, so I'm trying to figure out whether it will always default to root
cgroup, or crash in some corner case.

Thanks,
Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-25 Thread Jean-Philippe Brucker
On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > And a flag IOMMU_SVA_BIND_SUPERVISOR (not that I plan to implement it in
> > the SMMU, but I think we need to clean the current usage)
> > 
> You mean move #define SVM_FLAG_SUPERVISOR_MODE out of Intel code to be a
> generic flag in iommu-sva-lib.h called IOMMU_SVA_BIND_SUPERVISOR?

Yes, though it would need to be in iommu.h since it's used by device
drivers

> > Also wondering about device driver allocating auxiliary domains for their
> > private use, to do iommu_map/unmap on private PASIDs (a clean replacement
> > to super SVA, for example). Would that go through the same path as
> > /dev/ioasid and use the cgroup of current task?
> >
> For the in-kernel private use, I don't think we should restrict based on
> cgroup, since there is no affinity to user processes. I also think the
> PASID allocation should just use kernel API instead of /dev/ioasid. Why
> would user space need to know the actual PASID # for device private domains?
> Maybe I missed your idea?

No that's my bad, I didn't get the role of /dev/ioasid. Let me give the
series a proper read.

Thanks,
Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-25 Thread Jean-Philippe Brucker
On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > > > Also wondering about device driver allocating auxiliary domains for
> > > > their private use, to do iommu_map/unmap on private PASIDs (a clean
> > > > replacement to super SVA, for example). Would that go through the
> > > > same path as /dev/ioasid and use the cgroup of current task?  
> > >
> > > For the in-kernel private use, I don't think we should restrict based on
> > > cgroup, since there is no affinity to user processes. I also think the
> > > PASID allocation should just use kernel API instead of /dev/ioasid. Why
> > > would user space need to know the actual PASID # for device private
> > > domains? Maybe I missed your idea?  
> > 
> > There is not much in the kernel that isn't triggered by a process, I
> > would be careful about the idea that there is a class of users that
> > can consume a cgroup controlled resource without being inside the
> > cgroup.
> > 
> > We've got into trouble before overlooking this and with something
> > greenfield like PASID it would be best built in to the API to prevent
> > a mistake. eg accepting a cgroup or process input to the allocator.
> > 
> Make sense. But I think we only allow charging the current cgroup, how about
> I add the following to ioasid_alloc():
> 
>   misc_cg = get_current_misc_cg();
>   ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
>   if (ret) {
>   put_misc_cg(misc_cg);
>   return ret;
>   }

Does that allow PASID allocation during driver probe, in kernel_init or
modprobe context?

Thanks,
Jean



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-22 Thread Jean-Philippe Brucker
On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe  wrote:
> 
> > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:  
> > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > > >   
> > > > > Although there is no use for it at the moment (only two upstream
> > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > like the client-server model where the privileged process does
> > > > > bind() and programs the hardware queue on behalf of the client
> > > > > process.  
> > > > 
> > > > This creates a lot complexity, how do does process A get a secure
> > > > reference to B? How does it access the memory in B to setup the HW?  
> > > 
> > > mm_access() for example, and passing addresses via IPC  
> > 
> > I'd rather the source process establish its own PASID and then pass
> > the rights to use it to some other process via FD passing than try to
> > go the other way. There are lots of security questions with something
> > like mm_access.
> > 
> 
> Thank you all for the input, it sounds like we are OK to remove mm argument
> from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for now?

Fine by me. By the way the IDXD currently missues the bind API for
supervisor PASID, and the drvdata parameter isn't otherwise used. This
would be a good occasion to clean both. The new bind prototype could be:

struct iommu_sva *iommu_sva_bind_device(struct device *dev, int flags)

And a flag IOMMU_SVA_BIND_SUPERVISOR (not that I plan to implement it in
the SMMU, but I think we need to clean the current usage)

> 
> Let me try to summarize PASID allocation as below:
> 
> Interfaces| Usage |  Limit| bind¹ |User visible
> 
> /dev/ioasid²  | G-SVA/IOVA|  cgroup   | No|Yes
> 
> char dev³ | SVA   |  cgroup   | Yes   |No
> 
> iommu driver  | default PASID|  no| No|No

Is this PASID #0?

> 
> kernel| super SVA | no| yes   |No
> 

Also wondering about device driver allocating auxiliary domains for their
private use, to do iommu_map/unmap on private PASIDs (a clean replacement
to super SVA, for example). Would that go through the same path as
/dev/ioasid and use the cgroup of current task?

Thanks,
Jean

> 
> ¹ Allocated during SVA bind
> ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
>   ownership is assigned to the process that does the allocation.
> ³ Include uacce, other private device driver char dev such as idxd
> 
> Currently, the proposed /dev/ioasid interface does not map individual PASID
> with an FD. The FD is at the ioasid_set granularity and bond to the current
> mm. We could extend the IOCTLs to cover individual PASID-FD passing case
> when use cases arise. Would this work?
> 
> Thanks,
> 
> Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-19 Thread Jean-Philippe Brucker
On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> 
> > Although there is no use for it at the moment (only two upstream users and
> > it looks like amdkfd always uses current too), I quite like the
> > client-server model where the privileged process does bind() and programs
> > the hardware queue on behalf of the client process.
> 
> This creates a lot complexity, how do does process A get a secure
> reference to B? How does it access the memory in B to setup the HW?

mm_access() for example, and passing addresses via IPC

> Why do we need separation anyhow? SVM devices are supposed to be
> secure or they shouldn't do SVM.

Right

Thanks,
Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-19 Thread Jean-Philippe Brucker
Hi Jacob,

On Thu, Mar 18, 2021 at 05:22:34PM -0700, Jacob Pan wrote:
> Hi Jean,
> 
> Slightly off the title. As we are moving to use cgroup to limit PASID
> allocations, it would be much simpler if we enforce on the current task.

Yes I think we should do that. Is there a problem with charging the
process that does the PASID allocation even if the PASID indexes some
other mm?

> However, iommu_sva_alloc_pasid() takes an mm_struct pointer as argument
> which implies it can be something other the the current task mm. So far all
> kernel callers use current task mm. Is there a use case for doing PASID
> allocation on behalf of another mm? If not, can we remove the mm argument?

This would effectively remove the mm parameter from
iommu_sva_bind_device(). I'm not opposed to that, but reintroducing it
later will be difficult if IOMMU drivers start assuming that the bound mm
is from current.

Although there is no use for it at the moment (only two upstream users and
it looks like amdkfd always uses current too), I quite like the
client-server model where the privileged process does bind() and programs
the hardware queue on behalf of the client process.

Thanks,
Jean



Re: [PATCH 1/1] drm/amdkfd: fix build error with AMD_IOMMU_V2=m

2021-03-09 Thread Jean-Philippe Brucker
Hi Felix,

On Tue, Mar 09, 2021 at 11:30:19AM -0500, Felix Kuehling wrote:
> > I think the proper fix would be to not rely on custom hooks into a 
> > particular
> > IOMMU driver, but to instead ensure that the amdgpu driver can do everything
> > it needs through the regular linux/iommu.h interfaces. I realize this
> > is more work,
> > but I wonder if you've tried that, and why it didn't work out.
> 
> As far as I know this hasn't been tried. I see that intel-iommu has its
> own SVM thing, which seems to be similar to what our IOMMUv2 does. I
> guess we'd have to abstract that into a common API.

The common API was added in 26b25a2b98e4 and implemented by the Intel
driver in 064a57d7ddfc. To support it an IOMMU driver implements new IOMMU
ops:
.dev_has_feat()
.dev_feat_enabled()
.dev_enable_feat()
.dev_disable_feat()
.sva_bind()
.sva_unbind()
.sva_get_pasid()

And a device driver calls iommu_dev_enable_feature(IOMMU_DEV_FEAT_SVA)
followed by iommu_sva_bind_device().

If I remember correctly the biggest obstacle for KFD is the PASID
allocation, done by the GPU driver instead of the IOMMU driver, but there
may be others.

Thanks,
Jean


Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller

2021-03-05 Thread Jean-Philippe Brucker
On Fri, Mar 05, 2021 at 09:30:49AM +0100, Jean-Philippe Brucker wrote:
> That works but isn't perfect, because the hardware resource of shared
> address spaces can be much lower that PID limit - 16k ASIDs on Arm. To

Sorry I meant 16-bit here - 64k

Thanks,
Jean


Re: [PATCH v12 03/13] vfio: VFIO_IOMMU_SET_MSI_BINDING

2021-03-05 Thread Jean-Philippe Brucker
Hi,

On Tue, Feb 23, 2021 at 10:06:15PM +0100, Eric Auger wrote:
> This patch adds the VFIO_IOMMU_SET_MSI_BINDING ioctl which aim
> to (un)register the guest MSI binding to the host. This latter
> then can use those stage 1 bindings to build a nested stage
> binding targeting the physical MSIs.

Now that RMR is in the IORT spec, could it be used for the nested MSI
problem?  For virtio-iommu tables I was planning to do it like this:

MSI is mapped at stage-2 with an arbitrary IPA->doorbell PA. We report
this IPA to userspace through iommu_groups/X/reserved_regions. No change
there. Then to the guest we report a reserved identity mapping at IPA
(using RMR, an equivalent DT binding, or probed RESV_MEM for
virtio-iommu). The guest creates that mapping at stage-1, and that's it.
Unless I overlooked something we'd only reuse existing infrastructure and
avoid the SET_MSI_BINDING interface.

Thanks,
Jean


Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller

2021-03-05 Thread Jean-Philippe Brucker
On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:
> Hi Jean-Philippe,
> 
> On Thu, 4 Mar 2021 10:49:37 +0100, Jean-Philippe Brucker
>  wrote:
> 
> > On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> > > Hi Jacob,
> > > 
> > > On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> > >  wrote:
> > >   
> > > > Hi Tejun,
> > > > 
> > > > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo  wrote:
> > > >   
> > > > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> > > > > > IOASIDs are used to associate DMA requests with virtual address
> > > > > > spaces. They are a system-wide limited resource made available to
> > > > > > the userspace applications. Let it be VMs or user-space device
> > > > > > drivers.
> > > > > > 
> > > > > > This RFC patch introduces a cgroup controller to address the
> > > > > > following problems:
> > > > > > 1. Some user applications exhaust all the available IOASIDs thus
> > > > > > depriving others of the same host.
> > > > > > 2. System admins need to provision VMs based on their needs for
> > > > > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > > > > DMA requests with PASID.  
> > > > > 
> > > > > Please take a look at the proposed misc controller:
> > > > > 
> > > > >  http://lkml.kernel.org/r/20210302081705.1990283-2-vipi...@google.com
> > > > > 
> > > > > Would that fit your bill?
> > > > The interface definitely can be reused. But IOASID has a different
> > > > behavior in terms of migration and ownership checking. I guess SEV key
> > > > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > > > solved by adding
> > > > +   .can_attach = ioasids_can_attach,
> > > > +   .cancel_attach  = ioasids_cancel_attach,
> > > > Let me give it a try and come back.
> > > >   
> > > While I am trying to fit the IOASIDs cgroup in to the misc cgroup
> > > proposal. I'd like to have a direction check on whether this idea of
> > > using cgroup for IOASID/PASID resource management is viable.  
> > 
> > Yes, even for host SVA it would be good to have a cgroup. Currently the
> > number of shared address spaces is naturally limited by number of
> > processes, which can be controlled with rlimit and cgroup. But on Arm the
> > hardware limit on shared address spaces is 64k (number of ASIDs), easily
> > exhausted with the default PASID and PID limits. So a cgroup for managing
> > this resource is more than welcome.
> > 
> > It looks like your current implementation is very dependent on
> > IOASID_SET_TYPE_MM?  I'll need to do more reading about cgroup to see how
> > easily it can be adapted to host SVA which uses IOASID_SET_TYPE_NULL.
> > 
> Right, I was assuming have three use cases of IOASIDs:
> 1. host supervisor SVA (not a concern, just one init_mm to bind)
> 2. host user SVA, either one IOASID per process or perhaps some private
> IOASID for private address space
> 3. VM use for guest SVA, each IOASID is bound to a guest process
> 
> My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which is
> allocated by the new /dev/ioasid interface.
> 
> For #2, I was thinking you can limit the host process via PIDs cgroup? i.e.
> limit fork.

That works but isn't perfect, because the hardware resource of shared
address spaces can be much lower that PID limit - 16k ASIDs on Arm. To
allow an admin to fairly distribute that resource we could introduce
another cgroup just to limit the number of shared address spaces, but
limiting the number of IOASIDs does the trick.

> So the host IOASIDs are currently allocated from the system pool
> with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited use
> whatever is available. https://lkml.org/lkml/2021/2/28/18

Yes that's sensible, but it would be good to plan the cgroup user
interface to work for #2 as well, even if we don't implement it right
away.

Thanks,
Jean


Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller

2021-03-04 Thread Jean-Philippe Brucker
On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> Hi Jacob,
> 
> On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
>  wrote:
> 
> > Hi Tejun,
> > 
> > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo  wrote:
> > 
> > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:  
> > > > IOASIDs are used to associate DMA requests with virtual address
> > > > spaces. They are a system-wide limited resource made available to the
> > > > userspace applications. Let it be VMs or user-space device drivers.
> > > > 
> > > > This RFC patch introduces a cgroup controller to address the following
> > > > problems:
> > > > 1. Some user applications exhaust all the available IOASIDs thus
> > > > depriving others of the same host.
> > > > 2. System admins need to provision VMs based on their needs for
> > > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > > DMA requests with PASID.
> > > 
> > > Please take a look at the proposed misc controller:
> > > 
> > >  http://lkml.kernel.org/r/20210302081705.1990283-2-vipi...@google.com
> > > 
> > > Would that fit your bill?  
> > The interface definitely can be reused. But IOASID has a different
> > behavior in terms of migration and ownership checking. I guess SEV key
> > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > solved by adding
> > +   .can_attach = ioasids_can_attach,
> > +   .cancel_attach  = ioasids_cancel_attach,
> > Let me give it a try and come back.
> > 
> While I am trying to fit the IOASIDs cgroup in to the misc cgroup proposal.
> I'd like to have a direction check on whether this idea of using cgroup for
> IOASID/PASID resource management is viable.

Yes, even for host SVA it would be good to have a cgroup. Currently the
number of shared address spaces is naturally limited by number of
processes, which can be controlled with rlimit and cgroup. But on Arm the
hardware limit on shared address spaces is 64k (number of ASIDs), easily
exhausted with the default PASID and PID limits. So a cgroup for managing
this resource is more than welcome.

It looks like your current implementation is very dependent on
IOASID_SET_TYPE_MM?  I'll need to do more reading about cgroup to see how
easily it can be adapted to host SVA which uses IOASID_SET_TYPE_NULL.

Thanks,
Jean


Re: [PATCH RFC v1 13/15] iommu/virtio: Attach Arm PASID tables when available

2021-03-03 Thread Jean-Philippe Brucker
On Fri, Jan 15, 2021 at 05:43:40PM +0530, Vivek Gautam wrote:
[...]
> +static int viommu_setup_pgtable(struct viommu_endpoint *vdev,
> + struct viommu_domain *vdomain)
> +{
> + int ret, id;
> + u32 asid;
> + enum io_pgtable_fmt fmt;
> + struct io_pgtable_ops *ops = NULL;
> + struct viommu_dev *viommu = vdev->viommu;
> + struct virtio_iommu_probe_table_format *desc = vdev->pgtf;
> + struct iommu_pasid_table *tbl = vdomain->pasid_tbl;
> + struct iommu_vendor_psdtable_cfg *pst_cfg;
> + struct arm_smmu_cfg_info *cfgi;
> + struct io_pgtable_cfg cfg = {
> + .iommu_dev  = viommu->dev->parent,
> + .tlb= _flush_ops,
> + .pgsize_bitmap  = vdev->pgsize_mask ? vdev->pgsize_mask :
> +   vdomain->domain.pgsize_bitmap,
> + .ias= (vdev->input_end ? ilog2(vdev->input_end) :
> +
> ilog2(vdomain->domain.geometry.aperture_end)) + 1,
> + .oas= vdev->output_bits,
> + };
> +
> + if (!desc)
> + return -EINVAL;
> +
> + if (!vdev->output_bits)
> + return -ENODEV;
> +
> + switch (le16_to_cpu(desc->format)) {
> + case VIRTIO_IOMMU_FOMRAT_PGTF_ARM_LPAE:
> + fmt = ARM_64_LPAE_S1;
> + break;
> + default:
> + dev_err(vdev->dev, "unsupported page table format 0x%x\n",
> + le16_to_cpu(desc->format));
> + return -EINVAL;
> + }
> +
> + if (vdomain->mm.ops) {
> + /*
> +  * TODO: attach additional endpoint to the domain. Check that
> +  * the config is sane.
> +  */
> + return -EEXIST;
> + }
> +
> + vdomain->mm.domain = vdomain;
> + ops = alloc_io_pgtable_ops(fmt, , >mm);
> + if (!ops)
> + return -ENOMEM;
> +
> + pst_cfg = >cfg;
> + cfgi = _cfg->vendor.cfg;
> + id = ida_simple_get(_ida, 1, 1 << desc->asid_bits, GFP_KERNEL);
> + if (id < 0) {
> + ret = id;
> + goto err_free_pgtable;
> + }
> +
> + asid = id;
> + ret = iommu_psdtable_prepare(tbl, pst_cfg, , asid);
> + if (ret)
> + goto err_free_asid;
> +
> + /*
> +  * Strange to setup an op here?
> +  * cd-lib is the actual user of sync op, and therefore the platform
> +  * drivers should assign this sync/maintenance ops as per need.
> +  */
> + tbl->ops->sync = viommu_flush_pasid;

But this function deals with page tables, not pasid tables. As said on an
earlier patch, the TLB flush ops should probably be passed during table
registration - those ops are global so should really be const.

> +
> + /* Right now only PASID 0 supported ?? */
> + ret = iommu_psdtable_write(tbl, pst_cfg, 0, >s1_cfg->cd);
> + if (ret)
> + goto err_free_asid;
> +
> + vdomain->mm.ops = ops;
> + dev_dbg(vdev->dev, "using page table format 0x%x\n", fmt);
> +
> + return 0;
> +
> +err_free_asid:
> + ida_simple_remove(_ida, asid);
> +err_free_pgtable:
> + free_io_pgtable_ops(ops);
> + return ret;
> +}
> +
> +static int viommu_config_arm_pst(struct iommu_vendor_psdtable_cfg *pst_cfg,
> +  struct virtio_iommu_req_attach_pst_arm *req)
> +{
> + struct arm_smmu_s1_cfg *s1_cfg = pst_cfg->vendor.cfg.s1_cfg;
> +
> + if (!s1_cfg)
> + return -ENODEV;
> +
> + req->format = cpu_to_le16(VIRTIO_IOMMU_FORMAT_PSTF_ARM_SMMU_V3);
> + req->s1fmt  = s1_cfg->s1fmt;
> + req->s1dss  = VIRTIO_IOMMU_PSTF_ARM_SMMU_V3_DSS_0;
> + req->s1contextptr = cpu_to_le64(pst_cfg->base);
> + req->s1cdmax= cpu_to_le32(s1_cfg->s1cdmax);
> +
> + return 0;
> +}
> +
> +static int viommu_config_pst(struct iommu_vendor_psdtable_cfg *pst_cfg,
> +  void *req, enum pasid_table_fmt fmt)
> +{
> + int ret;
> +
> + switch (fmt) {
> + case PASID_TABLE_ARM_SMMU_V3:
> + ret = viommu_config_arm_pst(pst_cfg, req);
> + break;
> + default:
> + ret = -EINVAL;
> + WARN_ON(1);
> + }
> +
> + return ret;
> +}
> +
> +static int viommu_prepare_arm_pst(struct viommu_endpoint *vdev,
> +   struct iommu_vendor_psdtable_cfg *pst_cfg)
> +{
> + struct virtio_iommu_probe_table_format *pgtf = vdev->pgtf;
> + struct arm_smmu_cfg_info *cfgi = _cfg->vendor.cfg;
> + struct arm_smmu_s1_cfg *cfg;
> +
> + /* Some sanity checks */
> + if (pgtf->asid_bits != 8 && pgtf->asid_bits != 16)
> + return -EINVAL;

No need for this, next patch cheks asid size in viommu_config_arm_pgt()

> +
> + cfg = devm_kzalloc(pst_cfg->iommu_dev, sizeof(cfg), GFP_KERNEL);
> + if (!cfg)
> + return -ENOMEM;
> +
> + cfgi->s1_cfg = cfg;
> + cfg->s1cdmax = vdev->pasid_bits;
> + cfg->cd.asid = 

Re: [PATCH RFC v1 15/15] iommu/virtio: Update fault type and reason info for viommu fault

2021-03-03 Thread Jean-Philippe Brucker
On Fri, Jan 15, 2021 at 05:43:42PM +0530, Vivek Gautam wrote:
> Fault type information can tell about a page request fault or
> an unreceoverable fault, and further additions to fault reasons
> and the related PASID information can help in handling faults
> efficiently.
> 
> Signed-off-by: Vivek Gautam 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Michael S. Tsirkin 
> Cc: Robin Murphy 
> Cc: Jean-Philippe Brucker 
> Cc: Eric Auger 
> Cc: Alex Williamson 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Liu Yi L 
> Cc: Lorenzo Pieralisi 
> Cc: Shameerali Kolothum Thodi 
> ---
>  drivers/iommu/virtio-iommu.c  | 27 +--
>  include/uapi/linux/virtio_iommu.h | 13 -
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> index 9cc3d35125e9..10ef9e98214a 100644
> --- a/drivers/iommu/virtio-iommu.c
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -652,9 +652,16 @@ static int viommu_fault_handler(struct viommu_dev 
> *viommu,
>   char *reason_str;
>  
>   u8 reason   = fault->reason;
> + u16 type= fault->flt_type;
>   u32 flags   = le32_to_cpu(fault->flags);
>   u32 endpoint= le32_to_cpu(fault->endpoint);
>   u64 address = le64_to_cpu(fault->address);
> + u32 pasid   = le32_to_cpu(fault->pasid);
> +
> + if (type == VIRTIO_IOMMU_FAULT_F_PAGE_REQ) {
> + dev_info(viommu->dev, "Page request fault - unhandled\n");
> + return 0;
> + }
>  
>   switch (reason) {
>   case VIRTIO_IOMMU_FAULT_R_DOMAIN:
> @@ -663,6 +670,21 @@ static int viommu_fault_handler(struct viommu_dev 
> *viommu,
>   case VIRTIO_IOMMU_FAULT_R_MAPPING:
>   reason_str = "page";
>   break;
> + case VIRTIO_IOMMU_FAULT_R_WALK_EABT:
> + reason_str = "page walk external abort";
> + break;
> + case VIRTIO_IOMMU_FAULT_R_PTE_FETCH:
> + reason_str = "pte fetch";
> + break;
> + case VIRTIO_IOMMU_FAULT_R_PERMISSION:
> + reason_str = "permission";
> + break;
> + case VIRTIO_IOMMU_FAULT_R_ACCESS:
> + reason_str = "access";
> + break;
> + case VIRTIO_IOMMU_FAULT_R_OOR_ADDRESS:
> + reason_str = "output address";
> + break;
>   case VIRTIO_IOMMU_FAULT_R_UNKNOWN:
>   default:
>   reason_str = "unknown";
> @@ -671,8 +693,9 @@ static int viommu_fault_handler(struct viommu_dev *viommu,
>  
>   /* TODO: find EP by ID and report_iommu_fault */
>   if (flags & VIRTIO_IOMMU_FAULT_F_ADDRESS)
> - dev_err_ratelimited(viommu->dev, "%s fault from EP %u at %#llx 
> [%s%s%s]\n",
> - reason_str, endpoint, address,
> + dev_err_ratelimited(viommu->dev,
> + "%s fault from EP %u PASID %u at %#llx 
> [%s%s%s]\n",
> + reason_str, endpoint, pasid, address,
>   flags & VIRTIO_IOMMU_FAULT_F_READ ? "R" : 
> "",
>   flags & VIRTIO_IOMMU_FAULT_F_WRITE ? "W" : 
> "",
>   flags & VIRTIO_IOMMU_FAULT_F_EXEC ? "X" : 
> "");
> diff --git a/include/uapi/linux/virtio_iommu.h 
> b/include/uapi/linux/virtio_iommu.h
> index 608c8d642e1f..a537d82777f7 100644
> --- a/include/uapi/linux/virtio_iommu.h
> +++ b/include/uapi/linux/virtio_iommu.h
> @@ -290,19 +290,30 @@ struct virtio_iommu_req_invalidate {
>  #define VIRTIO_IOMMU_FAULT_R_UNKNOWN 0
>  #define VIRTIO_IOMMU_FAULT_R_DOMAIN  1
>  #define VIRTIO_IOMMU_FAULT_R_MAPPING 2
> +#define VIRTIO_IOMMU_FAULT_R_WALK_EABT   3
> +#define VIRTIO_IOMMU_FAULT_R_PTE_FETCH   4
> +#define VIRTIO_IOMMU_FAULT_R_PERMISSION  5
> +#define VIRTIO_IOMMU_FAULT_R_ACCESS  6
> +#define VIRTIO_IOMMU_FAULT_R_OOR_ADDRESS 7
>  
>  #define VIRTIO_IOMMU_FAULT_F_READ(1 << 0)
>  #define VIRTIO_IOMMU_FAULT_F_WRITE   (1 << 1)
>  #define VIRTIO_IOMMU_FAULT_F_EXEC(1 << 2)
>  #define VIRTIO_IOMMU_FAULT_F_ADDRESS (1 << 8)
>  
> +#define VIRTIO_IOMMU_FAULT_F_DMA_UNRECOV 1
> +#define VIRTIO_IOMMU_FAULT_F_PAGE_REQ2

Currently all reported faults are unrecoverable, so to be consistent
DMA_UNRECOV should be 0. But I'd prefer ha

Re: [PATCH RFC v1 09/15] iommu/virtio: Update table format probing header

2021-03-03 Thread Jean-Philippe Brucker
On Fri, Jan 15, 2021 at 05:43:36PM +0530, Vivek Gautam wrote:
> Add info about asid_bits and additional flags to table format
> probing header.
> 
> Signed-off-by: Vivek Gautam 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Michael S. Tsirkin 
> Cc: Robin Murphy 
> Cc: Jean-Philippe Brucker 
> Cc: Eric Auger 
> Cc: Alex Williamson 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Liu Yi L 
> Cc: Lorenzo Pieralisi 
> Cc: Shameerali Kolothum Thodi 
> ---
>  include/uapi/linux/virtio_iommu.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/virtio_iommu.h 
> b/include/uapi/linux/virtio_iommu.h
> index 43821e33e7af..8a0624bab4b2 100644
> --- a/include/uapi/linux/virtio_iommu.h
> +++ b/include/uapi/linux/virtio_iommu.h
> @@ -169,7 +169,10 @@ struct virtio_iommu_probe_pasid_size {
>  struct virtio_iommu_probe_table_format {
>   struct virtio_iommu_probe_property  head;
>   __le16  format;
> - __u8reserved[2];
> + __le16  asid_bits;
> +
> + __le32  flags;

This struct should only contain the head and format fields. asid and flags
should go in a specialized structure - virtio_iommu_probe_pgt_arm64 in the
latest spec draft, where I dropped the asid_bits field in favor of an
"ASID16" flag.

Thanks,
Jean



Re: [PATCH RFC v1 08/15] iommu: Add asid_bits to arm smmu-v3 stage1 table info

2021-03-03 Thread Jean-Philippe Brucker
On Fri, Jan 15, 2021 at 05:43:35PM +0530, Vivek Gautam wrote:
> aisd_bits data is required to prepare stage-1 tables for arm-smmu-v3.
> 
> Signed-off-by: Vivek Gautam 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Robin Murphy 
> Cc: Jean-Philippe Brucker 
> Cc: Eric Auger 
> Cc: Alex Williamson 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Liu Yi L 
> Cc: Lorenzo Pieralisi 
> Cc: Shameerali Kolothum Thodi 
> ---
>  include/uapi/linux/iommu.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 082d758dd016..96abbfc7c643 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -357,7 +357,7 @@ struct iommu_pasid_smmuv3 {
>   __u32   version;
>   __u8s1fmt;
>   __u8s1dss;
> - __u8padding[2];
> + __u16   asid_bits;

Is this used anywhere?  This struct is passed from host userspace to host
kernel to attach the PASID table, so I don't think it needs an asid_bits
field.

Thanks,
Jean



Re: [PATCH RFC v1 06/15] iommu/virtio: Add headers for table format probing

2021-03-03 Thread Jean-Philippe Brucker
On Fri, Jan 15, 2021 at 05:43:33PM +0530, Vivek Gautam wrote:
> From: Jean-Philippe Brucker 
> 
> Add required UAPI defines for probing table format for underlying
> iommu hardware. The device may provide information about hardware
> tables and additional capabilities for each device.
> This allows guest to correctly fabricate stage-1 page tables.
> 
> Signed-off-by: Jean-Philippe Brucker 
> [Vivek: Use a single "struct virtio_iommu_probe_table_format" rather
> than separate structures for page table and pasid table format.

Makes sense. I've integrated that into the spec draft, added more precise
documentation and modified some of the definitions.

The current draft is here:
https://jpbrucker.net/virtio-iommu/spec/virtio-iommu-v0.13.pdf
Posted on the list here
https://lists.oasis-open.org/archives/virtio-dev/202102/msg00012.html

>   Also update commit message.]
> Signed-off-by: Vivek Gautam 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Michael S. Tsirkin 
> Cc: Robin Murphy 
> Cc: Jean-Philippe Brucker 
> Cc: Eric Auger 
> Cc: Alex Williamson 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Liu Yi L 
> Cc: Lorenzo Pieralisi 
> Cc: Shameerali Kolothum Thodi 
> ---
>  include/uapi/linux/virtio_iommu.h | 44 ++-
>  1 file changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/virtio_iommu.h 
> b/include/uapi/linux/virtio_iommu.h
> index 237e36a280cb..43821e33e7af 100644
> --- a/include/uapi/linux/virtio_iommu.h
> +++ b/include/uapi/linux/virtio_iommu.h
> @@ -2,7 +2,7 @@
>  /*
>   * Virtio-iommu definition v0.12
>   *
> - * Copyright (C) 2019 Arm Ltd.
> + * Copyright (C) 2019-2021 Arm Ltd.

Not strictly necessary. But if you're modifying this comment please also
remove the "v0.12" above

>   */
>  #ifndef _UAPI_LINUX_VIRTIO_IOMMU_H
>  #define _UAPI_LINUX_VIRTIO_IOMMU_H
> @@ -111,6 +111,12 @@ struct virtio_iommu_req_unmap {
>  
>  #define VIRTIO_IOMMU_PROBE_T_NONE0
>  #define VIRTIO_IOMMU_PROBE_T_RESV_MEM1
> +#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK  2
> +#define VIRTIO_IOMMU_PROBE_T_INPUT_RANGE 3
> +#define VIRTIO_IOMMU_PROBE_T_OUTPUT_SIZE 4
> +#define VIRTIO_IOMMU_PROBE_T_PASID_SIZE  5
> +#define VIRTIO_IOMMU_PROBE_T_PAGE_TABLE_FMT  6
> +#define VIRTIO_IOMMU_PROBE_T_PASID_TABLE_FMT 7

Since there is a single struct we can have a single
VIRTIO_IOMMU_PROBE_T_TABLE_FORMAT.

>  
>  #define VIRTIO_IOMMU_PROBE_T_MASK0xfff
>  
> @@ -130,6 +136,42 @@ struct virtio_iommu_probe_resv_mem {
>   __le64  end;
>  };
>  
> +struct virtio_iommu_probe_page_size_mask {
> + struct virtio_iommu_probe_property  head;
> + __u8reserved[4];
> + __le64  mask;
> +};
> +
> +struct virtio_iommu_probe_input_range {
> + struct virtio_iommu_probe_property  head;
> + __u8reserved[4];
> + __le64  start;
> + __le64  end;
> +};
> +
> +struct virtio_iommu_probe_output_size {
> + struct virtio_iommu_probe_property  head;
> + __u8bits;
> + __u8reserved[3];
> +};
> +
> +struct virtio_iommu_probe_pasid_size {
> + struct virtio_iommu_probe_property  head;
> + __u8bits;
> + __u8reserved[3];
> +};
> +
> +/* Arm LPAE page table format */
> +#define VIRTIO_IOMMU_FOMRAT_PGTF_ARM_LPAE1

s/FOMRAT/FORMAT

> +/* Arm smmu-v3 type PASID table format */
> +#define VIRTIO_IOMMU_FORMAT_PSTF_ARM_SMMU_V3 2

These should be with the Arm-specific definitions patches 11 and 14

Thanks,
Jean

> +
> +struct virtio_iommu_probe_table_format {
> + struct virtio_iommu_probe_property  head;
> + __le16  format;
> + __u8reserved[2];
> +};
> +
>  struct virtio_iommu_req_probe {
>   struct virtio_iommu_req_headhead;
>   __le32  endpoint;
> -- 
> 2.17.1
> 


Re: [PATCH RFC v1 05/15] iommu/arm-smmu-v3: Set sync op from consumer driver of cd-lib

2021-03-03 Thread Jean-Philippe Brucker
On Fri, Jan 15, 2021 at 05:43:32PM +0530, Vivek Gautam wrote:
> Te change allows different consumers of arm-smmu-v3-cd-lib to set
> their respective sync op for pasid entries.
> 
> Signed-off-by: Vivek Gautam 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Robin Murphy 
> Cc: Jean-Philippe Brucker 
> Cc: Eric Auger 
> Cc: Alex Williamson 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Liu Yi L 
> Cc: Lorenzo Pieralisi 
> Cc: Shameerali Kolothum Thodi 
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c | 1 -
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c| 7 +++
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
> index ec37476c8d09..acaa09acecdd 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
> @@ -265,7 +265,6 @@ struct iommu_vendor_psdtable_ops arm_cd_table_ops = {
>   .free= arm_smmu_free_cd_tables,
>   .prepare = arm_smmu_prepare_cd,
>   .write   = arm_smmu_write_ctx_desc,
> - .sync= arm_smmu_sync_cd,
>  };
>  
>  struct iommu_pasid_table *arm_smmu_register_cd_table(struct device *dev,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 2f86c6ac42b6..0c644be22b4b 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1869,6 +1869,13 @@ static int arm_smmu_domain_finalise_s1(struct 
> arm_smmu_domain *smmu_domain,
>   if (ret)
>   goto out_free_cd_tables;
>  
> + /*
> +  * Strange to setup an op here?
> +  * cd-lib is the actual user of sync op, and therefore the platform
> +  * drivers should assign this sync/maintenance ops as per need.
> +  */
> + tbl->ops->sync = arm_smmu_sync_cd;
> +

Modifying a static struct from here doesn't feel right. I think the
interface should be roughly similar to io-pgtable since the principle is
the same. So the sync() op should be separate from arm_cd_table_ops since
it's a callback into the driver. Maybe pass it to
iommu_register_pasid_table().

Thanks,
Jean


Re: [PATCH RFC v1 04/15] iommu/arm-smmu-v3: Update CD base address info for user-space

2021-03-03 Thread Jean-Philippe Brucker
On Fri, Jan 15, 2021 at 05:43:31PM +0530, Vivek Gautam wrote:
> Update base address information in vendor pasid table info to pass that
> to user-space for stage1 table management.
> 
> Signed-off-by: Vivek Gautam 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Robin Murphy 
> Cc: Jean-Philippe Brucker 
> Cc: Eric Auger 
> Cc: Alex Williamson 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Liu Yi L 
> Cc: Lorenzo Pieralisi 
> Cc: Shameerali Kolothum Thodi 
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
> index 8a7187534706..ec37476c8d09 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
> @@ -55,6 +55,9 @@ static __le64 *arm_smmu_get_cd_ptr(struct 
> iommu_vendor_psdtable_cfg *pst_cfg,
>   if (arm_smmu_alloc_cd_leaf_table(dev, l1_desc))
>   return NULL;
>  
> + if (s1cfg->s1fmt == STRTAB_STE_0_S1FMT_LINEAR)
> + pst_cfg->base = l1_desc->l2ptr_dma;
> +

This isn't the right place, because this path allocates second-level
tables for two-level tables. I don't think we need pst_cfg->base at all,
because for both linear and two-level tables, the base pointer is in
cdcfg->cdtab_dma, which can be read directly.

Thanks,
Jean

>   l1ptr = cdcfg->cdtab + idx * CTXDESC_L1_DESC_DWORDS;
>   arm_smmu_write_cd_l1_desc(l1ptr, l1_desc);
>   /* An invalid L1CD can be cached */
> @@ -211,6 +214,9 @@ static int arm_smmu_alloc_cd_tables(struct 
> iommu_vendor_psdtable_cfg *pst_cfg)
>   goto err_free_l1;
>   }
>  
> + if (s1cfg->s1fmt == STRTAB_STE_0_S1FMT_64K_L2)
> + pst_cfg->base = cdcfg->cdtab_dma;
> +
>   return 0;
>  
>  err_free_l1:
> -- 
> 2.17.1
> 


Re: [PATCH RFC v1 02/15] iommu: Add a simple PASID table library

2021-03-03 Thread Jean-Philippe Brucker
Hi Vivek,

Thanks again for working on this. I have a few comments but it looks
sensible overall.

Regarding the overall design, I was initially assigning page directories
instead of whole PASID tables, which would simplify the driver and host
implementation. A major complication, however, is SMMUv3 accesses PASID
tables using a guest-physical address, so there is a messy negotiation
needed between host and guest when the host needs to allocate PASID
tables. Plus vSMMU needs PASID table assignment, so that's what the host
driver will implement.

On Fri, Jan 15, 2021 at 05:43:29PM +0530, Vivek Gautam wrote:
> Add a small API in iommu subsystem to handle PASID table allocation
> requests from different consumer drivers, such as a paravirtualized
> iommu driver. The API provides ops for allocating and freeing PASID
> table, writing to it and managing the table caches.
> 
> This library also provides for registering a vendor API that attaches
> to these ops. The vendor APIs would eventually perform arch level
> implementations for these PASID tables.

Although Arm might be the only vendor to ever use this, I think the
abstraction makes sense and isn't too invasive. Even if we called directly
into the SMMU driver from the virtio one, we'd still need patch 3 and
separate TLB invalidations ops.

> Signed-off-by: Vivek Gautam 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Robin Murphy 
> Cc: Jean-Philippe Brucker 
> Cc: Eric Auger 
> Cc: Alex Williamson 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Liu Yi L 
> Cc: Lorenzo Pieralisi 
> Cc: Shameerali Kolothum Thodi 
> ---
>  drivers/iommu/iommu-pasid-table.h | 134 ++
>  1 file changed, 134 insertions(+)
>  create mode 100644 drivers/iommu/iommu-pasid-table.h
> 
> diff --git a/drivers/iommu/iommu-pasid-table.h 
> b/drivers/iommu/iommu-pasid-table.h
> new file mode 100644
> index ..bd4f57656f67
> --- /dev/null
> +++ b/drivers/iommu/iommu-pasid-table.h
> @@ -0,0 +1,134 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PASID table management for the IOMMU
> + *
> + * Copyright (C) 2021 Arm Ltd.
> + */
> +#ifndef __IOMMU_PASID_TABLE_H
> +#define __IOMMU_PASID_TABLE_H
> +
> +#include 
> +
> +#include "arm/arm-smmu-v3/arm-smmu-v3.h"
> +
> +enum pasid_table_fmt {
> + PASID_TABLE_ARM_SMMU_V3,
> + PASID_TABLE_NUM_FMTS,
> +};
> +
> +/**
> + * struct arm_smmu_cfg_info - arm-smmu-v3 specific configuration data
> + *
> + * @s1_cfg: arm-smmu-v3 stage1 config data
> + * @feat_flag: features supported by arm-smmu-v3 implementation
> + */
> +struct arm_smmu_cfg_info {
> + struct arm_smmu_s1_cfg  *s1_cfg;
> + u32 feat_flag;
> +};
> +
> +/**
> + * struct iommu_vendor_psdtable_cfg - Configuration data for PASID tables
> + *
> + * @iommu_dev: device performing the DMA table walks
> + * @fmt: The PASID table format
> + * @base: DMA address of the allocated table, set by the vendor driver
> + * @cfg: arm-smmu-v3 specific config data
> + */
> +struct iommu_vendor_psdtable_cfg {
> + struct device   *iommu_dev;
> + enum pasid_table_fmtfmt;
> + dma_addr_t  base;
> + union {
> + struct arm_smmu_cfg_infocfg;

For the union to be extensible, that field should be called "arm" or
something like that.

Thanks,
Jean

> + } vendor;
> +};
> +
> +struct iommu_vendor_psdtable_ops;
> +
> +/**
> + * struct iommu_pasid_table - describes a set of PASID tables
> + *
> + * @cookie: An opaque token provided by the IOMMU driver and passed back to 
> any
> + * callback routine.
> + * @cfg: A copy of the PASID table configuration
> + * @ops: The PASID table operations in use for this set of page tables
> + */
> +struct iommu_pasid_table {
> + void*cookie;
> + struct iommu_vendor_psdtable_cfgcfg;
> + struct iommu_vendor_psdtable_ops*ops;
> +};
> +
> +#define pasid_table_cfg_to_table(pst_cfg) \
> + container_of((pst_cfg), struct iommu_pasid_table, cfg)
> +
> +struct iommu_vendor_psdtable_ops {
> + int (*alloc)(struct iommu_vendor_psdtable_cfg *cfg);
> + void (*free)(struct iommu_vendor_psdtable_cfg *cfg);
> + void (*prepare)(struct iommu_vendor_psdtable_cfg *cfg,
> + struct io_pgtable_cfg *pgtbl_cfg, u32 asid);
> + int (*write)(struct iommu_vendor_psdtable_cfg *cfg, int ssid,
> +  void *cookie);
> + void (*sync)(void *cookie, int ssid, bool leaf);
> +};
> +
> +static inline int iommu_psdtable_alloc(struct iommu_pasid_table *tbl,
> +struct iommu_vendor_psdtable_cfg *cfg)
> +

Re: [PATCH v6 08/12] fork: Clear PASID for new mm

2021-03-02 Thread Jean-Philippe Brucker
On Mon, Mar 01, 2021 at 03:00:11PM -0800, Jacob Pan wrote:
> > functionality is not a problem without this patch on x86. But I think
> I feel the reason that x86 doesn't care is that mm->pasid is not used
> unless bind_mm is called.

I think vt-d also maintains the global_svm_list, that tells whether a
PASID is already allocated for the mm. The SMMU driver relies only on
iommu_sva_alloc_pasid() for this.

> For the fork children even mm->pasid is non-zero,
> it has no effect since it is not loaded onto MSRs.
> Perhaps you could also add a check or WARN_ON(!mm->pasid) in load_pasid()?
> 
> > we do need to have this patch in the kernel because PASID is per addr
> > space and two addr spaces shouldn't have the same PASID.
> > 
> Agreed.
> 
> > Who will accept this patch?

It's not clear from get_maintainers.pl, but I guess it should go via
linux-mm. Since the list wasn't cc'd on the original patch, I resent it.

Thanks,
Jean


[PATCH] mm/fork: Clear PASID for new mm

2021-03-02 Thread Jean-Philippe Brucker
From: Fenghua Yu 

When a new mm is created, its PASID should be cleared, i.e. the PASID is
initialized to its init state 0 on both ARM and X86.

Reviewed-by: Tony Luck 
Signed-off-by: Fenghua Yu 
Signed-off-by: Jean-Philippe Brucker 
---
This patch was part of the series introducing mm->pasid, but got lost
along the way [1]. It still makes sense to have it, because each address
space has a different PASID. And the IOMMU code in iommu_sva_alloc_pasid()
expects the pasid field of a new mm struct to be cleared.

[1] 
https://lore.kernel.org/linux-iommu/ydgh53acqht+t...@otcwcpicx3.sc.intel.com/
---
 include/linux/mm_types.h | 1 +
 kernel/fork.c| 8 
 2 files changed, 9 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0974ad501a47..6613b26a8894 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 #endif
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 
+#define INIT_PASID 0
 
 struct address_space;
 struct mem_cgroup;
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..808af2cc8ab6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -994,6 +994,13 @@ static void mm_init_owner(struct mm_struct *mm, struct 
task_struct *p)
 #endif
 }
 
+static void mm_init_pasid(struct mm_struct *mm)
+{
+#ifdef CONFIG_IOMMU_SUPPORT
+   mm->pasid = INIT_PASID;
+#endif
+}
+
 static void mm_init_uprobes_state(struct mm_struct *mm)
 {
 #ifdef CONFIG_UPROBES
@@ -1024,6 +1031,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm_init_cpumask(mm);
mm_init_aio(mm);
mm_init_owner(mm, p);
+   mm_init_pasid(mm);
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_subscriptions_init(mm);
init_tlb_flush_pending(mm);
-- 
2.30.1



Re: [PATCH v6 08/12] fork: Clear PASID for new mm

2021-02-24 Thread Jean-Philippe Brucker
Hi Fenghua,

[Trimmed the Cc list]

On Mon, Jul 13, 2020 at 04:48:03PM -0700, Fenghua Yu wrote:
> When a new mm is created, its PASID should be cleared, i.e. the PASID is
> initialized to its init state 0 on both ARM and X86.

I just noticed this patch was dropped in v7, and am wondering whether we
could still upstream it. Does x86 need a child with a new address space
(!CLONE_VM) to inherit the PASID of the parent?  That doesn't make much
sense with regard to IOMMU structures - same PASID indexing multiple PGDs?

Currently iommu_sva_alloc_pasid() assumes mm->pasid is always initialized
to 0 and fails on forked tasks. I'm trying to figure out how to fix this.
Could we clear the pasid on fork or does it break the x86 model?

Thanks,
Jean

> 
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
> v2:
> - Add this patch to initialize PASID value for a new mm.
> 
>  include/linux/mm_types.h | 2 ++
>  kernel/fork.c| 8 
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d61285cfe027..d60d2ec10881 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -22,6 +22,8 @@
>  #endif
>  #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
>  
> +/* Initial PASID value is 0. */
> +#define INIT_PASID   0
>  
>  struct address_space;
>  struct mem_cgroup;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 142b23645d82..43b5f112604d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1007,6 +1007,13 @@ static void mm_init_owner(struct mm_struct *mm, struct 
> task_struct *p)
>  #endif
>  }
>  
> +static void mm_init_pasid(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_IOMMU_SUPPORT
> + mm->pasid = INIT_PASID;
> +#endif
> +}
> +
>  static void mm_init_uprobes_state(struct mm_struct *mm)
>  {
>  #ifdef CONFIG_UPROBES
> @@ -1035,6 +1042,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
> struct task_struct *p,
>   mm_init_cpumask(mm);
>   mm_init_aio(mm);
>   mm_init_owner(mm, p);
> + mm_init_pasid(mm);
>   RCU_INIT_POINTER(mm->exe_file, NULL);
>   mmu_notifier_subscriptions_init(mm);
>   init_tlb_flush_pending(mm);
> -- 
> 2.19.1
> 


Re: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-02-05 Thread Jean-Philippe Brucker
Hi,

On Thu, Feb 04, 2021 at 06:52:10AM +, Tian, Kevin wrote:
> > >>> The static pinning and mapping problem in VFIO and possible solutions
> > >>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> > >>> page fault support for VFIO devices. Different from those relatively
> > >>> complicated software approaches such as presenting a vIOMMU that
> > >> provides
> > >>> the DMA buffer information (might include para-virtualized
> > optimizations),

I'm curious about the performance difference between this and the
map/unmap vIOMMU, as well as the coIOMMU. This is probably a lot faster
but those don't depend on IOPF which is a pretty rare feature at the
moment.

[...]
> > > In reality, many
> > > devices allow I/O faulting only in selective contexts. However, there
> > > is no standard way (e.g. PCISIG) for the device to report whether
> > > arbitrary I/O fault is allowed. Then we may have to maintain device
> > > specific knowledge in software, e.g. in an opt-in table to list devices
> > > which allows arbitrary faults. For devices which only support selective
> > > faulting, a mediator (either through vendor extensions on vfio-pci-core
> > > or a mdev wrapper) might be necessary to help lock down non-faultable
> > > mappings and then enable faulting on the rest mappings.
> > 
> > For devices which only support selective faulting, they could tell it to the
> > IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
> 
> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> selectively page-pinning. The matter is that 'they' imply some device
> specific logic to decide which pages must be pinned and such knowledge
> is outside of VFIO.
> 
> From enabling p.o.v we could possibly do it in phased approach. First 
> handles devices which tolerate arbitrary DMA faults, and then extends
> to devices with selective-faulting. The former is simpler, but with one
> main open whether we want to maintain such device IDs in a static
> table in VFIO or rely on some hints from other components (e.g. PF
> driver in VF assignment case). Let's see how Alex thinks about it.

Do you think selective-faulting will be the norm, or only a problem for
initial IOPF implementations?  To me it's the selective-faulting kind of
device that will be the odd one out, but that's pure speculation. Either
way maintaining a device list seems like a pain.

[...]
> Yes, it's in plan but just not happened yet. We are still focusing on guest
> SVA part thus only the 1st-level page fault (+Yi/Jacob). It's always welcomed 
> to collaborate/help if you have time. 

By the way the current fault report API is missing a way to invalidate
partial faults: when the IOMMU device's PRI queue overflows, it may
auto-respond to page request groups that were already partially reported
by the IOMMU driver. Upon detecting an overflow, the IOMMU driver needs to
tell all fault consumers to discard their partial groups.
iopf_queue_discard_partial() [1] does this for the internal IOPF handler
but we have nothing for the lower-level fault handler at the moment. And
it gets more complicated when injecting IOPFs to guests, we'd need a
mechanism to recall partial groups all the way through kernel->userspace
and userspace->guest.

Shenming suggests [2] to also use the IOPF handler for IOPFs managed by
device drivers. It's worth considering in my opinion because we could hold
partial groups within the kernel and only report full groups to device
drivers (and guests). In addition we'd consolidate tracking of IOPFs,
since they're done both by iommu_report_device_fault() and the IOPF
handler at the moment.

Note that I plan to upstream the IOPF patch [1] as is because it was
already in good shape for 5.12, and consolidating the fault handler will
require some thinking.

Thanks,
Jean


[1] 
https://lore.kernel.org/linux-iommu/20210127154322.3959196-7-jean-phili...@linaro.org/
[2] 
https://lore.kernel.org/linux-iommu/f79f06be-e46b-a65a-3951-3e7dbfa66...@huawei.com/


Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-05 Thread Jean-Philippe Brucker
Hi Keqian,

On Fri, Feb 05, 2021 at 05:13:50PM +0800, Keqian Zhu wrote:
> > We need to accommodate the firmware override as well if we need this to be 
> > meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
> > stack[1].
> Robin, Thanks for pointing it out.
> 
> Jean, I see that the IORT HTTU flag overrides the hardware register info 
> unconditionally. I have some concern about it:
> 
> If the override flag has HTTU but hardware doesn't support it, then driver 
> will use this feature but receive access fault or permission fault from SMMU 
> unexpectedly.
> 1) If IOPF is not supported, then kernel can not work normally.
> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
> based dma dirty tracking (this series).
> 
> As the IORT spec doesn't give an explicit explanation for HTTU override, can 
> we comprehend it as a mask for HTTU related hardware register?

To me "Overrides the value of SMMU_IDR0.HTTU" is clear enough: disregard
the value of SMMU_IDR0.HTTU and use the one specified by IORT instead. And
that's both ways, since there is no validity mask for the IORT value: if
there is an IORT table, always ignore SMMU_IDR0.HTTU.

That's how the SMMU driver implements the COHACC bit, which has the same
wording in IORT. So I think we should implement HTTU the same way.

One complication is that there is no equivalent override for device tree.
I think it can be added later if necessary, because unlike IORT it can be
tri state (property not present, overriden positive, overridden negative).

Thanks,
Jean



Re: [PATCH v13 03/15] iommu/arm-smmu-v3: Maintain a SID->device structure

2021-02-01 Thread Jean-Philippe Brucker
On Mon, Feb 01, 2021 at 08:26:41PM +0800, Keqian Zhu wrote:
> > +static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
> > + struct arm_smmu_master *master)
> > +{
> > +   int i;
> > +   int ret = 0;
> > +   struct arm_smmu_stream *new_stream, *cur_stream;
> > +   struct rb_node **new_node, *parent_node = NULL;
> > +   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
> > +
> > +   master->streams = kcalloc(fwspec->num_ids,
> > + sizeof(struct arm_smmu_stream), GFP_KERNEL);
> > +   if (!master->streams)
> > +   return -ENOMEM;
> > +   master->num_streams = fwspec->num_ids;
> This is not roll-backed when fail.

No need, the caller frees master

> > +
> > +   mutex_lock(>streams_mutex);
> > +   for (i = 0; i < fwspec->num_ids && !ret; i++) {
> Check ret at here, makes it hard to decide the start index of rollback.
> 
> If we fail at here, then start index is (i-2).
> If we fail in the loop, then start index is (i-1).
> 
[...]
> > +   if (ret) {
> > +   for (; i > 0; i--)
> should be (i >= 0)?
> And the start index seems not correct.

Indeed, this whole bit is wrong. I'll fix it while resending the IOPF
series.

Thanks,
Jean



Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-18 Thread Jean-Philippe Brucker
On Mon, Jan 18, 2021 at 10:55:52AM +, John Garry wrote:
> On 18/01/2021 10:08, Jean-Philippe Brucker wrote:
> > > > Any idea why that's happening?  This fix seems ok but if we're expecting
> > > > allocation failures for the loaded magazine then we could easily get it
> > > > for cpu_rcaches too, and get a similar abort at runtime.
> > > It's not specifically that we expect them (allocation failures for the
> > > loaded magazine), rather we should make safe against it.
> > > 
> > > So could you be more specific in your concern for the cpu_rcache failure?
> > > cpu_rcache magazine assignment comes from this logic.
> > If this fails:
> > 
> > drivers/iommu/iova.c:847: rcache->cpu_rcaches = 
> > __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
> > 
> > then we'll get an Oops in __iova_rcache_get(). So if we're making the
> > module safer against magazine allocation failure, shouldn't we also
> > protect against cpu_rcaches allocation failure?
> 
> Ah, gotcha. So we have the WARN there, but that's not much use as this would
> still crash, as you say.
> 
> So maybe we can embed the cpu rcaches in iova_domain struct, to avoid the
> separate (failable) cpu rcache allocation.
> 
> Alternatively, we could add NULL checks __iova_rcache_get() et al for this
> allocation failure but that's not preferable as it's fastpath.
> 
> Finally so we could pass back an error code from init_iova_rcache() to its
> only caller, init_iova_domain(); but that has multiple callers and would
> need to be fixed up.
> 
> Not sure which is best or on other options.

I would have initially gone with option 2 which seems the simplest, but I
don't have a setup to measure that overhead

Thanks,
Jean


Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-18 Thread Jean-Philippe Brucker
On Mon, Jan 18, 2021 at 09:24:17AM +, John Garry wrote:
> On 15/01/2021 17:30, Jean-Philippe Brucker wrote:
> > On Thu, Dec 10, 2020 at 02:23:08AM +0800, John Garry wrote:
> > > A similar crash to the following could be observed if initial CPU rcache
> > > magazine allocations fail in init_iova_rcaches():
> > 
> 
> thanks for having a look
> 
> > Any idea why that's happening?  This fix seems ok but if we're expecting
> > allocation failures for the loaded magazine then we could easily get it
> > for cpu_rcaches too, and get a similar abort at runtime.
> 
> It's not specifically that we expect them (allocation failures for the
> loaded magazine), rather we should make safe against it.
> 
> So could you be more specific in your concern for the cpu_rcache failure?
> cpu_rcache magazine assignment comes from this logic.

If this fails:

drivers/iommu/iova.c:847: rcache->cpu_rcaches = 
__alloc_percpu(sizeof(*cpu_rcache), cache_line_size());

then we'll get an Oops in __iova_rcache_get(). So if we're making the
module safer against magazine allocation failure, shouldn't we also
protect against cpu_rcaches allocation failure?

Thanks,
Jean

> 
> Anyway, logic like "if not full" or "if not empty" is poor as the outcome
> for NULL is ambiguous (maybe there's a better word) and the code is not safe
> against it, and so I replace with "if space" or "if have an IOVA",
> respectively.
> 
> Thanks,
> John
> 


Re: [PATCH v4 3/3] iommu/iova: Flush CPU rcache for when a depot fills

2021-01-15 Thread Jean-Philippe Brucker
On Thu, Dec 10, 2020 at 02:23:09AM +0800, John Garry wrote:
> Leizhen reported some time ago that IOVA performance may degrade over time
> [0], but unfortunately his solution to fix this problem was not given
> attention.
> 
> To summarize, the issue is that as time goes by, the CPU rcache and depot
> rcache continue to grow. As such, IOVA RB tree access time also continues
> to grow.
> 
> At a certain point, a depot may become full, and also some CPU rcaches may
> also be full when inserting another IOVA is attempted. For this scenario,
> currently the "loaded" CPU rcache is freed and a new one is created. This
> freeing means that many IOVAs in the RB tree need to be freed, which
> makes IO throughput performance fall off a cliff in some storage scenarios:
> 
> Jobs: 12 (f=12): [] [0.0% done] [6314MB/0KB/0KB /s] [1616K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [5669MB/0KB/0KB /s] [1451K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6673MB/0KB/0KB /s] [1708K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6761MB/0KB/0KB /s] [1731K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6685MB/0KB/0KB /s] [1711K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6178MB/0KB/0KB /s] [1582K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [6731MB/0KB/0KB /s] [1723K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [2387MB/0KB/0KB /s] [611K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [2689MB/0KB/0KB /s] [688K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [2278MB/0KB/0KB /s] [583K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [1288MB/0KB/0KB /s] [330K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [1632MB/0KB/0KB /s] [418K/0/0 
> iops]
> Jobs: 12 (f=12): [] [0.0% done] [1765MB/0KB/0KB /s] [452K/0/0 
> iops]
> 
> And continue in this fashion, without recovering. Note that in this
> example it was required to wait 16 hours for this to occur. Also note that
> IO throughput also becomes gradually becomes more unstable leading up to
> this point.
> 
> This problem is only seen for non-strict mode. For strict mode, the rcaches
> stay quite compact.

It would be good to understand why the rcache doesn't stabilize. Could be
a bug, or just need some tuning

In strict mode, if a driver does Alloc-Free-Alloc and the first alloc
misses the rcache, the second allocation hits it. The same sequence in
non-strict mode misses the cache twice, because the IOVA is added to the
flush queue on Free. 

So rather than AFAFAF.. we get AAA..FFF.., only once the fq_timer triggers
or the FQ is full. Interestingly the FQ size is 2x IOVA_MAG_SIZE, so we
could allocate 2 magazines worth of fresh IOVAs before alloc starts
hitting the cache. If a job allocates more than that, some magazines are
going to the depot, and with multi-CPU jobs those will get used on other
CPUs during the next alloc bursts, causing the progressive increase in
rcache consumption. I wonder if setting IOVA_MAG_SIZE > IOVA_FQ_SIZE helps
reuse of IOVAs?

Then again I haven't worked out the details, might be entirely wrong. I'll
have another look next week.

Thanks,
Jean

> As a solution to this issue, judge that the IOVA caches have grown too big
> when cached magazines need to be free, and just flush all the CPUs rcaches
> instead.
> 
> The depot rcaches, however, are not flushed, as they can be used to
> immediately replenish active CPUs.
> 
> In future, some IOVA compaction could be implemented to solve the
> instability issue, which I figure could be quite complex to implement.
> 
> [0] 
> https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leiz...@huawei.com/
> 
> Analyzed-by: Zhen Lei 
> Reported-by: Xiang Chen 
> Tested-by: Xiang Chen 
> Signed-off-by: John Garry 
> Reviewed-by: Zhen Lei 
> ---
>  drivers/iommu/iova.c | 16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 732ee687e0e2..39b7488de8bb 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -841,7 +841,6 @@ static bool __iova_rcache_insert(struct iova_domain 
> *iovad,
>struct iova_rcache *rcache,
>unsigned long iova_pfn)
>  {
> - struct iova_magazine *mag_to_free = NULL;
>   struct iova_cpu_rcache *cpu_rcache;
>   bool can_insert = false;
>   unsigned long flags;
> @@ -863,13 +862,12 @@ static bool __iova_rcache_insert(struct iova_domain 
> *iovad,
>   if (cpu_rcache->loaded)
>   

Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-15 Thread Jean-Philippe Brucker
On Thu, Dec 10, 2020 at 02:23:08AM +0800, John Garry wrote:
> A similar crash to the following could be observed if initial CPU rcache
> magazine allocations fail in init_iova_rcaches():

Any idea why that's happening?  This fix seems ok but if we're expecting
allocation failures for the loaded magazine then we could easily get it
for cpu_rcaches too, and get a similar abort at runtime.

Thanks,
Jean

> Unable to handle kernel NULL pointer dereference at virtual address 
> 
> Mem abort info:
> 
>   free_iova_fast+0xfc/0x280
>   iommu_dma_free_iova+0x64/0x70
>   __iommu_dma_unmap+0x9c/0xf8
>   iommu_dma_unmap_sg+0xa8/0xc8
>   dma_unmap_sg_attrs+0x28/0x50
>   cq_thread_v3_hw+0x2dc/0x528
>   irq_thread_fn+0x2c/0xa0
>   irq_thread+0x130/0x1e0
>   kthread+0x154/0x158
>   ret_from_fork+0x10/0x34
> 
> The issue is that expression !iova_magazine_full(NULL) evaluates true; this
> falls over in __iova_rcache_insert() when we attempt to cache a mag and
> cpu_rcache->loaded == NULL:
> 
> if (!iova_magazine_full(cpu_rcache->loaded)) {
>   can_insert = true;
> ...
> 
> if (can_insert)
>   iova_magazine_push(cpu_rcache->loaded, iova_pfn);
> 
> As above, can_insert is evaluated true, which it shouldn't be, and we try
> to insert pfns in a NULL mag, which is not safe.
> 
> To avoid this, stop using double-negatives, like !iova_magazine_full() and
> !iova_magazine_empty(), and use positive tests, like
> iova_magazine_has_space() and iova_magazine_has_pfns(), respectively; these
> can safely deal with cpu_rcache->{loaded, prev} = NULL.
> 
> Signed-off-by: John Garry 
> Tested-by: Xiang Chen 
> Reviewed-by: Zhen Lei 

> ---
>  drivers/iommu/iova.c | 29 +
>  1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index cf1aacda2fe4..732ee687e0e2 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -767,14 +767,18 @@ iova_magazine_free_pfns(struct iova_magazine *mag, 
> struct iova_domain *iovad)
>   mag->size = 0;
>  }
>  
> -static bool iova_magazine_full(struct iova_magazine *mag)
> +static bool iova_magazine_has_space(struct iova_magazine *mag)
>  {
> - return (mag && mag->size == IOVA_MAG_SIZE);
> + if (!mag)
> + return false;
> + return mag->size < IOVA_MAG_SIZE;
>  }
>  
> -static bool iova_magazine_empty(struct iova_magazine *mag)
> +static bool iova_magazine_has_pfns(struct iova_magazine *mag)
>  {
> - return (!mag || mag->size == 0);
> + if (!mag)
> + return false;
> + return mag->size;
>  }
>  
>  static unsigned long iova_magazine_pop(struct iova_magazine *mag,
> @@ -783,7 +787,7 @@ static unsigned long iova_magazine_pop(struct 
> iova_magazine *mag,
>   int i;
>   unsigned long pfn;
>  
> - BUG_ON(iova_magazine_empty(mag));
> + BUG_ON(!iova_magazine_has_pfns(mag));
>  
>   /* Only fall back to the rbtree if we have no suitable pfns at all */
>   for (i = mag->size - 1; mag->pfns[i] > limit_pfn; i--)
> @@ -799,7 +803,7 @@ static unsigned long iova_magazine_pop(struct 
> iova_magazine *mag,
>  
>  static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn)
>  {
> - BUG_ON(iova_magazine_full(mag));
> + BUG_ON(!iova_magazine_has_space(mag));
>  
>   mag->pfns[mag->size++] = pfn;
>  }
> @@ -845,9 +849,9 @@ static bool __iova_rcache_insert(struct iova_domain 
> *iovad,
>   cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
>   spin_lock_irqsave(_rcache->lock, flags);
>  
> - if (!iova_magazine_full(cpu_rcache->loaded)) {
> + if (iova_magazine_has_space(cpu_rcache->loaded)) {
>   can_insert = true;
> - } else if (!iova_magazine_full(cpu_rcache->prev)) {
> + } else if (iova_magazine_has_space(cpu_rcache->prev)) {
>   swap(cpu_rcache->prev, cpu_rcache->loaded);
>   can_insert = true;
>   } else {
> @@ -856,8 +860,9 @@ static bool __iova_rcache_insert(struct iova_domain 
> *iovad,
>   if (new_mag) {
>   spin_lock(>lock);
>   if (rcache->depot_size < MAX_GLOBAL_MAGS) {
> - rcache->depot[rcache->depot_size++] =
> - cpu_rcache->loaded;
> + if (cpu_rcache->loaded)
> + rcache->depot[rcache->depot_size++] =
> + cpu_rcache->loaded;
>   } else {
>   mag_to_free = cpu_rcache->loaded;
>   }
> @@ -908,9 +913,9 @@ static unsigned long __iova_rcache_get(struct iova_rcache 
> *rcache,
>   cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
>   spin_lock_irqsave(_rcache->lock, flags);
>  
> - if (!iova_magazine_empty(cpu_rcache->loaded)) {
> + if (iova_magazine_has_pfns(cpu_rcache->loaded)) {
>   has_pfn = true;
> - } else 

Re: [PATCH v4 1/3] iommu/iova: Add free_all_cpu_cached_iovas()

2021-01-15 Thread Jean-Philippe Brucker
On Thu, Dec 10, 2020 at 02:23:07AM +0800, John Garry wrote:
> Add a helper function to free the CPU rcache for all online CPUs.
> 
> There also exists a function of the same name in
> drivers/iommu/intel/iommu.c, but the parameters are different, and there
> should be no conflict.
> 
> Signed-off-by: John Garry 
> Tested-by: Xiang Chen 
> Reviewed-by: Zhen Lei 

Reviewed-by: Jean-Philippe Brucker 

(unless we find a better solution for patch 3)

> ---
>  drivers/iommu/iova.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index f9c35852018d..cf1aacda2fe4 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -238,6 +238,14 @@ static int __alloc_and_insert_iova_range(struct 
> iova_domain *iovad,
>   return -ENOMEM;
>  }
>  
> +static void free_all_cpu_cached_iovas(struct iova_domain *iovad)
> +{
> + unsigned int cpu;
> +
> + for_each_online_cpu(cpu)
> + free_cpu_cached_iovas(cpu, iovad);
> +}
> +
>  static struct kmem_cache *iova_cache;
>  static unsigned int iova_cache_users;
>  static DEFINE_MUTEX(iova_cache_mutex);
> @@ -435,15 +443,12 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned 
> long size,
>  retry:
>   new_iova = alloc_iova(iovad, size, limit_pfn, true);
>   if (!new_iova) {
> - unsigned int cpu;
> -
>   if (!flush_rcache)
>   return 0;
>  
>   /* Try replenishing IOVAs by flushing rcache. */
>   flush_rcache = false;
> - for_each_online_cpu(cpu)
> - free_cpu_cached_iovas(cpu, iovad);
> + free_all_cpu_cached_iovas(iovad);
>   free_global_cached_iovas(iovad);
>   goto retry;
>   }
> -- 
> 2.26.2
> 
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v13 07/15] iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs

2021-01-14 Thread Jean-Philippe Brucker
Hi Eric,

On Thu, Jan 14, 2021 at 05:58:27PM +0100, Auger Eric wrote:
> >>  The uacce-devel branches from
> >>> https://github.com/Linaro/linux-kernel-uadk do provide this at the moment
> >>> (they track the latest sva/zip-devel branch
> >>> https://jpbrucker.net/git/linux/ which is roughly based on mainline.)
> As I plan to respin shortly, please could you confirm the best branch to
> rebase on still is that one (uacce-devel from the linux-kernel-uadk git
> repo). Is it up to date? Commits seem to be quite old there.

Right I meant the uacce-devel-X branches. The uacce-devel-5.11 branch
currently has the latest patches

Thanks,
Jean


Re: [PATCH bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-13 Thread Jean-Philippe Brucker
On Wed, Jan 13, 2021 at 10:21:31AM +, Qais Yousef wrote:
> On 01/12/21 12:07, Andrii Nakryiko wrote:
> > > > > $ sudo ./test_progs -v -t module_attach
> > > >
> > > > use -vv when debugging stuff like that with test_progs, it will output
> > > > libbpf detailed logs, that often are very helpful
> > >
> > > I tried that but it didn't help me. Full output is here
> > >
> > > https://paste.debian.net/1180846
> > >
> > 
> > It did help a bit for me to make sure that you have bpf_testmod
> > properly loaded and its BTF was found, so the problem is somewhere
> > else. Also, given load succeeded and attach failed with OPNOTSUPP, I
> > suspect you are missing some of FTRACE configs, which seems to be
> > missing from selftests's config as well. Check that you have
> > CONFIG_FTRACE=y and CONFIG_DYNAMIC_FTRACE=y, and you might need some
> > more. See [0] for a real config we are using to run all tests in
> > libbpf CI. If you figure out what you were missing, please also
> > contribute a patch to selftests' config.
> > 
> >   [0] 
> > https://github.com/libbpf/libbpf/blob/master/travis-ci/vmtest/configs/latest.config
> 
> Yeah that occurred to me too. I do have all necessary FTRACE options enabled,
> including DYNAMIC_FTRACE. I think I did try enabling fault injection too just
> in case. I have CONFIG_FAULT_INJECTION=y and 
> CONFIG_FUNCTION_ERROR_INJECTION=y.

Could it come from lack of fentry support on arm64 (or are you testing on
x86?) Since the arm64 JIT doesn't have trampoline support at the moment, a
lot of bpf selftests fail with ENOTSUPP.

Thanks,
Jean


Re: [PATCH] PCI: Add a quirk to enable SVA for HiSilicon chip

2021-01-13 Thread Jean-Philippe Brucker
On Wed, Jan 13, 2021 at 08:05:11PM +0800, Zhangfei Gao wrote:
> > > + /* Device-tree can set the stall property */
> > > + if (!pdev->dev.of_node &&
> > > + device_add_properties(>dev, properties))
> > Does this mean "dma-can-stall" *can* be set via DT, and if it is, this
> > quirk is not needed?  So is this quirk basically a workaround for an
> > old or broken DT?
> The quirk is still needed for uefi case, since uefi can not describe the
> endpoints (peripheral devices).

Yes, this comment isn't very clear. How about
/*
 * Set the dma-can-stall property on ACPI platforms. Device tree
 * can set it directly.
 */ 

> > 
> > > + pci_warn(pdev, "could not add stall property");
> > > +}
> > > +
> > Remove this blank line to follow the style of the rest of the file.
> > 
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_HUAWEI, 0xa250, 
> > > quirk_huawei_pcie_sva);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_HUAWEI, 0xa251, 
> > > quirk_huawei_pcie_sva);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_HUAWEI, 0xa255, 
> > > quirk_huawei_pcie_sva);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_HUAWEI, 0xa256, 
> > > quirk_huawei_pcie_sva);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_HUAWEI, 0xa258, 
> > > quirk_huawei_pcie_sva);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_HUAWEI, 0xa259, 
> > > quirk_huawei_pcie_sva);
> > > +
> > >   /*
> > >* It's possible for the MSI to get corrupted if SHPC and ACPI are used
> > >* together on certain PXH-based systems.
> 
> How about changes like this
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 68f53f7..886ea26 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2466,6 +2466,9 @@ static int arm_smmu_enable_pasid(struct
> arm_smmu_master *master)
>  if (num_pasids <= 0)
>      return num_pasids;
> 
> +    if (master->stall_enabled)
> +        pdev->pasid_no_tlp = 1;
> +

>From the SMMU perspective there is no relation between stall and pasid, so
I don't think this makes a lot of sense. Could we instead set pasid_no_tlp
for the list of device IDs above?

I agree with splitting the patches. PASID support for SMMUv3 is upstream,
but the introduction of dma-can-stall, which this depends on, is still
pending on the list.

Thanks,
Jean


Re: [PATCH AUTOSEL 5.4 08/10] selftests/bpf: Fix array access with signed variable test

2020-12-20 Thread Jean-Philippe Brucker
Hi,

On Sat, Dec 19, 2020 at 10:34:55PM -0500, Sasha Levin wrote:
> From: Jean-Philippe Brucker 
> 
> [ Upstream commit 77ce220c0549dcc3db8226c61c60e83fc59dfafc ]
> 
> The test fails because of a recent fix to the verifier, even though this

That fix is commit b02709587ea3 ("bpf: Fix propagation of 32-bit signed
bounds from 64-bit bounds.") upstream, which only needed backport to 5.9.
So although backporting this patch to 5.4 shouldn't break anything, I
wouldn't bother. 

Thanks,
Jean

> program is valid. In details what happens is:
> 
> 7: (61) r1 = *(u32 *)(r0 +0)
> 
> Load a 32-bit value, with signed bounds [S32_MIN, S32_MAX]. The bounds
> of the 64-bit value are [0, U32_MAX]...
> 
> 8: (65) if r1 s> 0x goto pc+1
> 
> ... therefore this is always true (the operand is sign-extended).
> 
> 10: (b4) w2 = 11
> 11: (6d) if r2 s> r1 goto pc+1
> 
> When true, the 64-bit bounds become [0, 10]. The 32-bit bounds are still
> [S32_MIN, 10].
> 
> 13: (64) w1 <<= 2
> 
> Because this is a 32-bit operation, the verifier propagates the new
> 32-bit bounds to the 64-bit ones, and the knowledge gained from insn 11
> is lost.
> 
> 14: (0f) r0 += r1
> 15: (7a) *(u64 *)(r0 +0) = 4
> 
> Then the verifier considers r0 unbounded here, rejecting the test. To
> make the test work, change insn 8 to check the sign of the 32-bit value.
> 
> Signed-off-by: Jean-Philippe Brucker 
> Acked-by: John Fastabend 
> Signed-off-by: Alexei Starovoitov 
> Signed-off-by: Sasha Levin 
> ---
>  tools/testing/selftests/bpf/verifier/array_access.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/bpf/verifier/array_access.c 
> b/tools/testing/selftests/bpf/verifier/array_access.c
> index f3c33e128709b..a80d806ead15f 100644
> --- a/tools/testing/selftests/bpf/verifier/array_access.c
> +++ b/tools/testing/selftests/bpf/verifier/array_access.c
> @@ -68,7 +68,7 @@
>   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
>   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 9),
>   BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
> - BPF_JMP_IMM(BPF_JSGT, BPF_REG_1, 0x, 1),
> + BPF_JMP32_IMM(BPF_JSGT, BPF_REG_1, 0x, 1),
>   BPF_MOV32_IMM(BPF_REG_1, 0),
>   BPF_MOV32_IMM(BPF_REG_2, MAX_ENTRIES),
>   BPF_JMP_REG(BPF_JSGT, BPF_REG_2, BPF_REG_1, 1),
> -- 
> 2.27.0
> 


Re: [PATCH v13 07/15] iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs

2020-12-04 Thread Jean-Philippe Brucker
Hi Shameer,

On Thu, Dec 03, 2020 at 06:42:57PM +, Shameerali Kolothum Thodi wrote:
> Hi Jean/zhangfei,
> Is it possible to have a branch with minimum required SVA/UACCE related 
> patches
> that are already public and can be a "stable" candidate for future respin of 
> Eric's series?
> Please share your thoughts.

By "stable" you mean a fixed branch with the latest SVA/UACCE patches
based on mainline?  The uacce-devel branches from
https://github.com/Linaro/linux-kernel-uadk do provide this at the moment
(they track the latest sva/zip-devel branch
https://jpbrucker.net/git/linux/ which is roughly based on mainline.)

Thanks,
Jean



Re: [PATCH v6 2/5] iommu: Use bus iommu ops for aux related callback

2020-10-30 Thread Jean-Philippe Brucker
On Fri, Oct 30, 2020 at 05:55:53AM +, Tian, Kevin wrote:
> > From: Lu Baolu 
> > Sent: Friday, October 30, 2020 12:58 PM
> > 
> > The aux-domain apis were designed for macro driver where the subdevices
> > are created and used inside a device driver. Use the device's bus iommu
> > ops instead of that in iommu domain for various callbacks.
> 
> IIRC there are only two users on these apis. One is VFIO, and the other
> is on the ARM side (not checked in yet). Jean, can you help confirm 
> whether ARM-side usage still relies on aux apis even with this change?

No, I have something out of tree but no plan to upstream it anymore, and
the SMMUv2 implementation is out as well:

https://lore.kernel.org/linux-iommu/20200713173556.gc3...@jcrouse1-lnx.qualcomm.com/

> If no, possibly they can be removed completely?

No objection from me. They can be added back later (I still belive adding
PASID to the DMA API would be nice to have once more HW implements it).

Thanks,
Jean


Re: [PATCH v3 01/14] docs: Document IO Address Space ID (IOASID) APIs

2020-10-30 Thread Jean-Philippe Brucker
On Mon, Oct 26, 2020 at 02:05:06PM -0700, Jacob Pan wrote:
> > This looks good to me, with small comments below.
> > 
> Can I add your Reviewed-by tag after addressing the comments?

Yes sure, this took forever to review so I'm happy not to do another pass :)


> > > +Each IOASID set is created with a token, which can be one of the
> > > +following token types:
> > > +
> > > + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)  
> > 
> > Maybe NULL isn't the best name then. NONE?
> > 
> Agreed, 'NONE' makes more sense.

Although patch 5 only allows a NULL token for this type. So the name seems
fine, you could just fix this description.


> > > +IOASID core has the notion of "custom allocator" such that guest can
> > > +register virtual command allocator that precedes the default one.  
> > 
> > "Supersedes", rather than "precedes"?
> > 
> My understanding is that 'supersede' means replace something but 'precede'
> means get in front of something. I do want to emphasis that the custom
> allocator takes precedence over the default allocator.

Right it's ambiguous. The custom allocator does entirely replace the
allocation action, but the default one is still used for storage. Anyway,
you can leave this.


> > > +Let's examine the IOASID life cycle again when free happens *before*
> > > +unbind. This could be a result of misbehaving guests or crash. Assuming
> > > +VFIO cannot enforce unbind->free order. Notice that the setup part up
> > > +until step #12 is identical to the normal case, the flow below starts
> > > +with step 13.
> > > +
> > > +::
> > > +
> > > + VFIOIOMMUKVMVDCMIOASID   Ref
> > > +   ..
> > > +   13  GUEST STARTS DMA --
> > > +   14  *GUEST MISBEHAVES!!!* 
> > > +   15 ioasid_free()
> > > +   16 ioasid_notify(FREE)
> > > +   17 mark_free_pending
> > > (1)  
> > 
> > Could we use superscript ¹²³⁴ for footnotes? These look like function
> > parameters
> > 
> yes, much better
> 
> > > +   18  kvm_nb_handler(FREE)
> > > +   19  vmcs_update_atomic()
> > > +   20  ioasid_put_locked()   ->   3
> > > +   21   vdcm_nb_handler(FREE)
> > > +   22iomm_nb_handler(FREE)  
> > 
> > iommu_nb_handler
> > 
> got it
> 
> > > +   23 ioasid_free() returns(2)  schedule_work()   2  
> > 
> > I completely lost track here, couldn't figure out in which direction to
> > read the diagram. What work is scheduled?
> The time line goes downward but we only control the notification order in
> terms of when the events are received. Some completions are async thus out
> of order done by work items. The only in-order completion is the KVM update
> of its PASID translation table.
> 
> After #23, the async works are scheduled to complete clean up work outside
> the spinlock(held by the caller of the atomic notifier).
> 
> Any suggestions to improve the readability of the time line?

Maybe explain what happens from line 23: ioasid_free() schedules... a FREE
notification? Which happens on line 24 (corresponding to the second
schedule_work()?) and is handled by (a) VDCM to clear the device context
and (b) IOMMU to clear the PASID context, both ending up dropping their
ref.

> 
> > Why does the IOMMU driver drop
> > its reference to the IOASID before unbdind_gpasid()?
> > 
> This is the exception case where userspace issues IOASID free before
> unbind_gpasid(). The equivalent of unbind is performed in the IOASID_FREE
> notification handler. In IOASID_FREE handler, reference is dropped and
> private data deleted. After that, if unbind comes to IOMMU driver, it will
> not find IOASID private data therefore just return.

Right ok. As you noted below the damage is caused by and limited to the
guest, so I think it's fine.

> 
> > > +   24schedule_work()vdev_clear_wk(hpasid)
> > > +   25teardown_pasid_wk()
> > > +   26   ioasid_put() ->   1
> > > +   27ioasid_put() 0
> > > +   28 Reclaimed
> > > +   29 unbind_gpasid()
> > > +   30iommu_unbind()->ioasid_find() Fails(3)
> > > +   -- New Life Cycle Begin 
> > > +
> > > +Note:
> > > +
> > > +1. By marking IOASID FREE_PENDING at step #17, no new references can be
> > > +   held. ioasid_get/find() will return -ENOENT;  
> > 
> > s/held/taken
> > 
> Got it.
> 
> > Thanks,
> > Jean
> > 
> > > +2. After step #23, all events can go out of order. Shall not affect
> > > +   the outcome.
> > > +3. IOMMU driver fails to find private data for unbinding. If unbind is
> > > +   called after the same IOASID is 

Re: [PATCH v1 2/3] iommu: Fix an issue in iommu_page_response() flags check

2020-10-28 Thread Jean-Philippe Brucker
Hi,

On Wed, Oct 28, 2020 at 09:36:57AM +0800, Yi Sun wrote:
> From: Jacob Pan 
> 
> original code fails when LAST_PAGE is set in flags.

LAST_PAGE is not documented to be a valid flags for page_response.
So isn't failing the right thing to do?

> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Yi Sun 
> ---
>  drivers/iommu/iommu.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 8c470f4..053cec3 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1200,9 +1200,11 @@ int iommu_page_response(struct device *dev,
>   return -EINVAL;
>  
>   if (msg->version != IOMMU_PAGE_RESP_VERSION_1 ||
> - msg->flags & ~IOMMU_PAGE_RESP_PASID_VALID)
> + !(msg->flags & IOMMU_PAGE_RESP_PASID_VALID)) {

It should be OK not to have PASID_VALID set, we're just checking for
undefined flags here.

Thanks,
Jean

> + dev_warn_ratelimited(dev, "%s:Invalid ver %x: flags %x\n",
> + __func__, msg->version, msg->flags);
>   return -EINVAL;
> -
> + }
>   /* Only send response if there is a fault report pending */
>   mutex_lock(>fault_param->lock);
>   if (list_empty(>fault_param->faults)) {
> -- 
> 2.7.4
> 
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 13/29] arm64/build: Assert for unwanted sections

2020-10-27 Thread Jean-Philippe Brucker
Hi,

On Mon, Oct 26, 2020 at 06:38:46PM +0100, Ard Biesheuvel wrote:
> > > > Note that even on plain be2881824ae9eb92, I get:
> > > >
> > > > aarch64-linux-gnu-ld: Unexpected GOT/PLT entries detected!
> > > > aarch64-linux-gnu-ld: Unexpected run-time procedure linkages 
> > > > detected!
> > > >
> > > > The parent commit obviously doesn't show that (but probably still has
> > > > the problem).
> >
> > Reverting both
> > b3e5d80d0c48c0cc ("arm64/build: Warn on orphan section placement")
> > be2881824ae9eb92 ("arm64/build: Assert for unwanted sections")
> > seems to solve my problems, without any ill effects?
> >
> 
> I cannot reproduce the issue here with my distro GCC+binutils (Debian 8.3.0)

I have the same problem with one of my debug configs and Linux v5.10-rc1,
and can reproduce with the Debian 8.3.0 toolchain, by using the arm64
defconfig and disabling CONFIG_MODULES:

ld -EL -maarch64elf --no-undefined -X -z norelro -shared -Bsymbolic -z notext 
--no-apply-dynamic-relocs --fix-cortex-a53-843419 --orphan-handling=warn 
--build-id=sha1 --strip-debug -o .tmp_vmlinux.kallsyms1 -T 
./arch/arm64/kernel/vmlinux.lds --whole-archive arch/arm64/kernel/head.o 
init/built-in.a usr/built-in.a arch/arm64/built-in.a kernel/built-in.a 
certs/built-in.a mm/built-in.a fs/built-in.a ipc/built-in.a security/built-in.a 
crypto/built-in.a block/built-in.a arch/arm64/lib/built-in.a lib/built-in.a 
drivers/built-in.a sound/built-in.a net/built-in.a virt/built-in.a 
--no-whole-archive --start-group arch/arm64/lib/lib.a lib/lib.a 
./drivers/firmware/efi/libstub/lib.a --end-group
ld: Unexpected GOT/PLT entries detected!
ld: Unexpected run-time procedure linkages detected!

Adding -fno-pie to this command doesn't fix the problem.

Note that when cross-building with a GCC 10.2 and binutils 2.35.1 I also
get several "aarch64-linux-gnu-ld: warning: -z norelro ignored" in
addition to the error, but I don't get that warning with the 8.3.0
toolchain.

Thanks,
Jean

> 
> The presence of .data.rel.ro and .got.plt sections suggests that the
> toolchain is using -fpie and/or -z relro to build shared objects
> rather than a fully linked bare metal binary.
> 
> Which toolchain are you using? Does adding -fno-pie to the compiler
> command line and/or adding -z norelro to the linker command line make
> any difference?
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-23 Thread Jean-Philippe Brucker
On Mon, Oct 19, 2020 at 02:16:08PM -0700, Raj, Ashok wrote:
> Hi Jean
> 
> On Mon, Oct 19, 2020 at 04:08:24PM +0200, Jean-Philippe Brucker wrote:
> > On Sat, Oct 17, 2020 at 04:25:25AM -0700, Raj, Ashok wrote:
> > > > For devices that *don't* use a stop marker, the PCIe spec says 
> > > > (10.4.1.2):
> > > > 
> > > >   To stop [using a PASID] without using a Stop Marker Message, the
> > > >   function shall:
> > > >   1. Stop queueing new Page Request Messages for this PASID.
> > > 
> > > The device driver would need to tell stop sending any new PR's.
> > > 
> > > >   2. Finish transmitting any multi-page Page Request Messages for this
> > > >  PASID (i.e. send the Page Request Message with the L bit Set).
> > > >   3. Wait for PRG Response Messages associated any outstanding Page
> > > >  Request Messages for the PASID.
> > > > 
> > > > So they have to flush their PR themselves. And since the device driver
> > > > completes this sequence before calling unbind(), then there shouldn't be
> > > > any oustanding PR for the PASID, and unbind() doesn't need to flush,
> > > > right?
> > > 
> > > I can see how the device can complete #2,3 above. But the device driver
> > > isn't the one managing page-responses right. So in order for the device to
> > > know the above sequence is complete, it would need to get some assist from
> > > IOMMU driver?
> > 
> > No the device driver just waits for the device to indicate that it has
> > completed the sequence. That's what the magic stop-PASID mechanism
> > described by PCIe does. In 6.20.1 "Managing PASID TLP Prefix Usage" it
> > says:
> 
> The goal is we do this when the device is in a messup up state. So we can't
> take for granted the device is properly operating which is why we are going
> to wack the device with a flr().
> 
> The only thing that's supposed to work without a brain freeze is the
> invalidation logic. Spec requires devices to respond to invalidations even 
> when
> they are in the process of flr().
> 
> So when IOMMU does an invalidation wait with a Page-Drain, IOMMU waits till
> the response for that arrives before completing the descriptor. Due to 
> the posted semantics it will ensure any PR's issued and in the fabric are 
> flushed 
> out to memory. 
> 
> I suppose you can wait for the device to vouch for all responses, but that
> is assuming the device is still functioning properly. Given that we use it
> in two places,
> 
> * Reclaiming a PASID - only during a tear down sequence, skipping it
>   doesn't really buy us much.

Yes I was only wondering about normal PASID reclaim operations, in
unbind(). Agreed that for FLR we need to properly clean the queue, though
I do need to do more thinking about this.

Anyway, having a full priq drain in unbind() isn't harmful, just
unnecessary delay in my opinion. I'll drop these patches for now but
thanks for the discussion.

Thanks,
Jean

> * During FLR this can't be skipped anyway due to the above sequence
>   requirement. 
> 
> > 
> > "A Function must have a mechanism to request that it gracefully stop using
> >  a specific PASID. This mechanism is device specific but must satisfy the
> >  following rules:
> >  [...]
> >  * When the stop request mechanism indicates completion, the Function has:
> >[...]
> >* Complied with additional rules described in Address Translation
> >  Services (Chapter 10 [10.4.1.2 quoted above]) if Address Translations
> >  or Page Requests were issued on the behalf of this PASID."
> > 
> > So after the device driver initiates this mechanism in the device, the
> > device must be able to indicate completion of the mechanism, which
> > includes completing all in-flight Page Requests. At that point the device
> > driver can call unbind() knowing there is no pending PR for this PASID.
> > 
> 
> Cheers,
> Ashok


Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-23 Thread Jean-Philippe Brucker
On Mon, Oct 19, 2020 at 11:33:16AM -0700, Jacob Pan wrote:
> Hi Jean-Philippe,
> 
> On Mon, 19 Oct 2020 16:08:24 +0200, Jean-Philippe Brucker
>  wrote:
> 
> > On Sat, Oct 17, 2020 at 04:25:25AM -0700, Raj, Ashok wrote:
> > > > For devices that *don't* use a stop marker, the PCIe spec says
> > > > (10.4.1.2):
> > > > 
> > > >   To stop [using a PASID] without using a Stop Marker Message, the
> > > >   function shall:
> > > >   1. Stop queueing new Page Request Messages for this PASID.  
> > > 
> > > The device driver would need to tell stop sending any new PR's.
> > >   
> > > >   2. Finish transmitting any multi-page Page Request Messages for this
> > > >  PASID (i.e. send the Page Request Message with the L bit Set).
> > > >   3. Wait for PRG Response Messages associated any outstanding Page
> > > >  Request Messages for the PASID.
> > > > 
> > > > So they have to flush their PR themselves. And since the device driver
> > > > completes this sequence before calling unbind(), then there shouldn't
> > > > be any oustanding PR for the PASID, and unbind() doesn't need to
> > > > flush, right?  
> > > 
> > > I can see how the device can complete #2,3 above. But the device driver
> > > isn't the one managing page-responses right. So in order for the device
> > > to know the above sequence is complete, it would need to get some
> > > assist from IOMMU driver?  
> > 
> > No the device driver just waits for the device to indicate that it has
> > completed the sequence. That's what the magic stop-PASID mechanism
> > described by PCIe does. In 6.20.1 "Managing PASID TLP Prefix Usage" it
> > says:
> > 
> > "A Function must have a mechanism to request that it gracefully stop using
> >  a specific PASID. This mechanism is device specific but must satisfy the
> >  following rules:
> >  [...]
> >  * When the stop request mechanism indicates completion, the Function has:
> >[...]
> >* Complied with additional rules described in Address Translation
> >  Services (Chapter 10 [10.4.1.2 quoted above]) if Address Translations
> >  or Page Requests were issued on the behalf of this PASID."
> > 
> > So after the device driver initiates this mechanism in the device, the
> > device must be able to indicate completion of the mechanism, which
> > includes completing all in-flight Page Requests. At that point the device
> > driver can call unbind() knowing there is no pending PR for this PASID.
> > 
> In step #3, I think it is possible that device driver received page response
> as part of the auto page response, so it may not guarantee all the in-flight
> PRQs are completed inside IOMMU. Therefore, drain is _always_ needed to be
> sure?

So I don't think that's a problem, because non-last PR are not handled by
IOPF until the last PR is received. And when the IOMMU driver detects that
the queue has been overflowing and could have auto-responded to last PR,
we discard any pending non-last PR. I couldn't find a leak here.

Thanks,
Jean



Re: [PATCH v3 11/14] iommu/ioasid: Support mm type ioasid_set notifications

2020-10-23 Thread Jean-Philippe Brucker
quot;, set->id);
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + curr = kzalloc(sizeof(*curr), GFP_ATOMIC);

Needs to be in the common path, before this block

> + if (!curr) {
> + ret = -ENOMEM;
> + goto exit_unlock;
> + }
> + curr->token = mm;
> + curr->nb = nb;
> + curr->active = true;
> + curr->set = set;
> +
> + /* Set already created, add to the notifier chain */
> + atomic_notifier_chain_register(>nh, nb);
> + /*
> +  * Do not hold a reference, if the set gets destroyed, the nb
> +  * entry will be marked inactive.
> +  */
> + ioasid_set_put(set);
> + }
> +
> + list_add(>list, _nb_pending_list);
> +
> +exit_unlock:
> + spin_unlock(_nb_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
> +
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct 
> notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> +
> + spin_lock(_nb_lock);
> + list_for_each_entry(curr, _nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + list_del(>list);
> + spin_unlock(_nb_lock);
> + if (curr->active) {
> + atomic_notifier_chain_unregister(>set->nh,
> +  nb);

I think it's possible for ioasid_set_put_locked() to free the set right
after we release the lock, then this unregister() will be use-after-free.
Best keep holding the lock for this.

> + }
> + kfree(curr);
> + return;
> + }
> + }
> + pr_warn("No ioasid set found for mm token %llx\n",  (u64)mm);

Use %p

Overall I think error messages in this series could be demoted to
pr_debug, since they're useful when developing users to this API, but
figuring out whether they can be user-triggered and need care is too much
work.

Thanks,
Jean

> + spin_unlock(_nb_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
> +
>  MODULE_AUTHOR("Jean-Philippe Brucker ");
>  MODULE_AUTHOR("Jacob Pan ");
>  MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 1b551c99d568..c6cc855aadb6 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -132,6 +132,9 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
>  void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
>   void (*fn)(ioasid_t id, void *data),
>   void *data);
> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block 
> *nb);
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct 
> notifier_block *nb);
> +
>  #else /* !CONFIG_IOASID */
>  static inline void ioasid_install_capacity(ioasid_t total)
>  {
> @@ -238,5 +241,17 @@ static inline void ioasid_set_for_each_ioasid(struct 
> ioasid_set *sdata,
> void *data)
>  {
>  }
> +
> +static inline int ioasid_register_notifier_mm(struct mm_struct *mm,
> +   struct notifier_block *nb)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline void ioasid_unregister_notifier_mm(struct mm_struct *mm,
> +  struct notifier_block *nb)
> +{
> +}
> +
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> -- 
> 2.7.4
> 


Re: [PATCH v3 10/14] iommu/ioasid: Introduce notification APIs

2020-10-23 Thread Jean-Philippe Brucker
@@ -436,6 +496,7 @@ void ioasid_detach_spid(ioasid_t ioasid)
>   goto done_unlock;
>   }
>   data->spid = INVALID_IOASID;
> + ioasid_notify(data, IOASID_NOTIFY_UNBIND, IOASID_NOTIFY_FLAG_SET);
>  
>  done_unlock:
>   spin_unlock(_allocator_lock);
> @@ -469,6 +530,28 @@ static inline bool ioasid_set_is_valid(struct ioasid_set 
> *set)
>   return xa_load(_sets, set->id) == set;
>  }
>  
> +static void ioasid_add_pending_nb(struct ioasid_set *set)
> +{
> + struct ioasid_set_nb *curr;
> +
> + if (set->type != IOASID_SET_TYPE_MM)
> + return;
> +
> + /*
> +  * Check if there are any pending nb requests for the given token, if so
> +  * add them to the notifier chain.
> +  */
> + spin_lock(_nb_lock);
> + list_for_each_entry(curr, _nb_pending_list, list) {
> + if (curr->token == set->token && !curr->active) {
> + atomic_notifier_chain_register(>nh, curr->nb);
> + curr->set = set;
> + curr->active = true;
> + }
> + }
> + spin_unlock(_nb_lock);
> +}
> +
>  /**
>   * ioasid_set_alloc - Allocate a new IOASID set for a given token
>   *
> @@ -556,6 +639,13 @@ struct ioasid_set *ioasid_set_alloc(void *token, 
> ioasid_t quota, int type)
>   set->quota = quota;
>   set->id = id;
>   refcount_set(>ref, 1);
> + ATOMIC_INIT_NOTIFIER_HEAD(>nh);
> +
> + /*
> +  * Check if there are any pending nb requests for the given token, if so
> +  * add them to the notifier chain.
> +  */
> + ioasid_add_pending_nb(set);
>  
>   /*
>* Per set XA is used to store private IDs within the set, get ready
> @@ -641,6 +731,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t 
> min, ioasid_t max,
>   goto exit_free;
>   }
>   set->nr_ioasids++;
> + ioasid_notify(data, IOASID_NOTIFY_ALLOC, IOASID_NOTIFY_FLAG_SET);
>   goto done_unlock;
>  exit_free:
>   kfree(data);
> @@ -687,6 +778,8 @@ static void ioasid_free_locked(struct ioasid_set *set, 
> ioasid_t ioasid)
>   return;
>  
>   data->state = IOASID_STATE_FREE_PENDING;
> + ioasid_notify(data, IOASID_NOTIFY_FREE,
> +   IOASID_NOTIFY_FLAG_ALL | IOASID_NOTIFY_FLAG_ALL);
>   if (!refcount_dec_and_test(>users))
>   return;
>  
> @@ -726,6 +819,7 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);
>  
>  static void ioasid_set_put_locked(struct ioasid_set *set)
>  {
> + struct ioasid_set_nb *curr;
>   struct ioasid_data *entry;
>   unsigned long index;
>  
> @@ -757,6 +851,16 @@ static void ioasid_set_put_locked(struct ioasid_set *set)
>   /* Return the quota back to system pool */
>   ioasid_capacity_avail += set->quota;
>  
> + /* Restore pending status of the set NBs */
> + list_for_each_entry(curr, _nb_pending_list, list) {
> + if (curr->token == set->token) {
> + if (curr->active)
> + curr->active = false;
> + else
> + pr_warn("Set token exists but not active!\n");
> + }
> + }
> +
>   /*
>* Token got released right away after the ioasid_set is freed.
>* If a new set is created immediately with the newly released token,
> @@ -778,7 +882,9 @@ static void ioasid_set_put_locked(struct ioasid_set *set)
>  void ioasid_set_put(struct ioasid_set *set)
>  {
>   spin_lock(_allocator_lock);
> + spin_lock(_nb_lock);
>   ioasid_set_put_locked(set);
> + spin_unlock(_nb_lock);
>   spin_unlock(_allocator_lock);
>  }
>  EXPORT_SYMBOL_GPL(ioasid_set_put);
> @@ -980,6 +1086,41 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t 
> ioasid,
>  }
>  EXPORT_SYMBOL_GPL(ioasid_find);
>  
> +int ioasid_register_notifier(struct ioasid_set *set, struct notifier_block 
> *nb)

A comment might be useful, explaining that @set is optional and that the
caller should set a priority within ioasid_notifier_prios in @nb.

> +{
> + if (set)
> + return atomic_notifier_chain_register(>nh, nb);
> + else
> + return atomic_notifier_chain_register(_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier);
> +

Here as well, a comment saying that a reference to the set must be held,
though maybe that's obvious.

Thanks,
Jean

> +void ioasid_unregister_notifier(struct ioasid_set *set,
> +         struct notifier_block *nb)
> +{
> + struct 

Re: [PATCH v3 09/14] iommu/ioasid: Introduce ioasid_set private ID

2020-10-23 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:36PM -0700, Jacob Pan wrote:
> When an IOASID set is used for guest SVA, each VM will acquire its
> ioasid_set for IOASID allocations. IOASIDs within the VM must have a
> host/physical IOASID backing, mapping between guest and host IOASIDs can
> be non-identical. IOASID set private ID (SPID) is introduced in this
> patch to be used as guest IOASID. However, the concept of ioasid_set
> specific namespace is generic, thus named SPID.
> 
> As SPID namespace is within the IOASID set, the IOASID core can provide
> lookup services at both directions. SPIDs may not be available when its
> IOASID is allocated, the mapping between SPID and IOASID is usually
> established when a guest page table is bound to a host PASID.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 102 
> +
>  include/linux/ioasid.h |  19 +
>  2 files changed, 121 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 828cc44b1b1c..378fef8f23d9 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -26,6 +26,7 @@ enum ioasid_state {
>   * struct ioasid_data - Meta data about ioasid
>   *
>   * @id:  Unique ID
> + * @spid:Private ID unique within a set
>   * @users:   Number of active users
>   * @state:   Track state of the IOASID
>   * @set: ioasid_set of the IOASID belongs to
> @@ -34,6 +35,7 @@ enum ioasid_state {
>   */
>  struct ioasid_data {
>   ioasid_t id;
> + ioasid_t spid;
>   refcount_t users;
>   enum ioasid_state state;
>   struct ioasid_set *set;
> @@ -363,6 +365,105 @@ void ioasid_detach_data(ioasid_t ioasid)
>  }
>  EXPORT_SYMBOL_GPL(ioasid_detach_data);
>  
> +static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t 
> spid)
> +{
> + ioasid_t ioasid = INVALID_IOASID;
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (!xa_load(_sets, set->id)) {
> + pr_warn("Invalid set\n");

Could use ioasid_set_is_valid(), and perhaps a WARN_ON() instead of
pr_warn() since this is a programming error.

> + goto done;

Or just return INVALID_IOASID

> + }
> +
> + xa_for_each(>xa, index, entry) {
> + if (spid == entry->spid) {
> + refcount_inc(>users);
> + ioasid = index;

break

> + }
> + }
> +done:
> + return ioasid;
> +}
> +
> +/**
> + * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
> + *
> + * @ioasid: the system-wide IOASID to attach
> + * @spid:   the ioasid_set private ID of @ioasid
> + *
> + * After attching SPID, future lookup can be done via ioasid_find_by_spid().

attaching

> + */
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + struct ioasid_data *data;
> + int ret = 0;
> +
> + if (spid == INVALID_IOASID)
> + return -EINVAL;
> +
> + spin_lock(_allocator_lock);
> + data = xa_load(_allocator->xa, ioasid);
> +
> + if (!data) {
> + pr_err("No IOASID entry %d to attach SPID %d\n",
> + ioasid, spid);
> + ret = -ENOENT;
> + goto done_unlock;
> + }
> + /* Check if SPID is unique within the set */
> + if (ioasid_find_by_spid_locked(data->set, spid) != INVALID_IOASID) {

We need an additional parameter to ioasid_find_by_spid_locked(), telling
it not to take a reference to the conflicting entry. Here we return with
the reference, which will never be released.

> + ret = -EINVAL;
> + goto done_unlock;
> + }
> + data->spid = spid;
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_attach_spid);
> +
> +void ioasid_detach_spid(ioasid_t ioasid)

Could add a small comment to this public function

Thanks,
Jean

> +{
> + struct ioasid_data *data;
> +
> + spin_lock(_allocator_lock);
> + data = xa_load(_allocator->xa, ioasid);
> +
> + if (!data || data->spid == INVALID_IOASID) {
> + pr_err("Invalid IOASID entry %d to detach\n", ioasid);
> + goto done_unlock;
> + }
> + data->spid = INVALID_IOASID;
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_detach_spid);
> +
> +/**
> + * ioasid_find_by_spid - Find the system-wide IOASID by a set private ID and
> + * its set.
> + *
> + * @set: the ioasid_set to search within
> + * @spid:the set private ID
> + *
> + * Given a set private ID and its IOASID set, find the system-wide IOASID. 
> Take
> + * a reference upon finding the matching IOASID. Return INVALID_IOASID if the
> + * IOASID is not found in the set or the set is not valid.
> + */
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> +{
> + ioasid_t ioasid;
> +
> + spin_lock(_allocator_lock);
> + ioasid = ioasid_find_by_spid_locked(set, spid);
> + 

Re: [PATCH v3 07/14] iommu/ioasid: Add an iterator API for ioasid_set

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:34PM -0700, Jacob Pan wrote:
> Users of an ioasid_set may not keep track of all the IOASIDs allocated
> under the set. When collective actions are needed for each IOASIDs, it
> is useful to iterate over all the IOASIDs within the set. For example,
> when the ioasid_set is freed, the user might perform the same cleanup
> operation on each IOASID.
> 
> This patch adds an API to iterate all the IOASIDs within the set.
> 
> Signed-off-by: Jacob Pan 

Could add a short description of the function parameters, but

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/ioasid.c | 17 +
>  include/linux/ioasid.h |  9 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index cf8c7d34e2de..9628e78b2ab4 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -701,6 +701,23 @@ int ioasid_adjust_set(struct ioasid_set *set, int quota)
>  EXPORT_SYMBOL_GPL(ioasid_adjust_set);
>  
>  /**
> + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs within the set
> + *
> + * Caller must hold a reference of the set and handles its own locking.
> + */
> +void ioasid_set_for_each_ioasid(struct ioasid_set *set,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + xa_for_each(>xa, index, entry)
> + fn(index, data);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
> +
> +/**
>   * ioasid_find - Find IOASID data
>   * @set: the IOASID set
>   * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 0a5e82148eb9..aab58bc26714 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -75,6 +75,9 @@ int ioasid_register_allocator(struct ioasid_allocator_ops 
> *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>  int ioasid_attach_data(ioasid_t ioasid, void *data);
>  void ioasid_detach_data(ioasid_t ioasid);
> +void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> + void (*fn)(ioasid_t id, void *data),
> + void *data);
>  #else /* !CONFIG_IOASID */
>  static inline void ioasid_install_capacity(ioasid_t total)
>  {
> @@ -131,5 +134,11 @@ static inline int ioasid_attach_data(ioasid_t ioasid, 
> void *data)
>  static inline void ioasid_detach_data(ioasid_t ioasid)
>  {
>  }
> +
> +static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> +   void (*fn)(ioasid_t id, void 
> *data),
> +   void *data)
> +{
> +}
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> -- 
> 2.7.4
> 


Re: [PATCH v3 08/14] iommu/ioasid: Add reference couting functions

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:35PM -0700, Jacob Pan wrote:
> There can be multiple users of an IOASID, each user could have hardware
> contexts associated with the IOASID. In order to align lifecycles,
> reference counting is introduced in this patch. It is expected that when
> an IOASID is being freed, each user will drop a reference only after its
> context is cleared.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 117 
> +
>  include/linux/ioasid.h |  24 ++
>  2 files changed, 141 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 9628e78b2ab4..828cc44b1b1c 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -16,8 +16,26 @@ static ioasid_t ioasid_capacity = PCI_PASID_MAX;
>  static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
>  static DEFINE_XARRAY_ALLOC(ioasid_sets);
>  
> +enum ioasid_state {
> + IOASID_STATE_INACTIVE,
> + IOASID_STATE_ACTIVE,
> + IOASID_STATE_FREE_PENDING,
> +};
> +
> +/**
> + * struct ioasid_data - Meta data about ioasid
> + *
> + * @id:  Unique ID
> + * @users:   Number of active users
> + * @state:   Track state of the IOASID
> + * @set: ioasid_set of the IOASID belongs to
> + * @private: Private data associated with the IOASID
> + * @rcu: For free after RCU grace period
> + */
>  struct ioasid_data {
>   ioasid_t id;
> + refcount_t users;
> + enum ioasid_state state;
>   struct ioasid_set *set;
>   void *private;
>   struct rcu_head rcu;
> @@ -511,6 +529,8 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t 
> min, ioasid_t max,
>   goto exit_free;
>   }
>   data->id = id;
> + data->state = IOASID_STATE_ACTIVE;
> + refcount_set(>users, 1);
>  
>   /* Store IOASID in the per set data */
>   if (xa_err(xa_store(>xa, id, data, GFP_ATOMIC))) {
> @@ -560,6 +580,14 @@ static void ioasid_free_locked(struct ioasid_set *set, 
> ioasid_t ioasid)
>   if (WARN_ON(!xa_load(_sets, data->set->id)))
>   return;
>  
> + /* Free is already in progress */
> + if (data->state == IOASID_STATE_FREE_PENDING)
> + return;

But the previous call to ioasid_free_locked() dropped a reference, then
returned because more refs where held. Shouldn't this call also
dec_and_test() the reference and call ioasid_do_free_locked() if
necessary?

> +
> + data->state = IOASID_STATE_FREE_PENDING;
> + if (!refcount_dec_and_test(>users))
> + return;
> +
>   ioasid_do_free_locked(data);
>  }
>  
> @@ -717,6 +745,95 @@ void ioasid_set_for_each_ioasid(struct ioasid_set *set,
>  }
>  EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>  
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to get unknown IOASID %u\n", ioasid);
> + return -EINVAL;
> + }
> + if (data->state == IOASID_STATE_FREE_PENDING) {
> + pr_err("Trying to get IOASID being freed%u\n", ioasid);

Strange placement of the %u

> + return -EBUSY;
> + }
> +
> + /* Check set ownership if the set is non-null */
> + if (set && data->set != set) {
> + pr_err("Trying to get IOASID %u outside the set\n", ioasid);
> + /* data found but does not belong to the set */
> + return -EACCES;
> + }
> + refcount_inc(>users);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_locked);

If this is a public facing let's add a lockdep_assert_held() to make sure
they do hold the right lock. Same for ioasid_put_locked().

Thanks,
Jean

> +
> +/**
> + * ioasid_get - Obtain a reference to an ioasid
> + * @set: the ioasid_set to check permission against if not NULL
> + * @ioasid:  the ID to remove
> + *
> + *
> + * Return: 0 on success, error if failed.
> + */
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + int ret;
> +
> + spin_lock(_allocator_lock);
> + ret = ioasid_get_locked(set, ioasid);
> + spin_unlock(_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to put unknown IOASID %u\n", ioasid);
> + return false;
> + }
> + if (set && data->set != set) {
> + pr_err("Trying to drop IOASID %u outside the set\n", ioasid);
> + return false;
> + }
> + if (!refcount_dec_and_test(>users))
> + return false;
> +
> + ioasid_do_free_locked(data);
> +
> + return true;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put_locked);
> +
> +/**
> + * ioasid_put - Release a reference to an ioasid
> + * @set: the ioasid_set 

Re: [PATCH v3 06/14] iommu/ioasid: Introduce API to adjust the quota of an ioasid_set

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:33PM -0700, Jacob Pan wrote:
> Each ioasid_set is given a quota during allocation. As system
> administrators balance resources among VMs, we shall support the
> adjustment of quota at runtime. The new quota cannot be less than the
> outstanding IOASIDs already allocated within the set. The extra quota
> will be returned to the system-wide IOASID pool if the new quota is
> smaller than the existing one.
> 
> Signed-off-by: Jacob Pan 

Minor comments below, but

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/ioasid.c | 47 +++
>  include/linux/ioasid.h |  6 ++
>  2 files changed, 53 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 61e25c2375ab..cf8c7d34e2de 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -654,6 +654,53 @@ void ioasid_set_put(struct ioasid_set *set)
>  EXPORT_SYMBOL_GPL(ioasid_set_put);
>  
>  /**
> + * ioasid_adjust_set - Adjust the quota of an IOASID set
> + * @set: IOASID set to be assigned
> + * @quota:   Quota allowed in this set
> + *
> + * Return 0 on success. If the new quota is smaller than the number of
> + * IOASIDs already allocated, -EINVAL will be returned. No change will be
> + * made to the existing quota.
> + */
> +int ioasid_adjust_set(struct ioasid_set *set, int quota)

@quota probably doesn't need to be signed (since it's the same as an
ioasid_t, which is unsigned).

> +{
> + int ret = 0;
> +
> + if (quota <= 0)
> + return -EINVAL;
> +
> + spin_lock(_allocator_lock);
> + if (set->nr_ioasids > quota) {
> + pr_err("New quota %d is smaller than outstanding IOASIDs %d\n",
> + quota, set->nr_ioasids);
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> +
> + if ((quota > set->quota) &&
> + (quota - set->quota > ioasid_capacity_avail)) {

misaligned

> + ret = -ENOSPC;
> + goto done_unlock;
> + }
> +
> + /* Return the delta back to system pool */
> + ioasid_capacity_avail += set->quota - quota;
> +
> + /*
> +  * May have a policy to prevent giving all available IOASIDs
> +  * to one set. But we don't enforce here, it should be in the
> +  * upper layers.
> +  */

I think here would be OK to check about fairness. But anyway, we don't
have to worry about this yet, so I'd just drop the comment.

Thanks,
Jean

> + set->quota = quota;
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> +
> +/**
>   * ioasid_find - Find IOASID data
>   * @set: the IOASID set
>   * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 1ae213b660f0..0a5e82148eb9 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -62,6 +62,7 @@ struct ioasid_allocator_ops {
>  void ioasid_install_capacity(ioasid_t total);
>  ioasid_t ioasid_get_capacity(void);
>  struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type);
> +int ioasid_adjust_set(struct ioasid_set *set, int quota);
>  void ioasid_set_get(struct ioasid_set *set);
>  void ioasid_set_put(struct ioasid_set *set);
>  
> @@ -99,6 +100,11 @@ static inline struct ioasid_set *ioasid_set_alloc(void 
> *token, ioasid_t quota, i
>   return ERR_PTR(-ENOTSUPP);
>  }
>  
> +static inline int ioasid_adjust_set(struct ioasid_set *set, int quota)
> +{
> + return -ENOTSUPP;
> +}
> +
>  static inline void ioasid_set_put(struct ioasid_set *set)
>  {
>  }
> -- 
> 2.7.4
> 


Re: [PATCH v3 05/14] iommu/ioasid: Redefine IOASID set and allocation APIs

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:32PM -0700, Jacob Pan wrote:
> ioasid_set was introduced as an arbitrary token that is shared by a
> group of IOASIDs. For example, two IOASIDs allocated via the same
> ioasid_set pointer belong to the same set.
> 
> For guest SVA usages, system-wide IOASID resources need to be
> partitioned such that each VM can have its own quota and being managed
> separately. ioasid_set is the perfect candidate for meeting such
> requirements. This patch redefines and extends ioasid_set with the
> following new fields:
> - Quota
> - Reference count
> - Storage of its namespace
> - The token is now stored in the ioasid_set with types
> 
> Basic ioasid_set level APIs are introduced that wire up these new data.
> Existing users of IOASID APIs are converted where a host IOASID set is
> allocated for bare-metal usage.
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel/iommu.c |  26 +++--
>  drivers/iommu/intel/pasid.h |   1 +
>  drivers/iommu/intel/svm.c   |  25 ++--
>  drivers/iommu/ioasid.c  | 277 
> 
>  include/linux/ioasid.h  |  53 +++--
>  5 files changed, 333 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index e7bcb299e51e..872391890323 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -104,6 +104,9 @@
>   */
>  #define INTEL_IOMMU_PGSIZES  (~0xFFFUL)
>  
> +/* PASIDs used by host SVM */
> +struct ioasid_set *host_pasid_set;
> +
>  static inline int agaw_to_level(int agaw)
>  {
>   return agaw + 2;
> @@ -3147,8 +3150,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, 
> void *data)
>* Sanity check the ioasid owner is done at upper layer, e.g. VFIO
>* We can only free the PASID when all the devices are unbound.
>*/
> - if (ioasid_find(NULL, ioasid, NULL)) {
> - pr_alert("Cannot free active IOASID %d\n", ioasid);
> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> + pr_err("IOASID %d to be freed but not in system set\n", ioasid);
>   return;
>   }
>   vcmd_free_pasid(iommu, ioasid);
> @@ -,8 +3336,17 @@ static int __init init_dmars(void)
>   goto free_iommu;
>  
>   /* PASID is needed for scalable mode irrespective to SVM */
> - if (intel_iommu_sm)
> + if (intel_iommu_sm) {
>   ioasid_install_capacity(intel_pasid_max_id);
> + /* We should not run out of IOASIDs at boot */
> + host_pasid_set = ioasid_set_alloc(NULL, PID_MAX_DEFAULT,
> +   IOASID_SET_TYPE_NULL);
> + if (IS_ERR_OR_NULL(host_pasid_set)) {
> + pr_err("Failed to allocate host PASID set %lu\n",

Printing a negative value as unsigned

> + PTR_ERR(host_pasid_set));
> + intel_iommu_sm = 0;
> + }
> + }
>  
>   /*
>* for each drhd
> @@ -3381,7 +3393,7 @@ static int __init init_dmars(void)
>   disable_dmar_iommu(iommu);
>   free_dmar_iommu(iommu);
>   }
> -
> + ioasid_set_put(host_pasid_set);
>   kfree(g_iommus);
>  
>  error:
> @@ -5163,7 +5175,7 @@ static void auxiliary_unlink_device(struct dmar_domain 
> *domain,
>   domain->auxd_refcnt--;
>  
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>  }
>  
>  static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5181,7 +5193,7 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   int pasid;
>  
>   /* No private data needed for the default pasid */
> - pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
>pci_max_pasids(to_pci_dev(dev)) - 1,
>NULL);
>   if (pasid == INVALID_IOASID) {
> @@ -5224,7 +5236,7 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   spin_unlock(>lock);
>   spin_unlock_irqrestore(_domain_lock, flags);
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>  
>   return ret;
>  }
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index c9850766c3a9..ccdc23446015 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry 
> *pte)
>  }
>  
>  extern u32 intel_pasid_max_id;
> +extern struct ioasid_set *host_pasid_set;
>  int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
>  void intel_pasid_free_id(int pasid);
>  void 

Re: [PATCH v3 04/14] iommu/ioasid: Support setting system-wide capacity

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:31PM -0700, Jacob Pan wrote:
> IOASID is a system-wide resource that could vary on different systems.
> The default capacity is 20 bits as defined in the PCI-E specifications.
> This patch adds a function to allow adjusting system IOASID capacity.
> For VT-d this is set during boot as part of the Intel IOMMU
> initialization.
> 
> Signed-off-by: Jacob Pan 

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/intel/iommu.c |  5 +
>  drivers/iommu/ioasid.c  | 20 
>  include/linux/ioasid.h  | 11 +++
>  3 files changed, 36 insertions(+)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 18ed3b3c70d7..e7bcb299e51e 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

(not in alphabetical order)

>  #include 
>  #include 
>  #include 
> @@ -3331,6 +3332,10 @@ static int __init init_dmars(void)
>   if (ret)
>   goto free_iommu;
>  
> + /* PASID is needed for scalable mode irrespective to SVM */
> + if (intel_iommu_sm)
> + ioasid_install_capacity(intel_pasid_max_id);
> +
>   /*
>* for each drhd
>*   enable fault log
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 6cfbdfb492e0..4277cb17e15b 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -10,6 +10,10 @@
>  #include 
>  #include 
>  
> +/* Default to PCIe standard 20 bit PASID */
> +#define PCI_PASID_MAX 0x10
> +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
>  struct ioasid_data {
>   ioasid_t id;
>   struct ioasid_set *set;
> @@ -17,6 +21,22 @@ struct ioasid_data {
>   struct rcu_head rcu;
>  };
>  
> +void ioasid_install_capacity(ioasid_t total)
> +{
> + if (ioasid_capacity && ioasid_capacity != PCI_PASID_MAX) {
> + pr_warn("IOASID capacity is already set.\n");
> + return;
> + }
> + ioasid_capacity = ioasid_capacity_avail = total;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> +
> +ioasid_t ioasid_get_capacity(void)
> +{
> + return ioasid_capacity;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> +
>  /*
>   * struct ioasid_allocator_data - Internal data structure to hold information
>   * about an allocator. There are two types of allocators:
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index c7f649fa970a..7fc320656be2 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -32,6 +32,8 @@ struct ioasid_allocator_ops {
>  #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
>  
>  #if IS_ENABLED(CONFIG_IOASID)
> +void ioasid_install_capacity(ioasid_t total);
> +ioasid_t ioasid_get_capacity(void);
>  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private);
>  void ioasid_free(ioasid_t ioasid);
> @@ -42,6 +44,15 @@ void ioasid_unregister_allocator(struct 
> ioasid_allocator_ops *allocator);
>  int ioasid_attach_data(ioasid_t ioasid, void *data);
>  void ioasid_detach_data(ioasid_t ioasid);
>  #else /* !CONFIG_IOASID */
> +static inline void ioasid_install_capacity(ioasid_t total)
> +{
> +}
> +
> +static inline ioasid_t ioasid_get_capacity(void)
> +{
> + return 0;
> +}
> +
>  static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>   ioasid_t max, void *private)
>  {
> -- 
> 2.7.4
> 


Re: [PATCH v3 03/14] iommu/ioasid: Add a separate function for detach data

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:30PM -0700, Jacob Pan wrote:
> IOASID private data can be cleared by ioasid_attach_data() with a NULL
> data pointer. A common use case is for a caller to free the data
> afterward. ioasid_attach_data() calls synchronize_rcu() before return
> such that free data can be sure without outstanding readers.
> However, since synchronize_rcu() may sleep, ioasid_attach_data() cannot
> be used under spinlocks.
> 
> This patch adds ioasid_detach_data() as a separate API where
> synchronize_rcu() is called only in this case. ioasid_attach_data() can
> then be used under spinlocks. In addition, this change makes the API
> symmetrical.
> 
> Signed-off-by: Jacob Pan 

A typo below, but

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/intel/svm.c |  4 ++--
>  drivers/iommu/ioasid.c| 54 
> ++-
>  include/linux/ioasid.h|  5 -
>  3 files changed, 50 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 2c5645f0737a..06a16bee7b65 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -398,7 +398,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
> struct device *dev,
>   list_add_rcu(>list, >devs);
>   out:
>   if (!IS_ERR_OR_NULL(svm) && list_empty(>devs)) {
> - ioasid_attach_data(data->hpasid, NULL);
> + ioasid_detach_data(data->hpasid);
>   kfree(svm);
>   }
>  
> @@ -441,7 +441,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
>* the unbind, IOMMU driver will get notified
>* and perform cleanup.
>*/
> - ioasid_attach_data(pasid, NULL);
> + ioasid_detach_data(pasid);
>   kfree(svm);
>   }
>   }
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f63af07acd5..6cfbdfb492e0 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -272,24 +272,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
>  
>   spin_lock(_allocator_lock);
>   ioasid_data = xa_load(_allocator->xa, ioasid);
> - if (ioasid_data)
> - rcu_assign_pointer(ioasid_data->private, data);
> - else
> +
> + if (!ioasid_data) {
>   ret = -ENOENT;
> - spin_unlock(_allocator_lock);
> + goto done_unlock;
> + }
>  
> - /*
> -  * Wait for readers to stop accessing the old private data, so the
> -  * caller can free it.
> -  */
> - if (!ret)
> - synchronize_rcu();
> + if (ioasid_data->private) {
> + ret = -EBUSY;
> + goto done_unlock;
> + }
> + rcu_assign_pointer(ioasid_data->private, data);
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
>  
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(ioasid_attach_data);
>  
>  /**
> + * ioasid_detach_data - Clear the private data of an ioasid
> + *
> + * @ioasid: the IOASIDD to clear private data

IOASID

> + */
> +void ioasid_detach_data(ioasid_t ioasid)
> +{
> + struct ioasid_data *ioasid_data;
> +
> + spin_lock(_allocator_lock);
> + ioasid_data = xa_load(_allocator->xa, ioasid);
> +
> + if (!ioasid_data) {
> + pr_warn("IOASID %u not found to detach data from\n", ioasid);
> + goto done_unlock;
> + }
> +
> + if (ioasid_data->private) {
> + rcu_assign_pointer(ioasid_data->private, NULL);
> + goto done_unlock;
> + }
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
> + /*
> +  * Wait for readers to stop accessing the old private data,
> +  * so the caller can free it.
> +  */
> + synchronize_rcu();
> +}
> +EXPORT_SYMBOL_GPL(ioasid_detach_data);
> +
> +/**
>   * ioasid_alloc - Allocate an IOASID
>   * @set: the IOASID set
>   * @min: the minimum ID (inclusive)
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 9c44947a68c8..c7f649fa970a 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -40,7 +40,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>  int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>  int ioasid_attach_data(ioasid_t ioasid, void *data);
> -
> +void ioasid_detach_data(ioasid_t ioasid);
>  #else /* !CONFIG_IOASID */
>  static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>   ioasid_t max, void *private)
> @@ -72,5 +72,8 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void 
> *data)
>   return -ENOTSUPP;
>  }
>  
> +static inline void ioasid_detach_data(ioasid_t ioasid)
> +{
> +}
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> -- 
> 2.7.4
> 


Re: [PATCH v3 01/14] docs: Document IO Address Space ID (IOASID) APIs

2020-10-20 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:28PM -0700, Jacob Pan wrote:
> IOASID is used to identify address spaces that can be targeted by device
> DMA. It is a system-wide resource that is essential to its many users.
> This document is an attempt to help developers from all vendors navigate
> the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
> (SIOV) enabled platforms are the primary users of IOASID. Examples of
> how SIOV components interact with IOASID APIs are provided in that many
> APIs are driven by the requirements from SIOV.
> 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: Randy Dunlap 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Wu Hao 
> Signed-off-by: Jacob Pan 

This looks good to me, with small comments below.

> ---
>  Documentation/driver-api/ioasid.rst | 648 
> 
>  1 file changed, 648 insertions(+)
>  create mode 100644 Documentation/driver-api/ioasid.rst
> 
> diff --git a/Documentation/driver-api/ioasid.rst 
> b/Documentation/driver-api/ioasid.rst
> new file mode 100644
> index ..7f8e702997ab
> --- /dev/null
> +++ b/Documentation/driver-api/ioasid.rst
> @@ -0,0 +1,648 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. ioasid:
> +
> +===
> +IO Address Space ID
> +===
> +
> +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
> +SMMU SubstreamID. An IOASID identifies an address space that DMA
> +requests can target.
> +
> +The primary use cases for IOASID are Shared Virtual Address (SVA) and
> +multiple IOVA spaces per device. However, the requirements for IOASID
> +management can vary among hardware architectures.
> +
> +For baremetal IOVA, IOASID #0 is used for DMA request without

   bare metal

> +PASID. Even though some architectures such as VT-d also offers
> +the flexibility of using any PASIDs for DMA request without PASID.
> +PASID #0 is reserved and not allocated from any ioasid_set.
> +
> +Multiple IOVA spaces per device are mapped to auxiliary domains which
> +can be used for mediated device assignment with and without a virtual
> +IOMMU (vIOMMU). An IOASID is allocated for each auxiliary domain as default
> +PASID. Without vIOMMU, default IOASID is used for DMA map/unmap
> +APIs. With vIOMMU, default IOASID is used for guest IOVA where DMA
> +request with PASID is required for the device. The reason is that
> +there is only one PASID #0 per device, e.g. VT-d, RID_PASID is per PCI

   on VT-d

> +device.
> +
> +This document covers the generic features supported by IOASID
> +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
> +based platforms as the first example.
> +
> +.. contents:: :local:
> +
> +Glossary
> +
> +PASID - Process Address Space ID
> +
> +IOASID - IO Address Space ID (generic term for PCIe PASID and
> +SubstreamID in SMMU)
> +
> +SVA/SVM - Shared Virtual Addressing/Memory
> +
> +ENQCMD - Intel X86 ISA for efficient workqueue submission [1]
> +!!!TODO: Link to Spec at the bottom

Yes, or maybe hyperlinks at the end of this section would be better. There
are references and lists all over the document so keeping things as close
as possible avoids confusion.

> +
> +DSA - Intel Data Streaming Accelerator [2]
> +
> +VDCM - Virtual Device Composition Module [3]
> +
> +SIOV - Intel Scalable IO Virtualization
> +
> +
> +Key Concepts
> +
> +
> +IOASID Set
> +---
> +An IOASID set is a group of IOASIDs allocated from the system-wide
> +IOASID pool. Refer to IOASID set APIs for more details.

 IOASID Set Level APIs

> +
> +IOASID set is particularly useful for guest SVA where each guest could
> +have its own IOASID set for security and efficiency reasons.
> +
> +IOASID Set Private ID (SPID)
> +
> +Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> +single system-wide IOASID. Conversely, each IOASID may be associated
> +with an alias ID, local to the IOASID set, named SPID.
> +SPIDs can be used as guest IOASIDs where each guest could do
> +IOASID allocation from its own pool and map them to host physical
> +IOASIDs. SPIDs are particularly useful for supporting live migration
> +where decoupling guest and host physical resources are necessary.
> +
> +For example, two VMs can both allocate guest PASID/SPID #101 but map to
> +different host PASIDs #201 and #202 respectively as shown in the
> +diagram below.
> +::
> +
> + .--..--.
> + |   VM 1   ||   VM 2   |
> + |  ||  |
> + |--||--|
> + | GPASID/SPID 101  || GPASID/SPID 101  |
> + '--'---' Guest
> + __|__|
> +   |  |   Host
> +   v  v
> + .--.

Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-19 Thread Jean-Philippe Brucker
On Sat, Oct 17, 2020 at 04:25:25AM -0700, Raj, Ashok wrote:
> > For devices that *don't* use a stop marker, the PCIe spec says (10.4.1.2):
> > 
> >   To stop [using a PASID] without using a Stop Marker Message, the
> >   function shall:
> >   1. Stop queueing new Page Request Messages for this PASID.
> 
> The device driver would need to tell stop sending any new PR's.
> 
> >   2. Finish transmitting any multi-page Page Request Messages for this
> >  PASID (i.e. send the Page Request Message with the L bit Set).
> >   3. Wait for PRG Response Messages associated any outstanding Page
> >  Request Messages for the PASID.
> > 
> > So they have to flush their PR themselves. And since the device driver
> > completes this sequence before calling unbind(), then there shouldn't be
> > any oustanding PR for the PASID, and unbind() doesn't need to flush,
> > right?
> 
> I can see how the device can complete #2,3 above. But the device driver
> isn't the one managing page-responses right. So in order for the device to
> know the above sequence is complete, it would need to get some assist from
> IOMMU driver?

No the device driver just waits for the device to indicate that it has
completed the sequence. That's what the magic stop-PASID mechanism
described by PCIe does. In 6.20.1 "Managing PASID TLP Prefix Usage" it
says:

"A Function must have a mechanism to request that it gracefully stop using
 a specific PASID. This mechanism is device specific but must satisfy the
 following rules:
 [...]
 * When the stop request mechanism indicates completion, the Function has:
   [...]
   * Complied with additional rules described in Address Translation
 Services (Chapter 10 [10.4.1.2 quoted above]) if Address Translations
 or Page Requests were issued on the behalf of this PASID."

So after the device driver initiates this mechanism in the device, the
device must be able to indicate completion of the mechanism, which
includes completing all in-flight Page Requests. At that point the device
driver can call unbind() knowing there is no pending PR for this PASID.

Thanks,
Jean

> 
> How does the driver know that everything host received has been responded
> back to device?
> 
> > 
> > > I'm not sure about other IOMMU's how they behave, When there is no space 
> > > in
> > > the PRQ, IOMMU auto-responds to the device. This puts the device in a
> > > while (1) loop. The fake successful response will let the device do a ATS
> > > lookup, and that would fail forcing the device to do another PRQ.
> > 
> > But in the sequence above, step 1 should ensure that the device will not
> > send another PR for any successful response coming back at step 3.
> 
> True, but there could be some page-request in flight on its way to the
> IOMMU. By draining and getting that round trip back to IOMMU we gaurantee
> things in flight are flushed to PRQ after that Drain completes.
> > 
> > So I agree with the below if we suspect there could be pending PR, but
> > given that pending PR are a stop marker thing and we don't know any device
> > using stop markers, I wondered why I bothered implementing PRIq flush at
> > all for SMMUv3, hence this RFC.
> > 
> 
> Cheers,
> Ashok


Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-16 Thread Jean-Philippe Brucker
On Thu, Oct 15, 2020 at 11:22:11AM -0700, Raj, Ashok wrote:
> Hi Jean
> 
> + Baolu who is looking into this.
> 
> 
> On Thu, Oct 15, 2020 at 11:00:27AM +0200, Jean-Philippe Brucker wrote:
> > Add a parameter to iommu_sva_unbind_device() that tells the IOMMU driver
> > whether the PRI queue needs flushing. When looking at the PCIe spec
> > again I noticed that most of the time the SMMUv3 driver doesn't actually
> > need to flush the PRI queue. Does this make sense for Intel VT-d as well
> > or did I overlook something?
> > 
> > Before calling iommu_sva_unbind_device(), device drivers must stop the
> > device from using the PASID. For PCIe devices, that consists of
> > completing any pending DMA, and completing any pending page request
> > unless the device uses Stop Markers. So unless the device uses Stop
> > Markers, we don't need to flush the PRI queue. For SMMUv3, stopping DMA
> > means completing all stall events, so we never need to flush the event
> > queue.
> 
> I don't think this is true. Baolu is working on an enhancement to this,
> I'll quickly summarize this below:
> 
> Stop markers are weird, I'm not certain there is any device today that
> sends STOP markers. Even if they did, markers don't have a required
> response, they are fire and forget from the device pov.

Definitely agree with this, and I hope none will implement stop marker.
For devices that *don't* use a stop marker, the PCIe spec says (10.4.1.2):

  To stop [using a PASID] without using a Stop Marker Message, the
  function shall:
  1. Stop queueing new Page Request Messages for this PASID.
  2. Finish transmitting any multi-page Page Request Messages for this
 PASID (i.e. send the Page Request Message with the L bit Set).
  3. Wait for PRG Response Messages associated any outstanding Page
 Request Messages for the PASID.

So they have to flush their PR themselves. And since the device driver
completes this sequence before calling unbind(), then there shouldn't be
any oustanding PR for the PASID, and unbind() doesn't need to flush,
right?

> I'm not sure about other IOMMU's how they behave, When there is no space in
> the PRQ, IOMMU auto-responds to the device. This puts the device in a
> while (1) loop. The fake successful response will let the device do a ATS
> lookup, and that would fail forcing the device to do another PRQ.

But in the sequence above, step 1 should ensure that the device will not
send another PR for any successful response coming back at step 3.

So I agree with the below if we suspect there could be pending PR, but
given that pending PR are a stop marker thing and we don't know any device
using stop markers, I wondered why I bothered implementing PRIq flush at
all for SMMUv3, hence this RFC.

Thanks,
Jean


> The idea
> is somewhere there the OS has repeated the others and this will find a way
> in the PRQ. The point is this is less reliable and can't be the only
> indication. PRQ draining has a specific sequence. 
> 
> The detailed steps are outlined in 
> "Chapter 7.10 "Software Steps to Drain Page Requests & Responses"
> 
> - Submit invalidation wait with fence flag to ensure all prior
>   invalidations are processed.
> - submit iotlb followed by devtlb invalidation
> - Submit invalidation wait with page-drain to make sure any page-requests
>   issued by the device are flushed when this invalidation wait completes.
> - If during the above process there was a queue overflow SW can assume no
>   outstanding page-requests are there. If we had a queue full condition,
>   then sw must repeat step 2,3 above.
> 
> 
> To that extent the proposal is as follows: names are suggestive :-) I'm
> making this up as I go!
> 
> - iommu_stop_page_req() - Kernel needs to make sure we respond to all
>   outstanding requests (since we can't drop responses) 
>   - Ensure we respond immediatly for anything that comes before the drain
> sequence completes
> - iommu_drain_page_req() - Which does the above invalidation with
>   Page_drain set.
> 
> Once the driver has performed a reset and before issuing any new request,
> it does iommu_resume_page_req() this will ensure we start processing
> incoming page-req after this point.
> 
> 
> > 
> > First patch adds flags to unbind(), and the second one lets device
> > drivers tell whether the PRI queue needs to be flushed.
> > 
> > Other remarks:
> > 
> > * The PCIe spec (see quote on patch 2), says that the device signals
> >   whether it has sent a Stop Marker or not during Stop PASID. In reality
> >   it's unlikely that a given device will sometimes use one stop method
> >   and sometimes the other, so it could be a device-wide flag rather than
> >   passing it a

[RFC PATCH 2/2] iommu: Add IOMMU_UNBIND_FAULT_PENDING flag

2020-10-15 Thread Jean-Philippe Brucker
IOMMU drivers only need to flush their PRI queue when faults might be
pending. According to the PCIe spec (quoted below) this only happens
when using the "Stop Marker" method. Otherwise the function waits for
pending faults before signaling to the device driver that it can
unbind().

Add the IOMMU_UNBIND_FAULT_PENDING flags to unbind(), to tell the IOMMU
driver whether it's worth flushing the queue.

Signed-off-by: Jean-Philippe Brucker 
---
 include/linux/iommu.h | 31 +++
 drivers/iommu/intel/svm.c |  3 ++-
 drivers/iommu/iommu.c |  5 -
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 26c1358a2a37..fd9630b1240d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -163,6 +163,37 @@ enum iommu_dev_features {
 
 #define IOMMU_PASID_INVALID(-1U)
 
+/*
+ * Indicate that a device stops using a PASID by issuing a Stop Marker Message.
+ * From PCIe 4.0r1.0, 10.4.1.2 Managing PASID TLP Prefix Usage:
+ *
+ * "To stop without using a Stop Marker Message, the Function shall:
+ *  1. Stop queueing new Page Request Messages for this PASID.
+ *  2. Finish transmitting any multi-page Page Request Messages for this PASID
+ * (i.e. send the Page Request Message with the L bit Set).
+ *  3. Wait for PRG Response Messages associated any outstanding Page Request
+ * Messages for the PASID.
+ *  4. Indicate that the PASID has stopped using a device specific mechanism.
+ * This mechanism must indicate that a Stop Marker Message will not be
+ * generated.
+ *  To stop with the use of a Stop Marker Message the Function shall:
+ * [1. and 2. are the same]
+ *  3. Internally mark all outstanding Page Request Messages for this PASID as
+ * stale. PRG Response Messages associated with these requests will return
+ * Page Request Allocation credits and PRG Index values but are otherwise
+ * ignored.
+ *  4. Indicate that the PASID has stopped using a device specific mechanism.
+ * This mechanism must indicate that a Stop Marker Message will be
+ * generated.
+ *  5. Send a Stop Marker Message to indicate to the host that all subsequent
+ * Page Request Messages for this PASID are for a new use of the PASID
+ * value."
+ *
+ * If the device indicates that the Stop Marker Message will be generated, the
+ * device driver should set the IOMMU_UNBIND_FAULT_PENDING flag.
+ */
+#define IOMMU_UNBIND_FAULT_PENDING (1UL << 0)
+
 #ifdef CONFIG_IOMMU_API
 
 /**
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 700b05612af9..aa1fcb66fa95 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -680,7 +680,8 @@ static int intel_svm_unbind_mm(struct device *dev, u32 
pasid,
 * hard to be as defensive as we might like. */
intel_pasid_tear_down_entry(iommu, dev,
svm->pasid, false);
-   intel_svm_drain_prq(dev, svm->pasid);
+   if (flags & IOMMU_UNBIND_FAULT_PENDING)
+   intel_svm_drain_prq(dev, svm->pasid);
kfree_rcu(sdev, rcu);
 
if (list_empty(>devs)) {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 741c463095a8..eede0592a2c0 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2995,7 +2995,10 @@ EXPORT_SYMBOL_GPL(iommu_sva_bind_device);
  *
  * Put reference to a bond between device and address space. The device should
  * not be issuing any more transaction for this PASID. All outstanding page
- * requests for this PASID must have been flushed to the IOMMU.
+ * requests for this PASID must have been completed, or flushed to the IOMMU. 
If
+ * they have not been completed, for example when using a Stop Marker Message 
to
+ * stop PASID in a PCIe device, then the caller must set the flag
+ * %IOMMU_UNBIND_FAULT_PENDING.
  *
  * Returns 0 on success, or an error value
  */
-- 
2.28.0



[RFC PATCH 1/2] iommu: Add flags to sva_unbind()

2020-10-15 Thread Jean-Philippe Brucker
Provide a way for device drivers to tell IOMMU drivers about the device
state and the cleanup work to be done, when unbinding. No functional
change.

Signed-off-by: Jean-Philippe Brucker 
---
 include/linux/intel-iommu.h | 2 +-
 include/linux/iommu.h   | 7 ---
 drivers/iommu/intel/svm.c   | 7 ---
 drivers/iommu/iommu.c   | 5 +++--
 drivers/misc/uacce/uacce.c  | 4 ++--
 5 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index fbf5b3e7707e..5b66b23d591d 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -747,7 +747,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
 int intel_svm_unbind_gpasid(struct device *dev, u32 pasid);
 struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm,
 void *drvdata);
-void intel_svm_unbind(struct iommu_sva *handle);
+void intel_svm_unbind(struct iommu_sva *handle, unsigned long flags);
 u32 intel_svm_get_pasid(struct iommu_sva *handle);
 int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt,
struct iommu_page_response *msg);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b95a6f8db6ff..26c1358a2a37 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -285,7 +285,7 @@ struct iommu_ops {
 
struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct *mm,
  void *drvdata);
-   void (*sva_unbind)(struct iommu_sva *handle);
+   void (*sva_unbind)(struct iommu_sva *handle, unsigned long flags);
u32 (*sva_get_pasid)(struct iommu_sva *handle);
 
int (*page_response)(struct device *dev,
@@ -636,7 +636,7 @@ int iommu_aux_get_pasid(struct iommu_domain *domain, struct 
device *dev);
 struct iommu_sva *iommu_sva_bind_device(struct device *dev,
struct mm_struct *mm,
void *drvdata);
-void iommu_sva_unbind_device(struct iommu_sva *handle);
+void iommu_sva_unbind_device(struct iommu_sva *handle, unsigned long flags);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
 #else /* CONFIG_IOMMU_API */
@@ -1026,7 +1026,8 @@ iommu_sva_bind_device(struct device *dev, struct 
mm_struct *mm, void *drvdata)
return NULL;
 }
 
-static inline void iommu_sva_unbind_device(struct iommu_sva *handle)
+static inline void iommu_sva_unbind_device(struct iommu_sva *handle,
+  unsigned long flags)
 {
 }
 
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index f1861fa3d0e4..700b05612af9 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -651,7 +651,8 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
 }
 
 /* Caller must hold pasid_mutex */
-static int intel_svm_unbind_mm(struct device *dev, u32 pasid)
+static int intel_svm_unbind_mm(struct device *dev, u32 pasid,
+  unsigned long flags)
 {
struct intel_svm_dev *sdev;
struct intel_iommu *iommu;
@@ -1091,13 +1092,13 @@ intel_svm_bind(struct device *dev, struct mm_struct 
*mm, void *drvdata)
return sva;
 }
 
-void intel_svm_unbind(struct iommu_sva *sva)
+void intel_svm_unbind(struct iommu_sva *sva, unsigned long flags)
 {
struct intel_svm_dev *sdev;
 
mutex_lock(_mutex);
sdev = to_intel_svm_dev(sva);
-   intel_svm_unbind_mm(sdev->dev, sdev->pasid);
+   intel_svm_unbind_mm(sdev->dev, sdev->pasid, flags);
mutex_unlock(_mutex);
 }
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8c470f451a32..741c463095a8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2991,6 +2991,7 @@ EXPORT_SYMBOL_GPL(iommu_sva_bind_device);
 /**
  * iommu_sva_unbind_device() - Remove a bond created with iommu_sva_bind_device
  * @handle: the handle returned by iommu_sva_bind_device()
+ * @flags: IOMMU_UNBIND_* flags
  *
  * Put reference to a bond between device and address space. The device should
  * not be issuing any more transaction for this PASID. All outstanding page
@@ -2998,7 +2999,7 @@ EXPORT_SYMBOL_GPL(iommu_sva_bind_device);
  *
  * Returns 0 on success, or an error value
  */
-void iommu_sva_unbind_device(struct iommu_sva *handle)
+void iommu_sva_unbind_device(struct iommu_sva *handle, unsigned long flags)
 {
struct iommu_group *group;
struct device *dev = handle->dev;
@@ -3012,7 +3013,7 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
return;
 
mutex_lock(>mutex);
-   ops->sva_unbind(handle);
+   ops->sva_unbind(handle, flags);
mutex_unlock(>mutex);
 
iommu_group_put(group);
diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index 56dd98ab5a81..0800566a6656 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uac

[RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-15 Thread Jean-Philippe Brucker
Add a parameter to iommu_sva_unbind_device() that tells the IOMMU driver
whether the PRI queue needs flushing. When looking at the PCIe spec
again I noticed that most of the time the SMMUv3 driver doesn't actually
need to flush the PRI queue. Does this make sense for Intel VT-d as well
or did I overlook something?

Before calling iommu_sva_unbind_device(), device drivers must stop the
device from using the PASID. For PCIe devices, that consists of
completing any pending DMA, and completing any pending page request
unless the device uses Stop Markers. So unless the device uses Stop
Markers, we don't need to flush the PRI queue. For SMMUv3, stopping DMA
means completing all stall events, so we never need to flush the event
queue.

First patch adds flags to unbind(), and the second one lets device
drivers tell whether the PRI queue needs to be flushed.

Other remarks:

* The PCIe spec (see quote on patch 2), says that the device signals
  whether it has sent a Stop Marker or not during Stop PASID. In reality
  it's unlikely that a given device will sometimes use one stop method
  and sometimes the other, so it could be a device-wide flag rather than
  passing it at each unbind(). I don't want to speculate too much about
  future implementation so I prefer having the flag in unbind().

* In patch 1, uacce passes 0 to unbind(). To pass the right flag I'm
  thinking that uacce->ops->stop_queue(), which tells the device driver
  to stop DMA, should return whether faults are pending. This can be
  added later once uacce has an actual PCIe user, but we need to
  remember to do it.

Jean-Philippe Brucker (2):
  iommu: Add flags to sva_unbind()
  iommu: Add IOMMU_UNBIND_FAULT_PENDING flag

 include/linux/intel-iommu.h |  2 +-
 include/linux/iommu.h   | 38 ++---
 drivers/iommu/intel/svm.c   | 10 ++
 drivers/iommu/iommu.c   | 10 +++---
 drivers/misc/uacce/uacce.c  |  4 ++--
 5 files changed, 51 insertions(+), 13 deletions(-)

-- 
2.28.0



Re: [PATCH v11 5/6] iommu/uapi: Handle data and argsz filled by users

2020-09-25 Thread Jean-Philippe Brucker
On Thu, Sep 24, 2020 at 12:24:19PM -0700, Jacob Pan wrote:
> IOMMU user APIs are responsible for processing user data. This patch
> changes the interface such that user pointers can be passed into IOMMU
> code directly. Separate kernel APIs without user pointers are introduced
> for in-kernel users of the UAPI functionality.
> 
> IOMMU UAPI data has a user filled argsz field which indicates the data
> length of the structure. User data is not trusted, argsz must be
> validated based on the current kernel data size, mandatory data size,
> and feature flags.
> 
> User data may also be extended, resulting in possible argsz increase.
> Backward compatibility is ensured based on size and flags (or
> the functional equivalent fields) checking.
> 
> This patch adds sanity checks in the IOMMU layer. In addition to argsz,
> reserved/unused fields in padding, flags, and version are also checked.
> Details are documented in Documentation/userspace-api/iommu.rst
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 

Reviewed-by: Jean-Philippe Brucker 

Some comments below in case you're resending, but nothing important.

> ---
>  drivers/iommu/iommu.c  | 199 
> +++--
>  include/linux/iommu.h  |  28 +--
>  include/uapi/linux/iommu.h |   1 +
>  3 files changed, 212 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 4ae02291ccc2..5c1b7ae48aae 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1961,34 +1961,219 @@ int iommu_attach_device(struct iommu_domain *domain, 
> struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +/*
> + * Check flags and other user provided data for valid combinations. We also
> + * make sure no reserved fields or unused flags are set. This is to ensure
> + * not breaking userspace in the future when these fields or flags are used.
> + */
> +static int iommu_check_cache_invl_data(struct iommu_cache_invalidate_info 
> *info)
> +{
> + u32 mask;
> + int i;
> +
> + if (info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> + return -EINVAL;
> +
> + mask = (1 << IOMMU_CACHE_INV_TYPE_NR) - 1;
> + if (info->cache & ~mask)
> + return -EINVAL;
> +
> + if (info->granularity >= IOMMU_INV_GRANU_NR)
> + return -EINVAL;
> +
> + switch (info->granularity) {
> + case IOMMU_INV_GRANU_ADDR:
> + if (info->cache & IOMMU_CACHE_INV_TYPE_PASID)
> + return -EINVAL;
> +
> + mask = IOMMU_INV_ADDR_FLAGS_PASID |
> + IOMMU_INV_ADDR_FLAGS_ARCHID |
> + IOMMU_INV_ADDR_FLAGS_LEAF;
> +
> + if (info->granu.addr_info.flags & ~mask)
> + return -EINVAL;
> + break;
> + case IOMMU_INV_GRANU_PASID:
> + mask = IOMMU_INV_PASID_FLAGS_PASID |
> + IOMMU_INV_PASID_FLAGS_ARCHID;
> + if (info->granu.pasid_info.flags & ~mask)
> + return -EINVAL;
> +
> + break;
> + case IOMMU_INV_GRANU_DOMAIN:
> + if (info->cache & IOMMU_CACHE_INV_TYPE_DEV_IOTLB)
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + /* Check reserved padding fields */
> + for (i = 0; i < sizeof(info->padding); i++) {
> + if (info->padding[i])
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
>  int iommu_uapi_cache_invalidate(struct iommu_domain *domain, struct device 
> *dev,
> - struct iommu_cache_invalidate_info *inv_info)
> + void __user *uinfo)
>  {
> + struct iommu_cache_invalidate_info inv_info = { 0 };
> + u32 minsz;
> + int ret = 0;

nit: no need to initialize it

> +
>   if (unlikely(!domain->ops->cache_invalidate))
>   return -ENODEV;
>  
> - return domain->ops->cache_invalidate(domain, dev, inv_info);
> + /*
> +  * No new spaces can be added before the variable sized union, the
> +  * minimum size is the offset to the union.
> +  */
> + minsz = offsetof(struct iommu_cache_invalidate_info, granu);

Why not use offsetofend() to avoid naming the unions?

> +
> + /* Copy minsz from user to get flags and argsz */
> + if (copy_from_user(_info, uinfo, minsz))
> + return -EFAULT;
> +
> + /* Fields before variable size union is mandatory */
>

Re: [PATCH v2] arm64: bpf: Fix branch offset in JIT

2020-09-16 Thread Jean-Philippe Brucker
On Wed, Sep 16, 2020 at 03:39:37PM +0300, Yauheni Kaliuta wrote:
> If you start to amend extables, could you consider a change like
> 
> 05a68e892e89 ("s390/kernel: expand exception table logic to allow new 
> handling options")
> 
> and implementation of BPF_PROBE_MEM then?

Commit 800834285361 ("bpf, arm64: Add BPF exception tables") should have
taken care of that, no?

Thanks,
Jean


Re: [PATCH v2] arm64: bpf: Fix branch offset in JIT

2020-09-15 Thread Jean-Philippe Brucker
On Tue, Sep 15, 2020 at 02:11:03PM +0100, Will Deacon wrote:
> > ret = build_insn(insn, ctx, extra_pass);
> > if (ret > 0) {
> > i++;
> > if (ctx->image == NULL)
> > -   ctx->offset[i] = ctx->idx;
> > +   ctx->offset[i] = ctx->offset[i - 1];
> 
> Does it matter that we set the offset for both halves of a 16-byte BPF
> instruction? I think that's a change in behaviour here.

After testing this patch a bit, I think setting only the first slot should
be sufficient, and we can drop these two lines. It does make a minor
difference, because although the BPF verifier normally rejects a program
that jumps into the middle of a 16-byte instruction, it can validate it in
some cases: 

BPF_LD_IMM64(BPF_REG_0, 2)  // 16-byte immediate load
BPF_JMP_IMM(BPF_JLE, BPF_REG_0, 1, -2)  // If r0 <= 1, branch to
BPF_EXIT_INSN() //  the middle of the insn

The verifier detects that the condition is always false and doesn't follow
the branch, hands the program to the JIT. So if we don't set the second
slot, then we generate an invalid branch offset. But that doesn't really
matter as the branch is never taken.

Thanks,
Jean


Re: [PATCH v2 6/9] iommu/ioasid: Introduce notification APIs

2020-08-25 Thread Jean-Philippe Brucker
= nb) {
> + pr_warn("Cannot unregister NB from pending list\n");
> + spin_unlock(_nb_lock);
> + return;
> + }
> + }
> + spin_unlock(_nb_lock);
> +
> + if (set)
> + atomic_notifier_chain_unregister(>nh, nb);
> + else
> + atomic_notifier_chain_unregister(_chain, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
> +
> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block 
> *nb)
> +{
> + struct ioasid_set_nb *curr;
> + struct ioasid_set *set;
> + int ret = 0;
> +
> + if (!mm)
> + return -EINVAL;
> +
> + spin_lock(_nb_lock);
> +
> + /* Check for duplicates, nb is unique per set */
> + list_for_each_entry(curr, _nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + }
> +
> + /* Check if the token has an existing set */
> + set = ioasid_find_mm_set(mm);

Seems to be a deadlock here, as ioasid_find_mm_set() grabs
ioasid_allocator_lock while holding ioasid_nb_lock, and
ioasid_set_put/get_locked() grabs ioasid_nb_lock while holding
ioasid_allocator_lock.

> + if (IS_ERR_OR_NULL(set)) {

Looks a bit off, maybe we can check !set since ioasid_find_mm_set()
doesn't return errors.

> + /* Add to the rsvd list as inactive */
> + curr->active = false;

curr isn't valid here

> + } else {
> + /* REVISIT: Only register empty set for now. Can add an option
> +  * in the future to playback existing PASIDs.
> +  */
> + if (set->nr_ioasids) {
> + pr_warn("IOASID set %d not empty\n", set->sid);
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + curr = kzalloc(sizeof(*curr), GFP_ATOMIC);

As a side-note, I think there's too much atomic allocation in this file,
I'd like to try and rework the locking once it stabilizes and I find some
time. Do you remember why ioasid_allocator_lock needed to be a spinlock?

> + if (!curr) {
> + ret = -ENOMEM;
> + goto exit_unlock;
> + }
> + curr->token = mm;
> + curr->nb = nb;
> + curr->active = true;
> + curr->set = set;
> +
> + /* Set already created, add to the notifier chain */
> + atomic_notifier_chain_register(>nh, nb);
> + /*
> +  * Do not hold a reference, if the set gets destroyed, the nb
> +  * entry will be marked inactive.
> +  */
> + ioasid_set_put(set);
> + }
> +
> + list_add(>list, _nb_pending_list);
> +
> +exit_unlock:
> + spin_unlock(_nb_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
> +
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct 
> notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> +
> + spin_lock(_nb_lock);
> + list_for_each_entry(curr, _nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + list_del(>list);
> + goto exit_free;
> + }
> + }
> + pr_warn("No ioasid set found for mm token %llx\n",  (u64)mm);
> + goto done_unlock;
> +
> +exit_free:
> + if (curr->active) {
> + pr_debug("mm set active, unregister %llx\n",
> + (u64)mm);

%px shows raw pointers, but I would drop this altogether or use %p.

> + atomic_notifier_chain_unregister(>set->nh, nb);
> + }
> + kfree(curr);
> +done_unlock:
> + spin_unlock(_nb_lock);
> + return;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
> +
> +/**
> + * ioasid_notify - Send notification on a given IOASID for status change.
> + * Used by publishers when the status change may affect
> + * subscriber's internal state.
> + *
> + * @ioasid:  The IOASID to which the notification will send
> + * @cmd: The notification event
> + * @flags:   Special instructions, e.g. notify with a set or global

Describe valid values for @cmd and @flags?  I guess this function
shouldn't accept IOASID_ALLOC, IOASID_FREE etc

> + */
> +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int 
> flags)
> +{
> + struct ioasid_data *ioas

Re: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

2020-08-25 Thread Jean-Philippe Brucker
On Fri, Aug 21, 2020 at 09:35:14PM -0700, Jacob Pan wrote:
> When an IOASID set is used for guest SVA, each VM will acquire its
> ioasid_set for IOASID allocations. IOASIDs within the VM must have a
> host/physical IOASID backing, mapping between guest and host IOASIDs can
> be non-identical. IOASID set private ID (SPID) is introduced in this
> patch to be used as guest IOASID. However, the concept of ioasid_set
> specific namespace is generic, thus named SPID.
> 
> As SPID namespace is within the IOASID set, the IOASID core can provide
> lookup services at both directions. SPIDs may not be allocated when its
> IOASID is allocated, the mapping between SPID and IOASID is usually
> established when a guest page table is bound to a host PASID.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 54 
> ++
>  include/linux/ioasid.h | 12 +++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f31d63c75b1..c0aef38a4fde 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -21,6 +21,7 @@ enum ioasid_state {
>   * struct ioasid_data - Meta data about ioasid
>   *
>   * @id:  Unique ID
> + * @spid:Private ID unique within a set
>   * @usersNumber of active users
>   * @stateTrack state of the IOASID
>   * @set  Meta data of the set this IOASID belongs to
> @@ -29,6 +30,7 @@ enum ioasid_state {
>   */
>  struct ioasid_data {
>   ioasid_t id;
> + ioasid_t spid;
>   struct ioasid_set *set;
>   refcount_t users;
>   enum ioasid_state state;
> @@ -326,6 +328,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
>  EXPORT_SYMBOL_GPL(ioasid_attach_data);
>  
>  /**
> + * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
> + *
> + * @ioasid: the ID to attach
> + * @spid:   the ioasid_set private ID of @ioasid
> + *
> + * For IOASID that is already allocated, private ID within the set can be
> + * attached via this API. Future lookup can be done via ioasid_find.

via ioasid_find_by_spid()?

> + */
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + struct ioasid_data *ioasid_data;
> + int ret = 0;
> +
> + spin_lock(_allocator_lock);
> + ioasid_data = xa_load(_allocator->xa, ioasid);
> +
> + if (!ioasid_data) {
> + pr_err("No IOASID entry %d to attach SPID %d\n",
> + ioasid, spid);
> + ret = -ENOENT;
> + goto done_unlock;
> + }
> + ioasid_data->spid = spid;
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_attach_spid);
> +
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)

Maybe add a bit of documentation as this is public-facing.

> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (!xa_load(_sets, set->sid)) {
> + pr_warn("Invalid set\n");
> + return INVALID_IOASID;
> + }
> +
> + xa_for_each(>xa, index, entry) {
> + if (spid == entry->spid) {
> + pr_debug("Found ioasid %lu by spid %u\n", index, spid);
> + refcount_inc(>users);

Nothing prevents ioasid_free() from concurrently dropping the refcount to
zero and calling ioasid_do_free(). The caller will later call ioasid_put()
on a stale/reallocated index.

> + return index;
> + }
> + }
> + return INVALID_IOASID;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> +
> +/**
>   * ioasid_alloc - Allocate an IOASID
>   * @set: the IOASID set
>   * @min: the minimum ID (inclusive)
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 310abe4187a3..d4b3e83672f6 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -73,6 +73,8 @@ bool ioasid_is_active(ioasid_t ioasid);
>  
>  void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool 
> (*getter)(void *));
>  int ioasid_attach_data(ioasid_t ioasid, void *data);
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
>  int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>  void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> @@ -136,5 +138,15 @@ static inline int ioasid_attach_data(ioasid_t ioasid, 
> void *data)
>   return -ENOTSUPP;
>  }
>  
> +staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t 
> spid)
> +{
> + return -ENOTSUPP;

INVALID_IOASID

Thanks,
Jean

> +}
> +
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> -- 
> 2.7.4
> 


Re: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions

2020-08-25 Thread Jean-Philippe Brucker
On Mon, Aug 24, 2020 at 10:26:55AM +0800, Lu Baolu wrote:
> Hi Jacob,
> 
> On 8/22/20 12:35 PM, Jacob Pan wrote:
> > There can be multiple users of an IOASID, each user could have hardware
> > contexts associated with the IOASID. In order to align lifecycles,
> > reference counting is introduced in this patch. It is expected that when
> > an IOASID is being freed, each user will drop a reference only after its
> > context is cleared.
> > 
> > Signed-off-by: Jacob Pan 
[...]
> > +/**
> >* ioasid_find - Find IOASID data
> >* @set: the IOASID set
> >* @ioasid: the IOASID to find
> 
> Do you need to increase the refcount of the found ioasid and ask the
> caller to drop it after use? Otherwise, the ioasid might be freed
> elsewhere.

ioasid_find() takes a getter function as parameter, which ensures that the
returned data is valid. It fetches the IOASID data under rcu_read_lock()
and calls the getter on the private data (for example mmget_not_zero() for
bare-metal SVA). Given that, I don't think returning with a reference to
the IOASID is necessary. The IOASID may be freed once ioasid_find()
returns but not the returned data.

Thanks,
Jean


Re: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions

2020-08-25 Thread Jean-Philippe Brucker
On Fri, Aug 21, 2020 at 09:35:13PM -0700, Jacob Pan wrote:
> There can be multiple users of an IOASID, each user could have hardware
> contexts associated with the IOASID. In order to align lifecycles,
> reference counting is introduced in this patch. It is expected that when
> an IOASID is being freed, each user will drop a reference only after its
> context is cleared.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 113 
> +
>  include/linux/ioasid.h |   4 ++
>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index f73b3dbfc37a..5f31d63c75b1 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -717,6 +717,119 @@ int ioasid_set_for_each_ioasid(struct ioasid_set *set,
>  EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>  
>  /**
> + * IOASID refcounting rules
> + * - ioasid_alloc() set initial refcount to 1
> + *
> + * - ioasid_free() decrement and test refcount.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + *
> + * If recount is non-zero, mark IOASID as IOASID_STATE_FREE_PENDING.
> + * No new reference can be added. The IOASID is not returned to the pool
> + * for reuse.
> + * After free, ioasid_get() will return error but ioasid_find() and other
> + * non refcount adding APIs will continue to work until the last 
> reference
> + * is dropped
> + *
> + * - ioasid_get() get a reference on an active IOASID
> + *
> + * - ioasid_put() decrement and test refcount of the IOASID.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + * Do nothing if refcount is non-zero.
> + *
> + * - ioasid_find() does not take reference, caller must hold reference
> + *
> + * ioasid_free() can be called multiple times without error until all refs 
> are
> + * dropped.
> + */

Since you already document this in ioasid.rst, I'm not sure the comment
is necessary. Maybe the doc for _free/_put would be better in the
function's documentation.

> +
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to get unknown IOASID %u\n", ioasid);
> + return -EINVAL;
> + }
> + if (data->state == IOASID_STATE_FREE_PENDING) {
> + pr_err("Trying to get IOASID being freed%u\n", ioasid);
> + return -EBUSY;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to get IOASID not in set%u\n", ioasid);
> + /* data found but does not belong to the set */
> + return -EACCES;
> + }
> + refcount_inc(>users);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_locked);

Is it necessary to export the *_locked variant?  Who'd call them and how
would they acquire the lock?

> +
> +/**
> + * ioasid_get - Obtain a reference of an ioasid
> + * @set
> + * @ioasid

Can be dropped. The doc checker will throw a warning, though.

> + *
> + * Check set ownership if @set is non-null.
> + */
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + int ret = 0;

No need to initialize ret

> +
> + spin_lock(_allocator_lock);
> + ret = ioasid_get_locked(set, ioasid);
> + spin_unlock(_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to put unknown IOASID %u\n", ioasid);
> + return;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to drop IOASID not in the set %u\n", ioasid);
> + return;
> + }
> +
> + if (!refcount_dec_and_test(>users)) {
> + pr_debug("%s: IOASID %d has %d remainning users\n",
> + __func__, ioasid, refcount_read(>users));
> + return;
> + }
> + ioasid_do_free(data);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put_locked);
> +
> +/**
> + * ioasid_put - Drop a reference of an ioasid
> + * @set
> + * @ioasid
> + *
> + * Check set ownership if @set is non-null.
> + */
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + spin_lock(_allocator_lock);
> + ioasid_put_locked(set, ioasid);
> + spin_unlock(_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put);
> +
> +/**
>   * ioasid_find - Find IOASID data
>   * @set: the IOASID set
>   * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h 

Re: [PATCH v2 2/9] iommu/ioasid: Rename ioasid_set_data()

2020-08-24 Thread Jean-Philippe Brucker
On Fri, Aug 21, 2020 at 09:35:11PM -0700, Jacob Pan wrote:
> Rename ioasid_set_data() to ioasid_attach_data() to avoid confusion with
> struct ioasid_set. ioasid_set is a group of IOASIDs that share a common
> token.
> 
> Signed-off-by: Jacob Pan 

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/intel/svm.c | 6 +++---
>  drivers/iommu/ioasid.c| 6 +++---
>  include/linux/ioasid.h| 4 ++--
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index b6972dca2ae0..37a9beabc0ca 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -342,7 +342,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
> struct device *dev,
>   svm->gpasid = data->gpasid;
>   svm->flags |= SVM_FLAG_GUEST_PASID;
>   }
> - ioasid_set_data(data->hpasid, svm);
> + ioasid_attach_data(data->hpasid, svm);
>   INIT_LIST_HEAD_RCU(>devs);
>   mmput(svm->mm);
>   }
> @@ -394,7 +394,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
> struct device *dev,
>   list_add_rcu(>list, >devs);
>   out:
>   if (!IS_ERR_OR_NULL(svm) && list_empty(>devs)) {
> - ioasid_set_data(data->hpasid, NULL);
> + ioasid_attach_data(data->hpasid, NULL);
>   kfree(svm);
>   }
>  
> @@ -437,7 +437,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
>* the unbind, IOMMU driver will get notified
>* and perform cleanup.
>*/
> - ioasid_set_data(pasid, NULL);
> + ioasid_attach_data(pasid, NULL);
>   kfree(svm);
>   }
>   }
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 0f8dd377aada..5f63af07acd5 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -258,14 +258,14 @@ void ioasid_unregister_allocator(struct 
> ioasid_allocator_ops *ops)
>  EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
>  
>  /**
> - * ioasid_set_data - Set private data for an allocated ioasid
> + * ioasid_attach_data - Set private data for an allocated ioasid
>   * @ioasid: the ID to set data
>   * @data:   the private data
>   *
>   * For IOASID that is already allocated, private data can be set
>   * via this API. Future lookup can be done via ioasid_find.
>   */
> -int ioasid_set_data(ioasid_t ioasid, void *data)
> +int ioasid_attach_data(ioasid_t ioasid, void *data)
>  {
>   struct ioasid_data *ioasid_data;
>   int ret = 0;
> @@ -287,7 +287,7 @@ int ioasid_set_data(ioasid_t ioasid, void *data)
>  
>   return ret;
>  }
> -EXPORT_SYMBOL_GPL(ioasid_set_data);
> +EXPORT_SYMBOL_GPL(ioasid_attach_data);
>  
>  /**
>   * ioasid_alloc - Allocate an IOASID
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 6f000d7a0ddc..9c44947a68c8 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -39,7 +39,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> bool (*getter)(void *));
>  int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> -int ioasid_set_data(ioasid_t ioasid, void *data);
> +int ioasid_attach_data(ioasid_t ioasid, void *data);
>  
>  #else /* !CONFIG_IOASID */
>  static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> @@ -67,7 +67,7 @@ static inline void ioasid_unregister_allocator(struct 
> ioasid_allocator_ops *allo
>  {
>  }
>  
> -static inline int ioasid_set_data(ioasid_t ioasid, void *data)
> +static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
>  {
>   return -ENOTSUPP;
>  }
> -- 
> 2.7.4
> 


Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

2020-08-24 Thread Jean-Philippe Brucker
On Fri, Aug 21, 2020 at 09:35:12PM -0700, Jacob Pan wrote:
> ioasid_set was introduced as an arbitrary token that are shared by a
> group of IOASIDs. For example, if IOASID #1 and #2 are allocated via the
> same ioasid_set*, they are viewed as to belong to the same set.
> 
> For guest SVA usages, system-wide IOASID resources need to be
> partitioned such that VMs can have its own quota and being managed
> separately. ioasid_set is the perfect candidate for meeting such
> requirements. This patch redefines and extends ioasid_set with the
> following new fields:
> - Quota
> - Reference count
> - Storage of its namespace
> - The token is stored in the new ioasid_set but with optional types
> 
> ioasid_set level APIs are introduced that wires up these new data.
> Existing users of IOASID APIs are converted where a host IOASID set is
> allocated for bare-metal usage.
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel/iommu.c |  27 ++-
>  drivers/iommu/intel/pasid.h |   1 +
>  drivers/iommu/intel/svm.c   |   8 +-
>  drivers/iommu/ioasid.c  | 390 
> +---
>  include/linux/ioasid.h  |  82 --
>  5 files changed, 465 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index a3a0b5c8921d..5813eeaa5edb 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -103,6 +104,9 @@
>   */
>  #define INTEL_IOMMU_PGSIZES  (~0xFFFUL)
>  
> +/* PASIDs used by host SVM */
> +struct ioasid_set *host_pasid_set;
> +
>  static inline int agaw_to_level(int agaw)
>  {
>   return agaw + 2;
> @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, 
> void *data)
>* Sanity check the ioasid owner is done at upper layer, e.g. VFIO
>* We can only free the PASID when all the devices are unbound.
>*/
> - if (ioasid_find(NULL, ioasid, NULL)) {
> - pr_alert("Cannot free active IOASID %d\n", ioasid);
> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> + pr_err("Cannot free IOASID %d, not in system set\n", ioasid);
>   return;
>   }
>   vcmd_free_pasid(iommu, ioasid);
> @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
>   if (ret)
>   goto free_iommu;
>  
> + /* PASID is needed for scalable mode irrespective to SVM */
> + if (intel_iommu_sm) {
> + ioasid_install_capacity(intel_pasid_max_id);
> + /* We should not run out of IOASIDs at boot */
> + host_pasid_set = ioasid_alloc_set(NULL, PID_MAX_DEFAULT,
> + IOASID_SET_TYPE_NULL);
> + if (IS_ERR_OR_NULL(host_pasid_set)) {
> + pr_err("Failed to enable host PASID allocator %lu\n",
> + PTR_ERR(host_pasid_set));
> + intel_iommu_sm = 0;
> + }
> + }
> +
>   /*
>* for each drhd
>*   enable fault log
> @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct dmar_domain 
> *domain,
>   domain->auxd_refcnt--;
>  
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>  }
>  
>  static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   int pasid;
>  
>   /* No private data needed for the default pasid */
> - pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
>pci_max_pasids(to_pci_dev(dev)) - 1,
>NULL);
>   if (pasid == INVALID_IOASID) {
> @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   spin_unlock(>lock);
>   spin_unlock_irqrestore(_domain_lock, flags);
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>  
>   return ret;
>  }
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index c9850766c3a9..ccdc23446015 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry 
> *pte)
>  }
>  
>  extern u32 intel_pasid_max_id;
> +extern struct ioasid_set *host_pasid_set;
>  int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
>  void intel_pasid_free_id(int pasid);
>  void *intel_pasid_lookup_id(int pasid);
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 

Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

2020-08-24 Thread Jean-Philippe Brucker
On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
> IOASID is used to identify address spaces that can be targeted by device
> DMA. It is a system-wide resource that is essential to its many users.
> This document is an attempt to help developers from all vendors navigate
> the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
> (SIOV) enabled platforms are the primary users of IOASID. Examples of
> how SIOV components interact with IOASID APIs are provided in that many
> APIs are driven by the requirements from SIOV.
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Wu Hao 
> Signed-off-by: Jacob Pan 
> ---
>  Documentation/ioasid.rst | 618 
> +++
>  1 file changed, 618 insertions(+)
>  create mode 100644 Documentation/ioasid.rst
> 
> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst

Thanks for writing this up. Should it go to Documentation/driver-api/, or
Documentation/driver-api/iommu/? I think this also needs to Cc
linux-...@vger.kernel.org and cor...@lwn.net

> new file mode 100644
> index ..b6a8cdc885ff
> --- /dev/null
> +++ b/Documentation/ioasid.rst
> @@ -0,0 +1,618 @@
> +.. ioasid:
> +
> +=
> +IO Address Space ID
> +=
> +
> +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
> +SMMU sub-stream ID. An IOASID identifies an address space that DMA

"SubstreamID"

> +requests can target.
> +
> +The primary use cases for IOASID are Shared Virtual Address (SVA) and
> +IO Virtual Address (IOVA). However, the requirements for IOASID

IOVA alone isn't a use case, maybe "multiple IOVA spaces per device"?

> +management can vary among hardware architectures.
> +
> +This document covers the generic features supported by IOASID
> +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
> +based platforms as the first example.
> +
> +.. contents:: :local:
> +
> +Glossary
> +
> +PASID - Process Address Space ID
> +
> +IOASID - IO Address Space ID (generic term for PCIe PASID and
> +sub-stream ID in SMMU)

"SubstreamID"

> +
> +SVA/SVM - Shared Virtual Addressing/Memory
> +
> +ENQCMD - New Intel X86 ISA for efficient workqueue submission [1]

Maybe drop the "New", to keep the documentation perennial. It might be
good to add internal links here to the specifications URLs at the bottom.

> +
> +DSA - Intel Data Streaming Accelerator [2]
> +
> +VDCM - Virtual device composition module [3]
> +
> +SIOV - Intel Scalable IO Virtualization
> +
> +
> +Key Concepts
> +
> +
> +IOASID Set
> +---
> +An IOASID set is a group of IOASIDs allocated from the system-wide
> +IOASID pool. An IOASID set is created and can be identified by a
> +token of u64. Refer to IOASID set APIs for more details.

Identified either by an u64 or an mm_struct, right?  Maybe just drop the
second sentence if it's detailed in the IOASID set section below.

> +
> +IOASID set is particularly useful for guest SVA where each guest could
> +have its own IOASID set for security and efficiency reasons.
> +
> +IOASID Set Private ID (SPID)
> +
> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to a
> +system-wide IOASID but the namespace of SPID is within its IOASID
> +set.

The intro isn't super clear. Perhaps this is simpler:
"Each IOASID set has a private namespace of SPIDs. An SPID maps to a
single system-wide IOASID."

> SPIDs can be used as guest IOASIDs where each guest could do
> +IOASID allocation from its own pool and map them to host physical
> +IOASIDs. SPIDs are particularly useful for supporting live migration
> +where decoupling guest and host physical resources are necessary.
> +
> +For example, two VMs can both allocate guest PASID/SPID #101 but map to
> +different host PASIDs #201 and #202 respectively as shown in the
> +diagram below.
> +::
> +
> + .--..--.
> + |   VM 1   ||   VM 2   |
> + |  ||  |
> + |--||--|
> + | GPASID/SPID 101  || GPASID/SPID 101  |
> + '--'---' Guest
> + __|__|__
> +   |  |   Host
> +   v  v
> + .--..--.
> + | Host IOASID 201  || Host IOASID 202  |
> + '--''--'
> + |   IOASID set 1   ||   IOASID set 2   |
> + '--''--'
> +
> +Guest PASID is treated as IOASID set private ID (SPID) within an
> +IOASID set, mappings between guest and host IOASIDs are stored in the
> +set for inquiry.
> +
> +IOASID APIs
> +===
> +To get the IOASID APIs, users must #include . These APIs
> +serve the following functionalities:
> +
> +  - IOASID allocation/Free
> +  - Group 

Re: [PATCH v2 12/24] virtio_iommu: correct tags for config space fields

2020-08-04 Thread Jean-Philippe Brucker
On Mon, Aug 03, 2020 at 04:59:27PM -0400, Michael S. Tsirkin wrote:
> Since this is a modern-only device,
> tag config space fields as having little endian-ness.
> 
> Signed-off-by: Michael S. Tsirkin 

Reviewed-by: Jean-Philippe Brucker 

And tested with the latest sparse

> ---
>  include/uapi/linux/virtio_iommu.h | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/uapi/linux/virtio_iommu.h 
> b/include/uapi/linux/virtio_iommu.h
> index 48e3c29223b5..237e36a280cb 100644
> --- a/include/uapi/linux/virtio_iommu.h
> +++ b/include/uapi/linux/virtio_iommu.h
> @@ -18,24 +18,24 @@
>  #define VIRTIO_IOMMU_F_MMIO  5
>  
>  struct virtio_iommu_range_64 {
> - __u64   start;
> - __u64   end;
> + __le64  start;
> + __le64  end;
>  };
>  
>  struct virtio_iommu_range_32 {
> - __u32   start;
> - __u32   end;
> + __le32  start;
> + __le32  end;
>  };
>  
>  struct virtio_iommu_config {
>   /* Supported page sizes */
> - __u64   page_size_mask;
> + __le64  page_size_mask;
>   /* Supported IOVA range */
>   struct virtio_iommu_range_64input_range;
>   /* Max domain ID size */
>   struct virtio_iommu_range_32domain_range;
>   /* Probe buffer size */
> - __u32   probe_size;
> + __le32  probe_size;
>  };
>  
>  /* Request types */
> -- 
> MST
> 


Re: [PATCH bpf-next 1/1] arm64: bpf: Add BPF exception tables

2020-07-30 Thread Jean-Philippe Brucker
On Thu, Jul 30, 2020 at 09:47:39PM +0200, Daniel Borkmann wrote:
> On 7/30/20 4:22 PM, Jean-Philippe Brucker wrote:
> > On Thu, Jul 30, 2020 at 08:28:56AM -0400, Qian Cai wrote:
> > > On Tue, Jul 28, 2020 at 05:21:26PM +0200, Jean-Philippe Brucker wrote:
> > > > When a tracing BPF program attempts to read memory without using the
> > > > bpf_probe_read() helper, the verifier marks the load instruction with
> > > > the BPF_PROBE_MEM flag. Since the arm64 JIT does not currently recognize
> > > > this flag it falls back to the interpreter.
> > > > 
> > > > Add support for BPF_PROBE_MEM, by appending an exception table to the
> > > > BPF program. If the load instruction causes a data abort, the fixup
> > > > infrastructure finds the exception table and fixes up the fault, by
> > > > clearing the destination register and jumping over the faulting
> > > > instruction.
> > > > 
> > > > To keep the compact exception table entry format, inspect the pc in
> > > > fixup_exception(). A more generic solution would add a "handler" field
> > > > to the table entry, like on x86 and s390.
> > > > 
> > > > Signed-off-by: Jean-Philippe Brucker 
> > > 
> > > This will fail to compile on arm64,
> > > 
> > > https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config
> > > 
> > > arch/arm64/mm/extable.o: In function `fixup_exception':
> > > arch/arm64/mm/extable.c:19: undefined reference to 
> > > `arm64_bpf_fixup_exception'
> > 
> > Thanks for the report, I attached a fix. Daniel, can I squash it and
> > resend as v2 or is it too late?
> 
> If you want I can squash your attached snippet into the original patch of
> yours. If you want to send a v2 that is fine as well of course. Let me know.

Yes please squash it into the original patch, sorry for the mess

Thanks,
Jean


Re: [PATCH bpf-next 1/1] arm64: bpf: Add BPF exception tables

2020-07-30 Thread Jean-Philippe Brucker
On Thu, Jul 30, 2020 at 08:28:56AM -0400, Qian Cai wrote:
> On Tue, Jul 28, 2020 at 05:21:26PM +0200, Jean-Philippe Brucker wrote:
> > When a tracing BPF program attempts to read memory without using the
> > bpf_probe_read() helper, the verifier marks the load instruction with
> > the BPF_PROBE_MEM flag. Since the arm64 JIT does not currently recognize
> > this flag it falls back to the interpreter.
> > 
> > Add support for BPF_PROBE_MEM, by appending an exception table to the
> > BPF program. If the load instruction causes a data abort, the fixup
> > infrastructure finds the exception table and fixes up the fault, by
> > clearing the destination register and jumping over the faulting
> > instruction.
> > 
> > To keep the compact exception table entry format, inspect the pc in
> > fixup_exception(). A more generic solution would add a "handler" field
> > to the table entry, like on x86 and s390.
> > 
> > Signed-off-by: Jean-Philippe Brucker 
> 
> This will fail to compile on arm64,
> 
> https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config
> 
> arch/arm64/mm/extable.o: In function `fixup_exception':
> arch/arm64/mm/extable.c:19: undefined reference to `arm64_bpf_fixup_exception'

Thanks for the report, I attached a fix. Daniel, can I squash it and
resend as v2 or is it too late?

I'd be more confident if my patches sat a little longer on the list so
arm64 folks have a chance to review them. This isn't my first silly
mistake...

Thanks,
Jean
>From 17d0f041b57903cb2657dde15559cd1923498337 Mon Sep 17 00:00:00 2001
From: Jean-Philippe Brucker 
Date: Thu, 30 Jul 2020 14:45:44 +0200
Subject: [PATCH] arm64: bpf: Fix build for !CONFIG_BPF_JIT

Add a stub for arm64_bpf_fixup_exception() when CONFIG_BPF_JIT isn't
enabled, and avoid the fixup in this case.

Signed-off-by: Jean-Philippe Brucker 
---
 arch/arm64/include/asm/extable.h | 9 +
 arch/arm64/mm/extable.c  | 3 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/extable.h b/arch/arm64/include/asm/extable.h
index bcee40df1586..840a35ed92ec 100644
--- a/arch/arm64/include/asm/extable.h
+++ b/arch/arm64/include/asm/extable.h
@@ -22,8 +22,17 @@ struct exception_table_entry
 
 #define ARCH_HAS_RELATIVE_EXTABLE
 
+#ifdef CONFIG_BPF_JIT
 int arm64_bpf_fixup_exception(const struct exception_table_entry *ex,
  struct pt_regs *regs);
+#else /* !CONFIG_BPF_JIT */
+static inline
+int arm64_bpf_fixup_exception(const struct exception_table_entry *ex,
+ struct pt_regs *regs)
+{
+   return 0;
+}
+#endif /* !CONFIG_BPF_JIT */
 
 extern int fixup_exception(struct pt_regs *regs);
 #endif
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 1f42991cacdd..eee1732ab6cd 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -14,7 +14,8 @@ int fixup_exception(struct pt_regs *regs)
if (!fixup)
return 0;
 
-   if (regs->pc >= BPF_JIT_REGION_START &&
+   if (IS_ENABLED(CONFIG_BPF_JIT) &&
+   regs->pc >= BPF_JIT_REGION_START &&
regs->pc < BPF_JIT_REGION_END)
return arm64_bpf_fixup_exception(fixup, regs);
 
-- 
2.27.0



Re: [PATCH v5 03/15] iommu/smmu: Report empty domain nesting info

2020-07-17 Thread Jean-Philippe Brucker
On Thu, Jul 16, 2020 at 10:38:17PM +0200, Auger Eric wrote:
> Hi Jean,
> 
> On 7/16/20 5:39 PM, Jean-Philippe Brucker wrote:
> > On Tue, Jul 14, 2020 at 10:12:49AM +, Liu, Yi L wrote:
> >>> Have you verified that this doesn't break the existing usage of
> >>> DOMAIN_ATTR_NESTING in drivers/vfio/vfio_iommu_type1.c?
> >>
> >> I didn't have ARM machine on my hand. But I contacted with Jean
> >> Philippe, he confirmed no compiling issue. I didn't see any code
> >> getting DOMAIN_ATTR_NESTING attr in current 
> >> drivers/vfio/vfio_iommu_type1.c.
> >> What I'm adding is to call iommu_domai_get_attr(, DOMAIN_ATTR_NESTIN)
> >> and won't fail if the iommu_domai_get_attr() returns 0. This patch
> >> returns an empty nesting info for DOMAIN_ATTR_NESTIN and return
> >> value is 0 if no error. So I guess it won't fail nesting for ARM.
> > 
> > I confirm that this series doesn't break the current support for
> > VFIO_IOMMU_TYPE1_NESTING with an SMMUv3. That said...
> > 
> > If the SMMU does not support stage-2 then there is a change in behavior
> > (untested): after the domain is silently switched to stage-1 by the SMMU
> > driver, VFIO will now query nesting info and obtain -ENODEV. Instead of
> > succeding as before, the VFIO ioctl will now fail. I believe that's a fix
> > rather than a regression, it should have been like this since the
> > beginning. No known userspace has been using VFIO_IOMMU_TYPE1_NESTING so
> > far, so I don't think it should be a concern.
> But as Yi mentioned ealier, in the current vfio code there is no
> DOMAIN_ATTR_NESTING query yet.

That's why something that would have succeeded before will now fail:
Before this series, if user asked for a VFIO_IOMMU_TYPE1_NESTING, it would
have succeeded even if the SMMU didn't support stage-2, as the driver
would have silently fallen back on stage-1 mappings (which work exactly
the same as stage-2-only since there was no nesting supported). After the
series, we do check for DOMAIN_ATTR_NESTING so if user asks for
VFIO_IOMMU_TYPE1_NESTING and the SMMU doesn't support stage-2, the ioctl
fails.

I believe it's a good fix and completely harmless, but wanted to make sure
no one objects because it's an ABI change.

Thanks,
Jean

> In my SMMUV3 nested stage series, I added
> such a query in vfio-pci.c to detect if I need to expose a fault region
> but I already test both the returned value and the output arg. So to me
> there is no issue with that change.
> > 
> > And if userspace queries the nesting properties using the new ABI
> > introduced in this patchset, it will obtain an empty struct. I think
> > that's acceptable, but it may be better to avoid adding the nesting cap if
> > @format is 0?
> agreed
> 
> Thanks
> 
> Eric
> > 
> > Thanks,
> > Jean
> > 
> >>
> >> @Eric, how about your opinion? your dual-stage vSMMU support may
> >> also share the vfio_iommu_type1.c code.
> >>
> >> Regards,
> >> Yi Liu
> >>
> >>> Will
> > 
> 


Re: [PATCH v5 03/15] iommu/smmu: Report empty domain nesting info

2020-07-16 Thread Jean-Philippe Brucker
On Tue, Jul 14, 2020 at 10:12:49AM +, Liu, Yi L wrote:
> > Have you verified that this doesn't break the existing usage of
> > DOMAIN_ATTR_NESTING in drivers/vfio/vfio_iommu_type1.c?
> 
> I didn't have ARM machine on my hand. But I contacted with Jean
> Philippe, he confirmed no compiling issue. I didn't see any code
> getting DOMAIN_ATTR_NESTING attr in current drivers/vfio/vfio_iommu_type1.c.
> What I'm adding is to call iommu_domai_get_attr(, DOMAIN_ATTR_NESTIN)
> and won't fail if the iommu_domai_get_attr() returns 0. This patch
> returns an empty nesting info for DOMAIN_ATTR_NESTIN and return
> value is 0 if no error. So I guess it won't fail nesting for ARM.

I confirm that this series doesn't break the current support for
VFIO_IOMMU_TYPE1_NESTING with an SMMUv3. That said...

If the SMMU does not support stage-2 then there is a change in behavior
(untested): after the domain is silently switched to stage-1 by the SMMU
driver, VFIO will now query nesting info and obtain -ENODEV. Instead of
succeding as before, the VFIO ioctl will now fail. I believe that's a fix
rather than a regression, it should have been like this since the
beginning. No known userspace has been using VFIO_IOMMU_TYPE1_NESTING so
far, so I don't think it should be a concern.

And if userspace queries the nesting properties using the new ABI
introduced in this patchset, it will obtain an empty struct. I think
that's acceptable, but it may be better to avoid adding the nesting cap if
@format is 0?

Thanks,
Jean

> 
> @Eric, how about your opinion? your dual-stage vSMMU support may
> also share the vfio_iommu_type1.c code.
> 
> Regards,
> Yi Liu
> 
> > Will


Re: [PATCH v2 3/4] iommu/vt-d: Report page request faults for guest SVA

2020-07-07 Thread Jean-Philippe Brucker
On Mon, Jul 06, 2020 at 08:25:34AM +0800, Lu Baolu wrote:
> A pasid might be bound to a page table from a VM guest via the iommu
> ops.sva_bind_gpasid. In this case, when a DMA page fault is detected
> on the physical IOMMU, we need to inject the page fault request into
> the guest. After the guest completes handling the page fault, a page
> response need to be sent back via the iommu ops.page_response().
> 
> This adds support to report a page request fault. Any external module
> which is interested in handling this fault should regiester a notifier
> callback.
> 
> Co-developed-by: Jacob Pan 
> Signed-off-by: Jacob Pan 
> Co-developed-by: Liu Yi L 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> ---
[...]
> +static int
> +intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc)
> +{
> + struct iommu_fault_event event;
> + u8 bus, devfn;
> +
> + memset(, 0, sizeof(struct iommu_fault_event));
> + bus = PCI_BUS_NUM(desc->rid);
> + devfn = desc->rid & 0xff;
> +
> + /* Fill in event data for device specific processing */
> + event.fault.type = IOMMU_FAULT_PAGE_REQ;
> + event.fault.prm.addr = desc->addr;
> + event.fault.prm.pasid = desc->pasid;
> + event.fault.prm.grpid = desc->prg_index;
> + event.fault.prm.perm = prq_to_iommu_prot(desc);
> +
> + /*
> +  * Set last page in group bit if private data is present,
> +  * page response is required as it does for LPIG.
> +  */
> + if (desc->lpig)
> + event.fault.prm.flags |= IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> + if (desc->pasid_present)
> + event.fault.prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;

Do you also need to set IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID?  I added
the flag to deal with devices that do not want a PASID value in their PRI
response (bit 15 in the PCIe Page Request Status Register):
https://lore.kernel.org/linux-iommu/20200616144712.748818-1-jean-phili...@linaro.org/
(applied by Joerg for v5.9)

Grepping for pci_prg_resp_pasid_required() in intel/iommu.c it seems to
currently reject devices that do not want a PASID in a PRI response, so I
think you can set this flag unconditionally for now.

Thanks,
Jean

> + if (desc->priv_data_present) {
> + event.fault.prm.flags |= IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> + event.fault.prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> + memcpy(event.fault.prm.private_data, desc->priv_data,
> +sizeof(desc->priv_data));
> + }
> +
> + return iommu_report_device_fault(dev, );
> +}


Re: [PATCH v2 1/6] iommu/arm-smmu: Add auxiliary domain support for arm-smmuv2

2020-07-07 Thread Jean-Philippe Brucker
Hi Jordan,

On Fri, Jun 26, 2020 at 02:04:09PM -0600, Jordan Crouse wrote:
> Support auxiliary domains for arm-smmu-v2 to initialize and support
> multiple pagetables for a single SMMU context bank. Since the smmu-v2
> hardware doesn't have any built in support for switching the pagetable
> base it is left as an exercise to the caller to actually use the pagetable.
> 
> Aux domains are supported if split pagetable (TTBR1) support has been
> enabled on the master domain.  Each auxiliary domain will reuse the
> configuration of the master domain. By default the a domain with TTBR1
> support will have the TTBR0 region disabled so the first attached aux
> domain will enable the TTBR0 region in the hardware and conversely the
> last domain to be detached will disable TTBR0 translations.  All subsequent
> auxiliary domains create a pagetable but not touch the hardware.
> 
> The leaf driver will be able to query the physical address of the
> pagetable with the DOMAIN_ATTR_PTBASE attribute so that it can use the
> address with whatever means it has to switch the pagetable base.
> 
> Following is a pseudo code example of how a domain can be created
> 
>  /* Check to see if aux domains are supported */
>  if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
>iommu = iommu_domain_alloc(...);
> 

The device driver should also call iommu_dev_enable_feature() before using
the AUX feature. I see that you implement them as NOPs and in this case
the GPU is tightly coupled with the SMMU so interoperability between
different IOMMU and device drivers doesn't matter much, but I think it's
still a good idea to follow the same patterns in all drivers to make
future work on the core IOMMU easier.

>if (iommu_aux_attach_device(domain, dev))
>return FAIL;
> 
>   /* Save the base address of the pagetable for use by the driver
>   iommu_domain_get_attr(domain, DOMAIN_ATTR_PTBASE, );
>  }
> 
> Then 'domain' can be used like any other iommu domain to map and
> unmap iova addresses in the pagetable.
> 
> Signed-off-by: Jordan Crouse 
> ---
> 
>  drivers/iommu/arm-smmu.c | 219 ---
>  drivers/iommu/arm-smmu.h |   1 +
>  2 files changed, 204 insertions(+), 16 deletions(-)
[...]
> @@ -1653,6 +1836,10 @@ static struct iommu_ops arm_smmu_ops = {
>   .get_resv_regions   = arm_smmu_get_resv_regions,
>   .put_resv_regions   = generic_iommu_put_resv_regions,
>   .def_domain_type= arm_smmu_def_domain_type,
> + .dev_has_feat   = arm_smmu_dev_has_feat,
> + .dev_enable_feat= arm_smmu_dev_enable_feat,
> + .dev_disable_feat   = arm_smmu_dev_disable_feat,
> + .aux_attach_dev = arm_smmu_aux_attach_dev,

To be complete this also needs dev_feat_enabled() and aux_detach_dev() ops

Thanks,
Jean


Re: [PATCH v3 02/14] iommu: Report domain nesting info

2020-06-26 Thread Jean-Philippe Brucker
On Wed, Jun 24, 2020 at 01:55:15AM -0700, Liu Yi L wrote:
> IOMMUs that support nesting translation needs report the capability info
> to userspace, e.g. the format of first level/stage paging structures.
> 
> This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
> nesting info after setting DOMAIN_ATTR_NESTING.
> 
> v2 -> v3:
> *) remvoe cap/ecap_mask in iommu_nesting_info.
> *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> *) return an empty iommu_nesting_info for SMMU drivers per Jean'
>suggestion.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/arm-smmu-v3.c | 29 --
>  drivers/iommu/arm-smmu.c| 29 --

Looks reasonable to me. Please move the SMMU changes to a separate patch
and Cc the SMMU maintainers:

Cc: Will Deacon 
Cc: Robin Murphy 

Thanks,
Jean

>  include/uapi/linux/iommu.h  | 59 
> +
>  3 files changed, 113 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677..0c45d4d 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -3019,6 +3019,32 @@ static struct iommu_group 
> *arm_smmu_device_group(struct device *dev)
>   return group;
>  }
>  
> +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
> + void *data)
> +{
> + struct iommu_nesting_info *info = (struct iommu_nesting_info *) data;
> + u32 size;
> +
> + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + return -ENODEV;
> +
> + size = sizeof(struct iommu_nesting_info);
> +
> + /*
> +  * if provided buffer size is not equal to the size, should
> +  * return 0 and also the expected buffer size to caller.
> +  */
> + if (info->size != size) {
> + info->size = size;
> + return 0;
> + }
> +
> + /* report an empty iommu_nesting_info for now */
> + memset(info, 0x0, size);
> + info->size = size;
> + return 0;
> +}
> +
>  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
>   enum iommu_attr attr, void *data)
>  {
> @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
> *domain,
>   case IOMMU_DOMAIN_UNMANAGED:
>   switch (attr) {
>   case DOMAIN_ATTR_NESTING:
> - *(int *)data = (smmu_domain->stage == 
> ARM_SMMU_DOMAIN_NESTED);
> - return 0;
> + return arm_smmu_domain_nesting_info(smmu_domain, data);
>   default:
>   return -ENODEV;
>   }
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 243bc4c..908607d 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -1506,6 +1506,32 @@ static struct iommu_group 
> *arm_smmu_device_group(struct device *dev)
>   return group;
>  }
>  
> +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
> + void *data)
> +{
> + struct iommu_nesting_info *info = (struct iommu_nesting_info *) data;
> + u32 size;
> +
> + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + return -ENODEV;
> +
> + size = sizeof(struct iommu_nesting_info);
> +
> + /*
> +  * if provided buffer size is not equal to the size, should
> +  * return 0 and also the expected buffer size to caller.
> +  */
> + if (info->size != size) {
> + info->size = size;
> + return 0;
> + }
> +
> + /* report an empty iommu_nesting_info for now */
> + memset(info, 0x0, size);
> + info->size = size;
> + return 0;
> +}
> +
>  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
>   enum iommu_attr attr, void *data)
>  {
> @@ -1515,8 +1541,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
> *domain,
>   case IOMMU_DOMAIN_UNMANAGED:
>   switch (attr) {
>   case DOMAIN_ATTR_NESTING:
> - *(int *)data = (smmu_domain->stage == 
> ARM_SMMU_DOMAIN_NESTED);
> - return 0;
> + return arm_smmu_domain_nesting_info(smmu_domain, data);
>   default:
>

Re: [PATCH v2 02/15] iommu: Report domain nesting info

2020-06-17 Thread Jean-Philippe Brucker
[+ Will and Robin]

Hi Yi,

On Thu, Jun 11, 2020 at 05:15:21AM -0700, Liu Yi L wrote:
> IOMMUs that support nesting translation needs report the capability info
> to userspace, e.g. the format of first level/stage paging structures.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
> @Jean, Eric: as nesting was introduced for ARM, but looks like no actual
> user of it. right? So I'm wondering if we can reuse DOMAIN_ATTR_NESTING
> to retrieve nesting info? how about your opinions?

Sure, I think we could rework the getters for DOMAIN_ATTR_NESTING since
they aren't used, but we do need to keep the setters as is.

Before attaching a domain, VFIO sets DOMAIN_ATTR_NESTING if userspace
requested a VFIO_TYPE1_NESTING_IOMMU container. This is necessary for the
SMMU driver to know how to attach later, but at that point we don't know
whether the SMMU does support nesting (since the domain isn't attached to
any endpoint). During attach, the SMMU driver adapts to the SMMU's
capabilities, and may well fallback to one stage if the SMMU doesn't
support nesting.

VFIO should check after attaching that the nesting attribute held, by
calling iommu_domain_get_attr(NESTING). At the moment it does not, and
since your 03/15 patch does that with additional info, I agree with
reusing DOMAIN_ATTR_NESTING instead of adding DOMAIN_ATTR_NESTING_INFO.

However it requires changing the get_attr(NESTING) implementations in both
SMMU drivers as a precursor of this series, to avoid breaking
VFIO_TYPE1_NESTING_IOMMU on Arm. Since we haven't yet defined the
nesting_info structs for SMMUv2 and v3, I suppose we could return an empty
struct iommu_nesting_info for now?

> 
>  include/linux/iommu.h  |  1 +
>  include/uapi/linux/iommu.h | 34 ++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 78a26ae..f6e4b49 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -126,6 +126,7 @@ enum iommu_attr {
>   DOMAIN_ATTR_FSL_PAMUV1,
>   DOMAIN_ATTR_NESTING,/* two stages of translation */
>   DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> + DOMAIN_ATTR_NESTING_INFO,
>   DOMAIN_ATTR_MAX,
>  };
>  
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 303f148..02eac73 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -332,4 +332,38 @@ struct iommu_gpasid_bind_data {
>   };
>  };
>  
> +struct iommu_nesting_info {
> + __u32   size;
> + __u32   format;

What goes into format? And flags? This structure needs some documentation.

Thanks,
Jean

> + __u32   features;
> +#define IOMMU_NESTING_FEAT_SYSWIDE_PASID (1 << 0)
> +#define IOMMU_NESTING_FEAT_BIND_PGTBL(1 << 1)
> +#define IOMMU_NESTING_FEAT_CACHE_INVLD   (1 << 2)
> + __u32   flags;
> + __u8data[];
> +};
> +
> +/*
> + * @flags:   VT-d specific flags. Currently reserved for future
> + *   extension.
> + * @addr_width:  The output addr width of first level/stage translation
> + * @pasid_bits:  Maximum supported PASID bits, 0 represents no PASID
> + *   support.
> + * @cap_reg: Describe basic capabilities as defined in VT-d capability
> + *   register.
> + * @cap_mask:Mark valid capability bits in @cap_reg.
> + * @ecap_reg:Describe the extended capabilities as defined in VT-d
> + *   extended capability register.
> + * @ecap_mask:   Mark the valid capability bits in @ecap_reg.
> + */
> +struct iommu_nesting_info_vtd {
> + __u32   flags;
> + __u16   addr_width;
> + __u16   pasid_bits;
> + __u64   cap_reg;
> + __u64   cap_mask;
> + __u64   ecap_reg;
> + __u64   ecap_mask;
> +};
> +
>  #endif /* _UAPI_IOMMU_H */
> -- 
> 2.7.4
> 


Re: [PATCH] uacce: remove uacce_vma_fault

2020-06-17 Thread Jean-Philippe Brucker
On Mon, Jun 15, 2020 at 09:55:57PM +0800, Zhangfei Gao wrote:
> Fix NULL pointer error if removing uacce's parent module during app's
> running. SIGBUS is already reported by do_page_fault, so uacce_vma_fault
> is not needed. If providing vma_fault, vmf->page has to be filled as well,
> required by __do_fault.
> 
> Reported-by: Jean-Philippe Brucker 
> Signed-off-by: Zhangfei Gao 

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/misc/uacce/uacce.c | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
> index 107028e..aa91f69 100644
> --- a/drivers/misc/uacce/uacce.c
> +++ b/drivers/misc/uacce/uacce.c
> @@ -179,14 +179,6 @@ static int uacce_fops_release(struct inode *inode, 
> struct file *filep)
>   return 0;
>  }
>  
> -static vm_fault_t uacce_vma_fault(struct vm_fault *vmf)
> -{
> - if (vmf->flags & (FAULT_FLAG_MKWRITE | FAULT_FLAG_WRITE))
> - return VM_FAULT_SIGBUS;
> -
> - return 0;
> -}
> -
>  static void uacce_vma_close(struct vm_area_struct *vma)
>  {
>   struct uacce_queue *q = vma->vm_private_data;
> @@ -199,7 +191,6 @@ static void uacce_vma_close(struct vm_area_struct *vma)
>  }
>  
>  static const struct vm_operations_struct uacce_vm_ops = {
> - .fault = uacce_vma_fault,
>   .close = uacce_vma_close,
>  };
>  
> -- 
> 2.7.4
> 


Re: [PATCH v2 08/12] mm: Define pasid in mm

2020-06-16 Thread Jean-Philippe Brucker
On Fri, Jun 12, 2020 at 05:41:29PM -0700, Fenghua Yu wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 64ede5f150dc..5778db3aa42d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -538,6 +538,10 @@ struct mm_struct {
>   atomic_long_t hugetlb_usage;
>  #endif
>   struct work_struct async_put_work;
> +
> +#ifdef CONFIG_PCI_PASID

Non-PCI devices can also use a PASID (e.g. Arm's SubstreamID). How about
CONFIG_IOMMU_SUPPORT?

Thanks,
Jean

> + unsigned int pasid;
> +#endif


Re: [RFC PATCH] PCI: Remove End-End TLP as PASID dependency

2020-06-10 Thread Jean-Philippe Brucker
On Wed, Jun 10, 2020 at 12:18:14PM +0800, Zhangfei Gao wrote:
> Some platform devices appear as PCI and have PCI cfg space,
> but are actually on the AMBA bus.
> They can support PASID via smmu stall feature, but does not
> support tlp since they are not real pci devices.
> So remove tlp as a PASID dependency.
> 
> Signed-off-by: Zhangfei Gao 
> ---
>  drivers/pci/ats.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index 390e92f..8e31278 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -344,9 +344,6 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
>   if (WARN_ON(pdev->pasid_enabled))
>   return -EBUSY;
>  
> - if (!pdev->eetlp_prefix_path)
> - return -EINVAL;
> -

This check is useful, and follows the PCI specification (4.0r1.0
2.2.10.2 End-End TLP Prefix Processing: "Software should ensure that TLPs
containing End-End TLP Prefixes are not sent to components that do not
support them.")

Why not set the eetlp_prefix_path bit from a PCI quirk?  Unlike the stall
problem from the other thread, this one looks like a simple design mistake
that can be fixed easily in future iterations of the platform: just set
the "End-End TLP Prefix Supported" bit in the Device Capability 2 Register
of all bridges.

Thanks,
Jean


[PATCH] tracing/probe: Fix bpf_task_fd_query() for kprobes and uprobes

2020-06-08 Thread Jean-Philippe Brucker
Commit 60d53e2c3b75 ("tracing/probe: Split trace_event related data from
trace_probe") removed the trace_[ku]probe structure from the
trace_event_call->data pointer. As bpf_get_[ku]probe_info() were
forgotten in that change, fix them now. These functions are currently
only used by the bpf_task_fd_query() syscall handler to collect
information about a perf event.

Fixes: 60d53e2c3b75 ("tracing/probe: Split trace_event related data from 
trace_probe")
Signed-off-by: Jean-Philippe Brucker 
---
Found while trying to run the task_fd_query BPF sample. I intend to try
and move that sample to kselftests since it seems like a useful
regression test.
---
 kernel/trace/trace_kprobe.c | 2 +-
 kernel/trace/trace_uprobe.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 35989383ae113..8eeb95e04bf52 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1629,7 +1629,7 @@ int bpf_get_kprobe_info(const struct perf_event *event, 
u32 *fd_type,
if (perf_type_tracepoint)
tk = find_trace_kprobe(pevent, group);
else
-   tk = event->tp_event->data;
+   tk = trace_kprobe_primary_from_call(event->tp_event);
if (!tk)
return -EINVAL;
 
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 2a8e8e9c1c754..fdd47f99b18fd 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1412,7 +1412,7 @@ int bpf_get_uprobe_info(const struct perf_event *event, 
u32 *fd_type,
if (perf_type_tracepoint)
tu = find_probe_event(pevent, group);
else
-   tu = event->tp_event->data;
+   tu = trace_uprobe_primary_from_call(event->tp_event);
if (!tu)
return -EINVAL;
 
-- 
2.27.0



Re: [RFC PATCH] iommu/arm-smmu: Add module parameter to set msi iova address

2020-05-28 Thread Jean-Philippe Brucker
[+ Shameer]

On Thu, May 28, 2020 at 09:43:46AM +0200, Auger Eric wrote:
> Hi,
> 
> On 5/28/20 9:23 AM, Jean-Philippe Brucker wrote:
> > On Thu, May 28, 2020 at 10:45:14AM +0530, Srinath Mannam wrote:
> >> On Wed, May 27, 2020 at 11:00 PM Robin Murphy  wrote:
> >>>
> >> Thanks Robin for your quick response.
> >>> On 2020-05-27 17:03, Srinath Mannam wrote:
> >>>> This patch gives the provision to change default value of MSI IOVA base
> >>>> to platform's suitable IOVA using module parameter. The present
> >>>> hardcoded MSI IOVA base may not be the accessible IOVA ranges of 
> >>>> platform.
> >>>
> >>> That in itself doesn't seem entirely unreasonable; IIRC the current
> >>> address is just an arbitrary choice to fit nicely into Qemu's memory
> >>> map, and there was always the possibility that it wouldn't suit 
> >>> everything.
> >>>
> >>>> Since commit aadad097cd46 ("iommu/dma: Reserve IOVA for PCIe inaccessible
> >>>> DMA address"), inaccessible IOVA address ranges parsed from dma-ranges
> >>>> property are reserved.
> > 
> > I don't understand why we only reserve the PCIe windows for DMA domains.
> > Shouldn't VFIO also prevent userspace from mapping them?
> 
> VFIO prevents userspace from DMA mapping iovas within reserved regions:
> 9b77e5c79840  vfio/type1: check dma map request is within a valid iova range

Right but I was asking specifically about the IOVA reservation introduced
by commit aadad097cd46. They are not registered as reserved regions within
the IOMMU core, they are only taken into account by dma-iommu.c when
creating a DMA domain. As VFIO uses UNMANAGED domains, it isn't aware of
those regions and they won't be seen by vfio_iommu_resv_exclude().

It looks like the PCIe regions used to be common until cd2c9fcf5c66
("iommu/dma: Move PCI window region reservation back into dma specific
path.") But I couldn't find the justification for this commit.

The thing is, if VFIO isn't aware of the reserved PCIe windows, then
allowing VFIO or userspace to choose MSI_IOVA_BASE won't solve the problem
reported by Srinath, because they could well choose an IOVA within the
PCIe window...

Thanks,
Jean

> but it does not prevent the SW MSI region chosen by the kernel from
> colliding with other reserved regions (esp. PCIe host bridge windows).
> 
>   If they were
> > part of the common reserved regions then we could have VFIO choose a
> > SW_MSI region among the remaining free space.
> As Robin said this was the initial chosen approach
> [PATCH 10/10] vfio: allow the user to register reserved iova range for
> MSI mapping
> https://patchwork.kernel.org/patch/8121641/
> 
> Some additional background about why the static SW MSI region chosen by
> the kernel was later chosen:
> Summary of LPC guest MSI discussion in Santa Fe (was: Re: [RFC 0/8] KVM
> PCIe/MSI passthrough on ARM/ARM64 (Alt II))
> https://lists.linuxfoundation.org/pipermail/iommu/2016-November/019060.html
> 
> Thanks
> 
> Eric
> 
> 
>  It would just need a
> > different way of asking the IOMMU driver if a SW_MSI is needed, for
> > example with a domain attribute.
> > 
> > Thanks,
> > Jean
> > 
> >>>
> >>> That, however, doesn't seem to fit here; iommu-dma maps MSI doorbells
> >>> dynamically, so they aren't affected by reserved regions any more than
> >>> regular DMA pages are. In fact, it explicitly ignores the software MSI
> >>> region, since as the comment says, it *is* the software that manages 
> >>> those.
> >> Yes you are right, we don't see any issues with kernel drivers(PCI EP) 
> >> because
> >> MSI IOVA allocated dynamically by honouring reserved regions same as DMA 
> >> pages.
> >>>
> >>> The MSI_IOVA_BASE region exists for VFIO, precisely because in that case
> >>> the kernel *doesn't* control the address space, but still needs some way
> >>> to steal a bit of it for MSIs that the guest doesn't necessarily know
> >>> about, and give userspace a fighting chance of knowing what it's taken.
> >>> I think at the time we discussed the idea of adding something to the
> >>> VFIO uapi such that userspace could move this around if it wanted or
> >>> needed to, but decided we could live without that initially. Perhaps now
> >>> the time has come?
> >> Yes, we see issues only with user-space drivers(DPDK) in which 
> >> MSI_IOVA_BASE
> >> region is considered to map MSI registers. This patch helps us to

Re: [RFC PATCH] iommu/arm-smmu: Add module parameter to set msi iova address

2020-05-28 Thread Jean-Philippe Brucker
On Thu, May 28, 2020 at 10:45:14AM +0530, Srinath Mannam wrote:
> On Wed, May 27, 2020 at 11:00 PM Robin Murphy  wrote:
> >
> Thanks Robin for your quick response.
> > On 2020-05-27 17:03, Srinath Mannam wrote:
> > > This patch gives the provision to change default value of MSI IOVA base
> > > to platform's suitable IOVA using module parameter. The present
> > > hardcoded MSI IOVA base may not be the accessible IOVA ranges of platform.
> >
> > That in itself doesn't seem entirely unreasonable; IIRC the current
> > address is just an arbitrary choice to fit nicely into Qemu's memory
> > map, and there was always the possibility that it wouldn't suit everything.
> >
> > > Since commit aadad097cd46 ("iommu/dma: Reserve IOVA for PCIe inaccessible
> > > DMA address"), inaccessible IOVA address ranges parsed from dma-ranges
> > > property are reserved.

I don't understand why we only reserve the PCIe windows for DMA domains.
Shouldn't VFIO also prevent userspace from mapping them?  If they were
part of the common reserved regions then we could have VFIO choose a
SW_MSI region among the remaining free space. It would just need a
different way of asking the IOMMU driver if a SW_MSI is needed, for
example with a domain attribute.

Thanks,
Jean

> >
> > That, however, doesn't seem to fit here; iommu-dma maps MSI doorbells
> > dynamically, so they aren't affected by reserved regions any more than
> > regular DMA pages are. In fact, it explicitly ignores the software MSI
> > region, since as the comment says, it *is* the software that manages those.
> Yes you are right, we don't see any issues with kernel drivers(PCI EP) because
> MSI IOVA allocated dynamically by honouring reserved regions same as DMA 
> pages.
> >
> > The MSI_IOVA_BASE region exists for VFIO, precisely because in that case
> > the kernel *doesn't* control the address space, but still needs some way
> > to steal a bit of it for MSIs that the guest doesn't necessarily know
> > about, and give userspace a fighting chance of knowing what it's taken.
> > I think at the time we discussed the idea of adding something to the
> > VFIO uapi such that userspace could move this around if it wanted or
> > needed to, but decided we could live without that initially. Perhaps now
> > the time has come?
> Yes, we see issues only with user-space drivers(DPDK) in which MSI_IOVA_BASE
> region is considered to map MSI registers. This patch helps us to fix the 
> issue.
> 
> Thanks,
> Srinath.
> >
> > Robin.
> >
> > > If any platform has the limitaion to access default MSI IOVA, then it can
> > > be changed using "arm-smmu.msi_iova_base=0xa000" command line 
> > > argument.
> > >
> > > Signed-off-by: Srinath Mannam 
> > > ---
> > >   drivers/iommu/arm-smmu.c | 5 -
> > >   1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > > index 4f1a350..5e59c9d 100644
> > > --- a/drivers/iommu/arm-smmu.c
> > > +++ b/drivers/iommu/arm-smmu.c
> > > @@ -72,6 +72,9 @@ static bool disable_bypass =
> > >   module_param(disable_bypass, bool, S_IRUGO);
> > >   MODULE_PARM_DESC(disable_bypass,
> > >   "Disable bypass streams such that incoming transactions from 
> > > devices that are not attached to an iommu domain will report an abort 
> > > back to the device and will not be allowed to pass through the SMMU.");
> > > +static unsigned long msi_iova_base = MSI_IOVA_BASE;
> > > +module_param(msi_iova_base, ulong, S_IRUGO);
> > > +MODULE_PARM_DESC(msi_iova_base, "msi iova base address.");
> > >
> > >   struct arm_smmu_s2cr {
> > >   struct iommu_group  *group;
> > > @@ -1566,7 +1569,7 @@ static void arm_smmu_get_resv_regions(struct device 
> > > *dev,
> > >   struct iommu_resv_region *region;
> > >   int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
> > >
> > > - region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH,
> > > + region = iommu_alloc_resv_region(msi_iova_base, MSI_IOVA_LENGTH,
> > >prot, IOMMU_RESV_SW_MSI);
> > >   if (!region)
> > >   return;
> > >


Re: [PATCH v6] iommu/virtio: Use page size bitmap supported by endpoint

2020-05-14 Thread Jean-Philippe Brucker
On Thu, May 14, 2020 at 05:31:00AM -0400, Michael S. Tsirkin wrote:
> On Thu, May 14, 2020 at 01:22:37PM +0530, Bharat Bhushan wrote:
> > Different endpoint can support different page size, probe
> > endpoint if it supports specific page size otherwise use
> > global page sizes.
> > 
> > Device attached to domain should support a minimum of
> > domain supported page sizes. If device supports more
> > than domain supported page sizes then device is limited
> > to use domain supported page sizes only.
> 
> OK so I am just trying to figure it out.
> Before the patch, we always use the domain supported page sizes
> right?
> 
> With the patch, we still do, but we also probe and
> validate that device supports all domain page sizes,
> if it does not then we fail to attach the device.

Generally there is one endpoint per domain. Linux creates the domains and
decides which endpoint goes in which domain. It puts multiple endpoints in
a domain in two cases:

* If endpoints cannot be isolated from each others by the IOMMU, for
  example if ACS isolation isn't enabled in PCIe. In that case endpoints
  are in the same IOMMU group, and therefore contained in the same domain.
  This is more of a quirk for broken hardware, and isn't much of a concern
  for virtualization because it's easy for the hypervisor to present
  endpoints isolated from each others.

* If userspace wants to put endpoints in the same VFIO container, then
  VFIO first attempts to put them in the same IOMMU domain, and falls back
  to multiple domains if that fails. That case is just a luxury and we
  shouldn't over-complicate the driver to cater for this.

So the attach checks don't need to be that complicated. Checking that the
page masks are exactly the same should be enough.

> This seems like a lot of effort for little benefit, can't
> hypervisor simply make sure endpoints support the
> iommu page sizes for us?

I tend to agree, it's not very likely that we'll have a configuration with
different page sizes between physical and virtual endpoints.

If there is a way for QEMU to simply reject VFIO devices that don't use
the same page mask as what's configured globally, let's do that instead of
introducing the page_size_mask property.

> > @@ -615,7 +636,7 @@ static int viommu_domain_finalise(struct 
> > viommu_endpoint *vdev,
> > struct viommu_dev *viommu = vdev->viommu;
> > struct viommu_domain *vdomain = to_viommu_domain(domain);
> >  
> > -   viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap);
> > +   viommu_page_size = 1UL << __ffs(vdev->pgsize_bitmap);
> > if (viommu_page_size > PAGE_SIZE) {
> > dev_err(vdev->dev,
> > "granule 0x%lx larger than system page size 0x%lx\n",
> 
> 
> Looks like this is messed up on 32 bit: e.g. 0x1 will try to do
> 1UL << -1, which is undefined behaviour. Which is btw already messed up
> wrt viommu->pgsize_bitmap, but that's not a reason to propagate
> the error.

Realistically we're not going to have a page granule larger than 4G, it's
going to be 4k or 64k. But we can add a check that truncates the
page_size_mask to 32-bit and makes sure that it's non-null.

> > +struct virtio_iommu_probe_pgsize_mask {
> > +   struct virtio_iommu_probe_property  head;
> > +   __u8reserved[4];
> > +   /* Same format as virtio_iommu_config::page_size_mask */
> 
> It's actually slightly different in that
> this must be a superset of domain page size mask, right?

No it overrides the global mask

> > +   __le64  pgsize_bitmap;

Bharat, please rename this to page_size_mask for consistency

Thanks,
Jean

> > +};
> > +
> >  #define VIRTIO_IOMMU_RESV_MEM_T_RESERVED   0
> >  #define VIRTIO_IOMMU_RESV_MEM_T_MSI1
> >  
> > -- 
> > 2.17.1
> 


Re: [EXT] Re: [PATCH v5] iommu/virtio: Use page size bitmap supported by endpoint

2020-05-13 Thread Jean-Philippe Brucker
On Wed, May 13, 2020 at 09:15:22AM +, Bharat Bhushan wrote:
> Hi Jean,
> 
> > -Original Message-
> > From: Michael S. Tsirkin 
> > Sent: Wednesday, May 6, 2020 5:53 AM
> > To: Bharat Bhushan 
> > Cc: jean-phili...@linaro.org; j...@8bytes.org; jasow...@redhat.com;
> > virtualizat...@lists.linux-foundation.org; io...@lists.linux-foundation.org;
> > linux-kernel@vger.kernel.org; eric.auger@gmail.com; 
> > eric.au...@redhat.com
> > Subject: [EXT] Re: [PATCH v5] iommu/virtio: Use page size bitmap supported 
> > by
> > endpoint
[...]
> > > +struct virtio_iommu_probe_pgsize_mask {
> > > + struct virtio_iommu_probe_property  head;
> > > + __u8reserved[4];
> > > + __le64  pgsize_bitmap;
> > > +};
> > > +
> > 
> > This is UAPI. Document the format of pgsize_bitmap please.
> 
> I do not see uapi documentation in "Documentation" folder of other data 
> struct in this file, am I missing something?

I'm not the one requesting this change, but I'm guessing you should add a
comment in this header, above pgsize_bitmap (which should actually be
called page_size_mask to follow the spec). I guess I'd add:

/* Same format as virtio_iommu_config::page_size_mask */

for this field, and then maybe change the comment of virtio_iommu_config,
currently "/* Supported page sizes */", to something more precise such as:

/*
 * Bitmap of supported page sizes. The least significant bit indicates the
 * smallest granularity and the other bits are hints indicating optimal
 * block sizes.
 */

But I don't know how much should be spelled out here, since those details
are available in the spec.

Thanks,
Jean


Re: [PATCH v5] iommu/virtio: Use page size bitmap supported by endpoint

2020-05-12 Thread Jean-Philippe Brucker
On Tue, May 12, 2020 at 10:53:39AM -0400, Michael S. Tsirkin wrote:
> >  static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
> >struct virtio_iommu_probe_resv_mem *mem,
> >size_t len)
> > @@ -499,6 +513,9 @@ static int viommu_probe_endpoint(struct viommu_dev 
> > *viommu, struct device *dev)
> > case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
> > ret = viommu_add_resv_mem(vdev, (void *)prop, len);
> > break;
> > +   case VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK:
> > +   ret = viommu_set_pgsize_bitmap(vdev, (void *)prop, len);
> > +   break;
> > default:
> > dev_err(dev, "unknown viommu prop 0x%x\n", type);
> > }
> 
> So given this is necessary early in boot, how about we
> add this in the config space? And maybe ACPI too ...

A page_size_mask field is already in the config space and applies to all
endpoints. Here we add a property to override the global value for
individual endpoints. It can be necessary when mixing physical (pass-
through) and virtual endpoints under the same virtio-iommu device.

Thanks,
Jean


Re: [PATCH] iommu/virtio: reverse arguments to list_add

2020-05-06 Thread Jean-Philippe Brucker
On Tue, May 05, 2020 at 08:47:47PM +0200, Julia Lawall wrote:
> Elsewhere in the file, there is a list_for_each_entry with
> >resv_regions as the second argument, suggesting that
> >resv_regions is the list head.  So exchange the
> arguments on the list_add call to put the list head in the
> second argument.
> 
> Fixes: 2a5a31487445 ("iommu/virtio: Add probe request")
> Signed-off-by: Julia Lawall 

Thanks for the fix. The reason this hasn't blown up so far is
iommu_alloc_resv_region() initializes region->list, but adding more than
one item would break the list.

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/virtio-iommu.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> index d5cac4f46ca5..4e1d11af23c8 100644
> --- a/drivers/iommu/virtio-iommu.c
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -453,7 +453,7 @@ static int viommu_add_resv_mem(struct viommu_endpoint 
> *vdev,
>   if (!region)
>   return -ENOMEM;
>  
> - list_add(>resv_regions, >list);
> + list_add(>list, >resv_regions);
>   return 0;
>  }
>  
> 


Re: [PATCH v3 3/4] iommu/ioasid: Add custom allocators

2019-10-02 Thread Jean-Philippe Brucker
Hi Jacob,

There seem to be a mix-up here, the changes from your v2 are lost and
patches 1 and 3 are back to v1. Assuming this isn't intended, I'll
review v2 of this patch since it looked good to me overall.

Thanks,
Jean



Re: [PATCH 2/4] iommu: Add I/O ASID allocator

2019-09-20 Thread Jean-Philippe Brucker
On Wed, Sep 18, 2019 at 04:26:32PM -0700, Jacob Pan wrote:
> From: Jean-Philippe Brucker 
> 
> Some devices might support multiple DMA address spaces, in particular
> those that have the PCI PASID feature. PASID (Process Address Space ID)
> allows to share process address spaces with devices (SVA), partition a
> device into VM-assignable entities (VFIO mdev) or simply provide
> multiple DMA address space to kernel drivers. Add a global PASID
> allocator usable by different drivers at the same time. Name it I/O ASID
> to avoid confusion with ASIDs allocated by arch code, which are usually
> a separate ID space.
> 
> The IOASID space is global. Each device can have its own PASID space,
> but by convention the IOMMU ended up having a global PASID space, so
> that with SVA, each mm_struct is associated to a single PASID.
> 
> The allocator is primarily used by IOMMU subsystem but in rare occasions
> drivers would like to allocate PASIDs for devices that aren't managed by
> an IOMMU, using the same ID space as IOMMU.
> 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Jacob Pan 

Reviewed-by: Jean-Philippe Brucker 


Re: [PATCH 8/8] iommu/arm-smmu-v3: Add support for PCI PASID

2019-09-19 Thread Jean-Philippe Brucker
On Mon, Jul 08, 2019 at 09:58:16AM +0200, Auger Eric wrote:
> > +   ret = pci_enable_pasid(pdev, features);
> > +   if (!ret)
> > +   master->ssid_bits = min_t(u8, ilog2(num_pasids),
> > + master->smmu->ssid_bits);
> I don't really get why this setting is conditional to the success of
> pci_enabled_pasid and not num_pasids > 0.

num_pasids only contains the value of the PCIe PASID capability. If
pci_enable_pasid() fails then we want to leave master->ssid_bits to 0 so
that we report to users that SVA and AUXD aren't supported.

> If it fails the ssid_bits is set to min(smmu->ssid_bits,
> fwspec->num_pasid_bits) anyway.
>
> > +   return ret;
> > +}
> > +
> > +static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
> > +{
> > +   struct pci_dev *pdev;
> > +
> > +   if (!dev_is_pci(master->dev))
> > +   return;
> > +
> > +   pdev = to_pci_dev(master->dev);
> > +
> > +   if (!pdev->pasid_enabled)
> > +   return;
> > +
> > +   pci_disable_pasid(pdev);
> > +   master->ssid_bits = 0;
> in case of a platform device you leave the ssid_bits to a value != 0. Is
> that what you want?

Yes, this is only for PCI devices, there is no standard way of disabling
PASID in platform devices. We just take whatever the firmware gives us.

> > +}
> > +
> >  static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> >  {
> > unsigned long flags;
> > @@ -2413,6 +2456,9 @@ static int arm_smmu_add_device(struct device *dev)
> >  
> > master->ssid_bits = min(smmu->ssid_bits, fwspec->num_pasid_bits);
> >  
> > +   /* Note that PASID must be enabled before, and disabled after ATS */
> > +   arm_smmu_enable_pasid(master);
> In case the call fails, don't you want to handle the error and reset the
> ssid_bits?

This function fails if the device doesn't support PASID, and we leave
ssid_bits to 0. That said, I think it would be nicer to move the above
line (that deals with fwspec) into arm_smmu_enable_pasid()

Thanks,
Jean


  1   2   3   4   >