Re: [PATCH v14 07/13] iommu/smmuv3: Implement cache_invalidate

2021-03-31 Thread Zenghui Yu

Hi Eric,

On 2021/2/24 4:56, Eric Auger wrote:

+static int
+arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+ struct iommu_cache_invalidate_info *inv_info)
+{
+   struct arm_smmu_cmdq_ent cmd = {.opcode = CMDQ_OP_TLBI_NSNH_ALL};
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -EINVAL;
+
+   if (!smmu)
+   return -EINVAL;
+
+   if (inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+   return -EINVAL;
+
+   if (inv_info->cache & IOMMU_CACHE_INV_TYPE_PASID ||


I didn't find any code where we would emulate the CFGI_CD{_ALL} commands
for guest and invalidate the stale CD entries on the physical side. Is
PASID-cache type designed for that effect?


+   inv_info->cache & IOMMU_CACHE_INV_TYPE_DEV_IOTLB) {
+   return -ENOENT;
+   }
+
+   if (!(inv_info->cache & IOMMU_CACHE_INV_TYPE_IOTLB))
+   return -EINVAL;
+
+   /* IOTLB invalidation */
+
+   switch (inv_info->granularity) {
+   case IOMMU_INV_GRANU_PASID:
+   {
+   struct iommu_inv_pasid_info *info =
+   &inv_info->granu.pasid_info;
+
+   if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID)
+   return -ENOENT;
+   if (!(info->flags & IOMMU_INV_PASID_FLAGS_ARCHID))
+   return -EINVAL;
+
+   __arm_smmu_tlb_inv_context(smmu_domain, info->archid);
+   return 0;
+   }
+   case IOMMU_INV_GRANU_ADDR:
+   {
+   struct iommu_inv_addr_info *info = &inv_info->granu.addr_info;
+   size_t size = info->nb_granules * info->granule_size;
+   bool leaf = info->flags & IOMMU_INV_ADDR_FLAGS_LEAF;
+
+   if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID)
+   return -ENOENT;
+
+   if (!(info->flags & IOMMU_INV_ADDR_FLAGS_ARCHID))
+   break;
+
+   arm_smmu_tlb_inv_range_domain(info->addr, size,
+ info->granule_size, leaf,
+ info->archid, smmu_domain);
+
+   arm_smmu_cmdq_issue_sync(smmu);


There is no need to issue one more SYNC.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Wednesday, March 31, 2021 8:41 PM
> 
> On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:
> 
> > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > the VM should be able to be shared by all assigned device for the VM.
> > But the SVA operations (bind/unbind page table, cache_invalidate) should
> > be per-device.
> 
> It is not *per-device* it is *per-ioasid*
>
> And as /dev/ioasid is an interface for controlling multiple ioasid's
> there is no issue to also multiplex the page table manipulation for
> multiple ioasids as well.
> 
> What you should do next is sketch out in some RFC the exactl ioctls
> each FD would have and show how the parts I outlined would work and
> point out any remaining gaps.
> 
> The device FD is something like the vfio_device FD from VFIO, it has
> *nothing* to do with PASID beyond having a single ioctl to authorize
> the device to use the PASID. All control of the PASID is in
> /dev/ioasid.

good to see this reply. Your idea is much clearer to me now. If I'm getting
you correctly. I think the skeleton is something like below:

1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
   allocated and a per-ioasid context which can be used to do bind page
   table and cache invalidate, an ioasid FD returned to userspace.
2) userspace passes the ioasid FD to VFIO, let it associated with a device
   FD (like vfio_device FD).
3) userspace binds page table on the ioasid FD with the page table info.
4) userspace unbinds the page table on the ioasid FD
5) userspace de-associates the ioasid FD and device FD

Does above suit your outline?

If yes, I still have below concern and wish to see your opinion.
- the ioasid FD and device association will happen at runtime instead of
  just happen in the setup phase.
- how about AMD and ARM's vSVA support? Their PASID allocation and page table
  happens within guest. They only need to bind the guest PASID table to host.
  Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
  to correct me)
- this per-ioasid SVA operations is not aligned with the native SVA usage
  model. Native SVA bind is per-device.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/5] Optimization for unmapping iommu mapped buffers

2021-03-31 Thread chenxiang (M)

Hi Isaac,


在 2021/3/31 11:00, Isaac J. Manjarres 写道:

When unmapping a buffer from an IOMMU domain, the IOMMU framework unmaps
the buffer at a granule of the largest page size that is supported by
the IOMMU hardware and fits within the buffer. For every block that
is unmapped, the IOMMU framework will call into the IOMMU driver, and
then the io-pgtable framework to walk the page tables to find the entry
that corresponds to the IOVA, and then unmaps the entry.

This can be suboptimal in scenarios where a buffer or a piece of a
buffer can be split into several contiguous page blocks of the same size.
For example, consider an IOMMU that supports 4 KB page blocks, 2 MB page
blocks, and 1 GB page blocks, and a buffer that is 4 MB in size is being
unmapped at IOVA 0. The current call-flow will result in 4 indirect calls,
and 2 page table walks, to unmap 2 entries that are next to each other in
the page-tables, when both entries could have been unmapped in one shot
by clearing both page table entries in the same call.

These patches implement a callback called unmap_pages to the io-pgtable
code and IOMMU drivers which unmaps an IOVA range that consists of a
number of pages of the same page size that is supported by the IOMMU
hardware, and allows for clearing multiple entries in the same set of
indirect calls. The reason for introducing unmap_pages is to give
other IOMMU drivers/io-pgtable formats time to change to using the new
unmap_pages callback, so that the transition to using this approach can be
done piecemeal.

The same optimization is applicable for mapping buffers, however, the
error handling in the io-pgtable layer couldn't be handled cleanly, as we
would need to invoke iommu_unmap to unmap the parts of the buffer that
were mapped, and then do any TLB maintenance. However, that seemed like a
layering violation.

Any feedback is very much appreciated.


I apply those patchset and implement it for smmuv3 on my kunpeng ARM64 
platform, and test the latency of map/unmap
with tool dma_map_benchmark (./dma_map_benchmark -g xxx),  it promotes 
much on the latency of unmap(us).
Maybe you can add the implement for smmuv3 also. The test result is as 
follows:


latency of map/unmap(before opt)latency of 
map/unmap(after opt)

g=1(4K size)0.1/0.7 0.1/0.7
g=2(8K size)0.2/1.4 0.2/0.8
g=4(16K size)  0.3/2.7 0.3/0.9
g=8(32K size)  0.5/5.4 0.5/1.2
g=16(64K size)1/10.7 1/1.8
g=32(128K size)  1.8/21.4 1.8/2.8
g=64(256K size)  3.6/42.9 3.6/5.1
g=128(512K size)7/85.6 7/8.6
g=256(1M size)  13.9/171.1 13.9/15.5
g=512(2M size)  0.2/0.7 0.2/0.9
g=1024(4M size)0.3/1.5 0.3/1.1


The change for smmuv3 for test are as follows:
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c

index 8594b4a..e0268b1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2292,6 +2292,20 @@ static size_t arm_smmu_unmap(struct iommu_domain 
*domain, unsigned long iova,

return ops->unmap(ops, iova, size, gather);
 }

+static size_t arm_smmu_unmap_pages(struct iommu_domain *domain, 
unsigned long iova,

+size_t pgsize, size_t pgcount,
+struct iommu_iotlb_gather *gather)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+
+   if (!ops)
+   return 0;
+
+   return ops->unmap_pages(ops, iova, pgsize, pgcount, gather);
+}
+
+
 static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
 {
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -2613,6 +2627,7 @@ static struct iommu_ops arm_smmu_ops = {
.attach_dev = arm_smmu_attach_dev,
.map= arm_smmu_map,
.unmap  = arm_smmu_unmap,
+   .unmap_pages= arm_smmu_unmap_pages,
.flush_iotlb_all= arm_smmu_flush_iotlb_all,
.iotlb_sync = arm_smmu_iotlb_sync,
.iova_to_phys   = arm_smmu_iova_to_phys,



Thanks,
Isaac

Isaac J. Manjarres (5):
   iommu/io-pgtable: Introduce unmap_pages() as a page table op
   iommu: Add an unmap_pages() op for IOMMU drivers
   iommu: Add support for the unmap_pages IOMMU callback
   iommu/io-pgtable-arm: Implement arm_lpae_unmap_pages()
   iommu/arm-smmu: Implement the unmap_pages IOMMU driver callback

  drivers/iommu/arm/arm-smmu/arm-smmu.c |  19 +
  drivers/iommu/io-pgtable-arm.c| 114 +-
  drivers/iommu/iommu.c |  44 --
  include/linux/io-pgtable.h|   4 +
  include/linux/iommu.h |   4 +
  5 files changed, 159 insertions(+), 26 deletions(-)




___

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe  wrote:
> 
> > > > Get rid of the ioasid set.
> > > >
> > > > Each driver has its own list of allowed ioasids.  
> >  [...]  
> > 
> > The /dev/ioasid FD replaces this security check. By becoming FD
> > centric you don't need additional kernel security objects.
> > 
> > Any process with access to the /dev/ioasid FD is allowed to control
> > those PASID. The seperation between VMs falls naturally from the
> > seperation of FDs without creating additional, complicated, security
> > infrastrucure in the kernel.
> > 
> > This is why all APIs must be FD focused, and you need to have a
> > logical layering of responsibility.
> > 
> >  Allocate a /dev/ioasid FD
> >  Allocate PASIDs inside the FD
> >  Assign memory to the PASIDS
> > 
> >  Open a device FD, eg from VFIO or VDP
> >  Instruct the device FD to authorize the device to access PASID A in
> >  an ioasid FD
> How do we know user provided PASID A was allocated by the ioasid FD?

You pass in the ioasid FD and use a 'get pasid from fdno' API to
extract the required kernel structure.

> Shouldn't we validate user input by tracking which PASIDs are
> allocated by which ioasid FD?

Yes, but it is integral to the ioasid FD, not something separated.

> > VFIO extracts some kernel representation of the ioasid from the ioasid
> > fd using an API
> > 
> This lookup API seems to be asking for per ioasid FD storage array. Today,
> the ioasid_set is per mm and contains a Xarray. 

Right, put the xarray per FD. A set per mm is fairly nonsensical, we
don't use the mm as that kind of security key.

> Since each VM, KVM can only open one ioasid FD, this per FD array
> would be equivalent to the per mm ioasid_set, right?

Why only one?  Each interaction with the other FDs should include the
PASID/FD pair. There is no restriction to just one.

> > VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> > IOMMU that this PCI device is allowed to use this PASID'
>
> Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I thought
> the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls. Or we can
> skip this step for now and wait for the user to do SVA bind.

I'm not sure what you are asking.

Possibly some of the IOMMU API will need a bit adjusting to make
things split.

The act of programming the page tables and the act of authorizing a
PCI BDF to use a PASID are distinct things with two different IOCTLs.

iommu_uapi_sva_bind_gpasid() is never called by anything, and it's
uAPI is never implemented.

Joerg? Why did you merge dead uapi and dead code?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe  wrote:

> > > Get rid of the ioasid set.
> > >
> > > Each driver has its own list of allowed ioasids.  
>  [...]  
> 
> The /dev/ioasid FD replaces this security check. By becoming FD
> centric you don't need additional kernel security objects.
> 
> Any process with access to the /dev/ioasid FD is allowed to control
> those PASID. The seperation between VMs falls naturally from the
> seperation of FDs without creating additional, complicated, security
> infrastrucure in the kernel.
> 
> This is why all APIs must be FD focused, and you need to have a
> logical layering of responsibility.
> 
>  Allocate a /dev/ioasid FD
>  Allocate PASIDs inside the FD
>  Assign memory to the PASIDS
> 
>  Open a device FD, eg from VFIO or VDP
>  Instruct the device FD to authorize the device to access PASID A in
>  an ioasid FD
How do we know user provided PASID A was allocated by the ioasid FD?
Shouldn't we validate user input by tracking which PASIDs are allocated by
which ioasid FD? This is one reason why we have ioasid_set and its xarray.

>* Prior to being authorized the device will have NO access to any
>  PASID
>* Presenting BOTH the device FD and the ioasid FD to the kernel
>  is the security check. Any process with both FDs is allowed to
>  make the connection. This is normal Unix FD centric thinking.
> 
> > > Register a ioasid in the driver's list by passing the fd and ioasid #
> > >  
> > 
> > The fd here is a device fd. Am I right?   
> 
> It would be the vfio_device FD, for instance, and a VFIO IOCTL.
> 
> > If yes, your idea is ioasid is allocated via /dev/ioasid and
> > associated with device fd via either VFIO or vDPA ioctl. right?
> > sorry I may be asking silly questions but really need to ensure we
> > are talking in the same page.  
> 
> Yes, this is right
> 
> > > No listening to events. A simple understandable security model.  
> > 
> > For this suggestion, I have a little bit concern if we may have A-B/B-A
> > lock sequence issue since it requires the /dev/ioasid (if it supports)
> > to call back into VFIO/VDPA to check if the ioasid has been registered
> > to device FD and record it in the per-device list. right? Let's have
> > more discussion based on the skeleton sent by Kevin.  
> 
> Callbacks would be backwards.
> 
> User calls vfio with vfio_device fd and dev/ioasid fd
> 
> VFIO extracts some kernel representation of the ioasid from the ioasid
> fd using an API
> 
This lookup API seems to be asking for per ioasid FD storage array. Today,
the ioasid_set is per mm and contains a Xarray. Since each VM, KVM can only
open one ioasid FD, this per FD array would be equivalent to the per mm
ioasid_set, right?

> VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> IOMMU that this PCI device is allowed to use this PASID'
> 
Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I thought
the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls. Or we can
skip this step for now and wait for the user to do SVA bind.

> VFIO mdev drivers then record that the PASID is allowed in its own
> device specific struct for later checking during other system calls.


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 15:33:24 -0300, Jason Gunthorpe  wrote:

> On Wed, Mar 31, 2021 at 11:20:30AM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe 
> > wrote: 
> > > > > We should try to avoid hidden behind the scenes kernel
> > > > > interconnections between subsystems.
> > > > > 
>  [...]  
>  [...]  
> > yes, this is done in this patchset.
> >   
>  [...]  
> > Just to clarify, you are saying (when FREE happens before proper
> > teardown) there is no need to proactively notify all users of the
> > IOASID to drop their reference. Instead, just wait for the other
> > parties to naturally close and drop their references. Am I
> > understanding you correctly?  
> 
> Yes. What are receivers going to do when you notify them anyhow? What
> will a mdev do? This is how you get into they crazy locking problems.
> 
The receivers perform cleanup work similar to normal unbind. Drain/Abort
PASID. Locking is an issue in that the atomic notifier is under IOASID
spinlock, so I provided a common ordered workqueue to let mdev drivers
queue cleanup work that cannot be done in atomic context. Not ideal. Also
need to prevent nested notifications for certain cases.

> It is an error for userspace to shutdown like this, recover sensibly
> and don't crash the kernel. PCIe error TLPs are expected, supress
> them. That is what we decided on the mmu notifier discussion.
> 
> > I feel having the notifications can add two values:
> > 1. Shorten the duration of errors (as you mentioned below), FD close can
> > take a long and unpredictable time. e.g. FD shared.  
> 
> Only if userspace exits in some uncontrolled way. In a controlled exit
> it can close all the FDs in the right order.
> 
> It is OK if userspace does something weird and ends up with disabled
> IOASIDs. It shouldn't do that if it cares.
> 
Agreed.

> > 2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.
> >  
> 
> This is a hard ask too, there is no natural ordering here I can see,
> obviously we want vcpu, mdev, iommu for qemu but that doesn't seem to
> fall out unless we explicitly hard wire it into the kernel.
> 
The ordering problem as I understood is that it is difficult for KVM to
rendezvous all vCPUs before updating PASID translation table. So there
could be in-flight enqcmd with the stale PASID after the PASID table update
and refcount drop.

If KVM is the last one to drop the PASID refcount, the PASID could be
immediately reused and starts a new life. The in-flight enqcmd with the
stale PASID could cause problems. The likelihood and window is very small.

If we ensure KVM does PASID table update before IOMMU and mdev driver, the
stale PASID in the in-flight enqcmd would be be drained before starting
a new life.

Perhaps Yi and Kevin can explain this better.

> Doesn't kvm always kill the vCPU first based on the mmu notifier
> shooting down all the memory? IIRC this happens before FD close?
> 
I don't know the answer, Kevin & Yi?

> > > The duration between unmapping the ioasid and releasing all HW access
> > > will have HW see PCIE TLP errors due to the blocked access. If
> > > userspace messes up the order it is fine to cause this. We already had
> > > this dicussion when talking about how to deal with process exit in the
> > > simple SVA case.  
> > Yes, we have disabled fault reporting during this period. The slight
> > differences vs. the simple SVA case is that KVM is also involved and
> > there might be an ordering requirement to stop vCPU first.  
> 
> KVM can continue to use the PASIDs, they are parked and DMA is
> permanently blocked. When KVM reaches a natural point in its teardown
> it can release them.
> 
> If you have to stop the vcpu from a iommu notifier you are in the
> crazy locking world I mentioned. IMHO don't create exciting locking
> problems just to avoid PCI errors in uncontrolled shutdown.
> 
> Suppress the errors instead.
> 
I agree, this simplify things a lot. Just need to clarify the in-flight
enqcmd case.

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: benchmark: use the correct HiSilicon copyright

2021-03-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: fanghao (A)
> Sent: Tuesday, March 30, 2021 7:34 PM
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua
> (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org;
> linux-kselft...@vger.kernel.org; fanghao (A) 
> Subject: [PATCH] dma-mapping: benchmark: use the correct HiSilicon copyright
> 
> s/Hisilicon/HiSilicon/g.
> It should use capital S, according to
> https://www.hisilicon.com/en/terms-of-use.
> 

My bad. Thanks.

Acked-by: Barry Song 

> Signed-off-by: Hao Fang 
> ---
>  kernel/dma/map_benchmark.c  | 2 +-
>  tools/testing/selftests/dma/dma_map_benchmark.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index e0e64f8..00d6549 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
> - * Copyright (C) 2020 Hisilicon Limited.
> + * Copyright (C) 2020 HiSilicon Limited.
>   */
> 
>  #define pr_fmt(fmt)  KBUILD_MODNAME ": " fmt
> diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c
> b/tools/testing/selftests/dma/dma_map_benchmark.c
> index fb23ce9..b492bed 100644
> --- a/tools/testing/selftests/dma/dma_map_benchmark.c
> +++ b/tools/testing/selftests/dma/dma_map_benchmark.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
> - * Copyright (C) 2020 Hisilicon Limited.
> + * Copyright (C) 2020 HiSilicon Limited.
>   */
> 
>  #include 
> --
> 2.8.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 11:20:30AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe  wrote:
> 
> > > > We should try to avoid hidden behind the scenes kernel
> > > > interconnections between subsystems.
> > > >   
> > > Can we? in case of exception. Since all these IOCTLs are coming from the
> > > unreliable user space, we must deal all exceptions.
> > >
> > > For example, when user closes /dev/ioasid FD before (or w/o) unbind
> > > IOCTL for VFIO, KVM, kernel must do cleanup and coordinate among
> > > subsystems. In this patchset, we have a per mm(ioasid_set) notifier to
> > > inform mdev, KVM to clean up and drop its refcount. Do you have any
> > > suggestion on this?  
> > 
> > The ioasid should be a reference counted object.
> > 
> yes, this is done in this patchset.
> 
> > When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
> > and parks the PASID until *all* places release it. Upon a zero
> > refcount the PASID is recycled for future use.
> > 
> Just to clarify, you are saying (when FREE happens before proper
> teardown) there is no need to proactively notify all users of the IOASID to
> drop their reference. Instead, just wait for the other parties to naturally
> close and drop their references. Am I understanding you correctly?

Yes. What are receivers going to do when you notify them anyhow? What
will a mdev do? This is how you get into they crazy locking problems.

It is an error for userspace to shutdown like this, recover sensibly
and don't crash the kernel. PCIe error TLPs are expected, supress
them. That is what we decided on the mmu notifier discussion.

> I feel having the notifications can add two values:
> 1. Shorten the duration of errors (as you mentioned below), FD close can
> take a long and unpredictable time. e.g. FD shared.

Only if userspace exits in some uncontrolled way. In a controlled exit
it can close all the FDs in the right order.

It is OK if userspace does something weird and ends up with disabled
IOASIDs. It shouldn't do that if it cares.

> 2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.

This is a hard ask too, there is no natural ordering here I can see,
obviously we want vcpu, mdev, iommu for qemu but that doesn't seem to
fall out unless we explicitly hard wire it into the kernel.

Doesn't kvm always kill the vCPU first based on the mmu notifier
shooting down all the memory? IIRC this happens before FD close?

> > The duration between unmapping the ioasid and releasing all HW access
> > will have HW see PCIE TLP errors due to the blocked access. If
> > userspace messes up the order it is fine to cause this. We already had
> > this dicussion when talking about how to deal with process exit in the
> > simple SVA case.
> Yes, we have disabled fault reporting during this period. The slight
> differences vs. the simple SVA case is that KVM is also involved and there
> might be an ordering requirement to stop vCPU first.

KVM can continue to use the PASIDs, they are parked and DMA is
permanently blocked. When KVM reaches a natural point in its teardown
it can release them.

If you have to stop the vcpu from a iommu notifier you are in the
crazy locking world I mentioned. IMHO don't create exciting locking
problems just to avoid PCI errors in uncontrolled shutdown.

Suppress the errors instead.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe  wrote:

> > > We should try to avoid hidden behind the scenes kernel
> > > interconnections between subsystems.
> > >   
> > Can we? in case of exception. Since all these IOCTLs are coming from the
> > unreliable user space, we must deal all exceptions.
> >
> > For example, when user closes /dev/ioasid FD before (or w/o) unbind
> > IOCTL for VFIO, KVM, kernel must do cleanup and coordinate among
> > subsystems. In this patchset, we have a per mm(ioasid_set) notifier to
> > inform mdev, KVM to clean up and drop its refcount. Do you have any
> > suggestion on this?  
> 
> The ioasid should be a reference counted object.
> 
yes, this is done in this patchset.

> When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
> and parks the PASID until *all* places release it. Upon a zero
> refcount the PASID is recycled for future use.
> 
Just to clarify, you are saying (when FREE happens before proper
teardown) there is no need to proactively notify all users of the IOASID to
drop their reference. Instead, just wait for the other parties to naturally
close and drop their references. Am I understanding you correctly?

I feel having the notifications can add two values:
1. Shorten the duration of errors (as you mentioned below), FD close can
take a long and unpredictable time. e.g. FD shared.
2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.

> The duration between unmapping the ioasid and releasing all HW access
> will have HW see PCIE TLP errors due to the blocked access. If
> userspace messes up the order it is fine to cause this. We already had
> this dicussion when talking about how to deal with process exit in the
> simple SVA case.
Yes, we have disabled fault reporting during this period. The slight
differences vs. the simple SVA case is that KVM is also involved and there
might be an ordering requirement to stop vCPU first.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 09:34:57AM -0700, Jacob Pan wrote:

> "3.3 PASID translation
> To support PASID isolation for Shared Work Queues used by VMs, the CPU must
> provide a way for the PASID to be communicated to the device in the DMWr
> transaction. On Intel CPUs, the CPU provides a PASID translation table in
> the vCPUs virtual machine control structures. During ENQCMD/ENQCMDS
> instruction execution in a VM, the PASID translation table is used by the
> CPU to replace the guest PASID in the work descriptor with a host PASID
> before the descriptor is sent to the device.3.3 PASID translation"

Yikes, a special ENQCMD table in the hypervisor!

Still, pass the /dev/ioasid into a KVM IOCTL and tell it to populate
this table. KVM only adds to the table when userspace presents a
/dev/ioasid FD.

> > Doesn't work submission go either to the mdev driver or through the
> > secure PASID of #1?
> 
> No, once a PASID is bound with IOMMU, KVM, and the mdev, work
> submission is all done in HW.  But I don't think this will change
> for either uAPI design.

The big note here is "only for things that use ENQCMD" and that is
basically nothing these days.

> > Everything should revolve around the /dev/ioasid FD. qemu should pass
> > it to all places that need to know about PASID's in the VM.
> 
> I guess we need to extend KVM interface to support PASIDs. Our original
> intention was to avoid introducing new interfaces.

New features need new interfaces, especially if there is a security
sensitivity! KVM should *not* automatically opt into security
sensitive stuff without being explicitly told what to do.

Here you'd need to authorized *two* things for IDXD:
 - The mdev needs to be told it is allowed to use PASID, this tells
   the IOMMU driver to connect the pci device under the mdev
 - KVM needs to be told to populate a vPASID to the 'ENQCMD'
   security table translated to a physical PASID.

If qemu doesn't explicitly enable the ENQCMD security table it should
be *left disabled* by KVM - even if someone else is using PASID in the
same process. And the API should be narrow like this just to the
EQNCMD table as who knows what will come down the road, or how it will
work.

Having a PASID wrongly leak out into the VM would be a security
disaster. Be explicit.

> > We should try to avoid hidden behind the scenes kernel
> > interconnections between subsystems.
> > 
> Can we? in case of exception. Since all these IOCTLs are coming from the
> unreliable user space, we must deal all exceptions.
>
> For example, when user closes /dev/ioasid FD before (or w/o) unbind IOCTL
> for VFIO, KVM, kernel must do cleanup and coordinate among subsystems.
> In this patchset, we have a per mm(ioasid_set) notifier to inform mdev, KVM
> to clean up and drop its refcount. Do you have any suggestion on this?

The ioasid should be a reference counted object.

When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
and parks the PASID until *all* places release it. Upon a zero
refcount the PASID is recycled for future use.

The duration between unmapping the ioasid and releasing all HW access
will have HW see PCIE TLP errors due to the blocked access. If
userspace messes up the order it is fine to cause this. We already had
this dicussion when talking about how to deal with process exit in the
simple SVA case.

> Let me try to paraphrase, you are suggesting common helper code and data
> format but still driver specific storage of the mapping, correct?

The driver just needs to hold the datastructure in its memory.

Like an xarray, the driver can have an xarray inside its struct
device, but the xarray library provides all the implementation.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 09:28:05 -0300, Jason Gunthorpe  wrote:

> On Tue, Mar 30, 2021 at 05:10:41PM -0700, Jacob Pan wrote:
>  [...]  
>  [...]  
>  [...]  
> > This requires the mdev driver to obtain a list of allowed
> > PASIDs(possibly during PASID bind time) prior to do enforcement. IMHO,
> > the PASID enforcement points are:
> > 1. During WQ configuration (e.g.program MSI)
> > 2. During work submission
> > 
> > For VT-d shared workqueue, there is no way to enforce #2 in mdev driver
> > in that the PASID is obtained from PASID MSR from the CPU and submitted
> > w/o driver involvement.  
> 
> I assume that the PASID MSR is privileged and only qemu can program
> it? Otherwise this seems like a security problem.
> 
yes.

> If qemu controls it then the idxd userspace driver in qemu must ensure
> it is only ever programmed to an authorized PASID.
> 
it is ensured for #1.

> > The enforcement for #2 is in the KVM PASID translation table, which
> > is per VM.  
> 
> I don't understand why KVM gets involved in PASID??
> 
Here is an excerpt from the SIOV spec.
https://software.intel.com/content/www/us/en/develop/download/intel-scalable-io-virtualization-technical-specification.html

"3.3 PASID translation
To support PASID isolation for Shared Work Queues used by VMs, the CPU must
provide a way for the PASID to be communicated to the device in the DMWr
transaction. On Intel CPUs, the CPU provides a PASID translation table in
the vCPUs virtual machine control structures. During ENQCMD/ENQCMDS
instruction execution in a VM, the PASID translation table is used by the
CPU to replace the guest PASID in the work descriptor with a host PASID
before the descriptor is sent to the device.3.3 PASID translation"

> Doesn't work submission go either to the mdev driver or through the
> secure PASID of #1?
> 
No, once a PASID is bound with IOMMU, KVM, and the mdev, work submission is
all done in HW.
But I don't think this will change for either uAPI design.

> > For our current VFIO mdev model, bind guest page table does not involve
> > mdev driver. So this is a gap we must fill, i.e. include a callback from
> > mdev driver?  
> 
> No not a callback, tell the mdev driver with a VFIO IOCTL that it is
> authorized to use a specific PASID because the vIOMMU was told to
> allow it by the guest kernel. Simple and straightforward.
> 
Make sense.

> > > ioasid_set doesn't seem to help at all, certainly not as a concept
> > > tied to /dev/ioasid.
> > >   
> > Yes, we can take the security role off ioasid_set once we have per mdev
> > list. However, ioasid_set being a per VM/mm entity also bridge
> > communications among kernel subsystems that don't have direct call path.
> > e.g. KVM, VDCM and IOMMU.  
> 
> Everything should revolve around the /dev/ioasid FD. qemu should pass
> it to all places that need to know about PASID's in the VM.
> 
I guess we need to extend KVM interface to support PASIDs. Our original
intention was to avoid introducing new interfaces.

> We should try to avoid hidden behind the scenes kernel
> interconnections between subsystems.
> 
Can we? in case of exception. Since all these IOCTLs are coming from the
unreliable user space, we must deal all exceptions.

For example, when user closes /dev/ioasid FD before (or w/o) unbind IOCTL
for VFIO, KVM, kernel must do cleanup and coordinate among subsystems.
In this patchset, we have a per mm(ioasid_set) notifier to inform mdev, KVM
to clean up and drop its refcount. Do you have any suggestion on this?

> 
> > > So when you 'allow' a mdev to access a PASID you want to say:
> > >  Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD
> > >   
> 
> > Host and guest PASID value, as well as device info are available through
> > iommu_uapi_sva_bind_gpasid(), we just need to feed that info to mdev
> > driver.  
> 
> You need that IOCTL to exist on the *mdev driver*. It is a VFIO ioctl,
> not a iommu or ioasid or sva IOCTL.
>
OK. A separate IOCTL and separate step.

> > > That seems like a good helper library to provide for drivers to use,
> > > but it should be a construct entirely contained in the driver.  
> > why? would it be cleaner if it is in the common code?  
> 
> No, it is the "mid layer" problematic design.
> 
> Having the iommu layer store driver-specific data on behalf of a
> driver will just make a mess. Use the natural layering we have and
> store driver specific data in the driver structs.
> 
> Add a library to help build the datastructure if it necessary.
> 
Let me try to paraphrase, you are suggesting common helper code and data
format but still driver specific storage of the mapping, correct?

Will try this out, seems cleaner.

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-16 15:38, Christoph Hellwig wrote:
[...]

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f1e38526d5bd40..996dfdf9d375dd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2017,7 +2017,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
  
-	if (smmu_domain->non_strict)

+   if (!iommu_get_dma_strict())


As Will raised, this also needs to be checking "domain->type == 
IOMMU_DOMAIN_DMA" to maintain equivalent behaviour to the attribute code 
below.



pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
  
  	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);

@@ -2449,52 +2449,6 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
  }
  
-static int arm_smmu_domain_get_attr(struct iommu_domain *domain,

-   enum iommu_attr attr, void *data)
-{
-   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-
-   switch (domain->type) {
-   case IOMMU_DOMAIN_DMA:
-   switch (attr) {
-   case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = smmu_domain->non_strict;
-   return 0;
-   default:
-   return -ENODEV;
-   }
-   break;
-   default:
-   return -EINVAL;
-   }
-}

[...]

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index f985817c967a25..edb1de479dd1a7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -668,7 +668,6 @@ struct arm_smmu_domain {
struct mutexinit_mutex; /* Protects smmu pointer */
  
  	struct io_pgtable_ops		*pgtbl_ops;

-   boolnon_strict;
atomic_tnr_ats_masters;
  
  	enum arm_smmu_domain_stage	stage;

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 0aa6d667274970..3dde22b1f8ffb0 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -761,6 +761,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
  
+	if (!iommu_get_dma_strict())


Ditto here.

Sorry for not spotting that sooner :(

Robin.


+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+
if (smmu->impl && smmu->impl->init_context) {
ret = smmu->impl->init_context(smmu_domain, &pgtbl_cfg, dev);
if (ret)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-31 16:32, Will Deacon wrote:

On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote:

On 2021-03-31 12:49, Will Deacon wrote:

On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:

On 2021-03-30 14:58, Will Deacon wrote:

On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:

On 2021-03-30 14:11, Will Deacon wrote:

On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:

From: Robin Murphy 

Instead make the global iommu_dma_strict paramete in iommu.c canonical by
exporting helpers to get and set it and use those directly in the drivers.

This make sure that the iommu.strict parameter also works for the AMD and
Intel IOMMU drivers on x86.  As those default to lazy flushing a new
IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
represent the default if not overriden by an explicit parameter.

Signed-off-by: Robin Murphy .
[ported on top of the other iommu_attr changes and added a few small
 missing bits]
Signed-off-by: Christoph Hellwig 
---
 drivers/iommu/amd/iommu.c   | 23 +---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
 drivers/iommu/dma-iommu.c   |  9 +--
 drivers/iommu/intel/iommu.c | 64 -
 drivers/iommu/iommu.c   | 27 ++---
 include/linux/iommu.h   |  4 +-
 8 files changed, 40 insertions(+), 165 deletions(-)


I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.


The intent here was just to streamline the existing behaviour of stuffing a
global property into a domain attribute then pulling it out again in the
illusion that it was in any way per-domain. We're still checking
dev_is_untrusted() before making an actual decision, and it's not like we
can't add more factors at that point if we want to.


Like I say, the cleanup is great. I'm just wondering whether there's a
better way to express the complicated logic to decide whether or not to use
the flush queue than what we end up with:

if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict())

which is mixing up globals, device properties and domain properties. The
result is that the driver code ends up just using the global to determine
whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
which is a departure from the current way of doing things.


But previously, SMMU only ever saw the global policy piped through the
domain attribute by iommu_group_alloc_default_domain(), so there's no
functional change there.


For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.


Eh? This is only relevant to DMA domains anyway. Flush queues are part of
the IOVA allocator that VFIO doesn't even use. It's always been the case
that unmanaged domains only use strict invalidation.


Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets
IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is
true, no? In which case, that will get set for page-tables corresponding
to unmanaged domains as well as DMA domains when it is enabled. That didn't
happen before because you couldn't set the attribute for unmanaged domains.

What am I missing?


Oh cock... sorry, all this time I've been saying what I *expect* it to 
do, while overlooking the fact that the IO_PGTABLE_QUIRK_NON_STRICT 
hunks were the bits I forgot to write and Christoph had to fix up. 
Indeed, those should be checking the domain type too to preserve the 
existing behaviour. Apologies for the confusion.


Robin.


Obviously some of the above checks could be factored out into some kind of
iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
to keep in sync. Or maybe we just allow iommu-dma to set
IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
treating that as a generic thing now.


I think a helper that takes a domain would be a good starting point.


You mean device, right? The one condition we currently have is at the device
level, and there's really nothing inherent to the domain itself that matters
(since the type is implicitly IOMMU_DOMAIN_DMA to even care about this).


Device would probably work too; you'd pass the first device to attach to the
domain when querying this from the SMMU driver, I suppose.

Will


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linux

Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Will Deacon
On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote:
> On 2021-03-31 12:49, Will Deacon wrote:
> > On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:
> > > On 2021-03-30 14:58, Will Deacon wrote:
> > > > On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:
> > > > > On 2021-03-30 14:11, Will Deacon wrote:
> > > > > > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:
> > > > > > > From: Robin Murphy 
> > > > > > > 
> > > > > > > Instead make the global iommu_dma_strict paramete in iommu.c 
> > > > > > > canonical by
> > > > > > > exporting helpers to get and set it and use those directly in the 
> > > > > > > drivers.
> > > > > > > 
> > > > > > > This make sure that the iommu.strict parameter also works for the 
> > > > > > > AMD and
> > > > > > > Intel IOMMU drivers on x86.  As those default to lazy flushing a 
> > > > > > > new
> > > > > > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
> > > > > > > represent the default if not overriden by an explicit parameter.
> > > > > > > 
> > > > > > > Signed-off-by: Robin Murphy .
> > > > > > > [ported on top of the other iommu_attr changes and added a few 
> > > > > > > small
> > > > > > > missing bits]
> > > > > > > Signed-off-by: Christoph Hellwig 
> > > > > > > ---
> > > > > > > drivers/iommu/amd/iommu.c   | 23 +---
> > > > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 
> > > > > > > +---
> > > > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> > > > > > > drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
> > > > > > > drivers/iommu/dma-iommu.c   |  9 +--
> > > > > > > drivers/iommu/intel/iommu.c | 64 
> > > > > > > -
> > > > > > > drivers/iommu/iommu.c   | 27 ++---
> > > > > > > include/linux/iommu.h   |  4 +-
> > > > > > > 8 files changed, 40 insertions(+), 165 deletions(-)
> > > > > > 
> > > > > > I really like this cleanup, but I can't help wonder if it's going 
> > > > > > in the
> > > > > > wrong direction. With SoCs often having multiple IOMMU instances 
> > > > > > and a
> > > > > > distinction between "trusted" and "untrusted" devices, then having 
> > > > > > the
> > > > > > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
> > > > > > unreasonable to me, but this change makes it a global property.
> > > > > 
> > > > > The intent here was just to streamline the existing behaviour of 
> > > > > stuffing a
> > > > > global property into a domain attribute then pulling it out again in 
> > > > > the
> > > > > illusion that it was in any way per-domain. We're still checking
> > > > > dev_is_untrusted() before making an actual decision, and it's not 
> > > > > like we
> > > > > can't add more factors at that point if we want to.
> > > > 
> > > > Like I say, the cleanup is great. I'm just wondering whether there's a
> > > > better way to express the complicated logic to decide whether or not to 
> > > > use
> > > > the flush queue than what we end up with:
> > > > 
> > > > if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
> > > > domain->ops->flush_iotlb_all && !iommu_get_dma_strict())
> > > > 
> > > > which is mixing up globals, device properties and domain properties. The
> > > > result is that the driver code ends up just using the global to 
> > > > determine
> > > > whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table 
> > > > code,
> > > > which is a departure from the current way of doing things.
> > > 
> > > But previously, SMMU only ever saw the global policy piped through the
> > > domain attribute by iommu_group_alloc_default_domain(), so there's no
> > > functional change there.
> > 
> > For DMA domains sure, but I don't think that's the case for unmanaged
> > domains such as those used by VFIO.
> 
> Eh? This is only relevant to DMA domains anyway. Flush queues are part of
> the IOVA allocator that VFIO doesn't even use. It's always been the case
> that unmanaged domains only use strict invalidation.

Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets
IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is
true, no? In which case, that will get set for page-tables corresponding
to unmanaged domains as well as DMA domains when it is enabled. That didn't
happen before because you couldn't set the attribute for unmanaged domains.

What am I missing?

> > > Obviously some of the above checks could be factored out into some kind of
> > > iommu_use_flush_queue() helper that IOMMU drivers can also call if they 
> > > need
> > > to keep in sync. Or maybe we just allow iommu-dma to set
> > > IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if 
> > > we're
> > > treating that as a generic thing now.
> > 
> > I think a helper that takes a domain would be a good starting po

Re: [PATCH 2/3] tracing: Use pr_crit() instead of long fancy messages

2021-03-31 Thread Steven Rostedt
On Wed, 31 Mar 2021 11:31:03 +0200
Geert Uytterhoeven  wrote:

> This reduces kernel size by ca. 0.5 KiB.

If you are worried about size, disable tracing and it will go away
entirely. 0.5KiB is a drop in the bucket compared to what tracing adds in
size overhead.

Sorry, but NAK.

This has been very successful in stopping people from adding trace_printk()
to the kernel, and I like to keep it that way.

-- Steve
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-31 12:49, Will Deacon wrote:

On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:

On 2021-03-30 14:58, Will Deacon wrote:

On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:

On 2021-03-30 14:11, Will Deacon wrote:

On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:

From: Robin Murphy 

Instead make the global iommu_dma_strict paramete in iommu.c canonical by
exporting helpers to get and set it and use those directly in the drivers.

This make sure that the iommu.strict parameter also works for the AMD and
Intel IOMMU drivers on x86.  As those default to lazy flushing a new
IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
represent the default if not overriden by an explicit parameter.

Signed-off-by: Robin Murphy .
[ported on top of the other iommu_attr changes and added a few small
missing bits]
Signed-off-by: Christoph Hellwig 
---
drivers/iommu/amd/iommu.c   | 23 +---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
drivers/iommu/dma-iommu.c   |  9 +--
drivers/iommu/intel/iommu.c | 64 -
drivers/iommu/iommu.c   | 27 ++---
include/linux/iommu.h   |  4 +-
8 files changed, 40 insertions(+), 165 deletions(-)


I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.


The intent here was just to streamline the existing behaviour of stuffing a
global property into a domain attribute then pulling it out again in the
illusion that it was in any way per-domain. We're still checking
dev_is_untrusted() before making an actual decision, and it's not like we
can't add more factors at that point if we want to.


Like I say, the cleanup is great. I'm just wondering whether there's a
better way to express the complicated logic to decide whether or not to use
the flush queue than what we end up with:

if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict())

which is mixing up globals, device properties and domain properties. The
result is that the driver code ends up just using the global to determine
whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
which is a departure from the current way of doing things.


But previously, SMMU only ever saw the global policy piped through the
domain attribute by iommu_group_alloc_default_domain(), so there's no
functional change there.


For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.


Eh? This is only relevant to DMA domains anyway. Flush queues are part 
of the IOVA allocator that VFIO doesn't even use. It's always been the 
case that unmanaged domains only use strict invalidation.



Obviously some of the above checks could be factored out into some kind of
iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
to keep in sync. Or maybe we just allow iommu-dma to set
IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
treating that as a generic thing now.


I think a helper that takes a domain would be a good starting point.


You mean device, right? The one condition we currently have is at the 
device level, and there's really nothing inherent to the domain itself 
that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care 
about this).


Another idea that's just come to mind is now that IOMMU_DOMAIN_DMA has a 
standard meaning, maybe we could split out a separate 
IOMMU_DOMAIN_DMA_STRICT type such that it can all propagate from 
iommu_get_def_domain_type()? That feels like it might be quite 
promising, but I'd still do it as an improvement on top of this patch, 
since it's beyond just cleaning up the abuse of domain attributes to 
pass a command-line option around.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:

> The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> the VM should be able to be shared by all assigned device for the VM.
> But the SVA operations (bind/unbind page table, cache_invalidate) should
> be per-device.

It is not *per-device* it is *per-ioasid*

And as /dev/ioasid is an interface for controlling multiple ioasid's
there is no issue to also multiplex the page table manipulation for
multiple ioasids as well.

What you should do next is sketch out in some RFC the exactl ioctls
each FD would have and show how the parts I outlined would work and
point out any remaining gaps.

The device FD is something like the vfio_device FD from VFIO, it has
*nothing* to do with PASID beyond having a single ioctl to authorize
the device to use the PASID. All control of the PASID is in
/dev/ioasid.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 07:41:40AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 30, 2021 9:28 PM
> > 
> > On Tue, Mar 30, 2021 at 04:14:58AM +, Tian, Kevin wrote:
> > 
> > > One correction. The mdev should still construct the list of allowed 
> > > PASID's
> > as
> > > you said (by listening to IOASID_BIND/UNBIND event), in addition to the
> > ioasid
> > > set maintained per VM (updated when a PASID is allocated/freed). The
> > per-VM
> > > set is required for inter-VM isolation (verified when a pgtable is bound 
> > > to
> > the
> > > mdev/PASID), while the mdev's own list is necessary for intra-VM isolation
> > when
> > > multiple mdevs are assigned to the same VM (verified before loading a
> > PASID
> > > to the mdev). This series just handles the general part i.e. per-VM ioasid
> > set and
> > > leaves the mdev's own list to be managed by specific mdev driver which
> > listens
> > > to various IOASID events).
> > 
> > This is better, but I don't understand why we need such a convoluted
> > design.
> > 
> > Get rid of the ioasid set.
> >
> > Each driver has its own list of allowed ioasids.
> 
> First, I agree with you it's necessary to have a per-device allowed ioasid
> list. But besides it, I think we still need to ensure the ioasid used by a
> VM is really allocated to this VM. A VM should not use an ioasid allocated
> to another VM. right? Actually, this is the major intention for introducing
> ioasid_set.

The /dev/ioasid FD replaces this security check. By becoming FD
centric you don't need additional kernel security objects.

Any process with access to the /dev/ioasid FD is allowed to control
those PASID. The seperation between VMs falls naturally from the
seperation of FDs without creating additional, complicated, security
infrastrucure in the kernel.

This is why all APIs must be FD focused, and you need to have a
logical layering of responsibility.

 Allocate a /dev/ioasid FD
 Allocate PASIDs inside the FD
 Assign memory to the PASIDS

 Open a device FD, eg from VFIO or VDP
 Instruct the device FD to authorize the device to access PASID A in
 an ioasid FD
   * Prior to being authorized the device will have NO access to any
 PASID
   * Presenting BOTH the device FD and the ioasid FD to the kernel
 is the security check. Any process with both FDs is allowed to
 make the connection. This is normal Unix FD centric thinking.

> > Register a ioasid in the driver's list by passing the fd and ioasid #
> 
> The fd here is a device fd. Am I right? 

It would be the vfio_device FD, for instance, and a VFIO IOCTL.

> If yes, your idea is ioasid is allocated via /dev/ioasid and
> associated with device fd via either VFIO or vDPA ioctl. right?
> sorry I may be asking silly questions but really need to ensure we
> are talking in the same page.

Yes, this is right

> > No listening to events. A simple understandable security model.
> 
> For this suggestion, I have a little bit concern if we may have A-B/B-A
> lock sequence issue since it requires the /dev/ioasid (if it supports)
> to call back into VFIO/VDPA to check if the ioasid has been registered to
> device FD and record it in the per-device list. right? Let's have more
> discussion based on the skeleton sent by Kevin.

Callbacks would be backwards.

User calls vfio with vfio_device fd and dev/ioasid fd

VFIO extracts some kernel representation of the ioasid from the ioasid
fd using an API

VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
IOMMU that this PCI device is allowed to use this PASID'

VFIO mdev drivers then record that the PASID is allowed in its own
device specific struct for later checking during other system calls.

No lock inversions. No callbacks. Why do we need callbacks?? Simplify.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Tue, Mar 30, 2021 at 05:10:41PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Tue, 30 Mar 2021 10:43:13 -0300, Jason Gunthorpe  wrote:
> 
> > > If two mdevs from the same PF dev are assigned to two VMs, the PASID
> > > table will be shared. IOASID set ensures one VM cannot program another
> > > VM's PASIDs. I assume 'secure context' is per VM when it comes to host
> > > PASID.  
> > 
> > No, the mdev device driver must enforce this directly. It is the one
> > that programms the physical shared HW, it is the one that needs a list
> > of PASID's it is allowed to program *for each mdev*
> > 
> This requires the mdev driver to obtain a list of allowed PASIDs(possibly
> during PASID bind time) prior to do enforcement. IMHO, the PASID enforcement
> points are:
> 1. During WQ configuration (e.g.program MSI)
> 2. During work submission
> 
> For VT-d shared workqueue, there is no way to enforce #2 in mdev driver in
> that the PASID is obtained from PASID MSR from the CPU and submitted w/o
> driver involvement.

I assume that the PASID MSR is privileged and only qemu can program
it? Otherwise this seems like a security problem.

If qemu controls it then the idxd userspace driver in qemu must ensure
it is only ever programmed to an authorized PASID.

> The enforcement for #2 is in the KVM PASID translation table, which
> is per VM.

I don't understand why KVM gets involved in PASID??

Doesn't work submission go either to the mdev driver or through the
secure PASID of #1?

> For our current VFIO mdev model, bind guest page table does not involve
> mdev driver. So this is a gap we must fill, i.e. include a callback from
> mdev driver?

No not a callback, tell the mdev driver with a VFIO IOCTL that it is
authorized to use a specific PASID because the vIOMMU was told to
allow it by the guest kernel. Simple and straightforward.

> > ioasid_set doesn't seem to help at all, certainly not as a concept
> > tied to /dev/ioasid.
> > 
> Yes, we can take the security role off ioasid_set once we have per mdev
> list. However, ioasid_set being a per VM/mm entity also bridge
> communications among kernel subsystems that don't have direct call path.
> e.g. KVM, VDCM and IOMMU.

Everything should revolve around the /dev/ioasid FD. qemu should pass
it to all places that need to know about PASID's in the VM.

We should try to avoid hidden behind the scenes kernel
interconnections between subsystems.


> > So when you 'allow' a mdev to access a PASID you want to say:
> >  Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD
> > 

> Host and guest PASID value, as well as device info are available through
> iommu_uapi_sva_bind_gpasid(), we just need to feed that info to mdev driver.

You need that IOCTL to exist on the *mdev driver*. It is a VFIO ioctl,
not a iommu or ioasid or sva IOCTL.
 
> > That seems like a good helper library to provide for drivers to use,
> > but it should be a construct entirely contained in the driver.
> why? would it be cleaner if it is in the common code?

No, it is the "mid layer" problematic design.

Having the iommu layer store driver-specific data on behalf of a
driver will just make a mess. Use the natural layering we have and
store driver specific data in the driver structs.

Add a library to help build the datastructure if it necessary.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Will Deacon
On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:
> On 2021-03-30 14:58, Will Deacon wrote:
> > On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:
> > > On 2021-03-30 14:11, Will Deacon wrote:
> > > > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:
> > > > > From: Robin Murphy 
> > > > > 
> > > > > Instead make the global iommu_dma_strict paramete in iommu.c 
> > > > > canonical by
> > > > > exporting helpers to get and set it and use those directly in the 
> > > > > drivers.
> > > > > 
> > > > > This make sure that the iommu.strict parameter also works for the AMD 
> > > > > and
> > > > > Intel IOMMU drivers on x86.  As those default to lazy flushing a new
> > > > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
> > > > > represent the default if not overriden by an explicit parameter.
> > > > > 
> > > > > Signed-off-by: Robin Murphy .
> > > > > [ported on top of the other iommu_attr changes and added a few small
> > > > >missing bits]
> > > > > Signed-off-by: Christoph Hellwig 
> > > > > ---
> > > > >drivers/iommu/amd/iommu.c   | 23 +---
> > > > >drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
> > > > >drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> > > > >drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
> > > > >drivers/iommu/dma-iommu.c   |  9 +--
> > > > >drivers/iommu/intel/iommu.c | 64 
> > > > > -
> > > > >drivers/iommu/iommu.c   | 27 ++---
> > > > >include/linux/iommu.h   |  4 +-
> > > > >8 files changed, 40 insertions(+), 165 deletions(-)
> > > > 
> > > > I really like this cleanup, but I can't help wonder if it's going in the
> > > > wrong direction. With SoCs often having multiple IOMMU instances and a
> > > > distinction between "trusted" and "untrusted" devices, then having the
> > > > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
> > > > unreasonable to me, but this change makes it a global property.
> > > 
> > > The intent here was just to streamline the existing behaviour of stuffing 
> > > a
> > > global property into a domain attribute then pulling it out again in the
> > > illusion that it was in any way per-domain. We're still checking
> > > dev_is_untrusted() before making an actual decision, and it's not like we
> > > can't add more factors at that point if we want to.
> > 
> > Like I say, the cleanup is great. I'm just wondering whether there's a
> > better way to express the complicated logic to decide whether or not to use
> > the flush queue than what we end up with:
> > 
> > if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
> > domain->ops->flush_iotlb_all && !iommu_get_dma_strict())
> > 
> > which is mixing up globals, device properties and domain properties. The
> > result is that the driver code ends up just using the global to determine
> > whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
> > which is a departure from the current way of doing things.
> 
> But previously, SMMU only ever saw the global policy piped through the
> domain attribute by iommu_group_alloc_default_domain(), so there's no
> functional change there.

For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.

> Obviously some of the above checks could be factored out into some kind of
> iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
> to keep in sync. Or maybe we just allow iommu-dma to set
> IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
> treating that as a generic thing now.

I think a helper that takes a domain would be a good starting point.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/6] iova: Allow rcache range upper limit to be configurable

2021-03-31 Thread Robin Murphy

On 2021-03-19 17:26, John Garry wrote:
[...]

@@ -25,7 +25,8 @@ struct iova {
  struct iova_magazine;
  struct iova_cpu_rcache;
-#define IOVA_RANGE_CACHE_MAX_SIZE 6    /* log of max cached IOVA 
range size (in pages) */

+#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
+#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range 
size (in pages) */


No.

And why? If we're going to allocate massive caches anyway, whatever is 
the point of *not* using all of them?




I wanted to keep the same effective threshold for devices today, unless 
set otherwise.


The reason is that I didn't know if a blanket increase would cause 
regressions, and I was taking the super-safe road. Specifically some 
systems may be very IOVA space limited, and just work today by not 
caching large IOVAs.


alloc_iova_fast() will already clear out the caches if space is running 
low, so the caching itself shouldn't be an issue.


And in the precursor thread you wrote "We can't arbitrarily *increase* 
the scope of caching once a domain is active due to the size-rounding-up 
requirement, which would be prohibitive to larger allocations if applied 
universally" (sorry for quoting)


I took the last part to mean that we shouldn't apply this increase in 
threshold globally.


I meant we can't increase the caching threshold as-is once the domain is 
in use, because that could result in odd-sized IOVAs previously 
allocated above the old threshold being later freed back into caches, 
then causing havoc the next time they get allocated (because they're not 
as big as the actual size being asked for). However, trying to address 
that by just size-aligning everything even above the caching threshold 
is liable to waste too much space on IOVA-constrained systems (e.g. a 
single 4K video frame may be ~35MB - rounding that up to 64MB each time 
would be hard to justify).


It follows from that that there's really no point in decoupling the 
rounding-up threshold from the actual caching threshold - you get all 
the wastage (both IOVA space and actual memory for the cache arrays) for 
no obvious benefit.


It only makes sense for a configuration knob to affect the actual 
rcache and depot allocations - that way, big high-throughput systems 
with plenty of memory can spend it on better performance, while small 
systems - that often need IOMMU scatter-gather precisely *because* 
memory is tight and thus easily fragmented - don't have to pay the 
(not insignificant) cost for caches they don't need.


So do you suggest to just make IOVA_RANGE_CACHE_MAX_SIZE a kconfig option?


Again, I'm less convinced by Kconfig since I imagine many people tuning 
server-class systems for their own particular workloads will be running 
standard enterprise distros, so I think end-user-accessible knobs will 
be the most valuable. That's not to say that a Kconfig option to set the 
default state of a command-line option (as we do elsewhere) won't be 
useful for embedded users, cloud providers, etc., just that I'm not sure 
it's worth it being the *only* option.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 1/3] iommu: Use pr_crit() instead of long fancy messages

2021-03-31 Thread Geert Uytterhoeven
While long fancy messages have a higher probability of being seen than
small messages, they may scroll of the screen fast, if visible at all,
and may still be missed.  In addition, they increase boot time and
kernel size.

The correct mechanism to increase importance of a kernel message is not
to draw fancy boxes with more text, but to shout louder, i.e. increase
the message's reporting level.  Making sure the administrator of the
system is aware of such a message is a system policy, and is the
responsability of a user-space log daemon.

Fix this by increasing the reporting level from KERN_WARNING to
KERN_CRIT, and removing irrelevant text and graphics.

This reduces kernel size by ca. 0.5 KiB.

Fixes: bad614b24293ae46 ("iommu: Enable debugfs exposure of IOMMU driver 
internals")
Signed-off-by: Geert Uytterhoeven 
---
 drivers/iommu/iommu-debugfs.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/iommu-debugfs.c b/drivers/iommu/iommu-debugfs.c
index f0354894209648fd..c3306998de0687bd 100644
--- a/drivers/iommu/iommu-debugfs.c
+++ b/drivers/iommu/iommu-debugfs.c
@@ -32,20 +32,9 @@ void iommu_debugfs_setup(void)
 {
if (!iommu_debugfs_dir) {
iommu_debugfs_dir = debugfs_create_dir("iommu", NULL);
-   pr_warn("\n");
-   
pr_warn("*\n");
-   pr_warn("** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE 
NOTICE**\n");
-   pr_warn("** 
**\n");
-   pr_warn("**  IOMMU DebugFS SUPPORT HAS BEEN ENABLED IN THIS 
KERNEL  **\n");
-   pr_warn("** 
**\n");
-   pr_warn("** This means that this kernel is built to expose 
internal **\n");
-   pr_warn("** IOMMU data structures, which may compromise 
security on **\n");
-   pr_warn("** your system.
**\n");
-   pr_warn("** 
**\n");
-   pr_warn("** If you see this message and you are not debugging 
the   **\n");
-   pr_warn("** kernel, report this immediately to your vendor! 
**\n");
-   pr_warn("** 
**\n");
-   pr_warn("** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE 
NOTICE**\n");
-   
pr_warn("*\n");
+   pr_crit("IOMMU DebugFS SUPPORT HAS BEEN ENABLED IN THIS 
KERNEL\n");
+   pr_crit("This means that this kernel is built to expose 
internal\n");
+   pr_crit("IOMMU data structures, which may compromise security 
on\n");
+   pr_crit("your system.\n");
}
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/3] tracing: Use pr_crit() instead of long fancy messages

2021-03-31 Thread Geert Uytterhoeven
While long fancy messages have a higher probability of being seen than
small messages, they may scroll of the screen fast, if visible at all,
and may still be missed.  In addition, they increase boot time and
kernel size.

The correct mechanism to increase importance of a kernel message is not
to draw fancy boxes with more text, but to shout louder, i.e. increase
the message's reporting level.  Making sure the administrator of the
system is aware of such a message is a system policy, and is the
responsability of a user-space log daemon.

Fix this by increasing the reporting level from KERN_WARNING to
KERN_CRIT, and removing irrelevant text and graphics.

This reduces kernel size by ca. 0.5 KiB.

Fixes: 2184db46e425c2b8 ("tracing: Print nasty banner when trace_printk() is in 
use")
Signed-off-by: Geert Uytterhoeven 
---
 kernel/trace/trace.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index eccb4e1187cc788e..b3a93aff01045923 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3175,20 +3175,9 @@ void trace_printk_init_buffers(void)
 
/* trace_printk() is for debug use only. Don't use it in production. */
 
-   pr_warn("\n");
-   pr_warn("**\n");
-   pr_warn("**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n");
-   pr_warn("**  **\n");
-   pr_warn("** trace_printk() being used. Allocating extra memory.  **\n");
-   pr_warn("**  **\n");
-   pr_warn("** This means that this is a DEBUG kernel and it is **\n");
-   pr_warn("** unsafe for production use.   **\n");
-   pr_warn("**  **\n");
-   pr_warn("** If you see this message and you are not debugging**\n");
-   pr_warn("** the kernel, report this immediately to your vendor!  **\n");
-   pr_warn("**  **\n");
-   pr_warn("**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n");
-   pr_warn("**\n");
+   pr_crit("trace_printk() being used. Allocating extra memory.\n");
+   pr_crit("This means that this is a DEBUG kernel and it is\n");
+   pr_crit("unsafe for production use.\n");
 
/* Expand the buffers to set size */
tracing_update_buffers();
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/3] lib/vsprintf: Use pr_crit() instead of long fancy messages

2021-03-31 Thread Geert Uytterhoeven
While long fancy messages have a higher probability of being seen than
small messages, they may scroll of the screen fast, if visible at all,
and may still be missed.  In addition, they increase boot time and
kernel size.

The correct mechanism to increase importance of a kernel message is not
to draw fancy boxes with more text, but to shout louder, i.e. increase
the message's reporting level.  Making sure the administrator of the
system is aware of such a message is a system policy, and is the
responsability of a user-space log daemon.

Fix this by increasing the reporting level from KERN_WARNING to
KERN_CRIT, and removing irrelevant text and graphics.

This reduces kernel size by ca. 0.5 KiB.

Fixes: 5ead723a20e0447b ("lib/vsprintf: no_hash_pointers prints all addresses 
as unhashed")
Signed-off-by: Geert Uytterhoeven 
---
 lib/vsprintf.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 9b423359bb6433d3..0293f1b89064b287 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -2193,20 +2193,9 @@ static int __init no_hash_pointers_enable(char *str)
 
no_hash_pointers = true;
 
-   pr_warn("**\n");
-   pr_warn("**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n");
-   pr_warn("**  **\n");
-   pr_warn("** This system shows unhashed kernel memory addresses   **\n");
-   pr_warn("** via the console, logs, and other interfaces. This**\n");
-   pr_warn("** might reduce the security of your system.**\n");
-   pr_warn("**  **\n");
-   pr_warn("** If you see this message and you are not debugging**\n");
-   pr_warn("** the kernel, report this immediately to your system   **\n");
-   pr_warn("** administrator!   **\n");
-   pr_warn("**  **\n");
-   pr_warn("**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n");
-   pr_warn("**\n");
-
+   pr_crit("This system shows unhashed kernel memory addresses\n");
+   pr_crit("via the console, logs, and other interfaces. This\n");
+   pr_crit("might reduce the security of your system.\n");
return 0;
 }
 early_param("no_hash_pointers", no_hash_pointers_enable);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/3] Use pr_crit() instead of long fancy messages

2021-03-31 Thread Geert Uytterhoeven
Hi all,

While long fancy messages have a higher probability of being seen than
small messages, they may scroll of the screen fast, if visible at all,
and may still be missed.  In addition, they increase boot time and
kernel size.

The correct mechanism to increase importance of a kernel message is not
to draw fancy boxes with more text, but to shout louder, i.e. increase
the message's reporting level.  Making sure the administrator of the
system is aware of such a message is a system policy, and is the
responsability of a user-space log daemon.

This series increases the reporting level of such messages from
KERN_WARNING to KERN_CRIT, and removes irrelevant text and graphics.
It was made against linux-next, but applies to v5.12-rc5 with an offset
for the last patch.

Each of these patches reduces kernel size by ca. 0.5 KiB.

Thanks for your comments!

Geert Uytterhoeven (3):
  iommu: Use pr_crit() instead of long fancy messages
  tracing: Use pr_crit() instead of long fancy messages
  lib/vsprintf: Use pr_crit() instead of long fancy messages

 drivers/iommu/iommu-debugfs.c | 19 ---
 kernel/trace/trace.c  | 17 +++--
 lib/vsprintf.c| 17 +++--
 3 files changed, 10 insertions(+), 43 deletions(-)

-- 
2.25.1

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] clk: Align provider-specific CLK_* bit definitions

2021-03-31 Thread Geert Uytterhoeven
The definition of CLK_MULTIPLIER_ROUND_CLOSEST is not aligned to the two
bit definitions next to it.  A deeper inspection reveals that the
alignment of CLK_MULTIPLIER_ROUND_CLOSEST does match the most common
alignment.

Align the bit definitions for the various provider types throughout the
file at 40 columns, to increase uniformity.

Signed-off-by: Geert Uytterhoeven 
---
 include/linux/clk-provider.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/clk-provider.h b/include/linux/clk-provider.h
index 58f6fe866ae9b797..8f37829565f9640e 100644
--- a/include/linux/clk-provider.h
+++ b/include/linux/clk-provider.h
@@ -342,7 +342,7 @@ struct clk_fixed_rate {
unsigned long   flags;
 };
 
-#define CLK_FIXED_RATE_PARENT_ACCURACY BIT(0)
+#define CLK_FIXED_RATE_PARENT_ACCURACY BIT(0)
 
 extern const struct clk_ops clk_fixed_rate_ops;
 struct clk_hw *__clk_hw_register_fixed_rate(struct device *dev,
@@ -984,8 +984,8 @@ struct clk_fractional_divider {
 
 #define to_clk_fd(_hw) container_of(_hw, struct clk_fractional_divider, hw)
 
-#define CLK_FRAC_DIVIDER_ZERO_BASEDBIT(0)
-#define CLK_FRAC_DIVIDER_BIG_ENDIANBIT(1)
+#define CLK_FRAC_DIVIDER_ZERO_BASEDBIT(0)
+#define CLK_FRAC_DIVIDER_BIG_ENDIANBIT(1)
 
 extern const struct clk_ops clk_fractional_divider_ops;
 struct clk *clk_register_fractional_divider(struct device *dev,
@@ -1033,9 +1033,9 @@ struct clk_multiplier {
 
 #define to_clk_multiplier(_hw) container_of(_hw, struct clk_multiplier, hw)
 
-#define CLK_MULTIPLIER_ZERO_BYPASS BIT(0)
+#define CLK_MULTIPLIER_ZERO_BYPASS BIT(0)
 #define CLK_MULTIPLIER_ROUND_CLOSEST   BIT(1)
-#define CLK_MULTIPLIER_BIG_ENDIAN  BIT(2)
+#define CLK_MULTIPLIER_BIG_ENDIAN  BIT(2)
 
 extern const struct clk_ops clk_multiplier_ops;
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-03-31 Thread Robin Murphy

On 2021-03-22 15:01, John Garry wrote:

On 19/03/2021 19:20, Robin Murphy wrote:

Hi Robin,


So then we have the issue of how to dynamically increase this rcache
threshold. The problem is that we may have many devices associated with
the same domain. So, in theory, we can't assume that when we increase
the threshold that some other device will try to fast free an IOVA which
was allocated prior to the increase and was not rounded up.

I'm very open to better (or less bad) suggestions on how to do this ...

...but yes, regardless of exactly where it happens, rounding up or not
is the problem for rcaches in general. I've said several times that my
preferred approach is to not change it that dynamically at all, but
instead treat it more like we treat the default domain type.



Can you remind me of that idea? I don't remember you mentioning using 
default domain handling as a reference in any context.


Sorry if the phrasing was unclear there - the allusion to default 
domains is new, it just occurred to me that what we do there is in fact 
fairly close to what I've suggested previously for this. In that case, 
we have a global policy set by the command line, which *can* be 
overridden per-domain via sysfs at runtime, provided the user is willing 
to tear the whole thing down. Using a similar approach here would give a 
fair degree of flexibility but still mean that changes never have to be 
made dynamically to a live domain.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:43 PM
[..]
> No, the mdev device driver must enforce this directly. It is the one
> that programms the physical shared HW, it is the one that needs a list
> of PASID's it is allowed to program *for each mdev*
> 
> ioasid_set doesn't seem to help at all, certainly not as a concept
> tied to /dev/ioasid.
> 

As replied in another thread. We introduced ioasid_set based on the
motivation to have per-VM ioasid track, which is required when user
space tries to bind an ioasid with a device. Should ensure the ioasid
it is using was allocated to it. otherwise, we may suffer inter-VM ioasid
problem. It may not necessaty to be ioasid_set but a per-VM ioasid track
is necessary.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()

2021-03-31 Thread Salil Mehta
(+) correction below, sorry for the typo in earlier post.

> From: iommu [mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of
> Robin Murphy
> Sent: Friday, March 19, 2021 5:00 PM
> To: John Garry ; j...@8bytes.org; w...@kernel.org;
> j...@linux.ibm.com; martin.peter...@oracle.com; h...@lst.de;
> m.szyprow...@samsung.com
> Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org;
> linux-s...@vger.kernel.org; Linuxarm 
> Subject: Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
> 
> On 2021-03-19 13:25, John Garry wrote:
> > Add a function to allow the max size which we want to optimise DMA mappings
> > for.
> 
> It seems neat in theory - particularly for packet-based interfaces that
> might have a known fixed size of data unit that they're working on at
> any given time - but aren't there going to be many cases where the
> driver has no idea because it depends on whatever size(s) of request
> userspace happens to throw at it? Even if it does know the absolute
> maximum size of thing it could ever transfer, that could be
> impractically large in areas like video/AI/etc., so it could still be
> hard to make a reasonable decision.


This is also the case for networking workloads where we have MTU set but
actual packet sizes might vary.

> 
> Being largely workload-dependent is why I still think this should be a
> command-line or sysfs tuneable - we could set the default based on how
> much total memory is available, but ultimately it's the end user who
> knows what the workload is going to be and what they care about
> optimising for.
> 
> Another thought (which I'm almost reluctant to share) is that I would
> *love* to try implementing a self-tuning strategy that can detect high
> contention on particular allocation sizes and adjust the caches on the
> fly, but I can easily imagine that having enough inherent overhead to
> end up being an impractical (but fun) waste of time.


This might be particularly useful for the NICs where packet sizes vary
from 64B to 9K. But without optimal strategy this can affect the
performance of networking workloads.


> 
> Robin.
> 
> > Signed-off-by: John Garry 
> > ---
> >   drivers/iommu/dma-iommu.c   |  2 +-
> >   include/linux/dma-map-ops.h |  1 +
> >   include/linux/dma-mapping.h |  5 +
> >   kernel/dma/mapping.c| 11 +++
> >   4 files changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index a5dfbd6c0496..d35881fcfb9c 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
> > iommu_domain
> *domain,
> > return (dma_addr_t)iova << shift;
> >   }
> >
> > -__maybe_unused
> >   static void iommu_dma_set_opt_size(struct device *dev, size_t size)
> >   {
> > struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > @@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> > .map_resource   = iommu_dma_map_resource,
> > .unmap_resource = iommu_dma_unmap_resource,
> > .get_merge_boundary = iommu_dma_get_merge_boundary,
> > +   .set_max_opt_size   = iommu_dma_set_opt_size,
> >   };
> >
> >   /*
> > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > index 51872e736e7b..fed7a183b3b9 100644
> > --- a/include/linux/dma-map-ops.h
> > +++ b/include/linux/dma-map-ops.h
> > @@ -64,6 +64,7 @@ struct dma_map_ops {
> > u64 (*get_required_mask)(struct device *dev);
> > size_t (*max_mapping_size)(struct device *dev);
> > unsigned long (*get_merge_boundary)(struct device *dev);
> > +   void (*set_max_opt_size)(struct device *dev, size_t size);
> >   };
> >
> >   #ifdef CONFIG_DMA_OPS
> > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> > index 2a984cb4d1e0..91fe770145d4 100644
> > --- a/include/linux/dma-mapping.h
> > +++ b/include/linux/dma-mapping.h
> > @@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev);
> >   size_t dma_max_mapping_size(struct device *dev);
> >   bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
> >   unsigned long dma_get_merge_boundary(struct device *dev);
> > +void dma_set_max_opt_size(struct device *dev, size_t size);
> >   #else /* CONFIG_HAS_DMA */
> >   static inline dma_addr_t dma_map_page_attrs(struct device *dev,
> > struct page *page, size_t offset, size_t size,
> > @@ -257,6 +258,10 @@ static inline unsigned long 
> > dma_get_merge_boundary(struct
> device *dev)
> >   {
> > return 0;
> >   }
> > +static inline void dma_set_max_opt_size(struct device *dev, size_t size)
> > +{
> > +}
> > +
> >   #endif /* CONFIG_HAS_DMA */
> >
> >   struct page *dma_alloc_pages(struct device *dev, size_t size,
> > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > index b6a633679933..59e6acb1c471 100644
> > --- a/kernel/dma/mapping.c
> > +++ b/kernel/dma/mapping.c
> > @@ -608,3 +608,14 @@ unsigned

RE: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()

2021-03-31 Thread Salil Mehta
> From: iommu [mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of
> Robin Murphy
> Sent: Friday, March 19, 2021 5:00 PM
> To: John Garry ; j...@8bytes.org; w...@kernel.org;
> j...@linux.ibm.com; martin.peter...@oracle.com; h...@lst.de;
> m.szyprow...@samsung.com
> Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org;
> linux-s...@vger.kernel.org; Linuxarm 
> Subject: Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
> 
> On 2021-03-19 13:25, John Garry wrote:
> > Add a function to allow the max size which we want to optimise DMA mappings
> > for.
> 
> It seems neat in theory - particularly for packet-based interfaces that
> might have a known fixed size of data unit that they're working on at
> any given time - but aren't there going to be many cases where the
> driver has no idea because it depends on whatever size(s) of request
> userspace happens to throw at it? Even if it does know the absolute
> maximum size of thing it could ever transfer, that could be
> impractically large in areas like video/AI/etc., so it could still be
> hard to make a reasonable decision.


This is also the case in networking workloads where we have MTU set but
actual packet sizes might vary.


> 
> Being largely workload-dependent is why I still think this should be a
> command-line or sysfs tuneable - we could set the default based on how
> much total memory is available, but ultimately it's the end user who
> knows what the workload is going to be and what they care about
> optimising for.
> 
> Another thought (which I'm almost reluctant to share) is that I would
> *love* to try implementing a self-tuning strategy that can detect high
> contention on particular allocation sizes and adjust the caches on the
> fly, but I can easily imagine that having enough inherent overhead to
> end up being an impractical (but fun) waste of time.

This might be particularly useful for the NICs where packet sizes vary
from 64K to 9K. Hence, without optimal strategy this can affect the
performance of networking workloads.


> 
> Robin.
> 
> > Signed-off-by: John Garry 
> > ---
> >   drivers/iommu/dma-iommu.c   |  2 +-
> >   include/linux/dma-map-ops.h |  1 +
> >   include/linux/dma-mapping.h |  5 +
> >   kernel/dma/mapping.c| 11 +++
> >   4 files changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index a5dfbd6c0496..d35881fcfb9c 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
> > iommu_domain
> *domain,
> > return (dma_addr_t)iova << shift;
> >   }
> >
> > -__maybe_unused
> >   static void iommu_dma_set_opt_size(struct device *dev, size_t size)
> >   {
> > struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > @@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> > .map_resource   = iommu_dma_map_resource,
> > .unmap_resource = iommu_dma_unmap_resource,
> > .get_merge_boundary = iommu_dma_get_merge_boundary,
> > +   .set_max_opt_size   = iommu_dma_set_opt_size,
> >   };
> >
> >   /*
> > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > index 51872e736e7b..fed7a183b3b9 100644
> > --- a/include/linux/dma-map-ops.h
> > +++ b/include/linux/dma-map-ops.h
> > @@ -64,6 +64,7 @@ struct dma_map_ops {
> > u64 (*get_required_mask)(struct device *dev);
> > size_t (*max_mapping_size)(struct device *dev);
> > unsigned long (*get_merge_boundary)(struct device *dev);
> > +   void (*set_max_opt_size)(struct device *dev, size_t size);
> >   };
> >
> >   #ifdef CONFIG_DMA_OPS
> > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> > index 2a984cb4d1e0..91fe770145d4 100644
> > --- a/include/linux/dma-mapping.h
> > +++ b/include/linux/dma-mapping.h
> > @@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev);
> >   size_t dma_max_mapping_size(struct device *dev);
> >   bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
> >   unsigned long dma_get_merge_boundary(struct device *dev);
> > +void dma_set_max_opt_size(struct device *dev, size_t size);
> >   #else /* CONFIG_HAS_DMA */
> >   static inline dma_addr_t dma_map_page_attrs(struct device *dev,
> > struct page *page, size_t offset, size_t size,
> > @@ -257,6 +258,10 @@ static inline unsigned long 
> > dma_get_merge_boundary(struct
> device *dev)
> >   {
> > return 0;
> >   }
> > +static inline void dma_set_max_opt_size(struct device *dev, size_t size)
> > +{
> > +}
> > +
> >   #endif /* CONFIG_HAS_DMA */
> >
> >   struct page *dma_alloc_pages(struct device *dev, size_t size,
> > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > index b6a633679933..59e6acb1c471 100644
> > --- a/kernel/dma/mapping.c
> > +++ b/kernel/dma/mapping.c
> > @@ -608,3 +608,14 @@ unsigned long dma_get_merge_boundary(struct device 
> > *dev)
> >

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:28 PM
> 
> On Tue, Mar 30, 2021 at 04:14:58AM +, Tian, Kevin wrote:
> 
> > One correction. The mdev should still construct the list of allowed PASID's
> as
> > you said (by listening to IOASID_BIND/UNBIND event), in addition to the
> ioasid
> > set maintained per VM (updated when a PASID is allocated/freed). The
> per-VM
> > set is required for inter-VM isolation (verified when a pgtable is bound to
> the
> > mdev/PASID), while the mdev's own list is necessary for intra-VM isolation
> when
> > multiple mdevs are assigned to the same VM (verified before loading a
> PASID
> > to the mdev). This series just handles the general part i.e. per-VM ioasid
> set and
> > leaves the mdev's own list to be managed by specific mdev driver which
> listens
> > to various IOASID events).
> 
> This is better, but I don't understand why we need such a convoluted
> design.
> 
> Get rid of the ioasid set.
>
> Each driver has its own list of allowed ioasids.

First, I agree with you it's necessary to have a per-device allowed ioasid
list. But besides it, I think we still need to ensure the ioasid used by a
VM is really allocated to this VM. A VM should not use an ioasid allocated
to another VM. right? Actually, this is the major intention for introducing
ioasid_set.

> Register a ioasid in the driver's list by passing the fd and ioasid #

The fd here is a device fd. Am I right? If yes, your idea is ioasid is
allocated via /dev/ioasid and associated with device fd via either VFIO
or vDPA ioctl. right? sorry I may be asking silly questions but really
need to ensure we are talking in the same page.

> No listening to events. A simple understandable security model.

For this suggestion, I have a little bit concern if we may have A-B/B-A
lock sequence issue since it requires the /dev/ioasid (if it supports)
to call back into VFIO/VDPA to check if the ioasid has been registered to
device FD and record it in the per-device list. right? Let's have more
discussion based on the skeleton sent by Kevin.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:29 PM
> 
> On Tue, Mar 30, 2021 at 01:37:05AM +, Tian, Kevin wrote:
[...]
> > Hi, Jason,
> >
> > Actually above is a major open while we are refactoring vSVA uAPI toward
> > this direction. We have two concerns about merging /dev/ioasid with
> > /dev/sva, and would like to hear your thought whether they are valid.
> >
> > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > bound to specific security context (e.g. a control vq in vDPA) instead of
> > tying to mm. In this case there is no pgtable binding initiated from user
> > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > to the intended security context through specific passthrough framework
> > which manages that context.
> 
> This sounds like the exact opposite of what I'd like to see.
> 
> I do not want to see every subsystem gaining APIs to program a
> PASID. All of that should be consolidated in *one place*.
> 
> I do not want to see VDPA and VFIO have two nearly identical sets of
> APIs to control the PASID.
> 
> Drivers consuming a PASID, like VDPA, should consume the PASID and do
> nothing more than authorize the HW to use it.
> 
> quemu should have general code under the viommu driver that drives
> /dev/ioasid to create PASID's and manage the IO mapping according to
> the guest's needs.
> 
> Drivers like VDPA and VFIO should simply accept that PASID and
> configure/authorize their HW to do DMA's with its tag.
> 
> > Second, ioasid is managed per process/VM while pgtable binding is a
> > device-wise operation.  The userspace flow looks like below for an integral
> > /dev/ioasid interface:
> >
> > - ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
> > - ioasid_fd = open(/dev/ioasid)
> > - ioctl(ioasid_fd, IOASID_GET_USVA_FD, &sva_fd) //an empty context
> > - ioctl(device->fd, VFIO_DEVICE_SET_SVA, &sva_fd); //sva_fd ties to
> device
> > - ioctl(sva_fd, USVA_GET_INFO, &sva_info);
> > - ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, &ioasid);
> > - ioctl(sva_fd, USVA_BIND_PGTBL, &bind_data);
> > - ioctl(sva_fd, USVA_FLUSH_CACHE, &inv_info);
> > - ioctl(sva_fd, USVA_UNBIND_PGTBL, &unbind_data);
> > - ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, &sva_fd);
> > - close(sva_fd)
> > - close(ioasid_fd)
> >
> > Our hesitation here is based on one of your earlier comments that
> > you are not a fan of constructing fd's through ioctl. Are you OK with
> > above flow or have a better idea of handling it?
> 
> My reaction is to squash 'sva' and ioasid fds together, I can't see
> why you'd need two fds to manipulate a PASID.

The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
the VM should be able to be shared by all assigned device for the VM.
But the SVA operations (bind/unbind page table, cache_invalidate) should
be per-device. If squashing the two fds to be one, then requires a device
tag for each vSVA ioctl. I'm not sure if it is good. Per me, it looks
better to have a SVA FD and associated with a device FD so that any ioctl
on it will be in the device level. This also benefits ARM and AMD's vSVA
support since they binds guest PASID table to host instead of binding
guest page tables to specific PASIDs.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu