Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Shenming Lu
On 2021/6/2 1:33, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> 
>> The drivers register per page table fault handlers to /dev/ioasid which
>> will then register itself to iommu core to listen and route the per-
>> device I/O page faults. 
> 
> I'm still confused why drivers need fault handlers at all?

Essentially it is the userspace that needs the fault handlers,
one case is to deliver the faults to the vIOMMU, and another
case is to enable IOPF on the GPA address space for on-demand
paging, it seems that both could be specified in/through the
IOASID_ALLOC ioctl?

Thanks,
Shenming

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 0/7] iommu: Allow IOVA rcache range be configured

2021-06-01 Thread Lu Baolu

On 6/1/21 10:29 PM, John Garry wrote:

For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced.

This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry
from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA ageing issue,
as discussed at [1].

This series allows the IOVA rcache range be configured, so that we may
cache all IOVAs per domain, thus improving performance.

A new IOMMU group sysfs file is added - max_opt_dma_size - which is used
indirectly to configure the IOVA rcache range:
/sys/kernel/iommu_groups/X/max_opt_dma_size

This file is updated same as how the IOMMU group default domain type is
updated, i.e. must unbind the only device in the group first.


Could you explain why it requires singleton group and driver unbinding
if the user only wants to increase the upper limit? I haven't dived into
the details yet, sorry if this is a silly question.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Lu Baolu

On 6/2/21 1:26 AM, Jason Gunthorpe wrote:

On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:


This version only covers 1) and 4). Do you think we need to support 2),
3) and beyond?


Yes aboslutely. The API should be flexable enough to specify the
creation of all future page table formats we'd want to have and all HW
specific details on those formats.


OK, stay in the same line.


If so, it seems that we need some in-kernel helpers and uAPIs to
support pre-installing a page table to IOASID.


Not sure what this means..


Sorry that I didn't make this clear.

Let me bring back the page table types in my eyes.

 1) IOMMU format page table (a.k.a. iommu_domain)
 2) user application CPU page table (SVA for example)
 3) KVM EPT (future option)
 4) VM guest managed page table (nesting mode)

Each type of page table should be able to be associated with its IOASID.
We have BIND protocol for 4); We explicitly allocate an iommu_domain for
1). But we don't have a clear definition for 2) 3) and others. I think
it's necessary to clearly define a time point and kAPI name between
IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
opportunity to associate their page table with the allocated IOASID
before attaching the page table to the real IOMMU hardware.

I/O page fault handling is similar. The provider of the page table
should take the responsibility to handle the possible page faults.

Could this answer the question of "I'm still confused why drivers need
fault handlers at all?" in below thread?

https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/#m15def9e8b236dfcf97e21c8e9f8a58da214e3691




 From this point of view an IOASID is actually not just a variant of
iommu_domain, but an I/O page table representation in a broader
sense.


Yes, and things need to evolve in a staged way. The ioctl API should
have room for this growth but you need to start out with some
constrained enough to actually implement then figure out how to grow
from there


Yes, agreed. I just think about it from the perspective of a design
document.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, June 2, 2021 6:22 AM
> 
> On Tue, 1 Jun 2021 07:01:57 +
> "Tian, Kevin"  wrote:
> >
> > I summarized five opens here, about:
> >
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> >
> ...
> >
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> 
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is

invlpg -> wbinvd :)

> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.

so the symbol_get thing is not about the protocol itself. Whatever protocol
is defined, as long as kvm needs to call vfio or ioasid helper function, we 
need define a proper way to do it. Jason, what's your opinion of alternative 
option since you dislike symbol_get?

> 
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
> 
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.
> 
> > Because KVM deduces some policy based on the fact of assigned device,
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC.
> 
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.

One open is whether kvm should hold a device reference or IOASID
reference. For DMA coherence, it only matters whether assigned 
devices are coherent or not (not for a specific address space). For kvmgt, 
it is for recoding kvm pointer in mdev driver to do write protection. For 
ppc, it does relate to a specific I/O page table.

Then I feel only a part of the protocol will be moved to /dev/ioasid and
something will still remain between kvm and vfio?

> 
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
> 
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM
> needs
> to know how to handle those for in-kernel acceleration.  Thanks,
> 

ok.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, June 2, 2021 1:57 AM
> 
> On Tue, Jun 01, 2021 at 08:38:00AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired 
> > > > w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use
> cases:
> > > >
> > > > ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-
> virtualization
> > > > scenario.
> > >
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> >
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 

OK, got your point. So let's start by keeping this room. For new
sub-systems like vDPA,  they don't need inventing group fd uAPI
and just leave to their user to meet the group limitation. For existing
sub-system i.e. VFIO, it could keep a stronger group enforcement
uAPI like today. One day, we may revisit it if the simple policy works
well for all other new sub-systems.

> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

Yes, this is guaranteed. /dev/ioasid uAPI has no group concept.

> 
> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.
> 
> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

yes

> 
> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.

Possibly multiple IOASIDs as VFIO container cay hold incompatible devices
today. Suppose helper functions will be provided for VFIO container to
create IOASID and then use map/unmap to manage its I/O page table.
This is the shim iommu driver concept in previous discussion between
you and Alex.

This can be done at a later stage. Let's focus on /dev/ioasid uAPI, and
bear some code duplication between it and vfio type1 for now. 

> 
> But that is too complicated and far out for me at least to guess on at
> this point..

We're working on a prototype in parallel with this discussion. Based on
this work we'll figure out what's the best way to start with.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, June 2, 2021 1:42 AM
> 
> On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Saturday, May 29, 2021 1:36 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > >
> > > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > > software nesting. With hardware support the child and parent I/O page
> > > > tables are walked consecutively by the IOMMU to form a nested
> translation.
> > > > When it's implemented in software, the ioasid driver is responsible for
> > > > merging the two-level mappings into a single-level shadow I/O page
> table.
> > > > Software nesting requires both child/parent page tables operated
> through
> > > > the dma mapping protocol, so any change in either level can be
> captured
> > > > by the kernel to update the corresponding shadow mapping.
> > >
> > > Why? A SW emulation could do this synchronization during invalidation
> > > processing if invalidation contained an IOVA range.
> >
> > In this proposal we differentiate between host-managed and user-
> > managed I/O page tables. If host-managed, the user is expected to use
> > map/unmap cmd explicitly upon any change required on the page table.
> > If user-managed, the user first binds its page table to the IOMMU and
> > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > not required when changing a PTE from non-present to present).
> >
> > We expect user to use map+unmap and bind+invalidate respectively
> > instead of mixing them together. Following this policy, map+unmap
> > must be used in both levels for software nesting, so changes in either
> > level are captured timely to synchronize the shadow mapping.
> 
> map+unmap or bind+invalidate is a policy of the IOASID itself set when
> it is created. If you put two different types in a tree then each IOASID
> must continue to use its own operation mode.
> 
> I don't see a reason to force all IOASIDs in a tree to be consistent??

only for software nesting. With hardware support the parent uses map
while the child uses bind.

Yes, the policy is specified per IOASID. But if the policy violates the
requirement in a specific nesting mode, then nesting should fail.

> 
> A software emulated two level page table where the leaf level is a
> bound page table in guest memory should continue to use
> bind/invalidate to maintain the guest page table IOASID even though it
> is a SW construct.

with software nesting the leaf should be a host-managed page table
(or metadata). A bind/invalidate protocol doesn't require the user
to notify the kernel of every page table change. But for software nesting
the kernel must know every change to timely update the shadow/merged 
mapping, otherwise DMA may hit stale mapping.

> 
> The GPA level should use map/unmap because it is a kernel owned page
> table

yes, this is always true.

> 
> Though how to efficiently mix map/unmap on the GPA when there are SW
> nested levels below it looks to be quite challenging.
> 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, June 2, 2021 4:29 AM
> 
> On Tue, Jun 01, 2021 at 07:01:57AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Saturday, May 29, 2021 4:03 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > > /dev/ioasid provides an unified interface for managing I/O page tables
> for
> > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > vDPA,
> > > > etc.) are expected to use this interface instead of creating their own
> logic to
> > > > isolate untrusted device DMAs initiated by userspace.
> > >
> > > It is very long, but I think this has turned out quite well. It
> > > certainly matches the basic sketch I had in my head when we were
> > > talking about how to create vDPA devices a few years ago.
> > >
> > > When you get down to the operations they all seem pretty common
> sense
> > > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > > IOASID with pages somehow. Worry about PASID labeling.
> > >
> > > It really is critical to get all the vendor IOMMU people to go over it
> > > and see how their HW features map into this.
> > >
> >
> > Agree. btw I feel it might be good to have several design opens
> > centrally discussed after going through all the comments. Otherwise
> > they may be buried in different sub-threads and potentially with
> > insufficient care (especially for people who haven't completed the
> > reading).
> >
> > I summarized five opens here, about:
> >
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> >
> > For 1), two alternative names are mentioned: /dev/iommu and
> > /dev/ioas. I don't have a strong preference and would like to hear
> > votes from all stakeholders. /dev/iommu is slightly better imho for
> > two reasons. First, per AMD's presentation in last KVM forum they
> > implement vIOMMU in hardware thus need to support user-managed
> > domains. An iommu uAPI notation might make more sense moving
> > forward. Second, it makes later uAPI naming easier as 'IOASID' can
> > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> > IOASID_ALLOC_IOASID. :)
> 
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time. We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
> 
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
> 
> Then we don't have naming aliasing with kernel constructs.
> 
> > For 2), Jason prefers to not blocking it if no kernel design reason. If
> > one device is allowed to bind multiple IOASID fd's, the main problem
> > is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1
> > and giova_ioasid created in fd2 and then nesting them together (and
> 
> Huh? This can't happen
> 
> Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
> provide APIs to create a tree of IOASID's outside a single FD container.
> 
> If a device can consume multiple IOASID's it doesn't care how many or
> what /dev/ioasid FDs they come from.

OK, this implies that if one user inadvertently creates intended parent/
child via different fd's then the operation will simply fail. More specifically
taking ARM's case for example. There is only a single 2nd-level I/O page
table per device (nested by multiple 1st-level tables). Say the user already 
creates a gpa_ioasid for a device via fd1. Now he binds the device to fd2, 
intending to enable vSVA which requires nested translation thus needs 
create a parent via fd2. This parent creation will simply fail by the IOMMU 
layer because the 2nd-level (via fd1) is already installed for this device.

> 
> > To the other end there was also thought whether we should make
> > a single I/O address space per IOASID fd. This was discussed in previous
> > thread that #fd's are insufficient to afford theoretical 1M's address
> > spaces per device. But let's have another revisit and draw a clear
> > conclusion whether this option is viable.
> 
> I had remarks on this, I think per-fd doesn't work
> 
> > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > specify a device label. This label will be recorded in /dev/iommu to
> > serve per-device invalidation request from and report per-device
> > fault data to the user.
> 
> I wonder which of the user providing a 64 bit cookie or the kernel
> returning a small IDA is the best choice here? Both have merits
> depending on what qemu needs..

Yes, either way can work. I don't have a strong preference. Jean?

> 
> > In addition, vPASID (if provided by user) will
> > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > is conducted properly. e.g. invalidation request from user carries
> > a vPASID 

Re: [RFC PATCH V3 09/11] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-06-01 Thread Boris Ostrovsky


On 5/30/21 11:06 AM, Tianyu Lan wrote:
> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
>  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
>  
>  IOMMU_INIT_FINISH(2,
> -   NULL,
> +   hyperv_swiotlb_detect,
> pci_xen_swiotlb_init,
> NULL);


Could you explain this change?


-boris



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support

2021-06-01 Thread Rob Clark
From: Rob Clark 

Add, via the adreno-smmu-priv interface, a way for the GPU to request
the SMMU to stall translation on faults, and then later resume the
translation, either retrying or terminating the current translation.

This will be used on the GPU side to "freeze" the GPU while we snapshot
useful state for devcoredump.

Signed-off-by: Rob Clark 
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++
 include/linux/adreno-smmu-priv.h   |  7 +
 2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index b2e31ea84128..61fc645c1325 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -13,6 +13,7 @@ struct qcom_smmu {
struct arm_smmu_device smmu;
bool bypass_quirk;
u8 bypass_cbndx;
+   u32 stall_enabled;
 };
 
 static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
@@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct 
arm_smmu_device *smmu)
 static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
u32 reg)
 {
+   struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+
/*
 * On the GPU device we want to process subsequent transactions after a
 * fault to keep the GPU from hanging
 */
reg |= ARM_SMMU_SCTLR_HUPCF;
 
+   if (qsmmu->stall_enabled & BIT(idx))
+   reg |= ARM_SMMU_SCTLR_CFCFG;
+
arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
@@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void 
*cookie,
info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, 
ARM_SMMU_CB_CONTEXTIDR);
 }
 
+static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
+{
+   struct arm_smmu_domain *smmu_domain = (void *)cookie;
+   struct arm_smmu_cfg *cfg = _domain->cfg;
+   struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+
+   if (enabled)
+   qsmmu->stall_enabled |= BIT(cfg->cbndx);
+   else
+   qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+}
+
+static void qcom_adreno_smmu_resume_translation(const void *cookie, bool 
terminate)
+{
+   struct arm_smmu_domain *smmu_domain = (void *)cookie;
+   struct arm_smmu_cfg *cfg = _domain->cfg;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   u32 reg = 0;
+
+   if (terminate)
+   reg |= ARM_SMMU_RESUME_TERMINATE;
+
+   arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct 
arm_smmu_domain *smmu_domain,
priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
+   priv->set_stall = qcom_adreno_smmu_set_stall;
+   priv->resume_translation = qcom_adreno_smmu_resume_translation;
 
return 0;
 }
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index 53fe32fb9214..c637e0997f6d 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
  * TTBR0 translation is enabled with the specified cfg
  * @get_fault_info: Called by the GPU fault handler to get information about
  *  the fault
+ * @set_stall: Configure whether stall on fault (CFCFG) is enabled.  Call
+ * before set_ttbr0_cfg().  If stalling on fault is enabled,
+ * the GPU driver must call resume_translation()
+ * @resume_translation: Resume translation after a fault
+ *
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -60,6 +65,8 @@ struct adreno_smmu_priv {
 const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
 int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
 void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info 
*info);
+void (*set_stall)(const void *cookie, bool enabled);
+void (*resume_translation)(const void *cookie, bool terminate);
 };
 
 #endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info

2021-06-01 Thread Rob Clark
From: Jordan Crouse 

Add a callback in adreno-smmu-priv to read interesting SMMU
registers to provide an opportunity for a richer debug experience
in the GPU driver.

Signed-off-by: Jordan Crouse 
Signed-off-by: Rob Clark 
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 
 drivers/iommu/arm/arm-smmu/arm-smmu.h  |  2 ++
 include/linux/adreno-smmu-priv.h   | 31 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..b2e31ea84128 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct 
arm_smmu_device *smmu, int idx,
arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
+static void qcom_adreno_smmu_get_fault_info(const void *cookie,
+   struct adreno_smmu_fault_info *info)
+{
+   struct arm_smmu_domain *smmu_domain = (void *)cookie;
+   struct arm_smmu_cfg *cfg = _domain->cfg;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
+   info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
+   info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
+   info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
+   info->cbfrsynra = arm_smmu_gr1_read(smmu, 
ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
+   info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
+   info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, 
ARM_SMMU_CB_CONTEXTIDR);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct 
arm_smmu_domain *smmu_domain,
priv->cookie = smmu_domain;
priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
+   priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
 
return 0;
 }
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h 
b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index c31a59d35c64..84c21c4b0691 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
 #define ARM_SMMU_CB_FSYNR0 0x68
 #define ARM_SMMU_FSYNR0_WNRBIT(4)
 
+#define ARM_SMMU_CB_FSYNR1 0x6c
+
 #define ARM_SMMU_CB_S1_TLBIVA  0x600
 #define ARM_SMMU_CB_S1_TLBIASID0x610
 #define ARM_SMMU_CB_S1_TLBIVAL 0x620
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index a889f28afb42..53fe32fb9214 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -8,6 +8,32 @@
 
 #include 
 
+/**
+ * struct adreno_smmu_fault_info - container for key fault information
+ *
+ * @far: The faulting IOVA from ARM_SMMU_CB_FAR
+ * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
+ * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
+ * @fsr: The fault status from ARM_SMMU_CB_FSR
+ * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
+ * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
+ * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
+ *
+ * This struct passes back key page fault information to the GPU driver
+ * through the get_fault_info function pointer.
+ * The GPU driver can use this information to print informative
+ * log messages and provide deeper GPU specific insight into the fault.
+ */
+struct adreno_smmu_fault_info {
+   u64 far;
+   u64 ttbr0;
+   u32 contextidr;
+   u32 fsr;
+   u32 fsynr0;
+   u32 fsynr1;
+   u32 cbfrsynra;
+};
+
 /**
  * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
  *
@@ -17,6 +43,8 @@
  * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank.  A
  * NULL config disables TTBR0 translation, otherwise
  * TTBR0 translation is enabled with the specified cfg
+ * @get_fault_info: Called by the GPU fault handler to get information about
+ *  the fault
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -31,6 +59,7 @@ struct adreno_smmu_priv {
 const void *cookie;
 const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
 int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
+void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info 
*info);
 };
 
-#endif /* __ADRENO_SMMU_PRIV_H */
\ No newline at end of file
+#endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org

[PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers

2021-06-01 Thread Rob Clark
From: Jordan Crouse 

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse 
Signed-off-by: Rob Clark 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..b4b32d31fc06 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void 
*dev)
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_device *smmu = smmu_domain->smmu;
int idx = smmu_domain->cfg.cbndx;
+   int ret;
 
fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void 
*dev)
iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
 
-   dev_err_ratelimited(smmu->dev,
-   "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, 
cbfrsynra=0x%x, cb=%d\n",
+   ret = report_iommu_fault(domain, NULL, iova,
+   fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : 
IOMMU_FAULT_READ);
+
+   if (ret == -ENOSYS)
+   dev_err_ratelimited(smmu->dev,
+   "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, 
cbfrsynra=0x%x, cb=%d\n",
fsr, iova, fsynr, cbfrsynra, idx);
 
arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling

2021-06-01 Thread Rob Clark
From: Rob Clark 

This picks up an earlier series[1] from Jordan, and adds additional
support needed to generate GPU devcore dumps on iova faults.  Original
description:

This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
resume translation after it has had a chance to snapshot the GPUs
state
v3: Always clear FSR even if the target driver is going to handle resume
v2: Fix comment wording and function pointer check per Rob Clark

[1] 
https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcro...@codeaurora.org/

Jordan Crouse (3):
  iommu/arm-smmu: Add support for driver IOMMU fault handlers
  iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
info
  drm/msm: Improve the a6xx page fault handler

Rob Clark (3):
  iommu/arm-smmu-qcom: Add stall support
  drm/msm: Add crashdump support for stalled SMMU
  drm/msm: devcoredump iommu fault support

 drivers/gpu/drm/msm/adreno/a2xx_gpu.c   |   2 +-
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c   |   2 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c   |   2 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c   |   9 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c   | 101 +++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h   |   2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  43 +++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c |  15 +++
 drivers/gpu/drm/msm/msm_debugfs.c   |   2 +-
 drivers/gpu/drm/msm/msm_gem.h   |   1 +
 drivers/gpu/drm/msm/msm_gem_submit.c|   1 +
 drivers/gpu/drm/msm/msm_gpu.c   |  55 ++-
 drivers/gpu/drm/msm/msm_gpu.h   |  19 +++-
 drivers/gpu/drm/msm/msm_gpummu.c|   5 +
 drivers/gpu/drm/msm/msm_iommu.c |  22 -
 drivers/gpu/drm/msm/msm_mmu.h   |   5 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 ++
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |   9 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.h   |   2 +
 include/linux/adreno-smmu-priv.h|  38 +++-
 20 files changed, 354 insertions(+), 31 deletions(-)

-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Alex Williamson
On Tue, 1 Jun 2021 07:01:57 +
"Tian, Kevin"  wrote:
> 
> I summarized five opens here, about:
> 
> 1)  Finalizing the name to replace /dev/ioasid;
> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> 3)  Carry device information in invalidation/fault reporting uAPI;
> 4)  What should/could be specified when allocating an IOASID;
> 5)  The protocol between vfio group and kvm;
> 
...
> 
> For 5), I'd expect Alex to chime in. Per my understanding looks the
> original purpose of this protocol is not about I/O address space. It's
> for KVM to know whether any device is assigned to this VM and then
> do something special (e.g. posted interrupt, EPT cache attribute, etc.).

Right, the original use case was for KVM to determine whether it needs
to emulate invlpg, so it needs to be aware when an assigned device is
present and be able to test if DMA for that device is cache coherent.
The user, QEMU, creates a KVM "pseudo" device representing the vfio
group, providing the file descriptor of that group to show ownership.
The ugly symbol_get code is to avoid hard module dependencies, ie. the
kvm module should not pull in or require the vfio module, but vfio will
be present if attempting to register this device.

With kvmgt, the interface also became a way to register the kvm pointer
with vfio for the translation mentioned elsewhere in this thread.

The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
page table so that it can handle iotlb programming from pre-registered
memory without trapping out to userspace.

> Because KVM deduces some policy based on the fact of assigned device, 
> it needs to hold a reference to related vfio group. this part is irrelevant
> to this RFC. 

All of these use cases are related to the IOMMU, whether DMA is
coherent, translating device IOVA to GPA, and an acceleration path to
emulate IOMMU programming in kernel... they seem pretty relevant.

> But ARM's VMID usage is related to I/O address space thus needs some
> consideration. Another strange thing is about PPC. Looks it also leverages
> this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> group. I don't know why it's done through KVM instead of VFIO uAPI in
> the first place.

AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
to know how to handle those for in-kernel acceleration.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 07:01:57AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Saturday, May 29, 2021 4:03 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > /dev/ioasid provides an unified interface for managing I/O page tables for
> > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > vDPA,
> > > etc.) are expected to use this interface instead of creating their own 
> > > logic to
> > > isolate untrusted device DMAs initiated by userspace.
> > 
> > It is very long, but I think this has turned out quite well. It
> > certainly matches the basic sketch I had in my head when we were
> > talking about how to create vDPA devices a few years ago.
> > 
> > When you get down to the operations they all seem pretty common sense
> > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > IOASID with pages somehow. Worry about PASID labeling.
> > 
> > It really is critical to get all the vendor IOMMU people to go over it
> > and see how their HW features map into this.
> > 
> 
> Agree. btw I feel it might be good to have several design opens 
> centrally discussed after going through all the comments. Otherwise 
> they may be buried in different sub-threads and potentially with 
> insufficient care (especially for people who haven't completed the
> reading).
> 
> I summarized five opens here, about:
> 
> 1)  Finalizing the name to replace /dev/ioasid;
> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> 3)  Carry device information in invalidation/fault reporting uAPI;
> 4)  What should/could be specified when allocating an IOASID;
> 5)  The protocol between vfio group and kvm;
> 
> For 1), two alternative names are mentioned: /dev/iommu and 
> /dev/ioas. I don't have a strong preference and would like to hear 
> votes from all stakeholders. /dev/iommu is slightly better imho for 
> two reasons. First, per AMD's presentation in last KVM forum they 
> implement vIOMMU in hardware thus need to support user-managed 
> domains. An iommu uAPI notation might make more sense moving 
> forward. Second, it makes later uAPI naming easier as 'IOASID' can 
> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
> IOASID_ALLOC_IOASID. :)

I think two years ago I suggested /dev/iommu and it didn't go very far
at the time. We've also talked about this as /dev/sva for a while and
now /dev/ioasid

I think /dev/iommu is fine, and call the things inside them IOAS
objects.

Then we don't have naming aliasing with kernel constructs.
 
> For 2), Jason prefers to not blocking it if no kernel design reason. If 
> one device is allowed to bind multiple IOASID fd's, the main problem
> is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
> and giova_ioasid created in fd2 and then nesting them together (and

Huh? This can't happen

Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
provide APIs to create a tree of IOASID's outside a single FD container.

If a device can consume multiple IOASID's it doesn't care how many or
what /dev/ioasid FDs they come from.

> To the other end there was also thought whether we should make
> a single I/O address space per IOASID fd. This was discussed in previous
> thread that #fd's are insufficient to afford theoretical 1M's address
> spaces per device. But let's have another revisit and draw a clear
> conclusion whether this option is viable.

I had remarks on this, I think per-fd doesn't work
 
> This implies that VFIO_BOUND_IOASID will be extended to allow user
> specify a device label. This label will be recorded in /dev/iommu to
> serve per-device invalidation request from and report per-device 
> fault data to the user.

I wonder which of the user providing a 64 bit cookie or the kernel
returning a small IDA is the best choice here? Both have merits
depending on what qemu needs..

> In addition, vPASID (if provided by user) will
> be also recorded in /dev/iommu so vPASID<->pPASID conversion 
> is conducted properly. e.g. invalidation request from user carries
> a vPASID which must be converted into pPASID before calling iommu
> driver. Vice versa for raw fault data which carries pPASID while the
> user expects a vPASID.

I don't think the PASID should be returned at all. It should return
the IOASID number in the FD and/or a u64 cookie associated with that
IOASID. Userspace should figure out what the IOASID & device
combination means.

> Seems to close this design open we have to touch the kAPI design. and 
> Joerg's input is highly appreciated here.

uAPI is forever, the kAPI is constantly changing. I always dislike
warping the uAPI based on the current kAPI situation.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes

2021-06-01 Thread Nadav Amit


> On Jun 1, 2021, at 10:27 AM, Robin Murphy  wrote:
> 
> On 2021-06-01 17:39, Nadav Amit wrote:
>>> On Jun 1, 2021, at 8:59 AM, Robin Murphy  wrote:
>>> 
>>> On 2021-05-02 07:59, Nadav Amit wrote:
 From: Nadav Amit 
 Some IOMMU architectures perform invalidations regardless of the page
 size. In such architectures there is no need to sync when the page size
 changes or to regard pgsize when making interim flush in
 iommu_iotlb_gather_add_page().
 Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide
 whether gather's pgsize is tracked and triggers a flush.
>>> 
>>> I've objected before[1], and I'll readily object again ;)
>>> 
>>> I still think it's very silly to add a bunch of indirection all over the 
>>> place to make a helper function not do the main thing it's intended to help 
>>> with. If you only need trivial address gathering then it's far simpler to 
>>> just implement trivial address gathering. I suppose if you really want to 
>>> you could factor out another helper to share the 5 lines of code which 
>>> ended up in mtk-iommu (see commit f21ae3b10084).
>> Thanks, Robin.
>> I read your comments but I cannot fully understand the alternative that you 
>> propose, although I do understand your objection to the indirection 
>> “ignore_gather_pgsize”. Would it be ok if “ignore_gather_pgsize" was 
>> provided as an argument for iommu_iotlb_gather_add_page()?
> 
> No, I mean if iommu_iotlb_gather_add_page() doesn't have the behaviour your 
> driver wants, just don't call it. Write or factor out a suitable helper that 
> *does* do what you want and call that, or just implement the logic directly 
> inline. Indirect argument or not, it just doesn't make much sense to have a 
> helper function call which says "do this except don't do most of it".
> 
>> In general, I can live without this patch. It probably would have negligent 
>> impact on performance anyhow.
> 
> As I implied, it sounds like your needs are the same as the Mediatek driver 
> had, so you could readily factor out a new page-size-agnostic gather helper 
> from that. I fully support making the functional change to amd-iommu 
> *somehow* - nobody likes unnecessary syncs - just not with this particular 
> implementation :)

Hm… avoid code duplication I need to extract some common code to another 
function.

Is the following resembles what you had in mind (untested)?

-- >8 --

Subject: [PATCH] iommu: add iommu_iotlb_gather_add_page_ignore_pgsize()

---
 drivers/iommu/mtk_iommu.c |  7 ++---
 include/linux/iommu.h | 55 ++-
 2 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e168a682806a..5890e745bed3 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -520,12 +520,9 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
  struct iommu_iotlb_gather *gather)
 {
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
-   unsigned long end = iova + size - 1;

-   if (gather->start > iova)
-   gather->start = iova;
-   if (gather->end < end)
-   gather->end = end;
+   iommu_iotlb_gather_update_range(gather, iova, size);
+
return dom->iop->unmap(dom->iop, iova, size, gather);
 }

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9ca6e6b8084d..037434b6eb4c 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -535,29 +535,58 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }

-static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
+static inline
+void iommu_iotlb_gather_update_range(struct iommu_iotlb_gather *gather,
+unsigned long iova, size_t size)
+{
+   unsigned long start = iova, end = start + size - 1;
+
+   if (gather->end < end)
+   gather->end = end;
+
+   if (gather->start > start)
+   gather->start = start;
+
+   gather->pgsize = size;
+}
+
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   return iova + size < gather->start || iova > gather->end + 1;
+}
+
+static inline
+void iommu_iotlb_gather_add_page_ignore_pgsize(struct iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
 {
-   unsigned long start = iova, end = start + size - 1;
+   /*
+* Only if the new page is disjoint from the current range, then sync
+* the TLB so that the gather structure can be rewritten.
+*/
+   if (iommu_iotlb_gather_is_disjoint(gather, iova, size) && 
gather->pgsize)
+   iommu_iotlb_sync(domain, gather);
+
+   

Re: [PATCH 1/1] dma-contiguous: return early for dt case in dma_contiguous_reserve

2021-06-01 Thread Robin Murphy

On 2021-05-31 10:21, Dong Aisheng wrote:

On Tue, May 18, 2021 at 7:29 PM Dong Aisheng  wrote:


dma_contiguous_reserve() aims to support cmdline case for CMA memory
reserve. But if users define reserved memory in DT,
'dma_contiguous_default_area' will not be 0, then it's meaningless
to continue to run dma_contiguous_reserve(). So we return early
if detect 'dma_contiguous_default_area' is unzero.

Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Dong Aisheng 


Gently ping


The commit message is still wrong, and I still think the change doesn't 
achieve anything meaningful. This code is hard to make sense of either 
way because the crucial interplay between size_cmdline and 
dma_contiguous_default_area is hidden somewhere else entirely, and it 
would take a much more significant refactoring to clear that up.


Robin.



Regards
Aisheng


---
  kernel/dma/contiguous.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 3d63d91cba5c..ebade9f43eff 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -171,6 +171,9 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 phys_addr_t selected_limit = limit;
 bool fixed = false;

+   if (dma_contiguous_default_area)
+   return;
+
 pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit);

 if (size_cmdline != -1) {
@@ -191,7 +194,7 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
  #endif
 }

-   if (selected_size && !dma_contiguous_default_area) {
+   if (selected_size) {
 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
  (unsigned long)selected_size / SZ_1M);

--
2.25.1


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Regression 5.12.0-rc4 net: ice: significant throughput drop

2021-06-01 Thread Jussi Maki
Hi Robin,

On Tue, Jun 1, 2021 at 2:39 PM Robin Murphy  wrote:
> >> The regression shows as a significant drop in throughput as measured
> >> with "super_netperf" [0],
> >> with measured bandwidth of ~95Gbps before and ~35Gbps after:
>
> I guess that must be the difference between using the flush queue
> vs. strict invalidation. On closer inspection, it seems to me that
> there's a subtle pre-existing bug in the AMD IOMMU driver, in that
> amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api()
> has called bus_set_iommu(). Does the patch below work?

Thanks for the quick response & patch. I tried it out and indeed it
does solve the issue:

# uname -a
Linux zh-lab-node-3 5.13.0-rc3-amd-iommu+ #31 SMP Tue Jun 1 17:12:57
UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2
95341.2

root@zh-lab-node-3:~# uname -a
Linux zh-lab-node-3 5.13.0-rc3-amd-iommu-unpatched #32 SMP Tue Jun 1
17:29:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2
33989.5
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-01 Thread Thierry Reding
On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
> > On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
> > > From: Thierry Reding 
> > > 
> > > Hi,
> > > 
> > > this is a set of patches that is the result of earlier discussions
> > > regarding early identity mappings that are needed to avoid SMMU faults
> > > during early boot.
> > > 
> > > The goal here is to avoid early identity mappings altogether and instead
> > > postpone the need for the identity mappings to when devices are attached
> > > to the SMMU. This works by making the SMMU driver coordinate with the
> > > memory controller driver on when to start enforcing SMMU translations.
> > > This makes Tegra behave in a more standard way and pushes the code to
> > > deal with the Tegra-specific programming into the NVIDIA SMMU
> > > implementation.
> > > 
> > > Compared to the original version of these patches, I've split the
> > > preparatory work into a separate patch series because it became very
> > > large and will be mostly uninteresting for this audience.
> > > 
> > > Patch 1 provides a mechanism to program SID overrides at runtime. Patch
> > > 2 updates the ARM SMMU device tree bindings to include the Tegra186
> > > compatible string as suggested by Robin during review.
> > > 
> > > Patches 3 and 4 create the fundamentals in the SMMU driver to support
> > > this and also make this functionality available on Tegra186. Patch 5
> > > hooks the ARM SMMU up to the memory controller so that the memory client
> > > stream ID overrides can be programmed at the right time.
> > > 
> > > Patch 6 extends this mechanism to Tegra186 and patches 7-9 enable all of
> > > this through device tree updates. Patch 10 is included here to show how
> > > SMMU will be enabled for display controllers. However, it cannot be
> > > applied yet because the code to create identity mappings for potentially
> > > live framebuffers hasn't been merged yet.
> > > 
> > > The end result is that various peripherals will have SMMU enabled, while
> > > the display controllers will keep using passthrough, as initially set up
> > > by firmware. Once the device tree bindings have been accepted and the
> > > SMMU driver has been updated to create identity mappings for the display
> > > controllers, they can be hooked up to the SMMU and the code in this
> > > series will automatically program the SID overrides to enable SMMU
> > > translations at the right time.
> > > 
> > > Note that the series creates a compile time dependency between the
> > > memory controller and IOMMU trees. If it helps I can provide a branch
> > > for each tree, modelling the dependency, once the series has been
> > > reviewed.
> > > 
> > > Changes in v2:
> > > - split off the preparatory work into a separate series (that needs to
> > >   be applied first)
> > > - address review comments by Robin
> > > 
> > > Thierry
> > > 
> > > Thierry Reding (10):
> > >   memory: tegra: Implement SID override programming
> > >   dt-bindings: arm-smmu: Add Tegra186 compatible string
> > >   iommu/arm-smmu: Implement ->probe_finalize()
> > >   iommu/arm-smmu: tegra: Detect number of instances at runtime
> > >   iommu/arm-smmu: tegra: Implement SID override programming
> > >   iommu/arm-smmu: Use Tegra implementation on Tegra186
> > >   arm64: tegra: Use correct compatible string for Tegra186 SMMU
> > >   arm64: tegra: Hook up memory controller to SMMU on Tegra186
> > >   arm64: tegra: Enable SMMU support on Tegra194
> > >   arm64: tegra: Enable SMMU support for display on Tegra194
> > > 
> > >  .../devicetree/bindings/iommu/arm,smmu.yaml   |  11 +-
> > >  arch/arm64/boot/dts/nvidia/tegra186.dtsi  |   4 +-
> > >  arch/arm64/boot/dts/nvidia/tegra194.dtsi  | 166 ++
> > >  drivers/iommu/arm/arm-smmu/arm-smmu-impl.c|   3 +-
> > >  drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c  |  90 --
> > >  drivers/iommu/arm/arm-smmu/arm-smmu.c |  13 ++
> > >  drivers/iommu/arm/arm-smmu/arm-smmu.h |   1 +
> > >  drivers/memory/tegra/mc.c |   9 +
> > >  drivers/memory/tegra/tegra186.c   |  72 
> > >  include/soc/tegra/mc.h|   3 +
> > >  10 files changed, 349 insertions(+), 23 deletions(-)
> > 
> > Will, Robin,
> > 
> > do you have any more comments on the ARM SMMU bits of this series? If
> > not, can you guys provide an Acked-by so that Krzysztof can pick this
> > (modulo the DT patches) up into the memory-controller tree for v5.14?
> > 
> > I'll send out a v3 with the bisectibilitiy fix that Krishna pointed
> > out.
> 
> Probably best if I queue 3-6 on a separate branch once you send a v3,
> then Krzysztof can pull that in if he needs it.

Patch 5 has a build-time dependency on patch 1, so they need to go in
together. The reason why I suggested Krzysztof pick these up is because
there is a restructuring series that this depends on, which will go 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 08:38:00AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Saturday, May 29, 2021 3:59 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > >
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > >   device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > >
> > > As explained earlier, one IOASID fd is sufficient for all intended use 
> > > cases:
> > >
> > >   ioasid_fd = open("/dev/ioasid", mode);
> > >
> > > For simplicity below examples are all made for the virtualization story.
> > > They are representative and could be easily adapted to a 
> > > non-virtualization
> > > scenario.
> > 
> > For others, I don't think this is *strictly* necessary, we can
> > probably still get to the device_fd using the group_fd and fit in
> > /dev/ioasid. It does make the rest of this more readable though.
> 
> Jason, want to confirm here. Per earlier discussion we remain an
> impression that you want VFIO to be a pure device driver thus
> container/group are used only for legacy application.

Let me call this a "nice wish".

If you get to a point where you hard need this, then identify the hard
requirement and let's do it, but I wouldn't bloat this already large
project unnecessarily.

Similarly I wouldn't depend on the group fd existing in this design
so it could be changed later.

> From this comment are you suggesting that VFIO can still keep
> container/ group concepts and user just deprecates the use of vfio
> iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> simple policy that an IOASID will reject cmd if partially-attached
> group exists)?

I would say no on the container. /dev/ioasid == the container, having
two competing objects at once in a single process is just a mess.

If the group fd can be kept requires charting a path through the
ioctls where the container is not used and /dev/ioasid is sub'd in
using the same device FD specific IOCTLs you show here.

I didn't try to chart this out carefully.

Also, ultimately, something need to be done about compatability with
the vfio container fd. It looks clear enough to me that the the VFIO
container FD is just a single IOASID using a special ioctl interface
so it would be quite rasonable to harmonize these somehow.

But that is too complicated and far out for me at least to guess on at
this point..

> > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > there any scenario where we want different vpasid's for the same
> > IOASID? I guess it is OK like this. Hum.
> 
> Yes, it's completely sane that the guest links a I/O page table to 
> different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> that when multiple devices share an I/O page table they must use
> the same PASID#. 

Ok..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: Print default strict or lazy mode at init time

2021-06-01 Thread John Garry


pr_info("DMA domain default TLB invalidation policy: %s mode %s\n",
   iommu_dma_strict ? "strict" : "lazy",
    (iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ?
    "(set via kernel command line)" : "");

I think it's worth mentioning "default" somewhere, as not all IOMMUs 
or devices will use lazy mode even if it's default.


But that's part of what I think is misleading - I boot and see that the 
default is something, so I reboot with iommu.strict to explicitly set it 
the other way, but now that's the default... huh?


The way I see it, we're saying what the current IOMMU API policy is - 
the value of iommu_dma_strict at any given time is fact - but we're not 
necessarily saying how widely that policy is enforced. We similarly 
report the type for default domains from global policy even though that 
may also be overridden per-group by drivers and/or userspace later; 


we 
don't say it's the *default* default domain type.


I think that is this is the behavior a user would understand from that 
message.


However from a glance at the intel IOMMU driver, it seems possible to 
change default domain type after iommu_subsys_init().




However, having now debugged the AMD issue from another thread, I think 
doing this at subsys_initcall is in fact going to be too early to be 
meaningful, since it ignores drivers' ability to change the global 
policy :(


A user may still learn the IOMMU group domain type from sysfs. There is 
no such thing for TLB invalidation mode - how about add a file for this? 
It would be useful.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Saturday, May 29, 2021 1:36 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > 
> > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > software nesting. With hardware support the child and parent I/O page
> > > tables are walked consecutively by the IOMMU to form a nested translation.
> > > When it's implemented in software, the ioasid driver is responsible for
> > > merging the two-level mappings into a single-level shadow I/O page table.
> > > Software nesting requires both child/parent page tables operated through
> > > the dma mapping protocol, so any change in either level can be captured
> > > by the kernel to update the corresponding shadow mapping.
> > 
> > Why? A SW emulation could do this synchronization during invalidation
> > processing if invalidation contained an IOVA range.
> 
> In this proposal we differentiate between host-managed and user-
> managed I/O page tables. If host-managed, the user is expected to use
> map/unmap cmd explicitly upon any change required on the page table. 
> If user-managed, the user first binds its page table to the IOMMU and 
> then use invalidation cmd to flush iotlb when necessary (e.g. typically
> not required when changing a PTE from non-present to present).
> 
> We expect user to use map+unmap and bind+invalidate respectively
> instead of mixing them together. Following this policy, map+unmap
> must be used in both levels for software nesting, so changes in either 
> level are captured timely to synchronize the shadow mapping.

map+unmap or bind+invalidate is a policy of the IOASID itself set when
it is created. If you put two different types in a tree then each IOASID
must continue to use its own operation mode.

I don't see a reason to force all IOASIDs in a tree to be consistent??

A software emulated two level page table where the leaf level is a
bound page table in guest memory should continue to use
bind/invalidate to maintain the guest page table IOASID even though it
is a SW construct.

The GPA level should use map/unmap because it is a kernel owned page
table

Though how to efficiently mix map/unmap on the GPA when there are SW
nested levels below it looks to be quite challenging.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 12:04:00PM +, Parav Pandit wrote:
> 
> 
> > From: Jason Gunthorpe 
> > Sent: Monday, May 31, 2021 11:43 PM
> > 
> > On Mon, May 31, 2021 at 05:37:35PM +, Parav Pandit wrote:
> > 
> > > In that case, can it be a new system call? Why does it have to be under
> > /dev/ioasid?
> > > For example few years back such system call mpin() thought was proposed
> > in [1].
> > 
> > Reference counting of the overall pins are required
> > 
> > So when a pinned pages is incorporated into an IOASID page table in a later
> > IOCTL it means it cannot be unpinned while the IOASID page table is using 
> > it.
> Ok. but cant it use the same refcount of that mmu uses?

Manipulating that refcount is part of the overhead that is trying to
be avoided here, plus ensuring that the pinned pages accounting
doesn't get out of sync with the actual account of pinned pages!

> > Then the ioasid's interval tree would be mapped into a page table tree in HW
> > format.

> Does it mean in simple use case [1], second level page table copy is
> maintained in the IOMMU side via map interface?  I hope not. It
> should use the same as what mmu uses, right?

Not a full page by page copy, but some interval reference.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:

> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. 

I'm still confused why drivers need fault handlers at all?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
 
> We can open up to ~0U file descriptors, I don't see why we need to restrict
> it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.

A huge number of FDs for typical usages should be avoided.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Parav Pandit
> From: Tian, Kevin 
> Sent: Thursday, May 27, 2021 1:28 PM

> 5.6. I/O page fault
> +++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid,
> pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
> information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), 
> which
> is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
> to the shared ring buffer and triggers eventfd to userspace;
> 
> -   Upon received event, Qemu needs to find the virtual routing information
> (v_rid + v_pasid) of the device attached to the faulting ioasid. If there 
> are
> multiple, pick a random one. This should be fine since the purpose is to
> fix the I/O page table on the guest;
> 
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
> carrying the virtual fault data (v_rid, v_pasid, addr);
> 
Why does it have to be through vIOMMU?
For a VFIO PCI device, have you considered to reuse the same PRI interface to 
inject page fault in the guest?
This eliminates any new v_rid.
It will also route the page fault request and response through the right vfio 
device.

> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
> then sends a page response with virtual completion data (v_rid, v_pasid,
> response_code) to vIOMMU;
> 
What about fixing up the fault for mmu page table as well in guest?
Or you meant both when above you said "updates the I/O page table"?

It is unclear to me that if there is single nested page table maintained or two 
(one for cr3 references and other for iommu).
Can you please clarify?

> -   Qemu finds the pending fault event, converts virtual completion data
> into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> complete the pending fault;
> 
For VFIO PCI device a virtual PRI request response interface is done, it can be 
generic interface among multiple vIOMMUs.

> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> ioasid_data->fault_data, and then calls iommu api to complete it with
> {rid, pasid, response_code};
>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:

> For the case of 1M, I would like to know what's the use case for a single
> process to handle 1M+ address spaces?

For some scenarios every guest PASID will require a IOASID ID # so
there is a large enough demand that FDs alone are not a good fit.

Further there are global container wide properties that are hard to
carry over to a multi-FD model, like the attachment of devices to the
container at the startup.

> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
> 
> If the container and address space is 1:1 then the container seems useless.

The examples at the bottom of the document show multiple IOASIDs in
the container for a parent/child type relationship

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes

2021-06-01 Thread Robin Murphy

On 2021-06-01 17:39, Nadav Amit wrote:




On Jun 1, 2021, at 8:59 AM, Robin Murphy  wrote:

On 2021-05-02 07:59, Nadav Amit wrote:

From: Nadav Amit 
Some IOMMU architectures perform invalidations regardless of the page
size. In such architectures there is no need to sync when the page size
changes or to regard pgsize when making interim flush in
iommu_iotlb_gather_add_page().
Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide
whether gather's pgsize is tracked and triggers a flush.


I've objected before[1], and I'll readily object again ;)

I still think it's very silly to add a bunch of indirection all over the place 
to make a helper function not do the main thing it's intended to help with. If 
you only need trivial address gathering then it's far simpler to just implement 
trivial address gathering. I suppose if you really want to you could factor out 
another helper to share the 5 lines of code which ended up in mtk-iommu (see 
commit f21ae3b10084).


Thanks, Robin.

I read your comments but I cannot fully understand the alternative that you propose, 
although I do understand your objection to the indirection “ignore_gather_pgsize”. 
Would it be ok if “ignore_gather_pgsize" was provided as an argument for 
iommu_iotlb_gather_add_page()?


No, I mean if iommu_iotlb_gather_add_page() doesn't have the behaviour 
your driver wants, just don't call it. Write or factor out a suitable 
helper that *does* do what you want and call that, or just implement the 
logic directly inline. Indirect argument or not, it just doesn't make 
much sense to have a helper function call which says "do this except 
don't do most of it".



In general, I can live without this patch. It probably would have negligent 
impact on performance anyhow.


As I implied, it sounds like your needs are the same as the Mediatek 
driver had, so you could readily factor out a new page-size-agnostic 
gather helper from that. I fully support making the functional change to 
amd-iommu *somehow* - nobody likes unnecessary syncs - just not with 
this particular implementation :)


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:

> This version only covers 1) and 4). Do you think we need to support 2),
> 3) and beyond? 

Yes aboslutely. The API should be flexable enough to specify the
creation of all future page table formats we'd want to have and all HW
specific details on those formats.

> If so, it seems that we need some in-kernel helpers and uAPIs to
> support pre-installing a page table to IOASID. 

Not sure what this means..

> From this point of view an IOASID is actually not just a variant of
> iommu_domain, but an I/O page table representation in a broader
> sense.

Yes, and things need to evolve in a staged way. The ioctl API should
have room for this growth but you need to start out with some
constrained enough to actually implement then figure out how to grow
from there

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 11:08:53AM +0800, Lu Baolu wrote:
> On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
> > > > device bind should fail if the device somehow isn't compatible with
> > > > the scheme the user is tring to use.
> > > yeah, I guess you mean to fail the device attach when the IOASID is a
> > > nesting IOASID but the device is behind an iommu without nesting support.
> > > right?
> > Right..
> 
> Just want to confirm...
> 
> Does this mean that we only support hardware nesting and don't want to
> have soft nesting (shadowed page table in kernel) in IOASID?

No, the uAPI presents a contract, if the kernel can fulfill the
contract then it should be supported.

If you want SW nesting then the kernel has to have the SW support for
it or fail.

At least for the purposes of document I wouldn't devle too much deeper
into that question.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: Print default strict or lazy mode at init time

2021-06-01 Thread Robin Murphy

On 2021-06-01 16:50, John Garry wrote:

On 01/06/2021 10:09, Robin Murphy wrote:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 808ab70d5df5..f25fae62f077 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -138,6 +138,11 @@ static int __init iommu_subsys_init(void)
  (iommu_cmd_line & IOMMU_CMD_LINE_DMA_API) ?
  "(set via kernel command line)" : "");
+    pr_info("Default DMA domain mode: %s %s\n",


Nit: I think this might be a little unclear for end-users - *I'm* not 
even sure whether "Default" here is meant to refer to the mode setting 
itself or to default domains (of DMA type). Maybe something like "DMA 
domain TLB invalidation policy"? Certainly it seems like a good idea 
to explicitly mention invalidation to correlate with the documentation 
of the "iommu.strict" parameter.


Ack to the general idea though.


ok, so I'll go with this:

pr_info("DMA domain default TLB invalidation policy: %s mode %s\n",
   iommu_dma_strict ? "strict" : "lazy",
    (iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ?
    "(set via kernel command line)" : "");

I think it's worth mentioning "default" somewhere, as not all IOMMUs or 
devices will use lazy mode even if it's default.


But that's part of what I think is misleading - I boot and see that the 
default is something, so I reboot with iommu.strict to explicitly set it 
the other way, but now that's the default... huh?


The way I see it, we're saying what the current IOMMU API policy is - 
the value of iommu_dma_strict at any given time is fact - but we're not 
necessarily saying how widely that policy is enforced. We similarly 
report the type for default domains from global policy even though that 
may also be overridden per-group by drivers and/or userspace later; we 
don't say it's the *default* default domain type.


However, having now debugged the AMD issue from another thread, I think 
doing this at subsys_initcall is in fact going to be too early to be 
meaningful, since it ignores drivers' ability to change the global policy :(


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes

2021-06-01 Thread Nadav Amit


> On Jun 1, 2021, at 8:59 AM, Robin Murphy  wrote:
> 
> On 2021-05-02 07:59, Nadav Amit wrote:
>> From: Nadav Amit 
>> Some IOMMU architectures perform invalidations regardless of the page
>> size. In such architectures there is no need to sync when the page size
>> changes or to regard pgsize when making interim flush in
>> iommu_iotlb_gather_add_page().
>> Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide
>> whether gather's pgsize is tracked and triggers a flush.
> 
> I've objected before[1], and I'll readily object again ;)
> 
> I still think it's very silly to add a bunch of indirection all over the 
> place to make a helper function not do the main thing it's intended to help 
> with. If you only need trivial address gathering then it's far simpler to 
> just implement trivial address gathering. I suppose if you really want to you 
> could factor out another helper to share the 5 lines of code which ended up 
> in mtk-iommu (see commit f21ae3b10084).

Thanks, Robin.

I read your comments but I cannot fully understand the alternative that you 
propose, although I do understand your objection to the indirection 
“ignore_gather_pgsize”. Would it be ok if “ignore_gather_pgsize" was provided 
as an argument for iommu_iotlb_gather_add_page()?

In general, I can live without this patch. It probably would have negligent 
impact on performance anyhow.

Regards,
Nadav


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes

2021-06-01 Thread Robin Murphy

On 2021-05-02 07:59, Nadav Amit wrote:

From: Nadav Amit 

Some IOMMU architectures perform invalidations regardless of the page
size. In such architectures there is no need to sync when the page size
changes or to regard pgsize when making interim flush in
iommu_iotlb_gather_add_page().

Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide
whether gather's pgsize is tracked and triggers a flush.


I've objected before[1], and I'll readily object again ;)

I still think it's very silly to add a bunch of indirection all over the 
place to make a helper function not do the main thing it's intended to 
help with. If you only need trivial address gathering then it's far 
simpler to just implement trivial address gathering. I suppose if you 
really want to you could factor out another helper to share the 5 lines 
of code which ended up in mtk-iommu (see commit f21ae3b10084).


Robin.

[1] 
https://lore.kernel.org/linux-iommu/49bae447-d662-e6cf-7500-ab78e3b75...@arm.com/



Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
  drivers/iommu/amd/iommu.c | 1 +
  include/linux/iommu.h | 3 ++-
  2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index b8cabbbeed71..1849b53f2149 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2215,6 +2215,7 @@ const struct iommu_ops amd_iommu_ops = {
.put_resv_regions = generic_iommu_put_resv_regions,
.is_attach_deferred = amd_iommu_is_attach_deferred,
.pgsize_bitmap  = AMD_IOMMU_PGSIZES,
+   .ignore_gather_pgsize = true,
.flush_iotlb_all = amd_iommu_flush_iotlb_all,
.iotlb_sync = amd_iommu_iotlb_sync,
.def_domain_type = amd_iommu_def_domain_type,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..1fb2695418e9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -284,6 +284,7 @@ struct iommu_ops {
int (*def_domain_type)(struct device *dev);
  
  	unsigned long pgsize_bitmap;

+   bool ignore_gather_pgsize;
struct module *owner;
  };
  
@@ -508,7 +509,7 @@ static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,

 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
+   if ((gather->pgsize != size && !domain->ops->ignore_gather_pgsize) ||
end + 1 < gather->start || start > gather->end + 1) {
if (gather->pgsize)
iommu_iotlb_sync(domain, gather);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: Print default strict or lazy mode at init time

2021-06-01 Thread John Garry

On 01/06/2021 10:09, Robin Murphy wrote:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 808ab70d5df5..f25fae62f077 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -138,6 +138,11 @@ static int __init iommu_subsys_init(void)
  (iommu_cmd_line & IOMMU_CMD_LINE_DMA_API) ?
  "(set via kernel command line)" : "");
+    pr_info("Default DMA domain mode: %s %s\n",


Nit: I think this might be a little unclear for end-users - *I'm* not 
even sure whether "Default" here is meant to refer to the mode setting 
itself or to default domains (of DMA type). Maybe something like "DMA 
domain TLB invalidation policy"? Certainly it seems like a good idea to 
explicitly mention invalidation to correlate with the documentation of 
the "iommu.strict" parameter.


Ack to the general idea though.


ok, so I'll go with this:

pr_info("DMA domain default TLB invalidation policy: %s mode %s\n",
  iommu_dma_strict ? "strict" : "lazy",
   (iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ?
   "(set via kernel command line)" : "");

I think it's worth mentioning "default" somewhere, as not all IOMMUs or 
devices will use lazy mode even if it's default.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 7/7] dma-iommu: Use init_iova_domain_ext() for IOVA domain init

2021-06-01 Thread John Garry
Pass the max opt iova len to init the IOVA domain, if set.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f6d3302bb829..37765d540dc9 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -333,6 +333,8 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
struct iova_domain *iovad;
+   size_t max_opt_dma_size;
+   unsigned long iova_len;
 
if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
return -EINVAL;
@@ -366,7 +368,18 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
return 0;
}
 
-   init_iova_domain(iovad, 1UL << order, base_pfn);
+   max_opt_dma_size = iommu_group_get_max_opt_dma_size(dev->iommu_group);
+
+   if (max_opt_dma_size) {
+   unsigned long shift = __ffs(1UL << order);
+
+   iova_len = max_opt_dma_size >> shift;
+   iova_len = roundup_pow_of_two(iova_len);
+   } else {
+   iova_len = 0;
+   }
+
+   init_iova_domain_ext(iovad, 1UL << order, base_pfn, iova_len);
 
if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict(domain)) {
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 6/7] iommu: Allow max opt DMA len be set for a group via sysfs

2021-06-01 Thread John Garry
Add support to allow the maximum optimised DMA len be set for an IOMMU
group via sysfs.

This much the same with the method to change the default domain type for a
group.

Signed-off-by: John Garry 
---
 drivers/iommu/iommu.c | 51 +--
 include/linux/iommu.h |  6 +
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8bf2abb3d4c1..ea2bdd1c4f4e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -45,6 +45,7 @@ struct iommu_group {
struct iommu_domain *default_domain;
struct iommu_domain *domain;
struct list_head entry;
+   size_t max_opt_dma_size;
 };
 
 struct group_device {
@@ -86,6 +87,9 @@ static int iommu_create_device_direct_mappings(struct 
iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
  const char *buf, size_t count);
+static ssize_t iommu_group_store_max_opt_dma_size(struct iommu_group *group,
+ const char *buf,
+ size_t count);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)  \
 struct iommu_group_attribute iommu_group_attr_##_name =\
@@ -554,6 +558,12 @@ static ssize_t iommu_group_show_type(struct iommu_group 
*group,
return strlen(type);
 }
 
+static ssize_t iommu_group_show_max_opt_dma_size(struct iommu_group *group,
+char *buf)
+{
+   return sprintf(buf, "%zu\n", group->max_opt_dma_size);
+}
+
 static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL);
 
 static IOMMU_GROUP_ATTR(reserved_regions, 0444,
@@ -562,6 +572,9 @@ static IOMMU_GROUP_ATTR(reserved_regions, 0444,
 static IOMMU_GROUP_ATTR(type, 0644, iommu_group_show_type,
iommu_group_store_type);
 
+static IOMMU_GROUP_ATTR(max_opt_dma_size, 0644, 
iommu_group_show_max_opt_dma_size,
+   iommu_group_store_max_opt_dma_size);
+
 static void iommu_group_release(struct kobject *kobj)
 {
struct iommu_group *group = to_iommu_group(kobj);
@@ -648,6 +661,10 @@ struct iommu_group *iommu_group_alloc(void)
if (ret)
return ERR_PTR(ret);
 
+   ret = iommu_group_create_file(group, 
_group_attr_max_opt_dma_size);
+   if (ret)
+   return ERR_PTR(ret);
+
pr_debug("Allocated group %d\n", group->id);
 
return group;
@@ -2279,6 +2296,11 @@ struct iommu_domain *iommu_get_dma_domain(struct device 
*dev)
return dev->iommu_group->default_domain;
 }
 
+size_t iommu_group_get_max_opt_dma_size(struct iommu_group *group)
+{
+   return group->max_opt_dma_size;
+}
+
 /*
  * IOMMU groups are really the natural working unit of the IOMMU, but
  * the IOMMU API works on domains and devices.  Bridge that gap by
@@ -3045,12 +3067,14 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
  *  hasn't changed after the caller has called this function)
  * @type: The type of the new default domain that gets associated with the 
group
  * @new: Allocate new default domain, keeping same type when no type passed
+ * @max_opt_dma_size: If set, set the IOMMU group max_opt_dma_size when success
  *
  * Returns 0 on success and error code on failure
  *
  */
 static int iommu_change_dev_def_domain(struct iommu_group *group,
-  struct device *prev_dev, int type, bool 
new)
+  struct device *prev_dev, int type, bool 
new,
+  unsigned long max_opt_dma_size)
 {
struct iommu_domain *prev_dom;
struct group_device *grp_dev;
@@ -3146,6 +3170,9 @@ static int iommu_change_dev_def_domain(struct iommu_group 
*group,
 
group->domain = group->default_domain;
 
+   if (max_opt_dma_size)
+   group->max_opt_dma_size = max_opt_dma_size;
+
/*
 * Release the mutex here because ops->probe_finalize() call-back of
 * some vendor IOMMU drivers calls arm_iommu_attach_device() which
@@ -3272,7 +3299,7 @@ static int iommu_group_store_type_cb(const char *buf,
else
return -EINVAL;
 
-   return iommu_change_dev_def_domain(group, dev, type, false);
+   return iommu_change_dev_def_domain(group, dev, type, false, 0);
 }
 
 static ssize_t iommu_group_store_type(struct iommu_group *group,
@@ -3281,3 +3308,23 @@ static ssize_t iommu_group_store_type(struct iommu_group 
*group,
return iommu_group_store_common(group, buf, count,
iommu_group_store_type_cb);
 }
+
+static int iommu_group_store_max_opt_dma_size_cb(const char *buf,
+struct iommu_group *group,
+struct device *dev)
+{
+   

[PATCH v3 5/7] iova: Add init_iova_domain_ext()

2021-06-01 Thread John Garry
Add extended version of init_iova_domain() which accepts an max opt
iova length argument, and use it to set the rcaches range.

This can be combined with the normal version later.

Signed-off-by: John Garry 
---
 drivers/iommu/iova.c | 31 ---
 include/linux/iova.h |  9 +
 2 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 95892a0433cc..ae4901073a98 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -23,7 +23,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad,
 static unsigned long iova_rcache_get(struct iova_domain *iovad,
 unsigned long size,
 unsigned long limit_pfn);
-static void init_iova_rcaches(struct iova_domain *iovad);
+static void init_iova_rcaches(struct iova_domain *iovad, unsigned long 
iova_len);
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 static void fq_destroy_all_entries(struct iova_domain *iovad);
@@ -46,9 +46,9 @@ static struct iova *to_iova(struct rb_node *node)
return rb_entry(node, struct iova, node);
 }
 
-void
-init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-   unsigned long start_pfn)
+static void
+__init_iova_domain(struct iova_domain *iovad, unsigned long granule,
+   unsigned long start_pfn, unsigned long iova_len)
 {
/*
 * IOVA granularity will normally be equal to the smallest
@@ -71,7 +71,21 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
rb_link_node(>anchor.node, NULL, >rbroot.rb_node);
rb_insert_color(>anchor.node, >rbroot);
cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, 
>cpuhp_dead);
-   init_iova_rcaches(iovad);
+   init_iova_rcaches(iovad, iova_len);
+}
+
+void
+init_iova_domain_ext(struct iova_domain *iovad, unsigned long granule,
+   unsigned long start_pfn, unsigned long iova_len)
+{
+   __init_iova_domain(iovad, granule, start_pfn, iova_len);
+}
+
+void
+init_iova_domain(struct iova_domain *iovad, unsigned long granule,
+   unsigned long start_pfn)
+{
+   __init_iova_domain(iovad, granule, start_pfn, 0);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
 
@@ -883,14 +897,17 @@ bool iova_domain_len_is_cached(struct iova_domain *iovad, 
unsigned long iova_len
return iova_len_to_rcache_max(iova_len) == iovad->rcache_max_size;
 }
 
-static void init_iova_rcaches(struct iova_domain *iovad)
+static void init_iova_rcaches(struct iova_domain *iovad, unsigned long 
iova_len)
 {
struct iova_cpu_rcache *cpu_rcache;
struct iova_rcache *rcache;
unsigned int cpu;
int i;
 
-   iovad->rcache_max_size = IOVA_RANGE_CACHE_MAX_SIZE;
+   if (iova_len)
+   iovad->rcache_max_size = iova_len_to_rcache_max(iova_len);
+   else
+   iovad->rcache_max_size = IOVA_RANGE_CACHE_MAX_SIZE;
 
iovad->rcaches = kcalloc(iovad->rcache_max_size,
 sizeof(*iovad->rcaches), GFP_KERNEL);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 04cc8eb6de38..cfe416b6a8c7 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -154,6 +154,8 @@ struct iova *reserve_iova(struct iova_domain *iovad, 
unsigned long pfn_lo,
unsigned long pfn_hi);
 void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
unsigned long start_pfn);
+void init_iova_domain_ext(struct iova_domain *iovad, unsigned long granule,
+   unsigned long start_pfn, unsigned long iova_len);
 int init_iova_flush_queue(struct iova_domain *iovad,
  iova_flush_cb flush_cb, iova_entry_dtor entry_dtor);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
@@ -222,6 +224,13 @@ static inline void init_iova_domain(struct iova_domain 
*iovad,
 {
 }
 
+static inline void init_iova_domain_ext(struct iova_domain *iovad,
+   unsigned long granule,
+   unsigned long start_pfn,
+   unsigned long iova_len)
+{
+}
+
 static inline int init_iova_flush_queue(struct iova_domain *iovad,
iova_flush_cb flush_cb,
iova_entry_dtor entry_dtor)
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 4/7] iova: Add iova_domain_len_is_cached()

2021-06-01 Thread John Garry
Add a function to check whether an IOVA domain currently caches a given
upper IOVA len exactly.

Signed-off-by: John Garry 
---
 drivers/iommu/iova.c | 11 +++
 include/linux/iova.h |  8 +++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 0e4c0e55178a..95892a0433cc 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -872,6 +872,17 @@ static void iova_magazine_push(struct iova_magazine *mag, 
unsigned long pfn)
mag->pfns[mag->size++] = pfn;
 }
 
+static unsigned long iova_len_to_rcache_max(unsigned long iova_len)
+{
+   return order_base_2(iova_len) + 1;
+}
+
+/* Test if iova_len range cached upper limit matches that of IOVA domain */
+bool iova_domain_len_is_cached(struct iova_domain *iovad, unsigned long 
iova_len)
+{
+   return iova_len_to_rcache_max(iova_len) == iovad->rcache_max_size;
+}
+
 static void init_iova_rcaches(struct iova_domain *iovad)
 {
struct iova_cpu_rcache *cpu_rcache;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 9974e1d3e2bc..04cc8eb6de38 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -136,7 +136,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
 #if IS_ENABLED(CONFIG_IOMMU_IOVA)
 int iova_cache_get(void);
 void iova_cache_put(void);
-
+bool iova_domain_len_is_cached(struct iova_domain *iovad,
+  unsigned long iova_len);
 void free_iova(struct iova_domain *iovad, unsigned long pfn);
 void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
@@ -158,6 +159,11 @@ int init_iova_flush_queue(struct iova_domain *iovad,
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
 #else
+static inline bool iova_domain_len_is_cached(struct iova_domain *iovad,
+unsigned long iova_len)
+{
+   return false;
+}
 static inline int iova_cache_get(void)
 {
return -ENOTSUPP;
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 3/7] iommu: Allow iommu_change_dev_def_domain() realloc default domain for same type

2021-06-01 Thread John Garry
Allow iommu_change_dev_def_domain() to create a new default domain, keeping
the same as current when type is unset.

Also remove comment about function purpose, which will become stale.

Signed-off-by: John Garry 
---
 drivers/iommu/iommu.c | 54 ++-
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4d12b607918c..8bf2abb3d4c1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3036,6 +3036,7 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle)
 }
 EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
 
 /*
  * Changes the default domain of an iommu group that has *only* one device
  *
@@ -3043,16 +3044,13 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
  * @prev_dev: The device in the group (this is used to make sure that the 
device
  *  hasn't changed after the caller has called this function)
  * @type: The type of the new default domain that gets associated with the 
group
+ * @new: Allocate new default domain, keeping same type when no type passed
  *
  * Returns 0 on success and error code on failure
  *
- * Note:
- * 1. Presently, this function is called only when user requests to change the
- *group's default domain type through 
/sys/kernel/iommu_groups//type
- *Please take a closer look if intended to use for other purposes.
  */
 static int iommu_change_dev_def_domain(struct iommu_group *group,
-  struct device *prev_dev, int type)
+  struct device *prev_dev, int type, bool 
new)
 {
struct iommu_domain *prev_dom;
struct group_device *grp_dev;
@@ -3105,28 +3103,32 @@ static int iommu_change_dev_def_domain(struct 
iommu_group *group,
goto out;
}
 
-   dev_def_dom = iommu_get_def_domain_type(dev);
-   if (!type) {
+   if (new && !type) {
+   type = prev_dom->type;
+   } else {
+   dev_def_dom = iommu_get_def_domain_type(dev);
+   if (!type) {
+   /*
+* If the user hasn't requested any specific type of 
domain and
+* if the device supports both the domains, then 
default to the
+* domain the device was booted with
+*/
+   type = dev_def_dom ? : iommu_def_domain_type;
+   } else if (dev_def_dom && type != dev_def_dom) {
+   dev_err_ratelimited(prev_dev, "Device cannot be in %s 
domain\n",
+   iommu_domain_type_str(type));
+   ret = -EINVAL;
+   goto out;
+   }
+
/*
-* If the user hasn't requested any specific type of domain and
-* if the device supports both the domains, then default to the
-* domain the device was booted with
+* Switch to a new domain only if the requested domain type is 
different
+* from the existing default domain type
 */
-   type = dev_def_dom ? : iommu_def_domain_type;
-   } else if (dev_def_dom && type != dev_def_dom) {
-   dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n",
-   iommu_domain_type_str(type));
-   ret = -EINVAL;
-   goto out;
-   }
-
-   /*
-* Switch to a new domain only if the requested domain type is different
-* from the existing default domain type
-*/
-   if (prev_dom->type == type) {
-   ret = 0;
-   goto out;
+   if (prev_dom->type == type) {
+   ret = 0;
+   goto out;
+   }
}
 
/* Sets group->default_domain to the newly allocated domain */
@@ -3270,7 +3272,7 @@ static int iommu_group_store_type_cb(const char *buf,
else
return -EINVAL;
 
-   return iommu_change_dev_def_domain(group, dev, type);
+   return iommu_change_dev_def_domain(group, dev, type, false);
 }
 
 static ssize_t iommu_group_store_type(struct iommu_group *group,
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 1/7] iommu: Reactor iommu_group_store_type()

2021-06-01 Thread John Garry
Function iommu_group_store_type() supports changing the default domain
of an IOMMU group.

Many conditions need to be satisfied and steps taken for this action to be
successful.

Satisfying these conditions and steps will be required for setting other
IOMMU group attributes, so factor into a common part and a part specific
to update the IOMMU group attribute.

No functional change intended.

Some code comments are tidied up also.

Signed-off-by: John Garry 
---
 drivers/iommu/iommu.c | 73 +++
 1 file changed, 46 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 808ab70d5df5..4d12b607918c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3169,20 +3169,23 @@ static int iommu_change_dev_def_domain(struct 
iommu_group *group,
 }
 
 /*
- * Changing the default domain through sysfs requires the users to ubind the
- * drivers from the devices in the iommu group. Return failure if this doesn't
- * meet.
+ * Changing the default domain or any other IOMMU group attribute through sysfs
+ * requires the users to unbind the drivers from the devices in the IOMMU 
group.
+ * Return failure if this precondition is not met.
  *
  * We need to consider the race between this and the device release path.
  * device_lock(dev) is used here to guarantee that the device release path
  * will not be entered at the same time.
  */
-static ssize_t iommu_group_store_type(struct iommu_group *group,
- const char *buf, size_t count)
+static ssize_t iommu_group_store_common(struct iommu_group *group,
+   const char *buf, size_t count,
+   int (*cb)(const char *buf,
+ struct iommu_group *group,
+ struct device *dev))
 {
struct group_device *grp_dev;
struct device *dev;
-   int ret, req_type;
+   int ret;
 
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
return -EACCES;
@@ -3190,25 +3193,16 @@ static ssize_t iommu_group_store_type(struct 
iommu_group *group,
if (WARN_ON(!group))
return -EINVAL;
 
-   if (sysfs_streq(buf, "identity"))
-   req_type = IOMMU_DOMAIN_IDENTITY;
-   else if (sysfs_streq(buf, "DMA"))
-   req_type = IOMMU_DOMAIN_DMA;
-   else if (sysfs_streq(buf, "auto"))
-   req_type = 0;
-   else
-   return -EINVAL;
-
/*
 * Lock/Unlock the group mutex here before device lock to
-* 1. Make sure that the iommu group has only one device (this is a
+* 1. Make sure that the IOMMU group has only one device (this is a
 *prerequisite for step 2)
 * 2. Get struct *dev which is needed to lock device
 */
mutex_lock(>mutex);
if (iommu_group_device_count(group) != 1) {
mutex_unlock(>mutex);
-   pr_err_ratelimited("Cannot change default domain: Group has 
more than one device\n");
+   pr_err_ratelimited("Cannot change IOMMU group default domain 
attribute: Group has more than one device\n");
return -EINVAL;
}
 
@@ -3220,16 +3214,16 @@ static ssize_t iommu_group_store_type(struct 
iommu_group *group,
/*
 * Don't hold the group mutex because taking group mutex first and then
 * the device lock could potentially cause a deadlock as below. Assume
-* two threads T1 and T2. T1 is trying to change default domain of an
-* iommu group and T2 is trying to hot unplug a device or release [1] VF
-* of a PCIe device which is in the same iommu group. T1 takes group
-* mutex and before it could take device lock assume T2 has taken device
-* lock and is yet to take group mutex. Now, both the threads will be
-* waiting for the other thread to release lock. Below, lock order was
-* suggested.
+* two threads, T1 and T2. T1 is trying to change default domain
+* attribute of an IOMMU group and T2 is trying to hot unplug a device
+* or release [1] VF of a PCIe device which is in the same IOMMU group.
+* T1 takes the group mutex and before it could take device lock T2 may
+* have taken device lock and is yet to take group mutex. Now, both the
+* threads will be waiting for the other thread to release lock. Below,
+* lock order was suggested.
 * device_lock(dev);
 *  mutex_lock(>mutex);
-*  iommu_change_dev_def_domain();
+*  cb->iommu_change_dev_def_domain(); [example cb]
 *  mutex_unlock(>mutex);
 * device_unlock(dev);
 *
@@ -3243,7 +3237,7 @@ static ssize_t iommu_group_store_type(struct iommu_group 
*group,
 */
mutex_unlock(>mutex);
 
-   /* 

[PATCH v3 2/7] iova: Allow rcache range upper limit to be flexible

2021-06-01 Thread John Garry
Some LLDs may request DMA mappings whose IOVA length exceeds that of the
current rcache upper limit.

This means that allocations for those IOVAs will never be cached, and
always must be allocated and freed from the RB tree per DMA mapping cycle.
This has a significant affect on performance, more so since commit
4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
fails"), as discussed at [0].

As a first step towards allowing the rcache range upper limit be
configured, hold this value in the IOVA rcache structure, and allocate
the rcaches separately.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c |  2 +-
 drivers/iommu/iova.c  | 23 +--
 include/linux/iova.h  |  4 ++--
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..f6d3302bb829 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -432,7 +432,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
 * rounding up anything cacheable to make sure that can't happen. The
 * order of the unadjusted size will still match upon freeing.
 */
-   if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+   if (iova_len < (1 << (iovad->rcache_max_size - 1)))
iova_len = roundup_pow_of_two(iova_len);
 
dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b7ecd5b08039..0e4c0e55178a 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -15,6 +15,8 @@
 /* The anchor node sits above the top of the usable address space */
 #define IOVA_ANCHOR~0UL
 
+#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size 
(in pages) */
+
 static bool iova_rcache_insert(struct iova_domain *iovad,
   unsigned long pfn,
   unsigned long size);
@@ -877,7 +879,14 @@ static void init_iova_rcaches(struct iova_domain *iovad)
unsigned int cpu;
int i;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   iovad->rcache_max_size = IOVA_RANGE_CACHE_MAX_SIZE;
+
+   iovad->rcaches = kcalloc(iovad->rcache_max_size,
+sizeof(*iovad->rcaches), GFP_KERNEL);
+   if (!iovad->rcaches)
+   return;
+
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = >rcaches[i];
spin_lock_init(>lock);
rcache->depot_size = 0;
@@ -952,7 +961,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, 
unsigned long pfn,
 {
unsigned int log_size = order_base_2(size);
 
-   if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
+   if (log_size >= iovad->rcache_max_size)
return false;
 
return __iova_rcache_insert(iovad, >rcaches[log_size], pfn);
@@ -1008,7 +1017,7 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 {
unsigned int log_size = order_base_2(size);
 
-   if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
+   if (log_size >= iovad->rcache_max_size)
return 0;
 
return __iova_rcache_get(>rcaches[log_size], limit_pfn - size);
@@ -1024,7 +1033,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
unsigned int cpu;
int i, j;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = >rcaches[i];
for_each_possible_cpu(cpu) {
cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
@@ -1035,6 +1044,8 @@ static void free_iova_rcaches(struct iova_domain *iovad)
for (j = 0; j < rcache->depot_size; ++j)
iova_magazine_free(rcache->depot[j]);
}
+
+   kfree(iovad->rcaches);
 }
 
 /*
@@ -1047,7 +1058,7 @@ static void free_cpu_cached_iovas(unsigned int cpu, 
struct iova_domain *iovad)
unsigned long flags;
int i;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = >rcaches[i];
cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
spin_lock_irqsave(_rcache->lock, flags);
@@ -1066,7 +1077,7 @@ static void free_global_cached_iovas(struct iova_domain 
*iovad)
unsigned long flags;
int i, j;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = >rcaches[i];
spin_lock_irqsave(>lock, flags);
for (j = 0; j < rcache->depot_size; ++j) {
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 71d8a2de6635..9974e1d3e2bc 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -25,7 +25,6 

[PATCH v3 0/7] iommu: Allow IOVA rcache range be configured

2021-06-01 Thread John Garry
For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced. 

This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry
from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA ageing issue,
as discussed at [1].

This series allows the IOVA rcache range be configured, so that we may
cache all IOVAs per domain, thus improving performance.

A new IOMMU group sysfs file is added - max_opt_dma_size - which is used
indirectly to configure the IOVA rcache range:
/sys/kernel/iommu_groups/X/max_opt_dma_size

This file is updated same as how the IOMMU group default domain type is
updated, i.e. must unbind the only device in the group first.

The inspiration here comes from block layer request queue sysfs
"optimal_io_size" file, in /sys/block/sdX/queue/optimal_io_size

Some figures for storage scenario (when increasing IOVA rcache range to
cover all DMA mapping sizes from the LLD):
v5.13-rc1 baseline: 1200K IOPS
With series:1800K IOPS

All above are for IOMMU strict mode. Non-strict mode gives ~1800K IOPS in
all scenarios.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/

Differences to v2:
- Drop DMA mapping API to allow LLD set set for now
- Update default domain immediately, instead of in reprobe
- Fix build warning

Differences to v1:
- Many
- Change method to not operate on a 'live' IOMMU domain:
- rather, force device driver to be re-probed once
  dma_max_opt_size is set, and reconfig a new IOMMU group then
- Add iommu sysfs max_dma_opt_size file, and allow updating same as how
  group type is changed 

John Garry (7):
  iommu: Reactor iommu_group_store_type()
  iova: Allow rcache range upper limit to be flexible
  iommu: Allow iommu_change_dev_def_domain() realloc default domain for
same type
  iova: Add iova_domain_len_is_cached()
  iova: Add init_iova_domain_ext()
  iommu: Allow max opt DMA len be set for a group via sysfs
  dma-iommu: Use init_iova_domain_ext() for IOVA domain init

 drivers/iommu/dma-iommu.c |  17 +++-
 drivers/iommu/iommu.c | 172 ++
 drivers/iommu/iova.c  |  63 +++---
 include/linux/iommu.h |   6 ++
 include/linux/iova.h  |  21 -
 5 files changed, 210 insertions(+), 69 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Shenming Lu
On 2021/6/1 20:30, Lu Baolu wrote:
> On 2021/6/1 15:15, Shenming Lu wrote:
>> On 2021/6/1 13:10, Lu Baolu wrote:
>>> Hi Shenming,
>>>
>>> On 6/1/21 12:31 PM, Shenming Lu wrote:
 On 2021/5/27 15:58, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own 
> logic to
> isolate untrusted device DMAs initiated by userspace.
>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API 
> provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
>  
> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
>
 [..]

> /*
>     * Page fault report and response
>     *
>     * This is TBD. Can be added after other parts are cleared up. Likely 
> it
>     * will be a ring buffer shared between user/kernel, an eventfd to 
> notify
>     * the user and an ioctl to complete the fault.
>     *
>     * The fault data is per I/O address space, i.e.: IOASID + 
> faulting_addr
>     */
 Hi,

 It seems that the ioasid has different usage in different situation, it 
 could
 be directly used in the physical routing, or just a virtual handle that 
 indicates
 a page table or a vPASID table (such as the GPA address space, in the 
 simple
 passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
 Substream ID), right?

 And Baolu suggested that since one device might consume multiple page 
 tables,
 it's more reasonable to have one fault handler per page table. By this, do 
 we
 have to maintain such an ioasid info list in the IOMMU layer?
>>> As discussed earlier, the I/O page fault and cache invalidation paths
>>> will have "device labels" so that the information could be easily
>>> translated and routed.
>>>
>>> So it's likely the per-device fault handler registering API in iommu
>>> core can be kept, but /dev/ioasid will be grown with a layer to
>>> translate and propagate I/O page fault information to the right
>>> consumers.
>> Yeah, having a general preprocessing of the faults in IOASID seems to be
>> a doable direction. But since there may be more than one consumer at the
>> same time, who is responsible for registering the per-device fault handler?
> 
> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. This is just a top level thought. Haven't gone
> through the details yet. Need to wait and see what /dev/ioasid finally
> looks like.

OK. And it needs to be confirmed by Jean since we might migrate the code from
io-pgfault.c to IOASID... Anyway, finalize /dev/ioasid first.  Thanks,

Shenming

> 
> Best regards,
> baolu
> .
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 02:03:33PM +1000, David Gibson wrote:
> On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > > 
> > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created 
> > > > > per
> > > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > > protection scope as VF associated to it. 
> > > > 
> > > > This doesn't require, and certainly shouldn't create, a fake group.
> > > 
> > > It's only fake if you start with a narrow view of what a group is. 
> > 
> > A group is connected to drivers/iommu. A group object without *any*
> > relation to drivers/iommu is just a complete fiction, IMHO.
> 
> That might be where we differ.  As I've said, my group I'm primarily
> meaning the fundamental hardware unit of isolation.  *Usually* that's
> determined by the capabilities of an IOMMU, but in some cases it might
> not be.  In either case, the boundaries still matter.

As in my other email we absolutely need a group concept, it is just a
question of how the user API is designed around it.

> > The group mdev implicitly creates is just a fake proxy that comes
> > along with mdev API. It doesn't do anything and it doesn't mean
> > anything.
> 
> But.. the case of multiple mdevs managed by a single PCI device with
> an internal IOMMU also exists, and then the mdev groups are *not*
> proxies but true groups independent of the parent device.  Which means
> that the group structure of mdevs can vary, which is an argument *for*
> keeping it, not against.

If VFIO becomes more "vfio_device" centric then the vfio_device itself
has some properties. One of those can be "is it inside a drivers/iommu
group, or not?".

If the vfio_device is not using a drivers/iommu IOMMU interface then
it can just have no group at all - no reason to lie. This would mean
that the device has perfect isolation.

What I don't like is forcing certain things depending on how the
vfio_device was created - for instance forcing a IOMMU group as part
and forcing an ugly "SW IOMMU" mode in the container only as part of
mdev_device.

These should all be properties of the vfio_device itself.

Again this is all about the group fd - and how to fit in with the
/dev/ioasid proposal from Kevin:

https://lore.kernel.org/kvm/mwhpr11mb1886422d4839b372c6ab245f8c...@mwhpr11mb1886.namprd11.prod.outlook.com/

Focusing on vfio_device and skipping the group fd smooths out some
rough edges.

Code wise we are not quite there, but I have mapped out eliminating
the group from the vfio_device centric API and a few other places it
has crept in.

The group can exist in the background to enforce security without
being a cornerstone of the API design.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Regression 5.12.0-rc4 net: ice: significant throughput drop

2021-06-01 Thread Robin Murphy

On 2021-06-01 07:57, Daniel Borkmann wrote:

[ ping Robin / Joerg, +Cc Christoph ]


Sorry, I was off on Friday on top of the Bank Holiday yesterday.


On 5/28/21 10:34 AM, Jussi Maki wrote:

Hi all,

While measuring the impact of a kernel patch on our lab machines I 
stumbled upon

a performance regression affecting the 100Gbit ICE nic and bisected it
from range v5.11.1..v5.13-rc3 to the commit:
a250c23f15c2 iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

Both recent bpf-next (d6a6a55518) and linux-stable (c4681547bc) are
affected by the issue.

The regression shows as a significant drop in throughput as measured
with "super_netperf" [0],
with measured bandwidth of ~95Gbps before and ~35Gbps after:


I guess that must be the difference between using the flush queue
vs. strict invalidation. On closer inspection, it seems to me that
there's a subtle pre-existing bug in the AMD IOMMU driver, in that
amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api()
has called bus_set_iommu(). Does the patch below work?

Robin.

->8-

Subject: [PATCH] iommu/amd: Tidy up DMA ops init

Now that DMA ops are part of the core API via iommu-dma, fold the
vestigial remains of the IOMMU_DMA_OPS init state into the IOMMU API
phase, and clean up a few other leftovers. This should also close the
race window wherein bus_set_iommu() effectively makes the DMA ops state
visible before its nominal initialisation, which since commit
a250c23f15c2 ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE") can now
lead to the wrong flush queue policy being picked.

Reported-by: Jussi Maki 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/amd/amd_iommu.h |  2 --
 drivers/iommu/amd/init.c  |  5 -
 drivers/iommu/amd/iommu.c | 29 -
 3 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 55dd38d814d9..416815a525d6 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -11,8 +11,6 @@
 
 #include "amd_iommu_types.h"
 
-extern int amd_iommu_init_dma_ops(void);

-extern int amd_iommu_init_passthrough(void);
 extern irqreturn_t amd_iommu_int_thread(int irq, void *data);
 extern irqreturn_t amd_iommu_int_handler(int irq, void *data);
 extern void amd_iommu_apply_erratum_63(u16 devid);
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index d006724f4dc2..a418bf560a4b 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -231,7 +231,6 @@ enum iommu_init_state {
IOMMU_ENABLED,
IOMMU_PCI_INIT,
IOMMU_INTERRUPTS_EN,
-   IOMMU_DMA_OPS,
IOMMU_INITIALIZED,
IOMMU_NOT_FOUND,
IOMMU_INIT_ERROR,
@@ -2895,10 +2894,6 @@ static int __init state_next(void)
init_state = ret ? IOMMU_INIT_ERROR : IOMMU_INTERRUPTS_EN;
break;
case IOMMU_INTERRUPTS_EN:
-   ret = amd_iommu_init_dma_ops();
-   init_state = ret ? IOMMU_INIT_ERROR : IOMMU_DMA_OPS;
-   break;
-   case IOMMU_DMA_OPS:
init_state = IOMMU_INITIALIZED;
break;
case IOMMU_INITIALIZED:
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 80e8e1916dd1..20f7d141ea53 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -30,7 +30,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -1771,13 +1770,22 @@ void amd_iommu_domain_update(struct protection_domain 
*domain)
amd_iommu_domain_flush_complete(domain);
 }
 
+static void __init amd_iommu_init_dma_ops(void)

+{
+   swiotlb = (iommu_default_passthrough() || sme_me_mask) ? 1 : 0;
+
+   if (amd_iommu_unmap_flush)
+   pr_info("IO/TLB flush on unmap enabled\n");
+   else
+   pr_info("Lazy IO/TLB flushing enabled\n");
+   iommu_set_dma_strict(amd_iommu_unmap_flush);
+}
+
 int __init amd_iommu_init_api(void)
 {
int ret, err = 0;
 
-	ret = iova_cache_get();

-   if (ret)
-   return ret;
+   amd_iommu_init_dma_ops();
 
 	err = bus_set_iommu(_bus_type, _iommu_ops);

if (err)
@@ -1794,19 +1802,6 @@ int __init amd_iommu_init_api(void)
return 0;
 }
 
-int __init amd_iommu_init_dma_ops(void)

-{
-   swiotlb= (iommu_default_passthrough() || sme_me_mask) ? 1 : 0;
-
-   if (amd_iommu_unmap_flush)
-   pr_info("IO/TLB flush on unmap enabled\n");
-   else
-   pr_info("Lazy IO/TLB flushing enabled\n");
-   iommu_set_dma_strict(amd_iommu_unmap_flush);
-   return 0;
-
-}
-
 /*
  *
  * The following functions belong to the exported interface of AMD IOMMU
--
2.21.0.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Lu Baolu

On 2021/6/1 15:15, Shenming Lu wrote:

On 2021/6/1 13:10, Lu Baolu wrote:

Hi Shenming,

On 6/1/21 12:31 PM, Shenming Lu wrote:

On 2021/5/27 15:58, Tian, Kevin wrote:

/dev/ioasid provides an unified interface for managing I/O page tables for
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
etc.) are expected to use this interface instead of creating their own logic to
isolate untrusted device DMAs initiated by userspace.

This proposal describes the uAPI of /dev/ioasid and also sample sequences
with VFIO as example in typical usages. The driver-facing kernel API provided
by the iommu layer is still TBD, which can be discussed after consensus is
made on this uAPI.

It's based on a lengthy discussion starting from here:
 https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.


[..]


/*
    * Page fault report and response
    *
    * This is TBD. Can be added after other parts are cleared up. Likely it
    * will be a ring buffer shared between user/kernel, an eventfd to notify
    * the user and an ioctl to complete the fault.
    *
    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
    */

Hi,

It seems that the ioasid has different usage in different situation, it could
be directly used in the physical routing, or just a virtual handle that 
indicates
a page table or a vPASID table (such as the GPA address space, in the simple
passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
Substream ID), right?

And Baolu suggested that since one device might consume multiple page tables,
it's more reasonable to have one fault handler per page table. By this, do we
have to maintain such an ioasid info list in the IOMMU layer?

As discussed earlier, the I/O page fault and cache invalidation paths
will have "device labels" so that the information could be easily
translated and routed.

So it's likely the per-device fault handler registering API in iommu
core can be kept, but /dev/ioasid will be grown with a layer to
translate and propagate I/O page fault information to the right
consumers.

Yeah, having a general preprocessing of the faults in IOASID seems to be
a doable direction. But since there may be more than one consumer at the
same time, who is responsible for registering the per-device fault handler?


The drivers register per page table fault handlers to /dev/ioasid which
will then register itself to iommu core to listen and route the per-
device I/O page faults. This is just a top level thought. Haven't gone
through the details yet. Need to wait and see what /dev/ioasid finally
looks like.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-01 Thread Will Deacon
On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
> > From: Thierry Reding 
> > 
> > Hi,
> > 
> > this is a set of patches that is the result of earlier discussions
> > regarding early identity mappings that are needed to avoid SMMU faults
> > during early boot.
> > 
> > The goal here is to avoid early identity mappings altogether and instead
> > postpone the need for the identity mappings to when devices are attached
> > to the SMMU. This works by making the SMMU driver coordinate with the
> > memory controller driver on when to start enforcing SMMU translations.
> > This makes Tegra behave in a more standard way and pushes the code to
> > deal with the Tegra-specific programming into the NVIDIA SMMU
> > implementation.
> > 
> > Compared to the original version of these patches, I've split the
> > preparatory work into a separate patch series because it became very
> > large and will be mostly uninteresting for this audience.
> > 
> > Patch 1 provides a mechanism to program SID overrides at runtime. Patch
> > 2 updates the ARM SMMU device tree bindings to include the Tegra186
> > compatible string as suggested by Robin during review.
> > 
> > Patches 3 and 4 create the fundamentals in the SMMU driver to support
> > this and also make this functionality available on Tegra186. Patch 5
> > hooks the ARM SMMU up to the memory controller so that the memory client
> > stream ID overrides can be programmed at the right time.
> > 
> > Patch 6 extends this mechanism to Tegra186 and patches 7-9 enable all of
> > this through device tree updates. Patch 10 is included here to show how
> > SMMU will be enabled for display controllers. However, it cannot be
> > applied yet because the code to create identity mappings for potentially
> > live framebuffers hasn't been merged yet.
> > 
> > The end result is that various peripherals will have SMMU enabled, while
> > the display controllers will keep using passthrough, as initially set up
> > by firmware. Once the device tree bindings have been accepted and the
> > SMMU driver has been updated to create identity mappings for the display
> > controllers, they can be hooked up to the SMMU and the code in this
> > series will automatically program the SID overrides to enable SMMU
> > translations at the right time.
> > 
> > Note that the series creates a compile time dependency between the
> > memory controller and IOMMU trees. If it helps I can provide a branch
> > for each tree, modelling the dependency, once the series has been
> > reviewed.
> > 
> > Changes in v2:
> > - split off the preparatory work into a separate series (that needs to
> >   be applied first)
> > - address review comments by Robin
> > 
> > Thierry
> > 
> > Thierry Reding (10):
> >   memory: tegra: Implement SID override programming
> >   dt-bindings: arm-smmu: Add Tegra186 compatible string
> >   iommu/arm-smmu: Implement ->probe_finalize()
> >   iommu/arm-smmu: tegra: Detect number of instances at runtime
> >   iommu/arm-smmu: tegra: Implement SID override programming
> >   iommu/arm-smmu: Use Tegra implementation on Tegra186
> >   arm64: tegra: Use correct compatible string for Tegra186 SMMU
> >   arm64: tegra: Hook up memory controller to SMMU on Tegra186
> >   arm64: tegra: Enable SMMU support on Tegra194
> >   arm64: tegra: Enable SMMU support for display on Tegra194
> > 
> >  .../devicetree/bindings/iommu/arm,smmu.yaml   |  11 +-
> >  arch/arm64/boot/dts/nvidia/tegra186.dtsi  |   4 +-
> >  arch/arm64/boot/dts/nvidia/tegra194.dtsi  | 166 ++
> >  drivers/iommu/arm/arm-smmu/arm-smmu-impl.c|   3 +-
> >  drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c  |  90 --
> >  drivers/iommu/arm/arm-smmu/arm-smmu.c |  13 ++
> >  drivers/iommu/arm/arm-smmu/arm-smmu.h |   1 +
> >  drivers/memory/tegra/mc.c |   9 +
> >  drivers/memory/tegra/tegra186.c   |  72 
> >  include/soc/tegra/mc.h|   3 +
> >  10 files changed, 349 insertions(+), 23 deletions(-)
> 
> Will, Robin,
> 
> do you have any more comments on the ARM SMMU bits of this series? If
> not, can you guys provide an Acked-by so that Krzysztof can pick this
> (modulo the DT patches) up into the memory-controller tree for v5.14?
> 
> I'll send out a v3 with the bisectibilitiy fix that Krishna pointed
> out.

Probably best if I queue 3-6 on a separate branch once you send a v3,
then Krzysztof can pull that in if he needs it.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Parav Pandit



> From: Jason Gunthorpe 
> Sent: Monday, May 31, 2021 11:43 PM
> 
> On Mon, May 31, 2021 at 05:37:35PM +, Parav Pandit wrote:
> 
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
> 
> Reference counting of the overall pins are required
> 
> So when a pinned pages is incorporated into an IOASID page table in a later
> IOCTL it means it cannot be unpinned while the IOASID page table is using it.
Ok. but cant it use the same refcount of that mmu uses?

> 
> This is some trick to organize the pinning into groups and then refcount each
> group, thus avoiding needing per-page refcounts.
Pinned page refcount is already maintained by the mmu without ioasid, isn't it?

> 
> The data structure would be an interval tree of pins in general
> 
> The ioasid itself would have an interval tree of its own mappings, each entry
> in this tree would reference count against an element in the above tree
> 
> Then the ioasid's interval tree would be mapped into a page table tree in HW
> format.
Does it mean in simple use case [1], second level page table copy is maintained 
in the IOMMU side via map interface?
I hope not. It should use the same as what mmu uses, right?

[1] one SIOV/ADI device assigned with one PASID and mapped in guest VM

> 
> The redundant storages are needed to keep track of the refencing and the
> CPU page table values for later unpinning.
> 
> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Lu Baolu

Hi Jason,

On 2021/5/29 7:36, Jason Gunthorpe wrote:

/*
   * Bind an user-managed I/O page table with the IOMMU
   *
   * Because user page table is untrusted, IOASID nesting must be enabled
   * for this ioasid so the kernel can enforce its DMA isolation policy
   * through the parent ioasid.
   *
   * Pgtable binding protocol is different from DMA mapping. The latter
   * has the I/O page table constructed by the kernel and updated
   * according to user MAP/UNMAP commands. With pgtable binding the
   * whole page table is created and updated by userspace, thus different
   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
   *
   * Because the page table is directly walked by the IOMMU, the user
   * must  use a format compatible to the underlying hardware. It can
   * check the format information through IOASID_GET_INFO.
   *
   * The page table is bound to the IOMMU according to the routing
   * information of each attached device under the specified IOASID. The
   * routing information (RID and optional PASID) is registered when a
   * device is attached to this IOASID through VFIO uAPI.
   *
   * Input parameters:
   *- child_ioasid;
   *- address of the user page table;
   *- formats (vendor, address_width, etc.);
   *
   * Return: 0 on success, -errno on failure.
   */
#define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
#define IOASID_UNBIND_PGTABLE   _IO(IOASID_TYPE, IOASID_BASE + 10)

Also feels backwards, why wouldn't we specify this, and the required
page table format, during alloc time?



Thinking of the required page table format, perhaps we should shed more
light on the page table of an IOASID. So far, an IOASID might represent
one of the following page tables (might be more):

 1) an IOMMU format page table (a.k.a. iommu_domain)
 2) a user application CPU page table (SVA for example)
 3) a KVM EPT (future option)
 4) a VM guest managed page table (nesting mode)

This version only covers 1) and 4). Do you think we need to support 2),
3) and beyond? If so, it seems that we need some in-kernel helpers and
uAPIs to support pre-installing a page table to IOASID. From this point
of view an IOASID is actually not just a variant of iommu_domain, but an
I/O page table representation in a broader sense.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: Print default strict or lazy mode at init time

2021-06-01 Thread Robin Murphy

On 2021-05-28 14:37, John Garry wrote:

As well as the default domain type, it's useful to know whether strict
or lazy mode is default for DMA domains, so add this info in a separate
print.

Signed-off-by: John Garry 

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 808ab70d5df5..f25fae62f077 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -138,6 +138,11 @@ static int __init iommu_subsys_init(void)
(iommu_cmd_line & IOMMU_CMD_LINE_DMA_API) ?
"(set via kernel command line)" : "");
  
+	pr_info("Default DMA domain mode: %s %s\n",


Nit: I think this might be a little unclear for end-users - *I'm* not 
even sure whether "Default" here is meant to refer to the mode setting 
itself or to default domains (of DMA type). Maybe something like "DMA 
domain TLB invalidation policy"? Certainly it seems like a good idea to 
explicitly mention invalidation to correlate with the documentation of 
the "iommu.strict" parameter.


Ack to the general idea though.

Thanks,
Robin.


+   iommu_dma_strict ? "strict" : "lazy",
+   (iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ?
+   "(set via kernel command line)" : "");
+
return 0;
  }
  subsys_initcall(iommu_subsys_init);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu: Drop unnecessary of_iommu.h includes

2021-06-01 Thread Robin Murphy

On 2021-05-27 20:37, Rob Herring wrote:

The only place of_iommu.h is needed is in drivers/of/device.c. Remove it
from everywhere else.


Of course, this was from the OF_IOMMU_DECLARE() business which is all is 
long gone now.


Reviewed-by: Robin Murphy 


Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Joerg Roedel 
Cc: Rob Clark 
Cc: Marek Szyprowski 
Cc: Krzysztof Kozlowski 
Cc: Bjorn Andersson 
Cc: Yong Wu 
Cc: Matthias Brugger 
Cc: Heiko Stuebner 
Cc: Jean-Philippe Brucker 
Cc: Frank Rowand 
Cc: linux-arm-ker...@lists.infradead.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Rob Herring 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 -
  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 1 -
  drivers/iommu/arm/arm-smmu/qcom_iommu.c | 1 -
  drivers/iommu/exynos-iommu.c| 1 -
  drivers/iommu/ipmmu-vmsa.c  | 1 -
  drivers/iommu/msm_iommu.c   | 1 -
  drivers/iommu/mtk_iommu.c   | 1 -
  drivers/iommu/mtk_iommu_v1.c| 1 -
  drivers/iommu/omap-iommu.c  | 1 -
  drivers/iommu/rockchip-iommu.c  | 1 -
  drivers/iommu/virtio-iommu.c| 1 -
  drivers/of/platform.c   | 1 -
  12 files changed, 12 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 54b2f27b81d4..2ddc3cd5a7d1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -23,7 +23,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..dba15f312cbd 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -31,7 +31,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 4294abe389b2..021cf8f65ffc 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -25,7 +25,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 7623d8c371f5..d0fbf1d10e18 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -17,7 +17,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index aaa6a4d59057..51ea6f00db2f 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -19,7 +19,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 7880f307cb2d..3a38352b603f 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -18,7 +18,6 @@
  #include 
  #include 
  #include 
-#include 
  
  #include 

  #include 
diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e06b8a0e2b56..6f7c69688ce2 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -19,7 +19,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 5915d7b38211..778e66f5f1aa 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -22,7 +22,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 26e517eb0dd3..91749654fd49 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -22,7 +22,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index 7a2932772fdf..bb50e015b1d5 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -21,7 +21,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 7c02481a81b4..d9f46f2c3058 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -14,7 +14,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index 25d448f5af91..74afbb7a4f5e 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -17,7 +17,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Wang


在 2021/6/1 下午2:16, Tian, Kevin 写道:

From: Jason Wang
Sent: Tuesday, June 1, 2021 2:07 PM

在 2021/6/1 下午1:42, Tian, Kevin 写道:

From: Jason Wang
Sent: Tuesday, June 1, 2021 1:30 PM

在 2021/6/1 下午1:23, Lu Baolu 写道:

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:

2) If yes, what's the reason for not simply use the fd opened from
/dev/ioas. (This is the question that is not answered) and what
happens
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?

oh, missed this question in prior reply. Personally, no special reason
yet. But using ID may give us opportunity to customize the

management

of the handle. For one, better lookup efficiency by using xarray to
store the allocated IDs. For two, could categorize the allocated IDs
(parent or nested). GET_INFO just works with an input FD and an ID.

I'm not sure I get this, for nesting cases you can still make the
child an fd.

And a question still, under what case we need to create multiple
ioasids on a single ioasid fd?

One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.

Right, that's exactly my question. The latter seems much more easier to
be understood and implemented.


A simple reason discussed in previous thread - there could be 1M's
I/O address spaces per device while #FD's are precious resource.


Is the concern for ulimit or performance? Note that we had

#define NR_OPEN_MAX ~0U

And with the fd semantic, you can do a lot of other stuffs: close on
exec, passing via SCM_RIGHTS.

yes, fd has its merits.


For the case of 1M, I would like to know what's the use case for a
single process to handle 1M+ address spaces?

This single process is Qemu with an assigned device. Within the guest
there could be many guest processes. Though in reality I didn't see
such 1M processes on a single device, better not restrict it in uAPI?



Sorry I don't get here.

We can open up to ~0U file descriptors, I don't see why we need to 
restrict it in uAPI.


Thanks







So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.


If the container and address space is 1:1 then the container seems useless.


yes, 1:1 then container is useless. But here it's assumed 1:M then
even a single fd is sufficient for all intended usages.

Thanks
Kevin


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, May 29, 2021 3:59 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use 
> > cases:
> >
> > ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Jason, want to confirm here. Per earlier discussion we remain an
impression that you want VFIO to be a pure device driver thus
container/group are used only for legacy application. From this
comment are you suggesting that VFIO can still keep container/
group concepts and user just deprecates the use of vfio iommu
uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has
a simple policy that an IOASID will reject cmd if partially-attached 
group exists)?

> 
> 
> > Three types of IOASIDs are considered:
> >
> > gpa_ioasid[1...N]:  for GPA address space
> > giova_ioasid[1...N]:for guest IOVA address space
> > gva_ioasid[1...N]:  for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> > /* Bind device to IOASID fd */
> > device_fd = open("/dev/vfio/devices/dev1", mode);
> > ioasid_fd = open("/dev/ioasid", mode);
> > ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > /* Attach device to IOASID */
> > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > at_data = { .ioasid = gpa_ioasid};
> > ioctl(device_fd, VFIO_ATTACH_IOASID, _data);
> >
> > /* Setup GPA mapping */
> > dma_map = {
> > .ioasid = gpa_ioasid;
> > .iova   = 0;// GPA
> > .vaddr  = 0x4000;   // HVA
> > .size   = 1GB;
> > };
> > ioctl(ioasid_fd, IOASID_DMA_MAP, _map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
> 
> eg
> 
>   device2_fd = open("/dev/vfio/devices/dev1", mode);
>   ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>   ioctl(device2_fd, VFIO_ATTACH_IOASID, _data);
> 
> Right?

Exactly, except a small typo ('dev1' -> 'dev2'). :)

> 
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > 
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> > device_fd1 = open("/dev/vfio/devices/dev1", mode);
> > device_fd2 = open("/dev/vfio/devices/dev2", mode);
> > ioasid_fd = open("/dev/ioasid", mode);
> > ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> > ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > /* pre-register the virtual address range for accounting */
> > mem_info = { .vaddr = 0x4000; .size = 1GB };
> > ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, _info);
> >
> > /* Attach dev1 and dev2 to gpa_ioasid */
> > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > at_data = { .ioasid = gpa_ioasid};
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, _data);
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, _data);
> >
> > /* Setup GPA mapping */
> > dma_map = {
> > .ioasid = gpa_ioasid;
> > .iova 

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, May 29, 2021 1:36 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> 
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
> 
> Why? A SW emulation could do this synchronization during invalidation
> processing if invalidation contained an IOVA range.

In this proposal we differentiate between host-managed and user-
managed I/O page tables. If host-managed, the user is expected to use
map/unmap cmd explicitly upon any change required on the page table. 
If user-managed, the user first binds its page table to the IOMMU and 
then use invalidation cmd to flush iotlb when necessary (e.g. typically
not required when changing a PTE from non-present to present).

We expect user to use map+unmap and bind+invalidate respectively
instead of mixing them together. Following this policy, map+unmap
must be used in both levels for software nesting, so changes in either 
level are captured timely to synchronize the shadow mapping.

> 
> I think this document would be stronger to include some "Rational"
> statements in key places
> 

Sure. I tried to provide rationale as much as possible but sometimes 
it's lost in a complex context like this. :)

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 11:55:00PM +0530, Kirti Wankhede wrote:
> 
> 
> On 5/27/2021 10:30 AM, David Gibson wrote:
> > On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote:
> > > 
> > > 
> > > On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:
> > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > > 
> > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created 
> > > > > per
> > > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > > protection scope as VF associated to it.
> > > > 
> > > > This doesn't require, and certainly shouldn't create, a fake group.
> > > > 
> > > > Only the VF's real IOMMU group should be used to model an iommu domain
> > > > linked to a VF. Injecting fake groups that are proxies for real groups
> > > > only opens the possibility of security problems like David is
> > > > concerned with.
> > > > 
> > > 
> > > I think this security issue should be addressed by letting mdev device
> > > inherit its parent's iommu_group, i.e. VF's iommu_group here.
> > 
> > No, that doesn't work.  AIUI part of the whole point of mdevs is to
> > allow chunks of a single PCI function to be handed out to different
> > places, because they're isolated from each other not by the system
> > IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a
> > GPU card) and software in the mdev driver.
> 
> That's correct for non-iommu backed mdev devices.
> 
> > If mdevs inherited the
> > group of their parent device they wouldn't count as isolated from each
> > other, which they should.
> > 
> 
> For iommu backed mdev devices for SRIOV, where there can be single mdev
> device for its parent, here parent device is VF, there can't be multiple
> mdev devices associated with that VF. In this case mdev can inherit the
> group of parent device.

Ah, yes, if there's just one mdev for the PCI function, and the
function doesn't have an internal memory protection unit then this
makes sense.

Which means we *do* have at least two meaningfully different group
configurations for mdev:
  * mdev is in a singleton group independent of the parent PCI device
  * mdev shares a group with its parent PCI device

Which means even in the case of mdevs, the group structure is *not* a
meaningless fiction.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 04:06:20PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 02:53:42PM +1000, David Gibson wrote:
> 
> > > > If the physical device had a bug which meant the mdevs *weren't*
> > > > properly isolated from each other, then those mdevs would share a
> > > > group, and you *would* care about it.  Depending on how the isolation
> > > > failed the mdevs might or might not also share a group with the parent
> > > > physical device.
> > > 
> > > That isn't a real scenario.. mdevs that can't be isolated just
> > > wouldn't be useful to exist
> > 
> > Really?  So what do you do when you discover some mdevs you thought
> > were isolated actually aren't due to a hardware bug?  Drop support
> > from the driver entirely?  In which case what do you say to the people
> > who understandably complain "but... we had all the mdevs in one guest
> > anyway, we don't care if they're not isolated"?
> 
> I've never said to eliminate groups entirely. 
> 
> What I'm saying is that all the cases we have for mdev today do not
> require groups, but are forced to create a fake group anyhow just to
> satisfy the odd VFIO requirement to have a group FD.
> 
> If some future mdev needs groups then sure, add the appropriate group
> stuff.
> 
> But that doesn't effect the decision to have a VFIO group FD, or not.
> 
> > > > It ensures that they're parked at the moment the group moves from
> > > > kernel to userspace ownership, but it can't prevent dpdk from
> > > > accessing and unparking those devices via peer to peer DMA.
> > > 
> > > Right, and adding all this group stuff did nothing to alert the poor
> > > admin that is running DPDK to this risk.
> > 
> > Didn't it?  Seems to me the admin that in order to give the group to
> > DPDK, the admin had to find and unbind all the things in it... so is
> > therefore aware that they're giving everything in it to DPDK.
> 
> Again, I've never said the *group* should be removed. I'm only
> concerned about the *group FD*

Ok, that wasn't really clear to me.

I still wouldn't say the group for mdevs is a fiction though.. rather
that the group device used for (no internal IOMMU case) mdevs is just
plain wrong.

> When the admin found and unbound they didn't use the *group FD* in any
> way.

No, they are likely to have changed permissions on the group device
node as part of the process, though.

> > > You put the same security labels you'd put on the group to the devices
> > > that consitute the group. It is only more tricky in the sense that the
> > > script that would have to do this will need to do more than ID the
> > > group to label but also ID the device members of the group and label
> > > their char nodes.
> > 
> > Well, I guess, if you take the view that root is allowed to break the
> > kernel.  I tend to prefer that although root can obviously break the
> > kernel if they intend do, we should make it hard to do by accident -
> > which in this case would mean the kernel *enforcing* that the devices
> > in the group have the same security labels, which I can't really see
> > how to do without an exposed group.
> 
> How is this "break the kernel"? It has nothing to do with the
> kernel. Security labels are a user space concern.

*thinks*... yeah, ok, that was much too strong an assertion.  What I
was thinking of is the fact that this means that guarantees you'd
normally expect the kernel to enforce can be obviated by bad
configuration: chown-ing a device to root doesn't actually protect it
if there's another device in the same group exposed to other users.

But I guess you could say the same about, say, an unauthenticated nbd
export of a root-owned block device, so I guess that's not something
the kernel can reasonably enforce.


Ok.. you might be finally convincing me, somewhat.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > 
> > > > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > protection scope as VF associated to it. 
> > > 
> > > This doesn't require, and certainly shouldn't create, a fake group.
> > 
> > It's only fake if you start with a narrow view of what a group is. 
> 
> A group is connected to drivers/iommu. A group object without *any*
> relation to drivers/iommu is just a complete fiction, IMHO.

That might be where we differ.  As I've said, my group I'm primarily
meaning the fundamental hardware unit of isolation.  *Usually* that's
determined by the capabilities of an IOMMU, but in some cases it might
not be.  In either case, the boundaries still matter.

> > > Only the VF's real IOMMU group should be used to model an iommu domain
> > > linked to a VF. Injecting fake groups that are proxies for real groups
> > > only opens the possibility of security problems like David is
> > > concerned with.
> > 
> > It's not a proxy for a real group, it's a group of its own.  If you
> > discover that (due to a hardware bug, for example) the mdev is *not*
> 
> What Kirti is talking about here is the case where a mdev is wrapped
> around a VF and the DMA isolation stems directly from the SRIOV VF's
> inherent DMA isolation, not anything the mdev wrapper did.
> 
> The group providing the isolation is the VF's group.

Yes, in that case the mdev absolutely should be in the VF's group -
having its own group is not just messy but incorrect.

> The group mdev implicitly creates is just a fake proxy that comes
> along with mdev API. It doesn't do anything and it doesn't mean
> anything.

But.. the case of multiple mdevs managed by a single PCI device with
an internal IOMMU also exists, and then the mdev groups are *not*
proxies but true groups independent of the parent device.  Which means
that the group structure of mdevs can vary, which is an argument *for*
keeping it, not against.

> > properly isolated from its parent PCI device, then both the mdev
> > virtual device *and* the physical PCI device are in the same group.
> > Groups including devices of different types and on different buses
> > were considered from the start, and are precedented, if rare.
> 
> This is far too theoretical for me. A security broken mdev is
> functionally useless.

Is it, though?  Again, I'm talking about the case of multiple mdevs
with a single parent device (because that's the only case I was aware
of until recently).  Isolation comes from a device-internal
IOMMU... that turns out to be broken.  But if your security domain
happens to include all the mdevs on the device anyway, then you don't
care.

Are you really going to say people can't use their fancy hardware in
this mode because it has a security flaw that's not relevant to their
usecase?


And then.. there's Kirti's case.  In that case the mdev should belong
to its parent PCI device's group since that's what's providing
isolation.  But in that case the parent device can be in a
multi-device group for any of the usual reasons (PCIe-to-PCI bridge,
PCIe switch with broken ACS, multifunction device with crosstalk).
Which means the mdev also shares a group with those other device.  So
again, the group structure matters and is not a fiction.

> We don't need to support it, and we don't need complicated software to
> model it.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Saturday, May 29, 2021 12:23 AM
> >
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
> 
> Is there an advantage to moving software nesting into the kernel?
> We could just have the guest do its usual combined map/unmap on the child
> fd
> 

There are at least two intended usages:

1) From previous discussion looks PPC's window-based scheme can be
better supported with software nesting (a shared IOVA address space
as the parent (shared by all devices) which is nested by multiple windows
as the children (per-device);

2) Some mdev drivers (e.g. kvmgt) may want to do write-protection on 
guest data structures (base address programmed to mediated MMIO
register). The base address is IOVA while  KVM page-tracking API is 
based on GPA. nesting allows finding GPA according to IOVA.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Regression 5.12.0-rc4 net: ice: significant throughput drop

2021-06-01 Thread Daniel Borkmann

[ ping Robin / Joerg, +Cc Christoph ]

On 5/28/21 10:34 AM, Jussi Maki wrote:

Hi all,

While measuring the impact of a kernel patch on our lab machines I stumbled upon
a performance regression affecting the 100Gbit ICE nic and bisected it
from range v5.11.1..v5.13-rc3 to the commit:
a250c23f15c2 iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

Both recent bpf-next (d6a6a55518) and linux-stable (c4681547bc) are
affected by the issue.

The regression shows as a significant drop in throughput as measured
with "super_netperf" [0],
with measured bandwidth of ~95Gbps before and ~35Gbps after:

commit 3189713a1b84 (a250c23^):
$ ./super_netperf 32 -H 172.18.0.2 -l 10
97379.8

commit a250c23f15c2:
$ ./super_netperf 32 -H 172.18.0.2 -l 10
34097.5

The pair of test machines have this hardware:
CPU: AMD Ryzen 9 3950X 16-Core Processor
MB: X570 AORUS MASTER
0a:00.0 Ethernet controller [0200]: Intel Corporation Ethernet
Controller E810-C for QSFP [8086:1592] (rev 02)
Kernel config: https://gist.github.com/joamaki/9ee11294c78a8dd2776041f67e5620e7

[0] https://github.com/borkmann/stuff/blob/master/super_netperf



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Shenming Lu
On 2021/6/1 13:10, Lu Baolu wrote:
> Hi Shenming,
> 
> On 6/1/21 12:31 PM, Shenming Lu wrote:
>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>> etc.) are expected to use this interface instead of creating their own 
>>> logic to
>>> isolate untrusted device DMAs initiated by userspace.
>>>
>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>> with VFIO as example in typical usages. The driver-facing kernel API 
>>> provided
>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>> made on this uAPI.
>>>
>>> It's based on a lengthy discussion starting from here:
>>> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/
>>>
>>> It ends up to be a long writing due to many things to be summarized and
>>> non-trivial effort required to connect them into a complete proposal.
>>> Hope it provides a clean base to converge.
>>>
>>
>> [..]
>>
>>>
>>> /*
>>>    * Page fault report and response
>>>    *
>>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>    * the user and an ioctl to complete the fault.
>>>    *
>>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>    */
>>
>> Hi,
>>
>> It seems that the ioasid has different usage in different situation, it could
>> be directly used in the physical routing, or just a virtual handle that 
>> indicates
>> a page table or a vPASID table (such as the GPA address space, in the simple
>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>> Substream ID), right?
>>
>> And Baolu suggested that since one device might consume multiple page tables,
>> it's more reasonable to have one fault handler per page table. By this, do we
>> have to maintain such an ioasid info list in the IOMMU layer?
> 
> As discussed earlier, the I/O page fault and cache invalidation paths
> will have "device labels" so that the information could be easily
> translated and routed.
> 
> So it's likely the per-device fault handler registering API in iommu
> core can be kept, but /dev/ioasid will be grown with a layer to
> translate and propagate I/O page fault information to the right
> consumers.

Yeah, having a general preprocessing of the faults in IOASID seems to be
a doable direction. But since there may be more than one consumer at the
same time, who is responsible for registering the per-device fault handler?

Thanks,
Shenming

> 
> If things evolve in this way, probably the SVA I/O page fault also needs
> to be ported to /dev/ioasid.
> 
>>
>> Then if we add host IOPF support (for the GPA address space) in the future
>> (I have sent a series for this but it aimed for VFIO, I will convert it for
>> IOASID later [1] :-)), how could we find the handler for the received fault
>> event which only contains a Stream ID... Do we also have to maintain a
>> dev(vPASID)->ioasid mapping in the IOMMU layer?
>>
>> [1] https://lore.kernel.org/patchwork/cover/1410223/
> 
> Best regards,
> baolu
> .
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, May 29, 2021 4:03 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own 
> > logic to
> > isolate untrusted device DMAs initiated by userspace.
> 
> It is very long, but I think this has turned out quite well. It
> certainly matches the basic sketch I had in my head when we were
> talking about how to create vDPA devices a few years ago.
> 
> When you get down to the operations they all seem pretty common sense
> and straightfoward. Create an IOASID. Connect to a device. Fill the
> IOASID with pages somehow. Worry about PASID labeling.
> 
> It really is critical to get all the vendor IOMMU people to go over it
> and see how their HW features map into this.
> 

Agree. btw I feel it might be good to have several design opens 
centrally discussed after going through all the comments. Otherwise 
they may be buried in different sub-threads and potentially with 
insufficient care (especially for people who haven't completed the
reading).

I summarized five opens here, about:

1)  Finalizing the name to replace /dev/ioasid;
2)  Whether one device is allowed to bind to multiple IOASID fd's;
3)  Carry device information in invalidation/fault reporting uAPI;
4)  What should/could be specified when allocating an IOASID;
5)  The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and 
/dev/ioas. I don't have a strong preference and would like to hear 
votes from all stakeholders. /dev/iommu is slightly better imho for 
two reasons. First, per AMD's presentation in last KVM forum they 
implement vIOMMU in hardware thus need to support user-managed 
domains. An iommu uAPI notation might make more sense moving 
forward. Second, it makes later uAPI naming easier as 'IOASID' can 
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
IOASID_ALLOC_IOASID. :)

Another naming open is about IOASID (the software handle for ioas) 
and the associated hardware ID (PASID or substream ID). Jason thought 
PASID is defined more from SVA angle while ARM's convention sounds 
clearer from device p.o.v. Following this direction then SID/SSID will be 
used to replace RID/PASID in this RFC (and possibly also implying that 
the kernel IOASID allocator should also be renamed to SSID allocator). 
I don't have better alternative. If no one objects, I'll change to this new
naming in next version. 

For 2), Jason prefers to not blocking it if no kernel design reason. If 
one device is allowed to bind multiple IOASID fd's, the main problem
is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
and giova_ioasid created in fd2 and then nesting them together (and
whether any cross-fd notification required when handling invalidation
etc.). We thought that this just adds some complexity while not sure 
about the value of supporting it (when one fd can already afford all 
discussed usages). Therefore this RFC proposes a device only bound 
to at most one IOASID fd. Does this rationale make sense?

To the other end there was also thought whether we should make
a single I/O address space per IOASID fd. This was discussed in previous
thread that #fd's are insufficient to afford theoretical 1M's address
spaces per device. But let's have another revisit and draw a clear
conclusion whether this option is viable.

For 3), Jason/Jean both think it's cleaner to carry device info in the 
uAPI. Actually this was one option we developed in earlier internal
versions of this RFC. Later on we changed it to the current way based
on misinterpretation of previous discussion. Thinking more we will
adopt this suggestion in next version, due to both efficiency (I/O page
fault is already a long path ) and security reason (some faults are 
unrecoverable thus the faulting device must be identified/isolated).

This implies that VFIO_BOUND_IOASID will be extended to allow user
specify a device label. This label will be recorded in /dev/iommu to
serve per-device invalidation request from and report per-device 
fault data to the user. In addition, vPASID (if provided by user) will
be also recorded in /dev/iommu so vPASID<->pPASID conversion 
is conducted properly. e.g. invalidation request from user carries
a vPASID which must be converted into pPASID before calling iommu
driver. Vice versa for raw fault data which carries pPASID while the
user expects a vPASID.

For 4), There are two options for specifying the IOASID attributes:

In this RFC, an IOASID has no attribute before it's attached to any
device. After device attach, user queries capability/format info
about the IOMMU which the device belongs to, and then call
different ioctl commands to set the attributes for an IOASID (e.g.
map/unmap, bind/unbind 

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Tian, Kevin
> From: Jason Wang
> Sent: Tuesday, June 1, 2021 2:07 PM
> 
> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
> >> From: Jason Wang
> >> Sent: Tuesday, June 1, 2021 1:30 PM
> >>
> >> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> >>> Hi Jason W,
> >>>
> >>> On 6/1/21 1:08 PM, Jason Wang wrote:
> >> 2) If yes, what's the reason for not simply use the fd opened from
> >> /dev/ioas. (This is the question that is not answered) and what
> >> happens
> >> if we call GET_INFO for the ioasid_fd?
> >> 3) If not, how GET_INFO work?
> > oh, missed this question in prior reply. Personally, no special reason
> > yet. But using ID may give us opportunity to customize the
> management
> > of the handle. For one, better lookup efficiency by using xarray to
> > store the allocated IDs. For two, could categorize the allocated IDs
> > (parent or nested). GET_INFO just works with an input FD and an ID.
> 
>  I'm not sure I get this, for nesting cases you can still make the
>  child an fd.
> 
>  And a question still, under what case we need to create multiple
>  ioasids on a single ioasid fd?
> >>> One possible situation where multiple IOASIDs per FD could be used is
> >>> that devices with different underlying IOMMU capabilities are sharing a
> >>> single FD. In this case, only devices with consistent underlying IOMMU
> >>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> >>> be applied.
> >>>
> >>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> >>> IOASID FDs" for such case.
> >>
> >> Right, that's exactly my question. The latter seems much more easier to
> >> be understood and implemented.
> >>
> > A simple reason discussed in previous thread - there could be 1M's
> > I/O address spaces per device while #FD's are precious resource.
> 
> 
> Is the concern for ulimit or performance? Note that we had
> 
> #define NR_OPEN_MAX ~0U
> 
> And with the fd semantic, you can do a lot of other stuffs: close on
> exec, passing via SCM_RIGHTS.

yes, fd has its merits.

> 
> For the case of 1M, I would like to know what's the use case for a
> single process to handle 1M+ address spaces?

This single process is Qemu with an assigned device. Within the guest 
there could be many guest processes. Though in reality I didn't see
such 1M processes on a single device, better not restrict it in uAPI?

> 
> 
> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
> 
> 
> If the container and address space is 1:1 then the container seems useless.
> 

yes, 1:1 then container is useless. But here it's assumed 1:M then 
even a single fd is sufficient for all intended usages. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Wang


在 2021/6/1 下午1:42, Tian, Kevin 写道:

From: Jason Wang
Sent: Tuesday, June 1, 2021 1:30 PM

在 2021/6/1 下午1:23, Lu Baolu 写道:

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:

2) If yes, what's the reason for not simply use the fd opened from
/dev/ioas. (This is the question that is not answered) and what
happens
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?

oh, missed this question in prior reply. Personally, no special reason
yet. But using ID may give us opportunity to customize the management
of the handle. For one, better lookup efficiency by using xarray to
store the allocated IDs. For two, could categorize the allocated IDs
(parent or nested). GET_INFO just works with an input FD and an ID.


I'm not sure I get this, for nesting cases you can still make the
child an fd.

And a question still, under what case we need to create multiple
ioasids on a single ioasid fd?

One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.


Right, that's exactly my question. The latter seems much more easier to
be understood and implemented.


A simple reason discussed in previous thread - there could be 1M's
I/O address spaces per device while #FD's are precious resource.



Is the concern for ulimit or performance? Note that we had

#define NR_OPEN_MAX ~0U

And with the fd semantic, you can do a lot of other stuffs: close on 
exec, passing via SCM_RIGHTS.


For the case of 1M, I would like to know what's the use case for a 
single process to handle 1M+ address spaces?




So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.



If the container and address space is 1:1 then the container seems useless.

Thanks




Thanks
Kevin


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu