Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/2 1:33, Jason Gunthorpe wrote: > On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote: > >> The drivers register per page table fault handlers to /dev/ioasid which >> will then register itself to iommu core to listen and route the per- >> device I/O page faults. > > I'm still confused why drivers need fault handlers at all? Essentially it is the userspace that needs the fault handlers, one case is to deliver the faults to the vIOMMU, and another case is to enable IOPF on the GPA address space for on-demand paging, it seems that both could be specified in/through the IOASID_ALLOC ioctl? Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/7] iommu: Allow IOVA rcache range be configured
On 6/1/21 10:29 PM, John Garry wrote: For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA ageing issue, as discussed at [1]. This series allows the IOVA rcache range be configured, so that we may cache all IOVAs per domain, thus improving performance. A new IOMMU group sysfs file is added - max_opt_dma_size - which is used indirectly to configure the IOVA rcache range: /sys/kernel/iommu_groups/X/max_opt_dma_size This file is updated same as how the IOMMU group default domain type is updated, i.e. must unbind the only device in the group first. Could you explain why it requires singleton group and driver unbinding if the user only wants to increase the upper limit? I haven't dived into the details yet, sorry if this is a silly question. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 6/2/21 1:26 AM, Jason Gunthorpe wrote: On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote: This version only covers 1) and 4). Do you think we need to support 2), 3) and beyond? Yes aboslutely. The API should be flexable enough to specify the creation of all future page table formats we'd want to have and all HW specific details on those formats. OK, stay in the same line. If so, it seems that we need some in-kernel helpers and uAPIs to support pre-installing a page table to IOASID. Not sure what this means.. Sorry that I didn't make this clear. Let me bring back the page table types in my eyes. 1) IOMMU format page table (a.k.a. iommu_domain) 2) user application CPU page table (SVA for example) 3) KVM EPT (future option) 4) VM guest managed page table (nesting mode) Each type of page table should be able to be associated with its IOASID. We have BIND protocol for 4); We explicitly allocate an iommu_domain for 1). But we don't have a clear definition for 2) 3) and others. I think it's necessary to clearly define a time point and kAPI name between IOASID_ALLOC and IOASID_ATTACH, so that other modules have the opportunity to associate their page table with the allocated IOASID before attaching the page table to the real IOMMU hardware. I/O page fault handling is similar. The provider of the page table should take the responsibility to handle the possible page faults. Could this answer the question of "I'm still confused why drivers need fault handlers at all?" in below thread? https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/#m15def9e8b236dfcf97e21c8e9f8a58da214e3691 From this point of view an IOASID is actually not just a variant of iommu_domain, but an I/O page table representation in a broader sense. Yes, and things need to evolve in a staged way. The ioctl API should have room for this growth but you need to start out with some constrained enough to actually implement then figure out how to grow from there Yes, agreed. I just think about it from the perspective of a design document. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Alex Williamson > Sent: Wednesday, June 2, 2021 6:22 AM > > On Tue, 1 Jun 2021 07:01:57 + > "Tian, Kevin" wrote: > > > > I summarized five opens here, about: > > > > 1) Finalizing the name to replace /dev/ioasid; > > 2) Whether one device is allowed to bind to multiple IOASID fd's; > > 3) Carry device information in invalidation/fault reporting uAPI; > > 4) What should/could be specified when allocating an IOASID; > > 5) The protocol between vfio group and kvm; > > > ... > > > > For 5), I'd expect Alex to chime in. Per my understanding looks the > > original purpose of this protocol is not about I/O address space. It's > > for KVM to know whether any device is assigned to this VM and then > > do something special (e.g. posted interrupt, EPT cache attribute, etc.). > > Right, the original use case was for KVM to determine whether it needs > to emulate invlpg, so it needs to be aware when an assigned device is invlpg -> wbinvd :) > present and be able to test if DMA for that device is cache coherent. > The user, QEMU, creates a KVM "pseudo" device representing the vfio > group, providing the file descriptor of that group to show ownership. > The ugly symbol_get code is to avoid hard module dependencies, ie. the > kvm module should not pull in or require the vfio module, but vfio will > be present if attempting to register this device. so the symbol_get thing is not about the protocol itself. Whatever protocol is defined, as long as kvm needs to call vfio or ioasid helper function, we need define a proper way to do it. Jason, what's your opinion of alternative option since you dislike symbol_get? > > With kvmgt, the interface also became a way to register the kvm pointer > with vfio for the translation mentioned elsewhere in this thread. > > The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU > page table so that it can handle iotlb programming from pre-registered > memory without trapping out to userspace. > > > Because KVM deduces some policy based on the fact of assigned device, > > it needs to hold a reference to related vfio group. this part is irrelevant > > to this RFC. > > All of these use cases are related to the IOMMU, whether DMA is > coherent, translating device IOVA to GPA, and an acceleration path to > emulate IOMMU programming in kernel... they seem pretty relevant. One open is whether kvm should hold a device reference or IOASID reference. For DMA coherence, it only matters whether assigned devices are coherent or not (not for a specific address space). For kvmgt, it is for recoding kvm pointer in mdev driver to do write protection. For ppc, it does relate to a specific I/O page table. Then I feel only a part of the protocol will be moved to /dev/ioasid and something will still remain between kvm and vfio? > > > But ARM's VMID usage is related to I/O address space thus needs some > > consideration. Another strange thing is about PPC. Looks it also leverages > > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_ > > group. I don't know why it's done through KVM instead of VFIO uAPI in > > the first place. > > AIUI, IOMMU programming on PPC is done through hypercalls, so KVM > needs > to know how to handle those for in-kernel acceleration. Thanks, > ok. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Wednesday, June 2, 2021 1:57 AM > > On Tue, Jun 01, 2021 at 08:38:00AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Saturday, May 29, 2021 3:59 AM > > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > > > > > 5. Use Cases and Flows > > > > > > > > Here assume VFIO will support a new model where every bound device > > > > is explicitly listed under /dev/vfio thus a device fd can be acquired > > > > w/o > > > > going through legacy container/group interface. For illustration purpose > > > > those devices are just called dev[1...N]: > > > > > > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode); > > > > > > > > As explained earlier, one IOASID fd is sufficient for all intended use > cases: > > > > > > > > ioasid_fd = open("/dev/ioasid", mode); > > > > > > > > For simplicity below examples are all made for the virtualization story. > > > > They are representative and could be easily adapted to a non- > virtualization > > > > scenario. > > > > > > For others, I don't think this is *strictly* necessary, we can > > > probably still get to the device_fd using the group_fd and fit in > > > /dev/ioasid. It does make the rest of this more readable though. > > > > Jason, want to confirm here. Per earlier discussion we remain an > > impression that you want VFIO to be a pure device driver thus > > container/group are used only for legacy application. > > Let me call this a "nice wish". > > If you get to a point where you hard need this, then identify the hard > requirement and let's do it, but I wouldn't bloat this already large > project unnecessarily. > OK, got your point. So let's start by keeping this room. For new sub-systems like vDPA, they don't need inventing group fd uAPI and just leave to their user to meet the group limitation. For existing sub-system i.e. VFIO, it could keep a stronger group enforcement uAPI like today. One day, we may revisit it if the simple policy works well for all other new sub-systems. > Similarly I wouldn't depend on the group fd existing in this design > so it could be changed later. Yes, this is guaranteed. /dev/ioasid uAPI has no group concept. > > > From this comment are you suggesting that VFIO can still keep > > container/ group concepts and user just deprecates the use of vfio > > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a > > simple policy that an IOASID will reject cmd if partially-attached > > group exists)? > > I would say no on the container. /dev/ioasid == the container, having > two competing objects at once in a single process is just a mess. > > If the group fd can be kept requires charting a path through the > ioctls where the container is not used and /dev/ioasid is sub'd in > using the same device FD specific IOCTLs you show here. yes > > I didn't try to chart this out carefully. > > Also, ultimately, something need to be done about compatability with > the vfio container fd. It looks clear enough to me that the the VFIO > container FD is just a single IOASID using a special ioctl interface > so it would be quite rasonable to harmonize these somehow. Possibly multiple IOASIDs as VFIO container cay hold incompatible devices today. Suppose helper functions will be provided for VFIO container to create IOASID and then use map/unmap to manage its I/O page table. This is the shim iommu driver concept in previous discussion between you and Alex. This can be done at a later stage. Let's focus on /dev/ioasid uAPI, and bear some code duplication between it and vfio type1 for now. > > But that is too complicated and far out for me at least to guess on at > this point.. We're working on a prototype in parallel with this discussion. Based on this work we'll figure out what's the best way to start with. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Wednesday, June 2, 2021 1:42 AM > > On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Saturday, May 29, 2021 1:36 AM > > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > > > > IOASID nesting can be implemented in two ways: hardware nesting and > > > > software nesting. With hardware support the child and parent I/O page > > > > tables are walked consecutively by the IOMMU to form a nested > translation. > > > > When it's implemented in software, the ioasid driver is responsible for > > > > merging the two-level mappings into a single-level shadow I/O page > table. > > > > Software nesting requires both child/parent page tables operated > through > > > > the dma mapping protocol, so any change in either level can be > captured > > > > by the kernel to update the corresponding shadow mapping. > > > > > > Why? A SW emulation could do this synchronization during invalidation > > > processing if invalidation contained an IOVA range. > > > > In this proposal we differentiate between host-managed and user- > > managed I/O page tables. If host-managed, the user is expected to use > > map/unmap cmd explicitly upon any change required on the page table. > > If user-managed, the user first binds its page table to the IOMMU and > > then use invalidation cmd to flush iotlb when necessary (e.g. typically > > not required when changing a PTE from non-present to present). > > > > We expect user to use map+unmap and bind+invalidate respectively > > instead of mixing them together. Following this policy, map+unmap > > must be used in both levels for software nesting, so changes in either > > level are captured timely to synchronize the shadow mapping. > > map+unmap or bind+invalidate is a policy of the IOASID itself set when > it is created. If you put two different types in a tree then each IOASID > must continue to use its own operation mode. > > I don't see a reason to force all IOASIDs in a tree to be consistent?? only for software nesting. With hardware support the parent uses map while the child uses bind. Yes, the policy is specified per IOASID. But if the policy violates the requirement in a specific nesting mode, then nesting should fail. > > A software emulated two level page table where the leaf level is a > bound page table in guest memory should continue to use > bind/invalidate to maintain the guest page table IOASID even though it > is a SW construct. with software nesting the leaf should be a host-managed page table (or metadata). A bind/invalidate protocol doesn't require the user to notify the kernel of every page table change. But for software nesting the kernel must know every change to timely update the shadow/merged mapping, otherwise DMA may hit stale mapping. > > The GPA level should use map/unmap because it is a kernel owned page > table yes, this is always true. > > Though how to efficiently mix map/unmap on the GPA when there are SW > nested levels below it looks to be quite challenging. > Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Wednesday, June 2, 2021 4:29 AM > > On Tue, Jun 01, 2021 at 07:01:57AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Saturday, May 29, 2021 4:03 AM > > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > /dev/ioasid provides an unified interface for managing I/O page tables > for > > > > devices assigned to userspace. Device passthrough frameworks (VFIO, > > > vDPA, > > > > etc.) are expected to use this interface instead of creating their own > logic to > > > > isolate untrusted device DMAs initiated by userspace. > > > > > > It is very long, but I think this has turned out quite well. It > > > certainly matches the basic sketch I had in my head when we were > > > talking about how to create vDPA devices a few years ago. > > > > > > When you get down to the operations they all seem pretty common > sense > > > and straightfoward. Create an IOASID. Connect to a device. Fill the > > > IOASID with pages somehow. Worry about PASID labeling. > > > > > > It really is critical to get all the vendor IOMMU people to go over it > > > and see how their HW features map into this. > > > > > > > Agree. btw I feel it might be good to have several design opens > > centrally discussed after going through all the comments. Otherwise > > they may be buried in different sub-threads and potentially with > > insufficient care (especially for people who haven't completed the > > reading). > > > > I summarized five opens here, about: > > > > 1) Finalizing the name to replace /dev/ioasid; > > 2) Whether one device is allowed to bind to multiple IOASID fd's; > > 3) Carry device information in invalidation/fault reporting uAPI; > > 4) What should/could be specified when allocating an IOASID; > > 5) The protocol between vfio group and kvm; > > > > For 1), two alternative names are mentioned: /dev/iommu and > > /dev/ioas. I don't have a strong preference and would like to hear > > votes from all stakeholders. /dev/iommu is slightly better imho for > > two reasons. First, per AMD's presentation in last KVM forum they > > implement vIOMMU in hardware thus need to support user-managed > > domains. An iommu uAPI notation might make more sense moving > > forward. Second, it makes later uAPI naming easier as 'IOASID' can > > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of > > IOASID_ALLOC_IOASID. :) > > I think two years ago I suggested /dev/iommu and it didn't go very far > at the time. We've also talked about this as /dev/sva for a while and > now /dev/ioasid > > I think /dev/iommu is fine, and call the things inside them IOAS > objects. > > Then we don't have naming aliasing with kernel constructs. > > > For 2), Jason prefers to not blocking it if no kernel design reason. If > > one device is allowed to bind multiple IOASID fd's, the main problem > > is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 > > and giova_ioasid created in fd2 and then nesting them together (and > > Huh? This can't happen > > Creating an IOASID is an operation on on the /dev/ioasid FD. We won't > provide APIs to create a tree of IOASID's outside a single FD container. > > If a device can consume multiple IOASID's it doesn't care how many or > what /dev/ioasid FDs they come from. OK, this implies that if one user inadvertently creates intended parent/ child via different fd's then the operation will simply fail. More specifically taking ARM's case for example. There is only a single 2nd-level I/O page table per device (nested by multiple 1st-level tables). Say the user already creates a gpa_ioasid for a device via fd1. Now he binds the device to fd2, intending to enable vSVA which requires nested translation thus needs create a parent via fd2. This parent creation will simply fail by the IOMMU layer because the 2nd-level (via fd1) is already installed for this device. > > > To the other end there was also thought whether we should make > > a single I/O address space per IOASID fd. This was discussed in previous > > thread that #fd's are insufficient to afford theoretical 1M's address > > spaces per device. But let's have another revisit and draw a clear > > conclusion whether this option is viable. > > I had remarks on this, I think per-fd doesn't work > > > This implies that VFIO_BOUND_IOASID will be extended to allow user > > specify a device label. This label will be recorded in /dev/iommu to > > serve per-device invalidation request from and report per-device > > fault data to the user. > > I wonder which of the user providing a 64 bit cookie or the kernel > returning a small IDA is the best choice here? Both have merits > depending on what qemu needs.. Yes, either way can work. I don't have a strong preference. Jean? > > > In addition, vPASID (if provided by user) will > > be also recorded in /dev/iommu so vPASID<->pPASID conversion > > is conducted properly. e.g. invalidation request from user carries > > a vPASID
Re: [RFC PATCH V3 09/11] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM
On 5/30/21 11:06 AM, Tianyu Lan wrote: > @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void) > EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late); > > IOMMU_INIT_FINISH(2, > - NULL, > + hyperv_swiotlb_detect, > pci_xen_swiotlb_init, > NULL); Could you explain this change? -boris ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support
From: Rob Clark Add, via the adreno-smmu-priv interface, a way for the GPU to request the SMMU to stall translation on faults, and then later resume the translation, either retrying or terminating the current translation. This will be used on the GPU side to "freeze" the GPU while we snapshot useful state for devcoredump. Signed-off-by: Rob Clark --- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++ include/linux/adreno-smmu-priv.h | 7 + 2 files changed, 40 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index b2e31ea84128..61fc645c1325 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -13,6 +13,7 @@ struct qcom_smmu { struct arm_smmu_device smmu; bool bypass_quirk; u8 bypass_cbndx; + u32 stall_enabled; }; static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu) @@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu) static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx, u32 reg) { + struct qcom_smmu *qsmmu = to_qcom_smmu(smmu); + /* * On the GPU device we want to process subsequent transactions after a * fault to keep the GPU from hanging */ reg |= ARM_SMMU_SCTLR_HUPCF; + if (qsmmu->stall_enabled & BIT(idx)) + reg |= ARM_SMMU_SCTLR_CFCFG; + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg); } @@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie, info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR); } +static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled) +{ + struct arm_smmu_domain *smmu_domain = (void *)cookie; + struct arm_smmu_cfg *cfg = _domain->cfg; + struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu); + + if (enabled) + qsmmu->stall_enabled |= BIT(cfg->cbndx); + else + qsmmu->stall_enabled &= ~BIT(cfg->cbndx); +} + +static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate) +{ + struct arm_smmu_domain *smmu_domain = (void *)cookie; + struct arm_smmu_cfg *cfg = _domain->cfg; + struct arm_smmu_device *smmu = smmu_domain->smmu; + u32 reg = 0; + + if (terminate) + reg |= ARM_SMMU_RESUME_TERMINATE; + + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg); +} + #define QCOM_ADRENO_SMMU_GPU_SID 0 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev) @@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg; priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg; priv->get_fault_info = qcom_adreno_smmu_get_fault_info; + priv->set_stall = qcom_adreno_smmu_set_stall; + priv->resume_translation = qcom_adreno_smmu_resume_translation; return 0; } diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h index 53fe32fb9214..c637e0997f6d 100644 --- a/include/linux/adreno-smmu-priv.h +++ b/include/linux/adreno-smmu-priv.h @@ -45,6 +45,11 @@ struct adreno_smmu_fault_info { * TTBR0 translation is enabled with the specified cfg * @get_fault_info: Called by the GPU fault handler to get information about * the fault + * @set_stall: Configure whether stall on fault (CFCFG) is enabled. Call + * before set_ttbr0_cfg(). If stalling on fault is enabled, + * the GPU driver must call resume_translation() + * @resume_translation: Resume translation after a fault + * * * The GPU driver (drm/msm) and adreno-smmu work together for controlling * the GPU's SMMU instance. This is by necessity, as the GPU is directly @@ -60,6 +65,8 @@ struct adreno_smmu_priv { const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie); int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg); void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info); +void (*set_stall)(const void *cookie, bool enabled); +void (*resume_translation)(const void *cookie, bool terminate); }; #endif /* __ADRENO_SMMU_PRIV_H */ -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
From: Jordan Crouse Add a callback in adreno-smmu-priv to read interesting SMMU registers to provide an opportunity for a richer debug experience in the GPU driver. Signed-off-by: Jordan Crouse Signed-off-by: Rob Clark --- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 ++ include/linux/adreno-smmu-priv.h | 31 +- 3 files changed, 49 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index 98b3a1c2a181..b2e31ea84128 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx, arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg); } +static void qcom_adreno_smmu_get_fault_info(const void *cookie, + struct adreno_smmu_fault_info *info) +{ + struct arm_smmu_domain *smmu_domain = (void *)cookie; + struct arm_smmu_cfg *cfg = _domain->cfg; + struct arm_smmu_device *smmu = smmu_domain->smmu; + + info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR); + info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0); + info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1); + info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR); + info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx)); + info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0); + info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR); +} + #define QCOM_ADRENO_SMMU_GPU_SID 0 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev) @@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, priv->cookie = smmu_domain; priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg; priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg; + priv->get_fault_info = qcom_adreno_smmu_get_fault_info; return 0; } diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h index c31a59d35c64..84c21c4b0691 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h @@ -224,6 +224,8 @@ enum arm_smmu_cbar_type { #define ARM_SMMU_CB_FSYNR0 0x68 #define ARM_SMMU_FSYNR0_WNRBIT(4) +#define ARM_SMMU_CB_FSYNR1 0x6c + #define ARM_SMMU_CB_S1_TLBIVA 0x600 #define ARM_SMMU_CB_S1_TLBIASID0x610 #define ARM_SMMU_CB_S1_TLBIVAL 0x620 diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h index a889f28afb42..53fe32fb9214 100644 --- a/include/linux/adreno-smmu-priv.h +++ b/include/linux/adreno-smmu-priv.h @@ -8,6 +8,32 @@ #include +/** + * struct adreno_smmu_fault_info - container for key fault information + * + * @far: The faulting IOVA from ARM_SMMU_CB_FAR + * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0 + * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR + * @fsr: The fault status from ARM_SMMU_CB_FSR + * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0 + * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0 + * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx) + * + * This struct passes back key page fault information to the GPU driver + * through the get_fault_info function pointer. + * The GPU driver can use this information to print informative + * log messages and provide deeper GPU specific insight into the fault. + */ +struct adreno_smmu_fault_info { + u64 far; + u64 ttbr0; + u32 contextidr; + u32 fsr; + u32 fsynr0; + u32 fsynr1; + u32 cbfrsynra; +}; + /** * struct adreno_smmu_priv - private interface between adreno-smmu and GPU * @@ -17,6 +43,8 @@ * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank. A * NULL config disables TTBR0 translation, otherwise * TTBR0 translation is enabled with the specified cfg + * @get_fault_info: Called by the GPU fault handler to get information about + * the fault * * The GPU driver (drm/msm) and adreno-smmu work together for controlling * the GPU's SMMU instance. This is by necessity, as the GPU is directly @@ -31,6 +59,7 @@ struct adreno_smmu_priv { const void *cookie; const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie); int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg); +void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info); }; -#endif /* __ADRENO_SMMU_PRIV_H */ \ No newline at end of file +#endif /* __ADRENO_SMMU_PRIV_H */ -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org
[PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers
From: Jordan Crouse Call report_iommu_fault() to allow upper-level drivers to register their own fault handlers. Signed-off-by: Jordan Crouse Signed-off-by: Rob Clark --- drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 6f72c4d208ca..b4b32d31fc06 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); struct arm_smmu_device *smmu = smmu_domain->smmu; int idx = smmu_domain->cfg.cbndx; + int ret; fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR); if (!(fsr & ARM_SMMU_FSR_FAULT)) @@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR); cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx)); - dev_err_ratelimited(smmu->dev, - "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n", + ret = report_iommu_fault(domain, NULL, iova, + fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ); + + if (ret == -ENOSYS) + dev_err_ratelimited(smmu->dev, + "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n", fsr, iova, fsynr, cbfrsynra, idx); arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr); -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling
From: Rob Clark This picks up an earlier series[1] from Jordan, and adds additional support needed to generate GPU devcore dumps on iova faults. Original description: This is a stack to add an Adreno GPU specific handler for pagefaults. The first patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds a adreno-smmu-priv function hook to capture a handful of important debugging registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the third patch to print more detailed information on page fault such as the TTBR0 for the pagetable that caused the fault and the source of the fault as determined by a combination of the FSYNR1 register and an internal GPU register. This code provides a solid base that we can expand on later for even more extensive GPU side page fault debugging capabilities. v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver resume translation after it has had a chance to snapshot the GPUs state v3: Always clear FSR even if the target driver is going to handle resume v2: Fix comment wording and function pointer check per Rob Clark [1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcro...@codeaurora.org/ Jordan Crouse (3): iommu/arm-smmu: Add support for driver IOMMU fault handlers iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info drm/msm: Improve the a6xx page fault handler Rob Clark (3): iommu/arm-smmu-qcom: Add stall support drm/msm: Add crashdump support for stalled SMMU drm/msm: devcoredump iommu fault support drivers/gpu/drm/msm/adreno/a2xx_gpu.c | 2 +- drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 2 +- drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 2 +- drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 9 +- drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 101 +++- drivers/gpu/drm/msm/adreno/a6xx_gpu.h | 2 +- drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 43 +++-- drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 +++ drivers/gpu/drm/msm/msm_debugfs.c | 2 +- drivers/gpu/drm/msm/msm_gem.h | 1 + drivers/gpu/drm/msm/msm_gem_submit.c| 1 + drivers/gpu/drm/msm/msm_gpu.c | 55 ++- drivers/gpu/drm/msm/msm_gpu.h | 19 +++- drivers/gpu/drm/msm/msm_gpummu.c| 5 + drivers/gpu/drm/msm/msm_iommu.c | 22 - drivers/gpu/drm/msm/msm_mmu.h | 5 +- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 50 ++ drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +- drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 + include/linux/adreno-smmu-priv.h| 38 +++- 20 files changed, 354 insertions(+), 31 deletions(-) -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, 1 Jun 2021 07:01:57 + "Tian, Kevin" wrote: > > I summarized five opens here, about: > > 1) Finalizing the name to replace /dev/ioasid; > 2) Whether one device is allowed to bind to multiple IOASID fd's; > 3) Carry device information in invalidation/fault reporting uAPI; > 4) What should/could be specified when allocating an IOASID; > 5) The protocol between vfio group and kvm; > ... > > For 5), I'd expect Alex to chime in. Per my understanding looks the > original purpose of this protocol is not about I/O address space. It's > for KVM to know whether any device is assigned to this VM and then > do something special (e.g. posted interrupt, EPT cache attribute, etc.). Right, the original use case was for KVM to determine whether it needs to emulate invlpg, so it needs to be aware when an assigned device is present and be able to test if DMA for that device is cache coherent. The user, QEMU, creates a KVM "pseudo" device representing the vfio group, providing the file descriptor of that group to show ownership. The ugly symbol_get code is to avoid hard module dependencies, ie. the kvm module should not pull in or require the vfio module, but vfio will be present if attempting to register this device. With kvmgt, the interface also became a way to register the kvm pointer with vfio for the translation mentioned elsewhere in this thread. The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU page table so that it can handle iotlb programming from pre-registered memory without trapping out to userspace. > Because KVM deduces some policy based on the fact of assigned device, > it needs to hold a reference to related vfio group. this part is irrelevant > to this RFC. All of these use cases are related to the IOMMU, whether DMA is coherent, translating device IOVA to GPA, and an acceleration path to emulate IOMMU programming in kernel... they seem pretty relevant. > But ARM's VMID usage is related to I/O address space thus needs some > consideration. Another strange thing is about PPC. Looks it also leverages > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_ > group. I don't know why it's done through KVM instead of VFIO uAPI in > the first place. AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs to know how to handle those for in-kernel acceleration. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 07:01:57AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Saturday, May 29, 2021 4:03 AM > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > /dev/ioasid provides an unified interface for managing I/O page tables for > > > devices assigned to userspace. Device passthrough frameworks (VFIO, > > vDPA, > > > etc.) are expected to use this interface instead of creating their own > > > logic to > > > isolate untrusted device DMAs initiated by userspace. > > > > It is very long, but I think this has turned out quite well. It > > certainly matches the basic sketch I had in my head when we were > > talking about how to create vDPA devices a few years ago. > > > > When you get down to the operations they all seem pretty common sense > > and straightfoward. Create an IOASID. Connect to a device. Fill the > > IOASID with pages somehow. Worry about PASID labeling. > > > > It really is critical to get all the vendor IOMMU people to go over it > > and see how their HW features map into this. > > > > Agree. btw I feel it might be good to have several design opens > centrally discussed after going through all the comments. Otherwise > they may be buried in different sub-threads and potentially with > insufficient care (especially for people who haven't completed the > reading). > > I summarized five opens here, about: > > 1) Finalizing the name to replace /dev/ioasid; > 2) Whether one device is allowed to bind to multiple IOASID fd's; > 3) Carry device information in invalidation/fault reporting uAPI; > 4) What should/could be specified when allocating an IOASID; > 5) The protocol between vfio group and kvm; > > For 1), two alternative names are mentioned: /dev/iommu and > /dev/ioas. I don't have a strong preference and would like to hear > votes from all stakeholders. /dev/iommu is slightly better imho for > two reasons. First, per AMD's presentation in last KVM forum they > implement vIOMMU in hardware thus need to support user-managed > domains. An iommu uAPI notation might make more sense moving > forward. Second, it makes later uAPI naming easier as 'IOASID' can > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of > IOASID_ALLOC_IOASID. :) I think two years ago I suggested /dev/iommu and it didn't go very far at the time. We've also talked about this as /dev/sva for a while and now /dev/ioasid I think /dev/iommu is fine, and call the things inside them IOAS objects. Then we don't have naming aliasing with kernel constructs. > For 2), Jason prefers to not blocking it if no kernel design reason. If > one device is allowed to bind multiple IOASID fd's, the main problem > is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 > and giova_ioasid created in fd2 and then nesting them together (and Huh? This can't happen Creating an IOASID is an operation on on the /dev/ioasid FD. We won't provide APIs to create a tree of IOASID's outside a single FD container. If a device can consume multiple IOASID's it doesn't care how many or what /dev/ioasid FDs they come from. > To the other end there was also thought whether we should make > a single I/O address space per IOASID fd. This was discussed in previous > thread that #fd's are insufficient to afford theoretical 1M's address > spaces per device. But let's have another revisit and draw a clear > conclusion whether this option is viable. I had remarks on this, I think per-fd doesn't work > This implies that VFIO_BOUND_IOASID will be extended to allow user > specify a device label. This label will be recorded in /dev/iommu to > serve per-device invalidation request from and report per-device > fault data to the user. I wonder which of the user providing a 64 bit cookie or the kernel returning a small IDA is the best choice here? Both have merits depending on what qemu needs.. > In addition, vPASID (if provided by user) will > be also recorded in /dev/iommu so vPASID<->pPASID conversion > is conducted properly. e.g. invalidation request from user carries > a vPASID which must be converted into pPASID before calling iommu > driver. Vice versa for raw fault data which carries pPASID while the > user expects a vPASID. I don't think the PASID should be returned at all. It should return the IOASID number in the FD and/or a u64 cookie associated with that IOASID. Userspace should figure out what the IOASID & device combination means. > Seems to close this design open we have to touch the kAPI design. and > Joerg's input is highly appreciated here. uAPI is forever, the kAPI is constantly changing. I always dislike warping the uAPI based on the current kAPI situation. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes
> On Jun 1, 2021, at 10:27 AM, Robin Murphy wrote: > > On 2021-06-01 17:39, Nadav Amit wrote: >>> On Jun 1, 2021, at 8:59 AM, Robin Murphy wrote: >>> >>> On 2021-05-02 07:59, Nadav Amit wrote: From: Nadav Amit Some IOMMU architectures perform invalidations regardless of the page size. In such architectures there is no need to sync when the page size changes or to regard pgsize when making interim flush in iommu_iotlb_gather_add_page(). Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide whether gather's pgsize is tracked and triggers a flush. >>> >>> I've objected before[1], and I'll readily object again ;) >>> >>> I still think it's very silly to add a bunch of indirection all over the >>> place to make a helper function not do the main thing it's intended to help >>> with. If you only need trivial address gathering then it's far simpler to >>> just implement trivial address gathering. I suppose if you really want to >>> you could factor out another helper to share the 5 lines of code which >>> ended up in mtk-iommu (see commit f21ae3b10084). >> Thanks, Robin. >> I read your comments but I cannot fully understand the alternative that you >> propose, although I do understand your objection to the indirection >> “ignore_gather_pgsize”. Would it be ok if “ignore_gather_pgsize" was >> provided as an argument for iommu_iotlb_gather_add_page()? > > No, I mean if iommu_iotlb_gather_add_page() doesn't have the behaviour your > driver wants, just don't call it. Write or factor out a suitable helper that > *does* do what you want and call that, or just implement the logic directly > inline. Indirect argument or not, it just doesn't make much sense to have a > helper function call which says "do this except don't do most of it". > >> In general, I can live without this patch. It probably would have negligent >> impact on performance anyhow. > > As I implied, it sounds like your needs are the same as the Mediatek driver > had, so you could readily factor out a new page-size-agnostic gather helper > from that. I fully support making the functional change to amd-iommu > *somehow* - nobody likes unnecessary syncs - just not with this particular > implementation :) Hm… avoid code duplication I need to extract some common code to another function. Is the following resembles what you had in mind (untested)? -- >8 -- Subject: [PATCH] iommu: add iommu_iotlb_gather_add_page_ignore_pgsize() --- drivers/iommu/mtk_iommu.c | 7 ++--- include/linux/iommu.h | 55 ++- 2 files changed, 44 insertions(+), 18 deletions(-) diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c index e168a682806a..5890e745bed3 100644 --- a/drivers/iommu/mtk_iommu.c +++ b/drivers/iommu/mtk_iommu.c @@ -520,12 +520,9 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain, struct iommu_iotlb_gather *gather) { struct mtk_iommu_domain *dom = to_mtk_domain(domain); - unsigned long end = iova + size - 1; - if (gather->start > iova) - gather->start = iova; - if (gather->end < end) - gather->end = end; + iommu_iotlb_gather_update_range(gather, iova, size); + return dom->iop->unmap(dom->iop, iova, size, gather); } diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 9ca6e6b8084d..037434b6eb4c 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -535,29 +535,58 @@ static inline void iommu_iotlb_sync(struct iommu_domain *domain, iommu_iotlb_gather_init(iotlb_gather); } -static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain, +static inline +void iommu_iotlb_gather_update_range(struct iommu_iotlb_gather *gather, +unsigned long iova, size_t size) +{ + unsigned long start = iova, end = start + size - 1; + + if (gather->end < end) + gather->end = end; + + if (gather->start > start) + gather->start = start; + + gather->pgsize = size; +} + +static inline +bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather, + unsigned long iova, size_t size) +{ + return iova + size < gather->start || iova > gather->end + 1; +} + +static inline +void iommu_iotlb_gather_add_page_ignore_pgsize(struct iommu_domain *domain, struct iommu_iotlb_gather *gather, unsigned long iova, size_t size) { - unsigned long start = iova, end = start + size - 1; + /* +* Only if the new page is disjoint from the current range, then sync +* the TLB so that the gather structure can be rewritten. +*/ + if (iommu_iotlb_gather_is_disjoint(gather, iova, size) && gather->pgsize) + iommu_iotlb_sync(domain, gather); + +
Re: [PATCH 1/1] dma-contiguous: return early for dt case in dma_contiguous_reserve
On 2021-05-31 10:21, Dong Aisheng wrote: On Tue, May 18, 2021 at 7:29 PM Dong Aisheng wrote: dma_contiguous_reserve() aims to support cmdline case for CMA memory reserve. But if users define reserved memory in DT, 'dma_contiguous_default_area' will not be 0, then it's meaningless to continue to run dma_contiguous_reserve(). So we return early if detect 'dma_contiguous_default_area' is unzero. Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Dong Aisheng Gently ping The commit message is still wrong, and I still think the change doesn't achieve anything meaningful. This code is hard to make sense of either way because the crucial interplay between size_cmdline and dma_contiguous_default_area is hidden somewhere else entirely, and it would take a much more significant refactoring to clear that up. Robin. Regards Aisheng --- kernel/dma/contiguous.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index 3d63d91cba5c..ebade9f43eff 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -171,6 +171,9 @@ void __init dma_contiguous_reserve(phys_addr_t limit) phys_addr_t selected_limit = limit; bool fixed = false; + if (dma_contiguous_default_area) + return; + pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit); if (size_cmdline != -1) { @@ -191,7 +194,7 @@ void __init dma_contiguous_reserve(phys_addr_t limit) #endif } - if (selected_size && !dma_contiguous_default_area) { + if (selected_size) { pr_debug("%s: reserving %ld MiB for global area\n", __func__, (unsigned long)selected_size / SZ_1M); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Regression 5.12.0-rc4 net: ice: significant throughput drop
Hi Robin, On Tue, Jun 1, 2021 at 2:39 PM Robin Murphy wrote: > >> The regression shows as a significant drop in throughput as measured > >> with "super_netperf" [0], > >> with measured bandwidth of ~95Gbps before and ~35Gbps after: > > I guess that must be the difference between using the flush queue > vs. strict invalidation. On closer inspection, it seems to me that > there's a subtle pre-existing bug in the AMD IOMMU driver, in that > amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api() > has called bus_set_iommu(). Does the patch below work? Thanks for the quick response & patch. I tried it out and indeed it does solve the issue: # uname -a Linux zh-lab-node-3 5.13.0-rc3-amd-iommu+ #31 SMP Tue Jun 1 17:12:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2 95341.2 root@zh-lab-node-3:~# uname -a Linux zh-lab-node-3 5.13.0-rc3-amd-iommu-unpatched #32 SMP Tue Jun 1 17:29:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2 33989.5 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: > On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: > > On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: > > > From: Thierry Reding > > > > > > Hi, > > > > > > this is a set of patches that is the result of earlier discussions > > > regarding early identity mappings that are needed to avoid SMMU faults > > > during early boot. > > > > > > The goal here is to avoid early identity mappings altogether and instead > > > postpone the need for the identity mappings to when devices are attached > > > to the SMMU. This works by making the SMMU driver coordinate with the > > > memory controller driver on when to start enforcing SMMU translations. > > > This makes Tegra behave in a more standard way and pushes the code to > > > deal with the Tegra-specific programming into the NVIDIA SMMU > > > implementation. > > > > > > Compared to the original version of these patches, I've split the > > > preparatory work into a separate patch series because it became very > > > large and will be mostly uninteresting for this audience. > > > > > > Patch 1 provides a mechanism to program SID overrides at runtime. Patch > > > 2 updates the ARM SMMU device tree bindings to include the Tegra186 > > > compatible string as suggested by Robin during review. > > > > > > Patches 3 and 4 create the fundamentals in the SMMU driver to support > > > this and also make this functionality available on Tegra186. Patch 5 > > > hooks the ARM SMMU up to the memory controller so that the memory client > > > stream ID overrides can be programmed at the right time. > > > > > > Patch 6 extends this mechanism to Tegra186 and patches 7-9 enable all of > > > this through device tree updates. Patch 10 is included here to show how > > > SMMU will be enabled for display controllers. However, it cannot be > > > applied yet because the code to create identity mappings for potentially > > > live framebuffers hasn't been merged yet. > > > > > > The end result is that various peripherals will have SMMU enabled, while > > > the display controllers will keep using passthrough, as initially set up > > > by firmware. Once the device tree bindings have been accepted and the > > > SMMU driver has been updated to create identity mappings for the display > > > controllers, they can be hooked up to the SMMU and the code in this > > > series will automatically program the SID overrides to enable SMMU > > > translations at the right time. > > > > > > Note that the series creates a compile time dependency between the > > > memory controller and IOMMU trees. If it helps I can provide a branch > > > for each tree, modelling the dependency, once the series has been > > > reviewed. > > > > > > Changes in v2: > > > - split off the preparatory work into a separate series (that needs to > > > be applied first) > > > - address review comments by Robin > > > > > > Thierry > > > > > > Thierry Reding (10): > > > memory: tegra: Implement SID override programming > > > dt-bindings: arm-smmu: Add Tegra186 compatible string > > > iommu/arm-smmu: Implement ->probe_finalize() > > > iommu/arm-smmu: tegra: Detect number of instances at runtime > > > iommu/arm-smmu: tegra: Implement SID override programming > > > iommu/arm-smmu: Use Tegra implementation on Tegra186 > > > arm64: tegra: Use correct compatible string for Tegra186 SMMU > > > arm64: tegra: Hook up memory controller to SMMU on Tegra186 > > > arm64: tegra: Enable SMMU support on Tegra194 > > > arm64: tegra: Enable SMMU support for display on Tegra194 > > > > > > .../devicetree/bindings/iommu/arm,smmu.yaml | 11 +- > > > arch/arm64/boot/dts/nvidia/tegra186.dtsi | 4 +- > > > arch/arm64/boot/dts/nvidia/tegra194.dtsi | 166 ++ > > > drivers/iommu/arm/arm-smmu/arm-smmu-impl.c| 3 +- > > > drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c | 90 -- > > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 13 ++ > > > drivers/iommu/arm/arm-smmu/arm-smmu.h | 1 + > > > drivers/memory/tegra/mc.c | 9 + > > > drivers/memory/tegra/tegra186.c | 72 > > > include/soc/tegra/mc.h| 3 + > > > 10 files changed, 349 insertions(+), 23 deletions(-) > > > > Will, Robin, > > > > do you have any more comments on the ARM SMMU bits of this series? If > > not, can you guys provide an Acked-by so that Krzysztof can pick this > > (modulo the DT patches) up into the memory-controller tree for v5.14? > > > > I'll send out a v3 with the bisectibilitiy fix that Krishna pointed > > out. > > Probably best if I queue 3-6 on a separate branch once you send a v3, > then Krzysztof can pull that in if he needs it. Patch 5 has a build-time dependency on patch 1, so they need to go in together. The reason why I suggested Krzysztof pick these up is because there is a restructuring series that this depends on, which will go
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 08:38:00AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Saturday, May 29, 2021 3:59 AM > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > > > 5. Use Cases and Flows > > > > > > Here assume VFIO will support a new model where every bound device > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o > > > going through legacy container/group interface. For illustration purpose > > > those devices are just called dev[1...N]: > > > > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode); > > > > > > As explained earlier, one IOASID fd is sufficient for all intended use > > > cases: > > > > > > ioasid_fd = open("/dev/ioasid", mode); > > > > > > For simplicity below examples are all made for the virtualization story. > > > They are representative and could be easily adapted to a > > > non-virtualization > > > scenario. > > > > For others, I don't think this is *strictly* necessary, we can > > probably still get to the device_fd using the group_fd and fit in > > /dev/ioasid. It does make the rest of this more readable though. > > Jason, want to confirm here. Per earlier discussion we remain an > impression that you want VFIO to be a pure device driver thus > container/group are used only for legacy application. Let me call this a "nice wish". If you get to a point where you hard need this, then identify the hard requirement and let's do it, but I wouldn't bloat this already large project unnecessarily. Similarly I wouldn't depend on the group fd existing in this design so it could be changed later. > From this comment are you suggesting that VFIO can still keep > container/ group concepts and user just deprecates the use of vfio > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a > simple policy that an IOASID will reject cmd if partially-attached > group exists)? I would say no on the container. /dev/ioasid == the container, having two competing objects at once in a single process is just a mess. If the group fd can be kept requires charting a path through the ioctls where the container is not used and /dev/ioasid is sub'd in using the same device FD specific IOCTLs you show here. I didn't try to chart this out carefully. Also, ultimately, something need to be done about compatability with the vfio container fd. It looks clear enough to me that the the VFIO container FD is just a single IOASID using a special ioctl interface so it would be quite rasonable to harmonize these somehow. But that is too complicated and far out for me at least to guess on at this point.. > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is > > there any scenario where we want different vpasid's for the same > > IOASID? I guess it is OK like this. Hum. > > Yes, it's completely sane that the guest links a I/O page table to > different vpasids on dev1 and dev2. The IOMMU doesn't mandate > that when multiple devices share an I/O page table they must use > the same PASID#. Ok.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu: Print default strict or lazy mode at init time
pr_info("DMA domain default TLB invalidation policy: %s mode %s\n", iommu_dma_strict ? "strict" : "lazy", (iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ? "(set via kernel command line)" : ""); I think it's worth mentioning "default" somewhere, as not all IOMMUs or devices will use lazy mode even if it's default. But that's part of what I think is misleading - I boot and see that the default is something, so I reboot with iommu.strict to explicitly set it the other way, but now that's the default... huh? The way I see it, we're saying what the current IOMMU API policy is - the value of iommu_dma_strict at any given time is fact - but we're not necessarily saying how widely that policy is enforced. We similarly report the type for default domains from global policy even though that may also be overridden per-group by drivers and/or userspace later; we don't say it's the *default* default domain type. I think that is this is the behavior a user would understand from that message. However from a glance at the intel IOMMU driver, it seems possible to change default domain type after iommu_subsys_init(). However, having now debugged the AMD issue from another thread, I think doing this at subsys_initcall is in fact going to be too early to be meaningful, since it ignores drivers' ability to change the global policy :( A user may still learn the IOMMU group domain type from sysfs. There is no such thing for TLB invalidation mode - how about add a file for this? It would be useful. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Saturday, May 29, 2021 1:36 AM > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > > IOASID nesting can be implemented in two ways: hardware nesting and > > > software nesting. With hardware support the child and parent I/O page > > > tables are walked consecutively by the IOMMU to form a nested translation. > > > When it's implemented in software, the ioasid driver is responsible for > > > merging the two-level mappings into a single-level shadow I/O page table. > > > Software nesting requires both child/parent page tables operated through > > > the dma mapping protocol, so any change in either level can be captured > > > by the kernel to update the corresponding shadow mapping. > > > > Why? A SW emulation could do this synchronization during invalidation > > processing if invalidation contained an IOVA range. > > In this proposal we differentiate between host-managed and user- > managed I/O page tables. If host-managed, the user is expected to use > map/unmap cmd explicitly upon any change required on the page table. > If user-managed, the user first binds its page table to the IOMMU and > then use invalidation cmd to flush iotlb when necessary (e.g. typically > not required when changing a PTE from non-present to present). > > We expect user to use map+unmap and bind+invalidate respectively > instead of mixing them together. Following this policy, map+unmap > must be used in both levels for software nesting, so changes in either > level are captured timely to synchronize the shadow mapping. map+unmap or bind+invalidate is a policy of the IOASID itself set when it is created. If you put two different types in a tree then each IOASID must continue to use its own operation mode. I don't see a reason to force all IOASIDs in a tree to be consistent?? A software emulated two level page table where the leaf level is a bound page table in guest memory should continue to use bind/invalidate to maintain the guest page table IOASID even though it is a SW construct. The GPA level should use map/unmap because it is a kernel owned page table Though how to efficiently mix map/unmap on the GPA when there are SW nested levels below it looks to be quite challenging. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 12:04:00PM +, Parav Pandit wrote: > > > > From: Jason Gunthorpe > > Sent: Monday, May 31, 2021 11:43 PM > > > > On Mon, May 31, 2021 at 05:37:35PM +, Parav Pandit wrote: > > > > > In that case, can it be a new system call? Why does it have to be under > > /dev/ioasid? > > > For example few years back such system call mpin() thought was proposed > > in [1]. > > > > Reference counting of the overall pins are required > > > > So when a pinned pages is incorporated into an IOASID page table in a later > > IOCTL it means it cannot be unpinned while the IOASID page table is using > > it. > Ok. but cant it use the same refcount of that mmu uses? Manipulating that refcount is part of the overhead that is trying to be avoided here, plus ensuring that the pinned pages accounting doesn't get out of sync with the actual account of pinned pages! > > Then the ioasid's interval tree would be mapped into a page table tree in HW > > format. > Does it mean in simple use case [1], second level page table copy is > maintained in the IOMMU side via map interface? I hope not. It > should use the same as what mmu uses, right? Not a full page by page copy, but some interval reference. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote: > The drivers register per page table fault handlers to /dev/ioasid which > will then register itself to iommu core to listen and route the per- > device I/O page faults. I'm still confused why drivers need fault handlers at all? Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote: > We can open up to ~0U file descriptors, I don't see why we need to restrict > it in uAPI. There are significant problems with such large file descriptor tables. High FD numbers man things like select don't work at all anymore and IIRC there are more complications. A huge number of FDs for typical usages should be avoided. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Tian, Kevin > Sent: Thursday, May 27, 2021 1:28 PM > 5.6. I/O page fault > +++ > > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver > to guest IOMMU driver and backwards). > > - Host IOMMU driver receives a page request with raw fault_data {rid, > pasid, addr}; > > - Host IOMMU driver identifies the faulting I/O page table according to > information registered by IOASID fault handler; > > - IOASID fault handler is called with raw fault_data (rid, pasid, addr), > which > is saved in ioasid_data->fault_data (used for response); > > - IOASID fault handler generates an user fault_data (ioasid, addr), links it > to the shared ring buffer and triggers eventfd to userspace; > > - Upon received event, Qemu needs to find the virtual routing information > (v_rid + v_pasid) of the device attached to the faulting ioasid. If there > are > multiple, pick a random one. This should be fine since the purpose is to > fix the I/O page table on the guest; > > - Qemu generates a virtual I/O page fault through vIOMMU into guest, > carrying the virtual fault data (v_rid, v_pasid, addr); > Why does it have to be through vIOMMU? For a VFIO PCI device, have you considered to reuse the same PRI interface to inject page fault in the guest? This eliminates any new v_rid. It will also route the page fault request and response through the right vfio device. > - Guest IOMMU driver fixes up the fault, updates the I/O page table, and > then sends a page response with virtual completion data (v_rid, v_pasid, > response_code) to vIOMMU; > What about fixing up the fault for mmu page table as well in guest? Or you meant both when above you said "updates the I/O page table"? It is unclear to me that if there is single nested page table maintained or two (one for cr3 references and other for iommu). Can you please clarify? > - Qemu finds the pending fault event, converts virtual completion data > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to > complete the pending fault; > For VFIO PCI device a virtual PRI request response interface is done, it can be generic interface among multiple vIOMMUs. > - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in > ioasid_data->fault_data, and then calls iommu api to complete it with > {rid, pasid, response_code}; > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote: > For the case of 1M, I would like to know what's the use case for a single > process to handle 1M+ address spaces? For some scenarios every guest PASID will require a IOASID ID # so there is a large enough demand that FDs alone are not a good fit. Further there are global container wide properties that are hard to carry over to a multi-FD model, like the attachment of devices to the container at the startup. > > So this RFC treats fd as a container of address spaces which is each > > tagged by an IOASID. > > If the container and address space is 1:1 then the container seems useless. The examples at the bottom of the document show multiple IOASIDs in the container for a parent/child type relationship Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes
On 2021-06-01 17:39, Nadav Amit wrote: On Jun 1, 2021, at 8:59 AM, Robin Murphy wrote: On 2021-05-02 07:59, Nadav Amit wrote: From: Nadav Amit Some IOMMU architectures perform invalidations regardless of the page size. In such architectures there is no need to sync when the page size changes or to regard pgsize when making interim flush in iommu_iotlb_gather_add_page(). Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide whether gather's pgsize is tracked and triggers a flush. I've objected before[1], and I'll readily object again ;) I still think it's very silly to add a bunch of indirection all over the place to make a helper function not do the main thing it's intended to help with. If you only need trivial address gathering then it's far simpler to just implement trivial address gathering. I suppose if you really want to you could factor out another helper to share the 5 lines of code which ended up in mtk-iommu (see commit f21ae3b10084). Thanks, Robin. I read your comments but I cannot fully understand the alternative that you propose, although I do understand your objection to the indirection “ignore_gather_pgsize”. Would it be ok if “ignore_gather_pgsize" was provided as an argument for iommu_iotlb_gather_add_page()? No, I mean if iommu_iotlb_gather_add_page() doesn't have the behaviour your driver wants, just don't call it. Write or factor out a suitable helper that *does* do what you want and call that, or just implement the logic directly inline. Indirect argument or not, it just doesn't make much sense to have a helper function call which says "do this except don't do most of it". In general, I can live without this patch. It probably would have negligent impact on performance anyhow. As I implied, it sounds like your needs are the same as the Mediatek driver had, so you could readily factor out a new page-size-agnostic gather helper from that. I fully support making the functional change to amd-iommu *somehow* - nobody likes unnecessary syncs - just not with this particular implementation :) Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote: > This version only covers 1) and 4). Do you think we need to support 2), > 3) and beyond? Yes aboslutely. The API should be flexable enough to specify the creation of all future page table formats we'd want to have and all HW specific details on those formats. > If so, it seems that we need some in-kernel helpers and uAPIs to > support pre-installing a page table to IOASID. Not sure what this means.. > From this point of view an IOASID is actually not just a variant of > iommu_domain, but an I/O page table representation in a broader > sense. Yes, and things need to evolve in a staged way. The ioctl API should have room for this growth but you need to start out with some constrained enough to actually implement then figure out how to grow from there Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 11:08:53AM +0800, Lu Baolu wrote: > On 6/1/21 2:09 AM, Jason Gunthorpe wrote: > > > > device bind should fail if the device somehow isn't compatible with > > > > the scheme the user is tring to use. > > > yeah, I guess you mean to fail the device attach when the IOASID is a > > > nesting IOASID but the device is behind an iommu without nesting support. > > > right? > > Right.. > > Just want to confirm... > > Does this mean that we only support hardware nesting and don't want to > have soft nesting (shadowed page table in kernel) in IOASID? No, the uAPI presents a contract, if the kernel can fulfill the contract then it should be supported. If you want SW nesting then the kernel has to have the SW support for it or fail. At least for the purposes of document I wouldn't devle too much deeper into that question. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu: Print default strict or lazy mode at init time
On 2021-06-01 16:50, John Garry wrote: On 01/06/2021 10:09, Robin Murphy wrote: diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 808ab70d5df5..f25fae62f077 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -138,6 +138,11 @@ static int __init iommu_subsys_init(void) (iommu_cmd_line & IOMMU_CMD_LINE_DMA_API) ? "(set via kernel command line)" : ""); + pr_info("Default DMA domain mode: %s %s\n", Nit: I think this might be a little unclear for end-users - *I'm* not even sure whether "Default" here is meant to refer to the mode setting itself or to default domains (of DMA type). Maybe something like "DMA domain TLB invalidation policy"? Certainly it seems like a good idea to explicitly mention invalidation to correlate with the documentation of the "iommu.strict" parameter. Ack to the general idea though. ok, so I'll go with this: pr_info("DMA domain default TLB invalidation policy: %s mode %s\n", iommu_dma_strict ? "strict" : "lazy", (iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ? "(set via kernel command line)" : ""); I think it's worth mentioning "default" somewhere, as not all IOMMUs or devices will use lazy mode even if it's default. But that's part of what I think is misleading - I boot and see that the default is something, so I reboot with iommu.strict to explicitly set it the other way, but now that's the default... huh? The way I see it, we're saying what the current IOMMU API policy is - the value of iommu_dma_strict at any given time is fact - but we're not necessarily saying how widely that policy is enforced. We similarly report the type for default domains from global policy even though that may also be overridden per-group by drivers and/or userspace later; we don't say it's the *default* default domain type. However, having now debugged the AMD issue from another thread, I think doing this at subsys_initcall is in fact going to be too early to be meaningful, since it ignores drivers' ability to change the global policy :( Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes
> On Jun 1, 2021, at 8:59 AM, Robin Murphy wrote: > > On 2021-05-02 07:59, Nadav Amit wrote: >> From: Nadav Amit >> Some IOMMU architectures perform invalidations regardless of the page >> size. In such architectures there is no need to sync when the page size >> changes or to regard pgsize when making interim flush in >> iommu_iotlb_gather_add_page(). >> Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide >> whether gather's pgsize is tracked and triggers a flush. > > I've objected before[1], and I'll readily object again ;) > > I still think it's very silly to add a bunch of indirection all over the > place to make a helper function not do the main thing it's intended to help > with. If you only need trivial address gathering then it's far simpler to > just implement trivial address gathering. I suppose if you really want to you > could factor out another helper to share the 5 lines of code which ended up > in mtk-iommu (see commit f21ae3b10084). Thanks, Robin. I read your comments but I cannot fully understand the alternative that you propose, although I do understand your objection to the indirection “ignore_gather_pgsize”. Would it be ok if “ignore_gather_pgsize" was provided as an argument for iommu_iotlb_gather_add_page()? In general, I can live without this patch. It probably would have negligent impact on performance anyhow. Regards, Nadav ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes
On 2021-05-02 07:59, Nadav Amit wrote: From: Nadav Amit Some IOMMU architectures perform invalidations regardless of the page size. In such architectures there is no need to sync when the page size changes or to regard pgsize when making interim flush in iommu_iotlb_gather_add_page(). Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide whether gather's pgsize is tracked and triggers a flush. I've objected before[1], and I'll readily object again ;) I still think it's very silly to add a bunch of indirection all over the place to make a helper function not do the main thing it's intended to help with. If you only need trivial address gathering then it's far simpler to just implement trivial address gathering. I suppose if you really want to you could factor out another helper to share the 5 lines of code which ended up in mtk-iommu (see commit f21ae3b10084). Robin. [1] https://lore.kernel.org/linux-iommu/49bae447-d662-e6cf-7500-ab78e3b75...@arm.com/ Cc: Joerg Roedel Cc: Will Deacon Cc: Jiajun Cao Cc: iommu@lists.linux-foundation.org Cc: linux-ker...@vger.kernel.org Signed-off-by: Nadav Amit --- drivers/iommu/amd/iommu.c | 1 + include/linux/iommu.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index b8cabbbeed71..1849b53f2149 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -2215,6 +2215,7 @@ const struct iommu_ops amd_iommu_ops = { .put_resv_regions = generic_iommu_put_resv_regions, .is_attach_deferred = amd_iommu_is_attach_deferred, .pgsize_bitmap = AMD_IOMMU_PGSIZES, + .ignore_gather_pgsize = true, .flush_iotlb_all = amd_iommu_flush_iotlb_all, .iotlb_sync = amd_iommu_iotlb_sync, .def_domain_type = amd_iommu_def_domain_type, diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 32d448050bf7..1fb2695418e9 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -284,6 +284,7 @@ struct iommu_ops { int (*def_domain_type)(struct device *dev); unsigned long pgsize_bitmap; + bool ignore_gather_pgsize; struct module *owner; }; @@ -508,7 +509,7 @@ static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain, * a different granularity, then sync the TLB so that the gather * structure can be rewritten. */ - if (gather->pgsize != size || + if ((gather->pgsize != size && !domain->ops->ignore_gather_pgsize) || end + 1 < gather->start || start > gather->end + 1) { if (gather->pgsize) iommu_iotlb_sync(domain, gather); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu: Print default strict or lazy mode at init time
On 01/06/2021 10:09, Robin Murphy wrote: diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 808ab70d5df5..f25fae62f077 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -138,6 +138,11 @@ static int __init iommu_subsys_init(void) (iommu_cmd_line & IOMMU_CMD_LINE_DMA_API) ? "(set via kernel command line)" : ""); + pr_info("Default DMA domain mode: %s %s\n", Nit: I think this might be a little unclear for end-users - *I'm* not even sure whether "Default" here is meant to refer to the mode setting itself or to default domains (of DMA type). Maybe something like "DMA domain TLB invalidation policy"? Certainly it seems like a good idea to explicitly mention invalidation to correlate with the documentation of the "iommu.strict" parameter. Ack to the general idea though. ok, so I'll go with this: pr_info("DMA domain default TLB invalidation policy: %s mode %s\n", iommu_dma_strict ? "strict" : "lazy", (iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ? "(set via kernel command line)" : ""); I think it's worth mentioning "default" somewhere, as not all IOMMUs or devices will use lazy mode even if it's default. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 7/7] dma-iommu: Use init_iova_domain_ext() for IOVA domain init
Pass the max opt iova len to init the IOVA domain, if set. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index f6d3302bb829..37765d540dc9 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -333,6 +333,8 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; + size_t max_opt_dma_size; + unsigned long iova_len; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) return -EINVAL; @@ -366,7 +368,18 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, return 0; } - init_iova_domain(iovad, 1UL << order, base_pfn); + max_opt_dma_size = iommu_group_get_max_opt_dma_size(dev->iommu_group); + + if (max_opt_dma_size) { + unsigned long shift = __ffs(1UL << order); + + iova_len = max_opt_dma_size >> shift; + iova_len = roundup_pow_of_two(iova_len); + } else { + iova_len = 0; + } + + init_iova_domain_ext(iovad, 1UL << order, base_pfn, iova_len); if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict(domain)) { -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 6/7] iommu: Allow max opt DMA len be set for a group via sysfs
Add support to allow the maximum optimised DMA len be set for an IOMMU group via sysfs. This much the same with the method to change the default domain type for a group. Signed-off-by: John Garry --- drivers/iommu/iommu.c | 51 +-- include/linux/iommu.h | 6 + 2 files changed, 55 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 8bf2abb3d4c1..ea2bdd1c4f4e 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -45,6 +45,7 @@ struct iommu_group { struct iommu_domain *default_domain; struct iommu_domain *domain; struct list_head entry; + size_t max_opt_dma_size; }; struct group_device { @@ -86,6 +87,9 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, static struct iommu_group *iommu_group_get_for_dev(struct device *dev); static ssize_t iommu_group_store_type(struct iommu_group *group, const char *buf, size_t count); +static ssize_t iommu_group_store_max_opt_dma_size(struct iommu_group *group, + const char *buf, + size_t count); #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ struct iommu_group_attribute iommu_group_attr_##_name =\ @@ -554,6 +558,12 @@ static ssize_t iommu_group_show_type(struct iommu_group *group, return strlen(type); } +static ssize_t iommu_group_show_max_opt_dma_size(struct iommu_group *group, +char *buf) +{ + return sprintf(buf, "%zu\n", group->max_opt_dma_size); +} + static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL); static IOMMU_GROUP_ATTR(reserved_regions, 0444, @@ -562,6 +572,9 @@ static IOMMU_GROUP_ATTR(reserved_regions, 0444, static IOMMU_GROUP_ATTR(type, 0644, iommu_group_show_type, iommu_group_store_type); +static IOMMU_GROUP_ATTR(max_opt_dma_size, 0644, iommu_group_show_max_opt_dma_size, + iommu_group_store_max_opt_dma_size); + static void iommu_group_release(struct kobject *kobj) { struct iommu_group *group = to_iommu_group(kobj); @@ -648,6 +661,10 @@ struct iommu_group *iommu_group_alloc(void) if (ret) return ERR_PTR(ret); + ret = iommu_group_create_file(group, _group_attr_max_opt_dma_size); + if (ret) + return ERR_PTR(ret); + pr_debug("Allocated group %d\n", group->id); return group; @@ -2279,6 +2296,11 @@ struct iommu_domain *iommu_get_dma_domain(struct device *dev) return dev->iommu_group->default_domain; } +size_t iommu_group_get_max_opt_dma_size(struct iommu_group *group) +{ + return group->max_opt_dma_size; +} + /* * IOMMU groups are really the natural working unit of the IOMMU, but * the IOMMU API works on domains and devices. Bridge that gap by @@ -3045,12 +3067,14 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); * hasn't changed after the caller has called this function) * @type: The type of the new default domain that gets associated with the group * @new: Allocate new default domain, keeping same type when no type passed + * @max_opt_dma_size: If set, set the IOMMU group max_opt_dma_size when success * * Returns 0 on success and error code on failure * */ static int iommu_change_dev_def_domain(struct iommu_group *group, - struct device *prev_dev, int type, bool new) + struct device *prev_dev, int type, bool new, + unsigned long max_opt_dma_size) { struct iommu_domain *prev_dom; struct group_device *grp_dev; @@ -3146,6 +3170,9 @@ static int iommu_change_dev_def_domain(struct iommu_group *group, group->domain = group->default_domain; + if (max_opt_dma_size) + group->max_opt_dma_size = max_opt_dma_size; + /* * Release the mutex here because ops->probe_finalize() call-back of * some vendor IOMMU drivers calls arm_iommu_attach_device() which @@ -3272,7 +3299,7 @@ static int iommu_group_store_type_cb(const char *buf, else return -EINVAL; - return iommu_change_dev_def_domain(group, dev, type, false); + return iommu_change_dev_def_domain(group, dev, type, false, 0); } static ssize_t iommu_group_store_type(struct iommu_group *group, @@ -3281,3 +3308,23 @@ static ssize_t iommu_group_store_type(struct iommu_group *group, return iommu_group_store_common(group, buf, count, iommu_group_store_type_cb); } + +static int iommu_group_store_max_opt_dma_size_cb(const char *buf, +struct iommu_group *group, +struct device *dev) +{ +
[PATCH v3 5/7] iova: Add init_iova_domain_ext()
Add extended version of init_iova_domain() which accepts an max opt iova length argument, and use it to set the rcaches range. This can be combined with the normal version later. Signed-off-by: John Garry --- drivers/iommu/iova.c | 31 --- include/linux/iova.h | 9 + 2 files changed, 33 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 95892a0433cc..ae4901073a98 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -23,7 +23,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, static unsigned long iova_rcache_get(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn); -static void init_iova_rcaches(struct iova_domain *iovad); +static void init_iova_rcaches(struct iova_domain *iovad, unsigned long iova_len); static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); static void fq_destroy_all_entries(struct iova_domain *iovad); @@ -46,9 +46,9 @@ static struct iova *to_iova(struct rb_node *node) return rb_entry(node, struct iova, node); } -void -init_iova_domain(struct iova_domain *iovad, unsigned long granule, - unsigned long start_pfn) +static void +__init_iova_domain(struct iova_domain *iovad, unsigned long granule, + unsigned long start_pfn, unsigned long iova_len) { /* * IOVA granularity will normally be equal to the smallest @@ -71,7 +71,21 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, rb_link_node(>anchor.node, NULL, >rbroot.rb_node); rb_insert_color(>anchor.node, >rbroot); cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, >cpuhp_dead); - init_iova_rcaches(iovad); + init_iova_rcaches(iovad, iova_len); +} + +void +init_iova_domain_ext(struct iova_domain *iovad, unsigned long granule, + unsigned long start_pfn, unsigned long iova_len) +{ + __init_iova_domain(iovad, granule, start_pfn, iova_len); +} + +void +init_iova_domain(struct iova_domain *iovad, unsigned long granule, + unsigned long start_pfn) +{ + __init_iova_domain(iovad, granule, start_pfn, 0); } EXPORT_SYMBOL_GPL(init_iova_domain); @@ -883,14 +897,17 @@ bool iova_domain_len_is_cached(struct iova_domain *iovad, unsigned long iova_len return iova_len_to_rcache_max(iova_len) == iovad->rcache_max_size; } -static void init_iova_rcaches(struct iova_domain *iovad) +static void init_iova_rcaches(struct iova_domain *iovad, unsigned long iova_len) { struct iova_cpu_rcache *cpu_rcache; struct iova_rcache *rcache; unsigned int cpu; int i; - iovad->rcache_max_size = IOVA_RANGE_CACHE_MAX_SIZE; + if (iova_len) + iovad->rcache_max_size = iova_len_to_rcache_max(iova_len); + else + iovad->rcache_max_size = IOVA_RANGE_CACHE_MAX_SIZE; iovad->rcaches = kcalloc(iovad->rcache_max_size, sizeof(*iovad->rcaches), GFP_KERNEL); diff --git a/include/linux/iova.h b/include/linux/iova.h index 04cc8eb6de38..cfe416b6a8c7 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -154,6 +154,8 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo, unsigned long pfn_hi); void init_iova_domain(struct iova_domain *iovad, unsigned long granule, unsigned long start_pfn); +void init_iova_domain_ext(struct iova_domain *iovad, unsigned long granule, + unsigned long start_pfn, unsigned long iova_len); int init_iova_flush_queue(struct iova_domain *iovad, iova_flush_cb flush_cb, iova_entry_dtor entry_dtor); struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn); @@ -222,6 +224,13 @@ static inline void init_iova_domain(struct iova_domain *iovad, { } +static inline void init_iova_domain_ext(struct iova_domain *iovad, + unsigned long granule, + unsigned long start_pfn, + unsigned long iova_len) +{ +} + static inline int init_iova_flush_queue(struct iova_domain *iovad, iova_flush_cb flush_cb, iova_entry_dtor entry_dtor) -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 4/7] iova: Add iova_domain_len_is_cached()
Add a function to check whether an IOVA domain currently caches a given upper IOVA len exactly. Signed-off-by: John Garry --- drivers/iommu/iova.c | 11 +++ include/linux/iova.h | 8 +++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 0e4c0e55178a..95892a0433cc 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -872,6 +872,17 @@ static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn) mag->pfns[mag->size++] = pfn; } +static unsigned long iova_len_to_rcache_max(unsigned long iova_len) +{ + return order_base_2(iova_len) + 1; +} + +/* Test if iova_len range cached upper limit matches that of IOVA domain */ +bool iova_domain_len_is_cached(struct iova_domain *iovad, unsigned long iova_len) +{ + return iova_len_to_rcache_max(iova_len) == iovad->rcache_max_size; +} + static void init_iova_rcaches(struct iova_domain *iovad) { struct iova_cpu_rcache *cpu_rcache; diff --git a/include/linux/iova.h b/include/linux/iova.h index 9974e1d3e2bc..04cc8eb6de38 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -136,7 +136,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) #if IS_ENABLED(CONFIG_IOMMU_IOVA) int iova_cache_get(void); void iova_cache_put(void); - +bool iova_domain_len_is_cached(struct iova_domain *iovad, + unsigned long iova_len); void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, @@ -158,6 +159,11 @@ int init_iova_flush_queue(struct iova_domain *iovad, struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn); void put_iova_domain(struct iova_domain *iovad); #else +static inline bool iova_domain_len_is_cached(struct iova_domain *iovad, +unsigned long iova_len) +{ + return false; +} static inline int iova_cache_get(void) { return -ENOTSUPP; -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 3/7] iommu: Allow iommu_change_dev_def_domain() realloc default domain for same type
Allow iommu_change_dev_def_domain() to create a new default domain, keeping the same as current when type is unset. Also remove comment about function purpose, which will become stale. Signed-off-by: John Garry --- drivers/iommu/iommu.c | 54 ++- 1 file changed, 28 insertions(+), 26 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 4d12b607918c..8bf2abb3d4c1 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3036,6 +3036,7 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle) } EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); /* * Changes the default domain of an iommu group that has *only* one device * @@ -3043,16 +3044,13 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); * @prev_dev: The device in the group (this is used to make sure that the device * hasn't changed after the caller has called this function) * @type: The type of the new default domain that gets associated with the group + * @new: Allocate new default domain, keeping same type when no type passed * * Returns 0 on success and error code on failure * - * Note: - * 1. Presently, this function is called only when user requests to change the - *group's default domain type through /sys/kernel/iommu_groups//type - *Please take a closer look if intended to use for other purposes. */ static int iommu_change_dev_def_domain(struct iommu_group *group, - struct device *prev_dev, int type) + struct device *prev_dev, int type, bool new) { struct iommu_domain *prev_dom; struct group_device *grp_dev; @@ -3105,28 +3103,32 @@ static int iommu_change_dev_def_domain(struct iommu_group *group, goto out; } - dev_def_dom = iommu_get_def_domain_type(dev); - if (!type) { + if (new && !type) { + type = prev_dom->type; + } else { + dev_def_dom = iommu_get_def_domain_type(dev); + if (!type) { + /* +* If the user hasn't requested any specific type of domain and +* if the device supports both the domains, then default to the +* domain the device was booted with +*/ + type = dev_def_dom ? : iommu_def_domain_type; + } else if (dev_def_dom && type != dev_def_dom) { + dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n", + iommu_domain_type_str(type)); + ret = -EINVAL; + goto out; + } + /* -* If the user hasn't requested any specific type of domain and -* if the device supports both the domains, then default to the -* domain the device was booted with +* Switch to a new domain only if the requested domain type is different +* from the existing default domain type */ - type = dev_def_dom ? : iommu_def_domain_type; - } else if (dev_def_dom && type != dev_def_dom) { - dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n", - iommu_domain_type_str(type)); - ret = -EINVAL; - goto out; - } - - /* -* Switch to a new domain only if the requested domain type is different -* from the existing default domain type -*/ - if (prev_dom->type == type) { - ret = 0; - goto out; + if (prev_dom->type == type) { + ret = 0; + goto out; + } } /* Sets group->default_domain to the newly allocated domain */ @@ -3270,7 +3272,7 @@ static int iommu_group_store_type_cb(const char *buf, else return -EINVAL; - return iommu_change_dev_def_domain(group, dev, type); + return iommu_change_dev_def_domain(group, dev, type, false); } static ssize_t iommu_group_store_type(struct iommu_group *group, -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 1/7] iommu: Reactor iommu_group_store_type()
Function iommu_group_store_type() supports changing the default domain of an IOMMU group. Many conditions need to be satisfied and steps taken for this action to be successful. Satisfying these conditions and steps will be required for setting other IOMMU group attributes, so factor into a common part and a part specific to update the IOMMU group attribute. No functional change intended. Some code comments are tidied up also. Signed-off-by: John Garry --- drivers/iommu/iommu.c | 73 +++ 1 file changed, 46 insertions(+), 27 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 808ab70d5df5..4d12b607918c 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3169,20 +3169,23 @@ static int iommu_change_dev_def_domain(struct iommu_group *group, } /* - * Changing the default domain through sysfs requires the users to ubind the - * drivers from the devices in the iommu group. Return failure if this doesn't - * meet. + * Changing the default domain or any other IOMMU group attribute through sysfs + * requires the users to unbind the drivers from the devices in the IOMMU group. + * Return failure if this precondition is not met. * * We need to consider the race between this and the device release path. * device_lock(dev) is used here to guarantee that the device release path * will not be entered at the same time. */ -static ssize_t iommu_group_store_type(struct iommu_group *group, - const char *buf, size_t count) +static ssize_t iommu_group_store_common(struct iommu_group *group, + const char *buf, size_t count, + int (*cb)(const char *buf, + struct iommu_group *group, + struct device *dev)) { struct group_device *grp_dev; struct device *dev; - int ret, req_type; + int ret; if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; @@ -3190,25 +3193,16 @@ static ssize_t iommu_group_store_type(struct iommu_group *group, if (WARN_ON(!group)) return -EINVAL; - if (sysfs_streq(buf, "identity")) - req_type = IOMMU_DOMAIN_IDENTITY; - else if (sysfs_streq(buf, "DMA")) - req_type = IOMMU_DOMAIN_DMA; - else if (sysfs_streq(buf, "auto")) - req_type = 0; - else - return -EINVAL; - /* * Lock/Unlock the group mutex here before device lock to -* 1. Make sure that the iommu group has only one device (this is a +* 1. Make sure that the IOMMU group has only one device (this is a *prerequisite for step 2) * 2. Get struct *dev which is needed to lock device */ mutex_lock(>mutex); if (iommu_group_device_count(group) != 1) { mutex_unlock(>mutex); - pr_err_ratelimited("Cannot change default domain: Group has more than one device\n"); + pr_err_ratelimited("Cannot change IOMMU group default domain attribute: Group has more than one device\n"); return -EINVAL; } @@ -3220,16 +3214,16 @@ static ssize_t iommu_group_store_type(struct iommu_group *group, /* * Don't hold the group mutex because taking group mutex first and then * the device lock could potentially cause a deadlock as below. Assume -* two threads T1 and T2. T1 is trying to change default domain of an -* iommu group and T2 is trying to hot unplug a device or release [1] VF -* of a PCIe device which is in the same iommu group. T1 takes group -* mutex and before it could take device lock assume T2 has taken device -* lock and is yet to take group mutex. Now, both the threads will be -* waiting for the other thread to release lock. Below, lock order was -* suggested. +* two threads, T1 and T2. T1 is trying to change default domain +* attribute of an IOMMU group and T2 is trying to hot unplug a device +* or release [1] VF of a PCIe device which is in the same IOMMU group. +* T1 takes the group mutex and before it could take device lock T2 may +* have taken device lock and is yet to take group mutex. Now, both the +* threads will be waiting for the other thread to release lock. Below, +* lock order was suggested. * device_lock(dev); * mutex_lock(>mutex); -* iommu_change_dev_def_domain(); +* cb->iommu_change_dev_def_domain(); [example cb] * mutex_unlock(>mutex); * device_unlock(dev); * @@ -3243,7 +3237,7 @@ static ssize_t iommu_group_store_type(struct iommu_group *group, */ mutex_unlock(>mutex); - /*
[PATCH v3 2/7] iova: Allow rcache range upper limit to be flexible
Some LLDs may request DMA mappings whose IOVA length exceeds that of the current rcache upper limit. This means that allocations for those IOVAs will never be cached, and always must be allocated and freed from the RB tree per DMA mapping cycle. This has a significant affect on performance, more so since commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. As a first step towards allowing the rcache range upper limit be configured, hold this value in the IOVA rcache structure, and allocate the rcaches separately. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 2 +- drivers/iommu/iova.c | 23 +-- include/linux/iova.h | 4 ++-- 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..f6d3302bb829 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -432,7 +432,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, * rounding up anything cacheable to make sure that can't happen. The * order of the unadjusted size will still match upon freeing. */ - if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) + if (iova_len < (1 << (iovad->rcache_max_size - 1))) iova_len = roundup_pow_of_two(iova_len); dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit); diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index b7ecd5b08039..0e4c0e55178a 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -15,6 +15,8 @@ /* The anchor node sits above the top of the usable address space */ #define IOVA_ANCHOR~0UL +#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size (in pages) */ + static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, unsigned long size); @@ -877,7 +879,14 @@ static void init_iova_rcaches(struct iova_domain *iovad) unsigned int cpu; int i; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + iovad->rcache_max_size = IOVA_RANGE_CACHE_MAX_SIZE; + + iovad->rcaches = kcalloc(iovad->rcache_max_size, +sizeof(*iovad->rcaches), GFP_KERNEL); + if (!iovad->rcaches) + return; + + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = >rcaches[i]; spin_lock_init(>lock); rcache->depot_size = 0; @@ -952,7 +961,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, { unsigned int log_size = order_base_2(size); - if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE) + if (log_size >= iovad->rcache_max_size) return false; return __iova_rcache_insert(iovad, >rcaches[log_size], pfn); @@ -1008,7 +1017,7 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, { unsigned int log_size = order_base_2(size); - if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE) + if (log_size >= iovad->rcache_max_size) return 0; return __iova_rcache_get(>rcaches[log_size], limit_pfn - size); @@ -1024,7 +1033,7 @@ static void free_iova_rcaches(struct iova_domain *iovad) unsigned int cpu; int i, j; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = >rcaches[i]; for_each_possible_cpu(cpu) { cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu); @@ -1035,6 +1044,8 @@ static void free_iova_rcaches(struct iova_domain *iovad) for (j = 0; j < rcache->depot_size; ++j) iova_magazine_free(rcache->depot[j]); } + + kfree(iovad->rcaches); } /* @@ -1047,7 +1058,7 @@ static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) unsigned long flags; int i; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = >rcaches[i]; cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu); spin_lock_irqsave(_rcache->lock, flags); @@ -1066,7 +1077,7 @@ static void free_global_cached_iovas(struct iova_domain *iovad) unsigned long flags; int i, j; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = >rcaches[i]; spin_lock_irqsave(>lock, flags); for (j = 0; j < rcache->depot_size; ++j) { diff --git a/include/linux/iova.h b/include/linux/iova.h index 71d8a2de6635..9974e1d3e2bc 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -25,7 +25,6
[PATCH v3 0/7] iommu: Allow IOVA rcache range be configured
For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA ageing issue, as discussed at [1]. This series allows the IOVA rcache range be configured, so that we may cache all IOVAs per domain, thus improving performance. A new IOMMU group sysfs file is added - max_opt_dma_size - which is used indirectly to configure the IOVA rcache range: /sys/kernel/iommu_groups/X/max_opt_dma_size This file is updated same as how the IOMMU group default domain type is updated, i.e. must unbind the only device in the group first. The inspiration here comes from block layer request queue sysfs "optimal_io_size" file, in /sys/block/sdX/queue/optimal_io_size Some figures for storage scenario (when increasing IOVA rcache range to cover all DMA mapping sizes from the LLD): v5.13-rc1 baseline: 1200K IOPS With series:1800K IOPS All above are for IOMMU strict mode. Non-strict mode gives ~1800K IOPS in all scenarios. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/ Differences to v2: - Drop DMA mapping API to allow LLD set set for now - Update default domain immediately, instead of in reprobe - Fix build warning Differences to v1: - Many - Change method to not operate on a 'live' IOMMU domain: - rather, force device driver to be re-probed once dma_max_opt_size is set, and reconfig a new IOMMU group then - Add iommu sysfs max_dma_opt_size file, and allow updating same as how group type is changed John Garry (7): iommu: Reactor iommu_group_store_type() iova: Allow rcache range upper limit to be flexible iommu: Allow iommu_change_dev_def_domain() realloc default domain for same type iova: Add iova_domain_len_is_cached() iova: Add init_iova_domain_ext() iommu: Allow max opt DMA len be set for a group via sysfs dma-iommu: Use init_iova_domain_ext() for IOVA domain init drivers/iommu/dma-iommu.c | 17 +++- drivers/iommu/iommu.c | 172 ++ drivers/iommu/iova.c | 63 +++--- include/linux/iommu.h | 6 ++ include/linux/iova.h | 21 - 5 files changed, 210 insertions(+), 69 deletions(-) -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/1 20:30, Lu Baolu wrote: > On 2021/6/1 15:15, Shenming Lu wrote: >> On 2021/6/1 13:10, Lu Baolu wrote: >>> Hi Shenming, >>> >>> On 6/1/21 12:31 PM, Shenming Lu wrote: On 2021/5/27 15:58, Tian, Kevin wrote: > /dev/ioasid provides an unified interface for managing I/O page tables for > devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, > etc.) are expected to use this interface instead of creating their own > logic to > isolate untrusted device DMAs initiated by userspace. > > This proposal describes the uAPI of /dev/ioasid and also sample sequences > with VFIO as example in typical usages. The driver-facing kernel API > provided > by the iommu layer is still TBD, which can be discussed after consensus is > made on this uAPI. > > It's based on a lengthy discussion starting from here: > > https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ > > It ends up to be a long writing due to many things to be summarized and > non-trivial effort required to connect them into a complete proposal. > Hope it provides a clean base to converge. > [..] > /* > * Page fault report and response > * > * This is TBD. Can be added after other parts are cleared up. Likely > it > * will be a ring buffer shared between user/kernel, an eventfd to > notify > * the user and an ioctl to complete the fault. > * > * The fault data is per I/O address space, i.e.: IOASID + > faulting_addr > */ Hi, It seems that the ioasid has different usage in different situation, it could be directly used in the physical routing, or just a virtual handle that indicates a page table or a vPASID table (such as the GPA address space, in the simple passthrough case, the DMA input to IOMMU will just contain a Stream ID, no Substream ID), right? And Baolu suggested that since one device might consume multiple page tables, it's more reasonable to have one fault handler per page table. By this, do we have to maintain such an ioasid info list in the IOMMU layer? >>> As discussed earlier, the I/O page fault and cache invalidation paths >>> will have "device labels" so that the information could be easily >>> translated and routed. >>> >>> So it's likely the per-device fault handler registering API in iommu >>> core can be kept, but /dev/ioasid will be grown with a layer to >>> translate and propagate I/O page fault information to the right >>> consumers. >> Yeah, having a general preprocessing of the faults in IOASID seems to be >> a doable direction. But since there may be more than one consumer at the >> same time, who is responsible for registering the per-device fault handler? > > The drivers register per page table fault handlers to /dev/ioasid which > will then register itself to iommu core to listen and route the per- > device I/O page faults. This is just a top level thought. Haven't gone > through the details yet. Need to wait and see what /dev/ioasid finally > looks like. OK. And it needs to be confirmed by Jean since we might migrate the code from io-pgfault.c to IOASID... Anyway, finalize /dev/ioasid first. Thanks, Shenming > > Best regards, > baolu > . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, Jun 01, 2021 at 02:03:33PM +1000, David Gibson wrote: > On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote: > > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote: > > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote: > > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created > > > > > per > > > > > VF (mdev device == VF device) then that mdev device has same iommu > > > > > protection scope as VF associated to it. > > > > > > > > This doesn't require, and certainly shouldn't create, a fake group. > > > > > > It's only fake if you start with a narrow view of what a group is. > > > > A group is connected to drivers/iommu. A group object without *any* > > relation to drivers/iommu is just a complete fiction, IMHO. > > That might be where we differ. As I've said, my group I'm primarily > meaning the fundamental hardware unit of isolation. *Usually* that's > determined by the capabilities of an IOMMU, but in some cases it might > not be. In either case, the boundaries still matter. As in my other email we absolutely need a group concept, it is just a question of how the user API is designed around it. > > The group mdev implicitly creates is just a fake proxy that comes > > along with mdev API. It doesn't do anything and it doesn't mean > > anything. > > But.. the case of multiple mdevs managed by a single PCI device with > an internal IOMMU also exists, and then the mdev groups are *not* > proxies but true groups independent of the parent device. Which means > that the group structure of mdevs can vary, which is an argument *for* > keeping it, not against. If VFIO becomes more "vfio_device" centric then the vfio_device itself has some properties. One of those can be "is it inside a drivers/iommu group, or not?". If the vfio_device is not using a drivers/iommu IOMMU interface then it can just have no group at all - no reason to lie. This would mean that the device has perfect isolation. What I don't like is forcing certain things depending on how the vfio_device was created - for instance forcing a IOMMU group as part and forcing an ugly "SW IOMMU" mode in the container only as part of mdev_device. These should all be properties of the vfio_device itself. Again this is all about the group fd - and how to fit in with the /dev/ioasid proposal from Kevin: https://lore.kernel.org/kvm/mwhpr11mb1886422d4839b372c6ab245f8c...@mwhpr11mb1886.namprd11.prod.outlook.com/ Focusing on vfio_device and skipping the group fd smooths out some rough edges. Code wise we are not quite there, but I have mapped out eliminating the group from the vfio_device centric API and a few other places it has crept in. The group can exist in the background to enforce security without being a cornerstone of the API design. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Regression 5.12.0-rc4 net: ice: significant throughput drop
On 2021-06-01 07:57, Daniel Borkmann wrote: [ ping Robin / Joerg, +Cc Christoph ] Sorry, I was off on Friday on top of the Bank Holiday yesterday. On 5/28/21 10:34 AM, Jussi Maki wrote: Hi all, While measuring the impact of a kernel patch on our lab machines I stumbled upon a performance regression affecting the 100Gbit ICE nic and bisected it from range v5.11.1..v5.13-rc3 to the commit: a250c23f15c2 iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE Both recent bpf-next (d6a6a55518) and linux-stable (c4681547bc) are affected by the issue. The regression shows as a significant drop in throughput as measured with "super_netperf" [0], with measured bandwidth of ~95Gbps before and ~35Gbps after: I guess that must be the difference between using the flush queue vs. strict invalidation. On closer inspection, it seems to me that there's a subtle pre-existing bug in the AMD IOMMU driver, in that amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api() has called bus_set_iommu(). Does the patch below work? Robin. ->8- Subject: [PATCH] iommu/amd: Tidy up DMA ops init Now that DMA ops are part of the core API via iommu-dma, fold the vestigial remains of the IOMMU_DMA_OPS init state into the IOMMU API phase, and clean up a few other leftovers. This should also close the race window wherein bus_set_iommu() effectively makes the DMA ops state visible before its nominal initialisation, which since commit a250c23f15c2 ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE") can now lead to the wrong flush queue policy being picked. Reported-by: Jussi Maki Signed-off-by: Robin Murphy --- drivers/iommu/amd/amd_iommu.h | 2 -- drivers/iommu/amd/init.c | 5 - drivers/iommu/amd/iommu.c | 29 - 3 files changed, 12 insertions(+), 24 deletions(-) diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h index 55dd38d814d9..416815a525d6 100644 --- a/drivers/iommu/amd/amd_iommu.h +++ b/drivers/iommu/amd/amd_iommu.h @@ -11,8 +11,6 @@ #include "amd_iommu_types.h" -extern int amd_iommu_init_dma_ops(void); -extern int amd_iommu_init_passthrough(void); extern irqreturn_t amd_iommu_int_thread(int irq, void *data); extern irqreturn_t amd_iommu_int_handler(int irq, void *data); extern void amd_iommu_apply_erratum_63(u16 devid); diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c index d006724f4dc2..a418bf560a4b 100644 --- a/drivers/iommu/amd/init.c +++ b/drivers/iommu/amd/init.c @@ -231,7 +231,6 @@ enum iommu_init_state { IOMMU_ENABLED, IOMMU_PCI_INIT, IOMMU_INTERRUPTS_EN, - IOMMU_DMA_OPS, IOMMU_INITIALIZED, IOMMU_NOT_FOUND, IOMMU_INIT_ERROR, @@ -2895,10 +2894,6 @@ static int __init state_next(void) init_state = ret ? IOMMU_INIT_ERROR : IOMMU_INTERRUPTS_EN; break; case IOMMU_INTERRUPTS_EN: - ret = amd_iommu_init_dma_ops(); - init_state = ret ? IOMMU_INIT_ERROR : IOMMU_DMA_OPS; - break; - case IOMMU_DMA_OPS: init_state = IOMMU_INITIALIZED; break; case IOMMU_INITIALIZED: diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 80e8e1916dd1..20f7d141ea53 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -30,7 +30,6 @@ #include #include #include -#include #include #include #include @@ -1771,13 +1770,22 @@ void amd_iommu_domain_update(struct protection_domain *domain) amd_iommu_domain_flush_complete(domain); } +static void __init amd_iommu_init_dma_ops(void) +{ + swiotlb = (iommu_default_passthrough() || sme_me_mask) ? 1 : 0; + + if (amd_iommu_unmap_flush) + pr_info("IO/TLB flush on unmap enabled\n"); + else + pr_info("Lazy IO/TLB flushing enabled\n"); + iommu_set_dma_strict(amd_iommu_unmap_flush); +} + int __init amd_iommu_init_api(void) { int ret, err = 0; - ret = iova_cache_get(); - if (ret) - return ret; + amd_iommu_init_dma_ops(); err = bus_set_iommu(_bus_type, _iommu_ops); if (err) @@ -1794,19 +1802,6 @@ int __init amd_iommu_init_api(void) return 0; } -int __init amd_iommu_init_dma_ops(void) -{ - swiotlb= (iommu_default_passthrough() || sme_me_mask) ? 1 : 0; - - if (amd_iommu_unmap_flush) - pr_info("IO/TLB flush on unmap enabled\n"); - else - pr_info("Lazy IO/TLB flushing enabled\n"); - iommu_set_dma_strict(amd_iommu_unmap_flush); - return 0; - -} - /* * * The following functions belong to the exported interface of AMD IOMMU -- 2.21.0.dirty ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/1 15:15, Shenming Lu wrote: On 2021/6/1 13:10, Lu Baolu wrote: Hi Shenming, On 6/1/21 12:31 PM, Shenming Lu wrote: On 2021/5/27 15:58, Tian, Kevin wrote: /dev/ioasid provides an unified interface for managing I/O page tables for devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, etc.) are expected to use this interface instead of creating their own logic to isolate untrusted device DMAs initiated by userspace. This proposal describes the uAPI of /dev/ioasid and also sample sequences with VFIO as example in typical usages. The driver-facing kernel API provided by the iommu layer is still TBD, which can be discussed after consensus is made on this uAPI. It's based on a lengthy discussion starting from here: https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ It ends up to be a long writing due to many things to be summarized and non-trivial effort required to connect them into a complete proposal. Hope it provides a clean base to converge. [..] /* * Page fault report and response * * This is TBD. Can be added after other parts are cleared up. Likely it * will be a ring buffer shared between user/kernel, an eventfd to notify * the user and an ioctl to complete the fault. * * The fault data is per I/O address space, i.e.: IOASID + faulting_addr */ Hi, It seems that the ioasid has different usage in different situation, it could be directly used in the physical routing, or just a virtual handle that indicates a page table or a vPASID table (such as the GPA address space, in the simple passthrough case, the DMA input to IOMMU will just contain a Stream ID, no Substream ID), right? And Baolu suggested that since one device might consume multiple page tables, it's more reasonable to have one fault handler per page table. By this, do we have to maintain such an ioasid info list in the IOMMU layer? As discussed earlier, the I/O page fault and cache invalidation paths will have "device labels" so that the information could be easily translated and routed. So it's likely the per-device fault handler registering API in iommu core can be kept, but /dev/ioasid will be grown with a layer to translate and propagate I/O page fault information to the right consumers. Yeah, having a general preprocessing of the faults in IOASID seems to be a doable direction. But since there may be more than one consumer at the same time, who is responsible for registering the per-device fault handler? The drivers register per page table fault handlers to /dev/ioasid which will then register itself to iommu core to listen and route the per- device I/O page faults. This is just a top level thought. Haven't gone through the details yet. Need to wait and see what /dev/ioasid finally looks like. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: > On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: > > From: Thierry Reding > > > > Hi, > > > > this is a set of patches that is the result of earlier discussions > > regarding early identity mappings that are needed to avoid SMMU faults > > during early boot. > > > > The goal here is to avoid early identity mappings altogether and instead > > postpone the need for the identity mappings to when devices are attached > > to the SMMU. This works by making the SMMU driver coordinate with the > > memory controller driver on when to start enforcing SMMU translations. > > This makes Tegra behave in a more standard way and pushes the code to > > deal with the Tegra-specific programming into the NVIDIA SMMU > > implementation. > > > > Compared to the original version of these patches, I've split the > > preparatory work into a separate patch series because it became very > > large and will be mostly uninteresting for this audience. > > > > Patch 1 provides a mechanism to program SID overrides at runtime. Patch > > 2 updates the ARM SMMU device tree bindings to include the Tegra186 > > compatible string as suggested by Robin during review. > > > > Patches 3 and 4 create the fundamentals in the SMMU driver to support > > this and also make this functionality available on Tegra186. Patch 5 > > hooks the ARM SMMU up to the memory controller so that the memory client > > stream ID overrides can be programmed at the right time. > > > > Patch 6 extends this mechanism to Tegra186 and patches 7-9 enable all of > > this through device tree updates. Patch 10 is included here to show how > > SMMU will be enabled for display controllers. However, it cannot be > > applied yet because the code to create identity mappings for potentially > > live framebuffers hasn't been merged yet. > > > > The end result is that various peripherals will have SMMU enabled, while > > the display controllers will keep using passthrough, as initially set up > > by firmware. Once the device tree bindings have been accepted and the > > SMMU driver has been updated to create identity mappings for the display > > controllers, they can be hooked up to the SMMU and the code in this > > series will automatically program the SID overrides to enable SMMU > > translations at the right time. > > > > Note that the series creates a compile time dependency between the > > memory controller and IOMMU trees. If it helps I can provide a branch > > for each tree, modelling the dependency, once the series has been > > reviewed. > > > > Changes in v2: > > - split off the preparatory work into a separate series (that needs to > > be applied first) > > - address review comments by Robin > > > > Thierry > > > > Thierry Reding (10): > > memory: tegra: Implement SID override programming > > dt-bindings: arm-smmu: Add Tegra186 compatible string > > iommu/arm-smmu: Implement ->probe_finalize() > > iommu/arm-smmu: tegra: Detect number of instances at runtime > > iommu/arm-smmu: tegra: Implement SID override programming > > iommu/arm-smmu: Use Tegra implementation on Tegra186 > > arm64: tegra: Use correct compatible string for Tegra186 SMMU > > arm64: tegra: Hook up memory controller to SMMU on Tegra186 > > arm64: tegra: Enable SMMU support on Tegra194 > > arm64: tegra: Enable SMMU support for display on Tegra194 > > > > .../devicetree/bindings/iommu/arm,smmu.yaml | 11 +- > > arch/arm64/boot/dts/nvidia/tegra186.dtsi | 4 +- > > arch/arm64/boot/dts/nvidia/tegra194.dtsi | 166 ++ > > drivers/iommu/arm/arm-smmu/arm-smmu-impl.c| 3 +- > > drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c | 90 -- > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 13 ++ > > drivers/iommu/arm/arm-smmu/arm-smmu.h | 1 + > > drivers/memory/tegra/mc.c | 9 + > > drivers/memory/tegra/tegra186.c | 72 > > include/soc/tegra/mc.h| 3 + > > 10 files changed, 349 insertions(+), 23 deletions(-) > > Will, Robin, > > do you have any more comments on the ARM SMMU bits of this series? If > not, can you guys provide an Acked-by so that Krzysztof can pick this > (modulo the DT patches) up into the memory-controller tree for v5.14? > > I'll send out a v3 with the bisectibilitiy fix that Krishna pointed > out. Probably best if I queue 3-6 on a separate branch once you send a v3, then Krzysztof can pull that in if he needs it. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Monday, May 31, 2021 11:43 PM > > On Mon, May 31, 2021 at 05:37:35PM +, Parav Pandit wrote: > > > In that case, can it be a new system call? Why does it have to be under > /dev/ioasid? > > For example few years back such system call mpin() thought was proposed > in [1]. > > Reference counting of the overall pins are required > > So when a pinned pages is incorporated into an IOASID page table in a later > IOCTL it means it cannot be unpinned while the IOASID page table is using it. Ok. but cant it use the same refcount of that mmu uses? > > This is some trick to organize the pinning into groups and then refcount each > group, thus avoiding needing per-page refcounts. Pinned page refcount is already maintained by the mmu without ioasid, isn't it? > > The data structure would be an interval tree of pins in general > > The ioasid itself would have an interval tree of its own mappings, each entry > in this tree would reference count against an element in the above tree > > Then the ioasid's interval tree would be mapped into a page table tree in HW > format. Does it mean in simple use case [1], second level page table copy is maintained in the IOMMU side via map interface? I hope not. It should use the same as what mmu uses, right? [1] one SIOV/ADI device assigned with one PASID and mapped in guest VM > > The redundant storages are needed to keep track of the refencing and the > CPU page table values for later unpinning. > > Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
Hi Jason, On 2021/5/29 7:36, Jason Gunthorpe wrote: /* * Bind an user-managed I/O page table with the IOMMU * * Because user page table is untrusted, IOASID nesting must be enabled * for this ioasid so the kernel can enforce its DMA isolation policy * through the parent ioasid. * * Pgtable binding protocol is different from DMA mapping. The latter * has the I/O page table constructed by the kernel and updated * according to user MAP/UNMAP commands. With pgtable binding the * whole page table is created and updated by userspace, thus different * set of commands are required (bind, iotlb invalidation, page fault, etc.). * * Because the page table is directly walked by the IOMMU, the user * must use a format compatible to the underlying hardware. It can * check the format information through IOASID_GET_INFO. * * The page table is bound to the IOMMU according to the routing * information of each attached device under the specified IOASID. The * routing information (RID and optional PASID) is registered when a * device is attached to this IOASID through VFIO uAPI. * * Input parameters: *- child_ioasid; *- address of the user page table; *- formats (vendor, address_width, etc.); * * Return: 0 on success, -errno on failure. */ #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9) #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10) Also feels backwards, why wouldn't we specify this, and the required page table format, during alloc time? Thinking of the required page table format, perhaps we should shed more light on the page table of an IOASID. So far, an IOASID might represent one of the following page tables (might be more): 1) an IOMMU format page table (a.k.a. iommu_domain) 2) a user application CPU page table (SVA for example) 3) a KVM EPT (future option) 4) a VM guest managed page table (nesting mode) This version only covers 1) and 4). Do you think we need to support 2), 3) and beyond? If so, it seems that we need some in-kernel helpers and uAPIs to support pre-installing a page table to IOASID. From this point of view an IOASID is actually not just a variant of iommu_domain, but an I/O page table representation in a broader sense. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu: Print default strict or lazy mode at init time
On 2021-05-28 14:37, John Garry wrote: As well as the default domain type, it's useful to know whether strict or lazy mode is default for DMA domains, so add this info in a separate print. Signed-off-by: John Garry diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 808ab70d5df5..f25fae62f077 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -138,6 +138,11 @@ static int __init iommu_subsys_init(void) (iommu_cmd_line & IOMMU_CMD_LINE_DMA_API) ? "(set via kernel command line)" : ""); + pr_info("Default DMA domain mode: %s %s\n", Nit: I think this might be a little unclear for end-users - *I'm* not even sure whether "Default" here is meant to refer to the mode setting itself or to default domains (of DMA type). Maybe something like "DMA domain TLB invalidation policy"? Certainly it seems like a good idea to explicitly mention invalidation to correlate with the documentation of the "iommu.strict" parameter. Ack to the general idea though. Thanks, Robin. + iommu_dma_strict ? "strict" : "lazy", + (iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ? + "(set via kernel command line)" : ""); + return 0; } subsys_initcall(iommu_subsys_init); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu: Drop unnecessary of_iommu.h includes
On 2021-05-27 20:37, Rob Herring wrote: The only place of_iommu.h is needed is in drivers/of/device.c. Remove it from everywhere else. Of course, this was from the OF_IOMMU_DECLARE() business which is all is long gone now. Reviewed-by: Robin Murphy Cc: Will Deacon Cc: Robin Murphy Cc: Joerg Roedel Cc: Rob Clark Cc: Marek Szyprowski Cc: Krzysztof Kozlowski Cc: Bjorn Andersson Cc: Yong Wu Cc: Matthias Brugger Cc: Heiko Stuebner Cc: Jean-Philippe Brucker Cc: Frank Rowand Cc: linux-arm-ker...@lists.infradead.org Cc: iommu@lists.linux-foundation.org Signed-off-by: Rob Herring --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 1 - drivers/iommu/arm/arm-smmu/qcom_iommu.c | 1 - drivers/iommu/exynos-iommu.c| 1 - drivers/iommu/ipmmu-vmsa.c | 1 - drivers/iommu/msm_iommu.c | 1 - drivers/iommu/mtk_iommu.c | 1 - drivers/iommu/mtk_iommu_v1.c| 1 - drivers/iommu/omap-iommu.c | 1 - drivers/iommu/rockchip-iommu.c | 1 - drivers/iommu/virtio-iommu.c| 1 - drivers/of/platform.c | 1 - 12 files changed, 12 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 54b2f27b81d4..2ddc3cd5a7d1 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -23,7 +23,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 6f72c4d208ca..dba15f312cbd 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -31,7 +31,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c b/drivers/iommu/arm/arm-smmu/qcom_iommu.c index 4294abe389b2..021cf8f65ffc 100644 --- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c +++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c @@ -25,7 +25,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c index 7623d8c371f5..d0fbf1d10e18 100644 --- a/drivers/iommu/exynos-iommu.c +++ b/drivers/iommu/exynos-iommu.c @@ -17,7 +17,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index aaa6a4d59057..51ea6f00db2f 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -19,7 +19,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c index 7880f307cb2d..3a38352b603f 100644 --- a/drivers/iommu/msm_iommu.c +++ b/drivers/iommu/msm_iommu.c @@ -18,7 +18,6 @@ #include #include #include -#include #include #include diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c index e06b8a0e2b56..6f7c69688ce2 100644 --- a/drivers/iommu/mtk_iommu.c +++ b/drivers/iommu/mtk_iommu.c @@ -19,7 +19,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c index 5915d7b38211..778e66f5f1aa 100644 --- a/drivers/iommu/mtk_iommu_v1.c +++ b/drivers/iommu/mtk_iommu_v1.c @@ -22,7 +22,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c index 26e517eb0dd3..91749654fd49 100644 --- a/drivers/iommu/omap-iommu.c +++ b/drivers/iommu/omap-iommu.c @@ -22,7 +22,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index 7a2932772fdf..bb50e015b1d5 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -21,7 +21,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 7c02481a81b4..d9f46f2c3058 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -14,7 +14,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/of/platform.c b/drivers/of/platform.c index 25d448f5af91..74afbb7a4f5e 100644 --- a/drivers/of/platform.c +++ b/drivers/of/platform.c @@ -17,7 +17,6 @@ #include #include #include -#include #include #include #include ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/1 下午2:16, Tian, Kevin 写道: From: Jason Wang Sent: Tuesday, June 1, 2021 2:07 PM 在 2021/6/1 下午1:42, Tian, Kevin 写道: From: Jason Wang Sent: Tuesday, June 1, 2021 1:30 PM 在 2021/6/1 下午1:23, Lu Baolu 写道: Hi Jason W, On 6/1/21 1:08 PM, Jason Wang wrote: 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd? 3) If not, how GET_INFO work? oh, missed this question in prior reply. Personally, no special reason yet. But using ID may give us opportunity to customize the management of the handle. For one, better lookup efficiency by using xarray to store the allocated IDs. For two, could categorize the allocated IDs (parent or nested). GET_INFO just works with an input FD and an ID. I'm not sure I get this, for nesting cases you can still make the child an fd. And a question still, under what case we need to create multiple ioasids on a single ioasid fd? One possible situation where multiple IOASIDs per FD could be used is that devices with different underlying IOMMU capabilities are sharing a single FD. In this case, only devices with consistent underlying IOMMU capabilities could be put in an IOASID and multiple IOASIDs per FD could be applied. Though, I still not sure about "multiple IOASID per-FD" vs "multiple IOASID FDs" for such case. Right, that's exactly my question. The latter seems much more easier to be understood and implemented. A simple reason discussed in previous thread - there could be 1M's I/O address spaces per device while #FD's are precious resource. Is the concern for ulimit or performance? Note that we had #define NR_OPEN_MAX ~0U And with the fd semantic, you can do a lot of other stuffs: close on exec, passing via SCM_RIGHTS. yes, fd has its merits. For the case of 1M, I would like to know what's the use case for a single process to handle 1M+ address spaces? This single process is Qemu with an assigned device. Within the guest there could be many guest processes. Though in reality I didn't see such 1M processes on a single device, better not restrict it in uAPI? Sorry I don't get here. We can open up to ~0U file descriptors, I don't see why we need to restrict it in uAPI. Thanks So this RFC treats fd as a container of address spaces which is each tagged by an IOASID. If the container and address space is 1:1 then the container seems useless. yes, 1:1 then container is useless. But here it's assumed 1:M then even a single fd is sufficient for all intended usages. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Saturday, May 29, 2021 3:59 AM > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > 5. Use Cases and Flows > > > > Here assume VFIO will support a new model where every bound device > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o > > going through legacy container/group interface. For illustration purpose > > those devices are just called dev[1...N]: > > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode); > > > > As explained earlier, one IOASID fd is sufficient for all intended use > > cases: > > > > ioasid_fd = open("/dev/ioasid", mode); > > > > For simplicity below examples are all made for the virtualization story. > > They are representative and could be easily adapted to a non-virtualization > > scenario. > > For others, I don't think this is *strictly* necessary, we can > probably still get to the device_fd using the group_fd and fit in > /dev/ioasid. It does make the rest of this more readable though. Jason, want to confirm here. Per earlier discussion we remain an impression that you want VFIO to be a pure device driver thus container/group are used only for legacy application. From this comment are you suggesting that VFIO can still keep container/ group concepts and user just deprecates the use of vfio iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a simple policy that an IOASID will reject cmd if partially-attached group exists)? > > > > Three types of IOASIDs are considered: > > > > gpa_ioasid[1...N]: for GPA address space > > giova_ioasid[1...N]:for guest IOVA address space > > gva_ioasid[1...N]: for guest CPU VA address space > > > > At least one gpa_ioasid must always be created per guest, while the other > > two are relevant as far as vIOMMU is concerned. > > > > Examples here apply to both pdev and mdev, if not explicitly marked out > > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the > > associated routing information in the attaching operation. > > > > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_ > > INFO are skipped in these examples. > > > > 5.1. A simple example > > ++ > > > > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address > > space is managed through DMA mapping protocol: > > > > /* Bind device to IOASID fd */ > > device_fd = open("/dev/vfio/devices/dev1", mode); > > ioasid_fd = open("/dev/ioasid", mode); > > ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd); > > > > /* Attach device to IOASID */ > > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > > at_data = { .ioasid = gpa_ioasid}; > > ioctl(device_fd, VFIO_ATTACH_IOASID, _data); > > > > /* Setup GPA mapping */ > > dma_map = { > > .ioasid = gpa_ioasid; > > .iova = 0;// GPA > > .vaddr = 0x4000; // HVA > > .size = 1GB; > > }; > > ioctl(ioasid_fd, IOASID_DMA_MAP, _map); > > > > If the guest is assigned with more than dev1, user follows above sequence > > to attach other devices to the same gpa_ioasid i.e. sharing the GPA > > address space cross all assigned devices. > > eg > > device2_fd = open("/dev/vfio/devices/dev1", mode); > ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd); > ioctl(device2_fd, VFIO_ATTACH_IOASID, _data); > > Right? Exactly, except a small typo ('dev1' -> 'dev2'). :) > > > > > 5.2. Multiple IOASIDs (no nesting) > > > > > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially > > both devices are attached to gpa_ioasid. After boot the guest creates > > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass > > through mode (gpa_ioasid). > > > > Suppose IOASID nesting is not supported in this case. Qemu need to > > generate shadow mappings in userspace for giova_ioasid (like how > > VFIO works today). > > > > To avoid duplicated locked page accounting, it's recommended to pre- > > register the virtual address range that will be used for DMA: > > > > device_fd1 = open("/dev/vfio/devices/dev1", mode); > > device_fd2 = open("/dev/vfio/devices/dev2", mode); > > ioasid_fd = open("/dev/ioasid", mode); > > ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd); > > ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd); > > > > /* pre-register the virtual address range for accounting */ > > mem_info = { .vaddr = 0x4000; .size = 1GB }; > > ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, _info); > > > > /* Attach dev1 and dev2 to gpa_ioasid */ > > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > > at_data = { .ioasid = gpa_ioasid}; > > ioctl(device_fd1, VFIO_ATTACH_IOASID, _data); > > ioctl(device_fd2, VFIO_ATTACH_IOASID, _data); > > > > /* Setup GPA mapping */ > > dma_map = { > > .ioasid = gpa_ioasid; > > .iova
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Saturday, May 29, 2021 1:36 AM > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > IOASID nesting can be implemented in two ways: hardware nesting and > > software nesting. With hardware support the child and parent I/O page > > tables are walked consecutively by the IOMMU to form a nested translation. > > When it's implemented in software, the ioasid driver is responsible for > > merging the two-level mappings into a single-level shadow I/O page table. > > Software nesting requires both child/parent page tables operated through > > the dma mapping protocol, so any change in either level can be captured > > by the kernel to update the corresponding shadow mapping. > > Why? A SW emulation could do this synchronization during invalidation > processing if invalidation contained an IOVA range. In this proposal we differentiate between host-managed and user- managed I/O page tables. If host-managed, the user is expected to use map/unmap cmd explicitly upon any change required on the page table. If user-managed, the user first binds its page table to the IOMMU and then use invalidation cmd to flush iotlb when necessary (e.g. typically not required when changing a PTE from non-present to present). We expect user to use map+unmap and bind+invalidate respectively instead of mixing them together. Following this policy, map+unmap must be used in both levels for software nesting, so changes in either level are captured timely to synchronize the shadow mapping. > > I think this document would be stronger to include some "Rational" > statements in key places > Sure. I tried to provide rationale as much as possible but sometimes it's lost in a complex context like this. :) Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 27, 2021 at 11:55:00PM +0530, Kirti Wankhede wrote: > > > On 5/27/2021 10:30 AM, David Gibson wrote: > > On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote: > > > > > > > > > On 5/26/2021 1:22 AM, Jason Gunthorpe wrote: > > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created > > > > > per > > > > > VF (mdev device == VF device) then that mdev device has same iommu > > > > > protection scope as VF associated to it. > > > > > > > > This doesn't require, and certainly shouldn't create, a fake group. > > > > > > > > Only the VF's real IOMMU group should be used to model an iommu domain > > > > linked to a VF. Injecting fake groups that are proxies for real groups > > > > only opens the possibility of security problems like David is > > > > concerned with. > > > > > > > > > > I think this security issue should be addressed by letting mdev device > > > inherit its parent's iommu_group, i.e. VF's iommu_group here. > > > > No, that doesn't work. AIUI part of the whole point of mdevs is to > > allow chunks of a single PCI function to be handed out to different > > places, because they're isolated from each other not by the system > > IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a > > GPU card) and software in the mdev driver. > > That's correct for non-iommu backed mdev devices. > > > If mdevs inherited the > > group of their parent device they wouldn't count as isolated from each > > other, which they should. > > > > For iommu backed mdev devices for SRIOV, where there can be single mdev > device for its parent, here parent device is VF, there can't be multiple > mdev devices associated with that VF. In this case mdev can inherit the > group of parent device. Ah, yes, if there's just one mdev for the PCI function, and the function doesn't have an internal memory protection unit then this makes sense. Which means we *do* have at least two meaningfully different group configurations for mdev: * mdev is in a singleton group independent of the parent PCI device * mdev shares a group with its parent PCI device Which means even in the case of mdevs, the group structure is *not* a meaningless fiction. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 27, 2021 at 04:06:20PM -0300, Jason Gunthorpe wrote: > On Thu, May 27, 2021 at 02:53:42PM +1000, David Gibson wrote: > > > > > If the physical device had a bug which meant the mdevs *weren't* > > > > properly isolated from each other, then those mdevs would share a > > > > group, and you *would* care about it. Depending on how the isolation > > > > failed the mdevs might or might not also share a group with the parent > > > > physical device. > > > > > > That isn't a real scenario.. mdevs that can't be isolated just > > > wouldn't be useful to exist > > > > Really? So what do you do when you discover some mdevs you thought > > were isolated actually aren't due to a hardware bug? Drop support > > from the driver entirely? In which case what do you say to the people > > who understandably complain "but... we had all the mdevs in one guest > > anyway, we don't care if they're not isolated"? > > I've never said to eliminate groups entirely. > > What I'm saying is that all the cases we have for mdev today do not > require groups, but are forced to create a fake group anyhow just to > satisfy the odd VFIO requirement to have a group FD. > > If some future mdev needs groups then sure, add the appropriate group > stuff. > > But that doesn't effect the decision to have a VFIO group FD, or not. > > > > > It ensures that they're parked at the moment the group moves from > > > > kernel to userspace ownership, but it can't prevent dpdk from > > > > accessing and unparking those devices via peer to peer DMA. > > > > > > Right, and adding all this group stuff did nothing to alert the poor > > > admin that is running DPDK to this risk. > > > > Didn't it? Seems to me the admin that in order to give the group to > > DPDK, the admin had to find and unbind all the things in it... so is > > therefore aware that they're giving everything in it to DPDK. > > Again, I've never said the *group* should be removed. I'm only > concerned about the *group FD* Ok, that wasn't really clear to me. I still wouldn't say the group for mdevs is a fiction though.. rather that the group device used for (no internal IOMMU case) mdevs is just plain wrong. > When the admin found and unbound they didn't use the *group FD* in any > way. No, they are likely to have changed permissions on the group device node as part of the process, though. > > > You put the same security labels you'd put on the group to the devices > > > that consitute the group. It is only more tricky in the sense that the > > > script that would have to do this will need to do more than ID the > > > group to label but also ID the device members of the group and label > > > their char nodes. > > > > Well, I guess, if you take the view that root is allowed to break the > > kernel. I tend to prefer that although root can obviously break the > > kernel if they intend do, we should make it hard to do by accident - > > which in this case would mean the kernel *enforcing* that the devices > > in the group have the same security labels, which I can't really see > > how to do without an exposed group. > > How is this "break the kernel"? It has nothing to do with the > kernel. Security labels are a user space concern. *thinks*... yeah, ok, that was much too strong an assertion. What I was thinking of is the fact that this means that guarantees you'd normally expect the kernel to enforce can be obviated by bad configuration: chown-ing a device to root doesn't actually protect it if there's another device in the same group exposed to other users. But I guess you could say the same about, say, an unauthenticated nbd export of a root-owned block device, so I guess that's not something the kernel can reasonably enforce. Ok.. you might be finally convincing me, somewhat. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote: > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote: > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote: > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created per > > > > VF (mdev device == VF device) then that mdev device has same iommu > > > > protection scope as VF associated to it. > > > > > > This doesn't require, and certainly shouldn't create, a fake group. > > > > It's only fake if you start with a narrow view of what a group is. > > A group is connected to drivers/iommu. A group object without *any* > relation to drivers/iommu is just a complete fiction, IMHO. That might be where we differ. As I've said, my group I'm primarily meaning the fundamental hardware unit of isolation. *Usually* that's determined by the capabilities of an IOMMU, but in some cases it might not be. In either case, the boundaries still matter. > > > Only the VF's real IOMMU group should be used to model an iommu domain > > > linked to a VF. Injecting fake groups that are proxies for real groups > > > only opens the possibility of security problems like David is > > > concerned with. > > > > It's not a proxy for a real group, it's a group of its own. If you > > discover that (due to a hardware bug, for example) the mdev is *not* > > What Kirti is talking about here is the case where a mdev is wrapped > around a VF and the DMA isolation stems directly from the SRIOV VF's > inherent DMA isolation, not anything the mdev wrapper did. > > The group providing the isolation is the VF's group. Yes, in that case the mdev absolutely should be in the VF's group - having its own group is not just messy but incorrect. > The group mdev implicitly creates is just a fake proxy that comes > along with mdev API. It doesn't do anything and it doesn't mean > anything. But.. the case of multiple mdevs managed by a single PCI device with an internal IOMMU also exists, and then the mdev groups are *not* proxies but true groups independent of the parent device. Which means that the group structure of mdevs can vary, which is an argument *for* keeping it, not against. > > properly isolated from its parent PCI device, then both the mdev > > virtual device *and* the physical PCI device are in the same group. > > Groups including devices of different types and on different buses > > were considered from the start, and are precedented, if rare. > > This is far too theoretical for me. A security broken mdev is > functionally useless. Is it, though? Again, I'm talking about the case of multiple mdevs with a single parent device (because that's the only case I was aware of until recently). Isolation comes from a device-internal IOMMU... that turns out to be broken. But if your security domain happens to include all the mdevs on the device anyway, then you don't care. Are you really going to say people can't use their fancy hardware in this mode because it has a security flaw that's not relevant to their usecase? And then.. there's Kirti's case. In that case the mdev should belong to its parent PCI device's group since that's what's providing isolation. But in that case the parent device can be in a multi-device group for any of the usual reasons (PCIe-to-PCI bridge, PCIe switch with broken ACS, multifunction device with crosstalk). Which means the mdev also shares a group with those other device. So again, the group structure matters and is not a fiction. > We don't need to support it, and we don't need complicated software to > model it. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jean-Philippe Brucker > Sent: Saturday, May 29, 2021 12:23 AM > > > > IOASID nesting can be implemented in two ways: hardware nesting and > > software nesting. With hardware support the child and parent I/O page > > tables are walked consecutively by the IOMMU to form a nested translation. > > When it's implemented in software, the ioasid driver is responsible for > > merging the two-level mappings into a single-level shadow I/O page table. > > Software nesting requires both child/parent page tables operated through > > the dma mapping protocol, so any change in either level can be captured > > by the kernel to update the corresponding shadow mapping. > > Is there an advantage to moving software nesting into the kernel? > We could just have the guest do its usual combined map/unmap on the child > fd > There are at least two intended usages: 1) From previous discussion looks PPC's window-based scheme can be better supported with software nesting (a shared IOVA address space as the parent (shared by all devices) which is nested by multiple windows as the children (per-device); 2) Some mdev drivers (e.g. kvmgt) may want to do write-protection on guest data structures (base address programmed to mediated MMIO register). The base address is IOVA while KVM page-tracking API is based on GPA. nesting allows finding GPA according to IOVA. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Regression 5.12.0-rc4 net: ice: significant throughput drop
[ ping Robin / Joerg, +Cc Christoph ] On 5/28/21 10:34 AM, Jussi Maki wrote: Hi all, While measuring the impact of a kernel patch on our lab machines I stumbled upon a performance regression affecting the 100Gbit ICE nic and bisected it from range v5.11.1..v5.13-rc3 to the commit: a250c23f15c2 iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE Both recent bpf-next (d6a6a55518) and linux-stable (c4681547bc) are affected by the issue. The regression shows as a significant drop in throughput as measured with "super_netperf" [0], with measured bandwidth of ~95Gbps before and ~35Gbps after: commit 3189713a1b84 (a250c23^): $ ./super_netperf 32 -H 172.18.0.2 -l 10 97379.8 commit a250c23f15c2: $ ./super_netperf 32 -H 172.18.0.2 -l 10 34097.5 The pair of test machines have this hardware: CPU: AMD Ryzen 9 3950X 16-Core Processor MB: X570 AORUS MASTER 0a:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller E810-C for QSFP [8086:1592] (rev 02) Kernel config: https://gist.github.com/joamaki/9ee11294c78a8dd2776041f67e5620e7 [0] https://github.com/borkmann/stuff/blob/master/super_netperf ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/1 13:10, Lu Baolu wrote: > Hi Shenming, > > On 6/1/21 12:31 PM, Shenming Lu wrote: >> On 2021/5/27 15:58, Tian, Kevin wrote: >>> /dev/ioasid provides an unified interface for managing I/O page tables for >>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, >>> etc.) are expected to use this interface instead of creating their own >>> logic to >>> isolate untrusted device DMAs initiated by userspace. >>> >>> This proposal describes the uAPI of /dev/ioasid and also sample sequences >>> with VFIO as example in typical usages. The driver-facing kernel API >>> provided >>> by the iommu layer is still TBD, which can be discussed after consensus is >>> made on this uAPI. >>> >>> It's based on a lengthy discussion starting from here: >>> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ >>> >>> It ends up to be a long writing due to many things to be summarized and >>> non-trivial effort required to connect them into a complete proposal. >>> Hope it provides a clean base to converge. >>> >> >> [..] >> >>> >>> /* >>> * Page fault report and response >>> * >>> * This is TBD. Can be added after other parts are cleared up. Likely it >>> * will be a ring buffer shared between user/kernel, an eventfd to notify >>> * the user and an ioctl to complete the fault. >>> * >>> * The fault data is per I/O address space, i.e.: IOASID + faulting_addr >>> */ >> >> Hi, >> >> It seems that the ioasid has different usage in different situation, it could >> be directly used in the physical routing, or just a virtual handle that >> indicates >> a page table or a vPASID table (such as the GPA address space, in the simple >> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no >> Substream ID), right? >> >> And Baolu suggested that since one device might consume multiple page tables, >> it's more reasonable to have one fault handler per page table. By this, do we >> have to maintain such an ioasid info list in the IOMMU layer? > > As discussed earlier, the I/O page fault and cache invalidation paths > will have "device labels" so that the information could be easily > translated and routed. > > So it's likely the per-device fault handler registering API in iommu > core can be kept, but /dev/ioasid will be grown with a layer to > translate and propagate I/O page fault information to the right > consumers. Yeah, having a general preprocessing of the faults in IOASID seems to be a doable direction. But since there may be more than one consumer at the same time, who is responsible for registering the per-device fault handler? Thanks, Shenming > > If things evolve in this way, probably the SVA I/O page fault also needs > to be ported to /dev/ioasid. > >> >> Then if we add host IOPF support (for the GPA address space) in the future >> (I have sent a series for this but it aimed for VFIO, I will convert it for >> IOASID later [1] :-)), how could we find the handler for the received fault >> event which only contains a Stream ID... Do we also have to maintain a >> dev(vPASID)->ioasid mapping in the IOMMU layer? >> >> [1] https://lore.kernel.org/patchwork/cover/1410223/ > > Best regards, > baolu > . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Saturday, May 29, 2021 4:03 AM > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > /dev/ioasid provides an unified interface for managing I/O page tables for > > devices assigned to userspace. Device passthrough frameworks (VFIO, > vDPA, > > etc.) are expected to use this interface instead of creating their own > > logic to > > isolate untrusted device DMAs initiated by userspace. > > It is very long, but I think this has turned out quite well. It > certainly matches the basic sketch I had in my head when we were > talking about how to create vDPA devices a few years ago. > > When you get down to the operations they all seem pretty common sense > and straightfoward. Create an IOASID. Connect to a device. Fill the > IOASID with pages somehow. Worry about PASID labeling. > > It really is critical to get all the vendor IOMMU people to go over it > and see how their HW features map into this. > Agree. btw I feel it might be good to have several design opens centrally discussed after going through all the comments. Otherwise they may be buried in different sub-threads and potentially with insufficient care (especially for people who haven't completed the reading). I summarized five opens here, about: 1) Finalizing the name to replace /dev/ioasid; 2) Whether one device is allowed to bind to multiple IOASID fd's; 3) Carry device information in invalidation/fault reporting uAPI; 4) What should/could be specified when allocating an IOASID; 5) The protocol between vfio group and kvm; For 1), two alternative names are mentioned: /dev/iommu and /dev/ioas. I don't have a strong preference and would like to hear votes from all stakeholders. /dev/iommu is slightly better imho for two reasons. First, per AMD's presentation in last KVM forum they implement vIOMMU in hardware thus need to support user-managed domains. An iommu uAPI notation might make more sense moving forward. Second, it makes later uAPI naming easier as 'IOASID' can be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of IOASID_ALLOC_IOASID. :) Another naming open is about IOASID (the software handle for ioas) and the associated hardware ID (PASID or substream ID). Jason thought PASID is defined more from SVA angle while ARM's convention sounds clearer from device p.o.v. Following this direction then SID/SSID will be used to replace RID/PASID in this RFC (and possibly also implying that the kernel IOASID allocator should also be renamed to SSID allocator). I don't have better alternative. If no one objects, I'll change to this new naming in next version. For 2), Jason prefers to not blocking it if no kernel design reason. If one device is allowed to bind multiple IOASID fd's, the main problem is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 and giova_ioasid created in fd2 and then nesting them together (and whether any cross-fd notification required when handling invalidation etc.). We thought that this just adds some complexity while not sure about the value of supporting it (when one fd can already afford all discussed usages). Therefore this RFC proposes a device only bound to at most one IOASID fd. Does this rationale make sense? To the other end there was also thought whether we should make a single I/O address space per IOASID fd. This was discussed in previous thread that #fd's are insufficient to afford theoretical 1M's address spaces per device. But let's have another revisit and draw a clear conclusion whether this option is viable. For 3), Jason/Jean both think it's cleaner to carry device info in the uAPI. Actually this was one option we developed in earlier internal versions of this RFC. Later on we changed it to the current way based on misinterpretation of previous discussion. Thinking more we will adopt this suggestion in next version, due to both efficiency (I/O page fault is already a long path ) and security reason (some faults are unrecoverable thus the faulting device must be identified/isolated). This implies that VFIO_BOUND_IOASID will be extended to allow user specify a device label. This label will be recorded in /dev/iommu to serve per-device invalidation request from and report per-device fault data to the user. In addition, vPASID (if provided by user) will be also recorded in /dev/iommu so vPASID<->pPASID conversion is conducted properly. e.g. invalidation request from user carries a vPASID which must be converted into pPASID before calling iommu driver. Vice versa for raw fault data which carries pPASID while the user expects a vPASID. For 4), There are two options for specifying the IOASID attributes: In this RFC, an IOASID has no attribute before it's attached to any device. After device attach, user queries capability/format info about the IOMMU which the device belongs to, and then call different ioctl commands to set the attributes for an IOASID (e.g. map/unmap, bind/unbind
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Wang > Sent: Tuesday, June 1, 2021 2:07 PM > > 在 2021/6/1 下午1:42, Tian, Kevin 写道: > >> From: Jason Wang > >> Sent: Tuesday, June 1, 2021 1:30 PM > >> > >> 在 2021/6/1 下午1:23, Lu Baolu 写道: > >>> Hi Jason W, > >>> > >>> On 6/1/21 1:08 PM, Jason Wang wrote: > >> 2) If yes, what's the reason for not simply use the fd opened from > >> /dev/ioas. (This is the question that is not answered) and what > >> happens > >> if we call GET_INFO for the ioasid_fd? > >> 3) If not, how GET_INFO work? > > oh, missed this question in prior reply. Personally, no special reason > > yet. But using ID may give us opportunity to customize the > management > > of the handle. For one, better lookup efficiency by using xarray to > > store the allocated IDs. For two, could categorize the allocated IDs > > (parent or nested). GET_INFO just works with an input FD and an ID. > > I'm not sure I get this, for nesting cases you can still make the > child an fd. > > And a question still, under what case we need to create multiple > ioasids on a single ioasid fd? > >>> One possible situation where multiple IOASIDs per FD could be used is > >>> that devices with different underlying IOMMU capabilities are sharing a > >>> single FD. In this case, only devices with consistent underlying IOMMU > >>> capabilities could be put in an IOASID and multiple IOASIDs per FD could > >>> be applied. > >>> > >>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple > >>> IOASID FDs" for such case. > >> > >> Right, that's exactly my question. The latter seems much more easier to > >> be understood and implemented. > >> > > A simple reason discussed in previous thread - there could be 1M's > > I/O address spaces per device while #FD's are precious resource. > > > Is the concern for ulimit or performance? Note that we had > > #define NR_OPEN_MAX ~0U > > And with the fd semantic, you can do a lot of other stuffs: close on > exec, passing via SCM_RIGHTS. yes, fd has its merits. > > For the case of 1M, I would like to know what's the use case for a > single process to handle 1M+ address spaces? This single process is Qemu with an assigned device. Within the guest there could be many guest processes. Though in reality I didn't see such 1M processes on a single device, better not restrict it in uAPI? > > > > So this RFC treats fd as a container of address spaces which is each > > tagged by an IOASID. > > > If the container and address space is 1:1 then the container seems useless. > yes, 1:1 then container is useless. But here it's assumed 1:M then even a single fd is sufficient for all intended usages. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/1 下午1:42, Tian, Kevin 写道: From: Jason Wang Sent: Tuesday, June 1, 2021 1:30 PM 在 2021/6/1 下午1:23, Lu Baolu 写道: Hi Jason W, On 6/1/21 1:08 PM, Jason Wang wrote: 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd? 3) If not, how GET_INFO work? oh, missed this question in prior reply. Personally, no special reason yet. But using ID may give us opportunity to customize the management of the handle. For one, better lookup efficiency by using xarray to store the allocated IDs. For two, could categorize the allocated IDs (parent or nested). GET_INFO just works with an input FD and an ID. I'm not sure I get this, for nesting cases you can still make the child an fd. And a question still, under what case we need to create multiple ioasids on a single ioasid fd? One possible situation where multiple IOASIDs per FD could be used is that devices with different underlying IOMMU capabilities are sharing a single FD. In this case, only devices with consistent underlying IOMMU capabilities could be put in an IOASID and multiple IOASIDs per FD could be applied. Though, I still not sure about "multiple IOASID per-FD" vs "multiple IOASID FDs" for such case. Right, that's exactly my question. The latter seems much more easier to be understood and implemented. A simple reason discussed in previous thread - there could be 1M's I/O address spaces per device while #FD's are precious resource. Is the concern for ulimit or performance? Note that we had #define NR_OPEN_MAX ~0U And with the fd semantic, you can do a lot of other stuffs: close on exec, passing via SCM_RIGHTS. For the case of 1M, I would like to know what's the use case for a single process to handle 1M+ address spaces? So this RFC treats fd as a container of address spaces which is each tagged by an IOASID. If the container and address space is 1:1 then the container seems useless. Thanks Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu