RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
> From: Jason Gunthorpe > Sent: Thursday, March 24, 2022 4:34 AM > > On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote: > > On Wed, 23 Mar 2022 16:34:39 -0300 > > Jason Gunthorpe wrote: > > > > > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote: > > > > On Fri, 18 Mar 2022 14:27:33 -0300 > > > > Jason Gunthorpe wrote: > > > > > > > > > +static int conv_iommu_prot(u32 map_flags) > > > > > +{ > > > > > + int iommu_prot; > > > > > + > > > > > + /* > > > > > + * We provide no manual cache coherency ioctls to userspace > and most > > > > > + * architectures make the CPU ops for cache flushing > privileged. > > > > > + * Therefore we require the underlying IOMMU to support > CPU coherent > > > > > + * operation. > > > > > + */ > > > > > + iommu_prot = IOMMU_CACHE; > > > > > > > > Where is this requirement enforced? AIUI we'd need to test > > > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like > > > > intel_iommu_map() simply drop the flag when not supported by HW. > > > > > > You are right, the correct thing to do is to fail device > > > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there, > > > however we can't do that because Intel abuses the meaning of > > > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop > behavior is > > > supported. > > > > > > I want Intel to split out their special no-snoop from IOMMU_CACHE and > > > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent > meaning in > > > all iommu drivers. Once this is done vfio and iommufd should both > > > always set IOMMU_CACHE and refuse to work without > > > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of > an !IOMMU_CACHE > > > arch that does in fact work today with vfio, somehow, but I don't..) > > > > IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see > > lack of snoop-control support, causing us to have mixed coherent and > > non-coherent domains. I don't recall if you go back far enough in VT-d > > history if the primary IOMMU might have lacked this support. So I > > think there are systems we care about with IOMMUs that can't enforce > > DMA coherency. > > > > As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and > all > > mappings make use of IOMMU_CACHE, then all DMA is coherent. Are you > > suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all > mappings > > are coherent regardless of mapping protection flags? What's the point > > of IOMMU_CACHE at that point? > > IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's > change. One nit (as I explained in previous v1 discussion). It is not that Intel abuses this capability as it was firstly introduced for Intel's force snoop requirement. It is just that when later its meaning was changed to match what you described below the original use of Intel was not caught and fixed properly. 😊 > > It only means normal DMAs issued in a normal way are coherent with the > CPU and do not require special cache flushing instructions. ie DMA > issued by a kernel driver using the DMA API. > > It does not mean that non-coherence DMA does not exist, or that > platform or device specific ways to trigger non-coherence are blocked. > > Stated another way, any platform that wires dev_is_dma_coherent() to > true, like all x86 does, must support IOMMU_CACHE and report > IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform > supports. The platform obviously declares it support this in order to > support the in-kernel DMA API. This is a good explanation of IOMMU_CACHE. From that intel-iommu driver should always report this capability and do nothing with IOMMU_CACHE since it's already guaranteed by the arch. Actually this is exactly what AMD iommu driver does today. Then we'll introduce a new cap/prot in particular for enforcing snoop as you suggested below. > > Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop > blocking' should be created to cover Intel's special need. From what I > know it is only implemented in the Intel driver and apparently only > for some IOMMUs connected to IGD. > > > > Yes, it was missed in the notes for vfio compat that Intel no-snoop is > > > not working currently, I fixed it. > > > > Right, I see it in the comments relative to extensions, but missed in > > the commit log. Thanks, > > Oh good, I remembered it was someplace.. > Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP
On Wed, Mar 23, 2022 at 08:54:08PM +, Robin Murphy wrote: > I'll admit I still never quite grasped the reason for also adding the > override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think > by that point we were increasingly tired and confused and starting to > second-guess ourselves (well, I was, at least). I don't think it's wrong > per se, but as I said I do think it can bite anyone who's been doing > dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 > alone turns out to work OK then I'd be inclined to try a partial revert of > just that one hunk. Agreed. Let's try that first. Oleksandr, can you try the patch below: diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 6db1c475ec827..6c350555e5a1c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr, void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) { - /* -* Unconditional bounce is necessary to avoid corruption on -* sync_*_for_cpu or dma_ummap_* when the device didn't overwrite -* the whole lengt of the bounce buffer. -*/ - swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE); - BUG_ON(!valid_dma_direction(dir)); + if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) + swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE); + else + BUG_ON(dir != DMA_FROM_DEVICE); } void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
> From: Jason Wang > Sent: Thursday, March 24, 2022 11:51 AM > > > > > > > > In the end vfio type1 will be replaced by iommufd compat layer. With > > that goal in mind iommufd has to inherit type1 behaviors. > > So the compatibility should be provided by the compat layer instead of > the core iommufd. > > And I wonder what happens if iommufd is used by other subsystems like > vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do > we need a per subsystem new uAPI to expose this capability? If yes, > why can't VFIO have such an API then we don't even need the compat > layer at all? > No, compat layer is just for vfio. other subsystems including vdpa is expected to use iommu uAPI directly, except having their own bind_iommufd and attach_ioas uAPIs to build the connection between device and iommufd/ioas. And having a compat layer for vfio is just for transition purpose. Yi has demonstrated how vfio can follow what other subsystems are expected to do here: https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6 (specifically commits related to "cover-letter: Adapting vfio-pci to iommufd") Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin wrote: > > > From: Jason Wang > > Sent: Thursday, March 24, 2022 10:57 AM > > > > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin wrote: > > > > > > > From: Jason Wang > > > > Sent: Thursday, March 24, 2022 10:28 AM > > > > > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin > > wrote: > > > > > > > > > > > From: Jason Gunthorpe > > > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > > > Doesn't this difference need to be accounted in how libvirt > > > > > > > manages > > VM > > > > > > > resource limits? > > > > > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > > > memory limits. > > > > > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > > > { > > > > > > rlim = tsk->signal->rlim + resource; > > > > > > [..] > > > > > > if (new_rlim) > > > > > > *rlim = *new_rlim; > > > > > > > > > > > > Which vfio reads back here: > > > > > > > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > > > > > And iommufd does the same read back: > > > > > > > > > > > > lock_limit = > > > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > > > PAGE_SHIFT; > > > > > > npages = pages->npinned - pages->last_npinned; > > > > > > do { > > > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > > > >locked_vm); > > > > > > new_pages = cur_pages + npages; > > > > > > if (new_pages > lock_limit) > > > > > > return -ENOMEM; > > > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > > > cur_pages, > > > > > >new_pages) != cur_pages); > > > > > > > > > > > > So it does work essentially the same. > > > > > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the > > > > > > user > > > > > > so it is additive with things like iouring and additively spans all > > > > > > the users processes. > > > > > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > > > > > The user visible difference will be that a limit X that worked with > > > > > > VFIO may start to fail after a kernel upgrade as the charge > > > > > > accounting > > > > > > is now cross user and additive with things like iommufd. > > > > > > > > > > > > This whole area is a bit peculiar (eg mlock itself works > > > > > > differently), > > > > > > IMHO, but with most of the places doing pins voting to use > > > > > > user->locked_vm as the charge it seems the right path in today's > > > > > > kernel. > > > > > > > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > > > > > > > I plan to fix RDMA like this as well so at least we can have > > > > > > consistency within qemu. > > > > > > > > > > > > > > > > I have an impression that iommufd and vfio type1 must use > > > > > the same accounting scheme given the management stack > > > > > has no insight into qemu on which one is actually used thus > > > > > cannot adapt to the subtle difference in between. in this > > > > > regard either we start fixing vfio type1 to use user->locked_vm > > > > > now or have iommufd follow vfio type1 for upward compatibility > > > > > and then change them together at a later point. > > > > > > > > > > I prefer to the former as IMHO I don't know when will be a later > > > > > point w/o certain kernel changes to actually break the userspace > > > > > policy built on a wrong accounting scheme... > > > > > > > > I wonder if the kernel is the right place to do this. We have new uAPI > > > > > > I didn't get this. This thread is about that VFIO uses a wrong accounting > > > scheme and then the discussion is about the impact of fixing it to the > > > userspace. > > > > It's probably too late to fix the kernel considering it may break the > > userspace. > > > > > I didn't see the question on the right place part. > > > > I meant it would be easier
RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
> From: Jason Wang > Sent: Thursday, March 24, 2022 10:57 AM > > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin wrote: > > > > > From: Jason Wang > > > Sent: Thursday, March 24, 2022 10:28 AM > > > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin > wrote: > > > > > > > > > From: Jason Gunthorpe > > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > > Doesn't this difference need to be accounted in how libvirt manages > VM > > > > > > resource limits? > > > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > > memory limits. > > > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > > { > > > > > rlim = tsk->signal->rlim + resource; > > > > > [..] > > > > > if (new_rlim) > > > > > *rlim = *new_rlim; > > > > > > > > > > Which vfio reads back here: > > > > > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > > > And iommufd does the same read back: > > > > > > > > > > lock_limit = > > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > > PAGE_SHIFT; > > > > > npages = pages->npinned - pages->last_npinned; > > > > > do { > > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > > >locked_vm); > > > > > new_pages = cur_pages + npages; > > > > > if (new_pages > lock_limit) > > > > > return -ENOMEM; > > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > > cur_pages, > > > > >new_pages) != cur_pages); > > > > > > > > > > So it does work essentially the same. > > > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the user > > > > > so it is additive with things like iouring and additively spans all > > > > > the users processes. > > > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > > > The user visible difference will be that a limit X that worked with > > > > > VFIO may start to fail after a kernel upgrade as the charge accounting > > > > > is now cross user and additive with things like iommufd. > > > > > > > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > > > IMHO, but with most of the places doing pins voting to use > > > > > user->locked_vm as the charge it seems the right path in today's > > > > > kernel. > > > > > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > > > > > I plan to fix RDMA like this as well so at least we can have > > > > > consistency within qemu. > > > > > > > > > > > > > I have an impression that iommufd and vfio type1 must use > > > > the same accounting scheme given the management stack > > > > has no insight into qemu on which one is actually used thus > > > > cannot adapt to the subtle difference in between. in this > > > > regard either we start fixing vfio type1 to use user->locked_vm > > > > now or have iommufd follow vfio type1 for upward compatibility > > > > and then change them together at a later point. > > > > > > > > I prefer to the former as IMHO I don't know when will be a later > > > > point w/o certain kernel changes to actually break the userspace > > > > policy built on a wrong accounting scheme... > > > > > > I wonder if the kernel is the right place to do this. We have new uAPI > > > > I didn't get this. This thread is about that VFIO uses a wrong accounting > > scheme and then the discussion is about the impact of fixing it to the > > userspace. > > It's probably too late to fix the kernel considering it may break the > userspace. > > > I didn't see the question on the right place part. > > I meant it would be easier to let userspace know the difference than > trying to fix or workaround in the kernel. Jason already plans to fix RDMA which will also leads to user-aware impact when Qemu uses both VFIO and RDMA. Why is VFIO so special and left behind to carry the legacy misdesign? > > > > > > so management layer can know the difference
RE: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
> From: Jason Gunthorpe > Sent: Thursday, March 24, 2022 2:16 AM > > On Tue, Mar 22, 2022 at 04:15:44PM -0600, Alex Williamson wrote: > > > > +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova, > > > + unsigned long length, struct page **out_pages, bool write) > > > +{ > > > + unsigned long cur_iova = iova; > > > + unsigned long last_iova; > > > + struct iopt_area *area; > > > + int rc; > > > + > > > + if (!length) > > > + return -EINVAL; > > > + if (check_add_overflow(iova, length - 1, &last_iova)) > > > + return -EOVERFLOW; > > > + > > > + down_read(&iopt->iova_rwsem); > > > + for (area = iopt_area_iter_first(iopt, iova, last_iova); area; > > > + area = iopt_area_iter_next(area, iova, last_iova)) { > > > + unsigned long last = min(last_iova, iopt_area_last_iova(area)); > > > + unsigned long last_index; > > > + unsigned long index; > > > + > > > + /* Need contiguous areas in the access */ > > > + if (iopt_area_iova(area) < cur_iova || !area->pages) { > > ^^^ > > Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)? > > Oh good eye > > That is a typo < should be >: > > if (iopt_area_iova(area) > cur_iova || !area->pages) { > > There are three boundary conditions here to worry about > - interval trees return any nodes that intersect the queried range >so the first found node can start after iova > > - There can be a gap in the intervals > > - The final area can end short of last_iova > > The last one is botched too and needs this: > for ... { ... > } > + if (cur_iova != last_iova) > + goto out_remove; > > The test suite isn't covering the boundary cases here yet, I added a > FIXME for this for now. > Another nit about below: + /* +* Can't cross areas that are not aligned to the system page +* size with this API. +*/ + if (cur_iova % PAGE_SIZE) { + rc = -EINVAL; + goto out_remove; + } Currently it's done after iopt_pages_add_user() but before cur_iova is adjusted, which implies the last add_user() will not be reverted in case of failed check here. suppose this should be checked at the start of the loop. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows
On Wed, Mar 23, 2022 at 07:55:23PM -0500, Rob Herring wrote: > On Wed, Mar 23, 2022 at 5:15 PM dann frazier > wrote: > > > > On Wed, Mar 23, 2022 at 09:49:04AM +, Marc Zyngier wrote: > > > On Tue, 22 Mar 2022 17:27:36 +, > > > Robin Murphy wrote: > > > > > > > > Originally, creating the dma_ranges resource list in pre-sorted fashion > > > > was the simplest and most efficient way to enforce the order required by > > > > iova_reserve_pci_windows(). However since then at least one PCI host > > > > driver is now re-sorting the list for its own probe-time processing, > > > > which doesn't seem entirely unreasonable, so that basic assumption no > > > > longer holds. Make iommu-dma robust and get the sort order it needs by > > > > explicitly sorting, which means we can also save the effort at creation > > > > time and just build the list in whatever natural order the DT had. > > > > > > > > Signed-off-by: Robin Murphy > > > > --- > > > > > > > > Looking at this area off the back of the XGene thread[1] made me realise > > > > that we need to do it anyway, regardless of whether it might also happen > > > > to restore the previous XGene behaviour or not. Presumably nobody's > > > > tried to use pcie-cadence-host behind an IOMMU yet... > > > > > > This definitely restores PCIe functionality on my Mustang (XGene-1). > > > Hopefully dann can comment on whether this addresses his own issue, as > > > his firmware is significantly different. > > > > Robin, Marc, > > > > Adding just this patch on top of v5.17 (w/ vendor dtb) isn't enough to > > fix m400 networking: > > I wouldn't expect it to given both the IB register selection changed > and the 2nd dma-ranges entry is ignored. > > Can you (and others) try out this branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git xgene-pci-fix > > It should maintain the same IB register usage for both cases and > handle the error in 'dma-ranges'. Looks good Rob :) https://paste.ubuntu.com/p/zJF9PKhQpS/ -dann ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin wrote: > > > From: Jason Wang > > Sent: Thursday, March 24, 2022 10:28 AM > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin wrote: > > > > > > > From: Jason Gunthorpe > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > > > resource limits? > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > memory limits. > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > { > > > > rlim = tsk->signal->rlim + resource; > > > > [..] > > > > if (new_rlim) > > > > *rlim = *new_rlim; > > > > > > > > Which vfio reads back here: > > > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > And iommufd does the same read back: > > > > > > > > lock_limit = > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > PAGE_SHIFT; > > > > npages = pages->npinned - pages->last_npinned; > > > > do { > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > >locked_vm); > > > > new_pages = cur_pages + npages; > > > > if (new_pages > lock_limit) > > > > return -ENOMEM; > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > cur_pages, > > > >new_pages) != cur_pages); > > > > > > > > So it does work essentially the same. > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the user > > > > so it is additive with things like iouring and additively spans all > > > > the users processes. > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > The user visible difference will be that a limit X that worked with > > > > VFIO may start to fail after a kernel upgrade as the charge accounting > > > > is now cross user and additive with things like iommufd. > > > > > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > > IMHO, but with most of the places doing pins voting to use > > > > user->locked_vm as the charge it seems the right path in today's > > > > kernel. > > > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > > > I plan to fix RDMA like this as well so at least we can have > > > > consistency within qemu. > > > > > > > > > > I have an impression that iommufd and vfio type1 must use > > > the same accounting scheme given the management stack > > > has no insight into qemu on which one is actually used thus > > > cannot adapt to the subtle difference in between. in this > > > regard either we start fixing vfio type1 to use user->locked_vm > > > now or have iommufd follow vfio type1 for upward compatibility > > > and then change them together at a later point. > > > > > > I prefer to the former as IMHO I don't know when will be a later > > > point w/o certain kernel changes to actually break the userspace > > > policy built on a wrong accounting scheme... > > > > I wonder if the kernel is the right place to do this. We have new uAPI > > I didn't get this. This thread is about that VFIO uses a wrong accounting > scheme and then the discussion is about the impact of fixing it to the > userspace. It's probably too late to fix the kernel considering it may break the userspace. > I didn't see the question on the right place part. I meant it would be easier to let userspace know the difference than trying to fix or workaround in the kernel. > > > so management layer can know the difference of the accounting in > > advance by > > > > -device vfio-pci,iommufd=on > > > > I suppose iommufd will be used once Qemu supports it, as long as > the compatibility opens that Jason/Alex discussed in another thread > are well addressed. It is not necessarily to be a control knob exposed > to the caller. It has a lot of implications if we do this, it means iommufd needs to inherit all the userspace noticeable behaviour as well as the "bugs" of VFIO. We know it's e
RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
> From: Jason Wang > Sent: Thursday, March 24, 2022 10:28 AM > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin wrote: > > > > > From: Jason Gunthorpe > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > I'm still picking my way through the series, but the later compat > > > > interface doesn't mention this difference as an outstanding issue. > > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > > resource limits? > > > > > > AFACIT, no, but it should be checked. > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > memory limits. > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > { > > > rlim = tsk->signal->rlim + resource; > > > [..] > > > if (new_rlim) > > > *rlim = *new_rlim; > > > > > > Which vfio reads back here: > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > And iommufd does the same read back: > > > > > > lock_limit = > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > PAGE_SHIFT; > > > npages = pages->npinned - pages->last_npinned; > > > do { > > > cur_pages = atomic_long_read(&pages->source_user- > > > >locked_vm); > > > new_pages = cur_pages + npages; > > > if (new_pages > lock_limit) > > > return -ENOMEM; > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > cur_pages, > > >new_pages) != cur_pages); > > > > > > So it does work essentially the same. > > > > > > The difference is more subtle, iouring/etc puts the charge in the user > > > so it is additive with things like iouring and additively spans all > > > the users processes. > > > > > > However vfio is accounting only per-process and only for itself - no > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > The user visible difference will be that a limit X that worked with > > > VFIO may start to fail after a kernel upgrade as the charge accounting > > > is now cross user and additive with things like iommufd. > > > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > IMHO, but with most of the places doing pins voting to use > > > user->locked_vm as the charge it seems the right path in today's > > > kernel. > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > I plan to fix RDMA like this as well so at least we can have > > > consistency within qemu. > > > > > > > I have an impression that iommufd and vfio type1 must use > > the same accounting scheme given the management stack > > has no insight into qemu on which one is actually used thus > > cannot adapt to the subtle difference in between. in this > > regard either we start fixing vfio type1 to use user->locked_vm > > now or have iommufd follow vfio type1 for upward compatibility > > and then change them together at a later point. > > > > I prefer to the former as IMHO I don't know when will be a later > > point w/o certain kernel changes to actually break the userspace > > policy built on a wrong accounting scheme... > > I wonder if the kernel is the right place to do this. We have new uAPI I didn't get this. This thread is about that VFIO uses a wrong accounting scheme and then the discussion is about the impact of fixing it to the userspace. I didn't see the question on the right place part. > so management layer can know the difference of the accounting in > advance by > > -device vfio-pci,iommufd=on > I suppose iommufd will be used once Qemu supports it, as long as the compatibility opens that Jason/Alex discussed in another thread are well addressed. It is not necessarily to be a control knob exposed to the caller. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin wrote: > > > From: Jason Gunthorpe > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > I'm still picking my way through the series, but the later compat > > > interface doesn't mention this difference as an outstanding issue. > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > resource limits? > > > > AFACIT, no, but it should be checked. > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > memory limits. > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > { > > rlim = tsk->signal->rlim + resource; > > [..] > > if (new_rlim) > > *rlim = *new_rlim; > > > > Which vfio reads back here: > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > And iommufd does the same read back: > > > > lock_limit = > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > PAGE_SHIFT; > > npages = pages->npinned - pages->last_npinned; > > do { > > cur_pages = atomic_long_read(&pages->source_user- > > >locked_vm); > > new_pages = cur_pages + npages; > > if (new_pages > lock_limit) > > return -ENOMEM; > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > cur_pages, > >new_pages) != cur_pages); > > > > So it does work essentially the same. > > > > The difference is more subtle, iouring/etc puts the charge in the user > > so it is additive with things like iouring and additively spans all > > the users processes. > > > > However vfio is accounting only per-process and only for itself - no > > other subsystem uses locked as the charge variable for DMA pins. > > > > The user visible difference will be that a limit X that worked with > > VFIO may start to fail after a kernel upgrade as the charge accounting > > is now cross user and additive with things like iommufd. > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > IMHO, but with most of the places doing pins voting to use > > user->locked_vm as the charge it seems the right path in today's > > kernel. > > > > Ceratinly having qemu concurrently using three different subsystems > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > I plan to fix RDMA like this as well so at least we can have > > consistency within qemu. > > > > I have an impression that iommufd and vfio type1 must use > the same accounting scheme given the management stack > has no insight into qemu on which one is actually used thus > cannot adapt to the subtle difference in between. in this > regard either we start fixing vfio type1 to use user->locked_vm > now or have iommufd follow vfio type1 for upward compatibility > and then change them together at a later point. > > I prefer to the former as IMHO I don't know when will be a later > point w/o certain kernel changes to actually break the userspace > policy built on a wrong accounting scheme... I wonder if the kernel is the right place to do this. We have new uAPI so management layer can know the difference of the accounting in advance by -device vfio-pci,iommufd=on ? > > Thanks > Kevin > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
> From: Jason Gunthorpe > Sent: Wednesday, March 23, 2022 12:15 AM > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > I'm still picking my way through the series, but the later compat > > interface doesn't mention this difference as an outstanding issue. > > Doesn't this difference need to be accounted in how libvirt manages VM > > resource limits? > > AFACIT, no, but it should be checked. > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > memory limits. > > Yes, and ulimit does work fully. prlimit adjusts the value: > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > struct rlimit *new_rlim, struct rlimit *old_rlim) > { > rlim = tsk->signal->rlim + resource; > [..] > if (new_rlim) > *rlim = *new_rlim; > > Which vfio reads back here: > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > And iommufd does the same read back: > > lock_limit = > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > PAGE_SHIFT; > npages = pages->npinned - pages->last_npinned; > do { > cur_pages = atomic_long_read(&pages->source_user- > >locked_vm); > new_pages = cur_pages + npages; > if (new_pages > lock_limit) > return -ENOMEM; > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > cur_pages, >new_pages) != cur_pages); > > So it does work essentially the same. > > The difference is more subtle, iouring/etc puts the charge in the user > so it is additive with things like iouring and additively spans all > the users processes. > > However vfio is accounting only per-process and only for itself - no > other subsystem uses locked as the charge variable for DMA pins. > > The user visible difference will be that a limit X that worked with > VFIO may start to fail after a kernel upgrade as the charge accounting > is now cross user and additive with things like iommufd. > > This whole area is a bit peculiar (eg mlock itself works differently), > IMHO, but with most of the places doing pins voting to use > user->locked_vm as the charge it seems the right path in today's > kernel. > > Ceratinly having qemu concurrently using three different subsystems > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > RLIMIT_MEMLOCK differently cannot be sane or correct. > > I plan to fix RDMA like this as well so at least we can have > consistency within qemu. > I have an impression that iommufd and vfio type1 must use the same accounting scheme given the management stack has no insight into qemu on which one is actually used thus cannot adapt to the subtle difference in between. in this regard either we start fixing vfio type1 to use user->locked_vm now or have iommufd follow vfio type1 for upward compatibility and then change them together at a later point. I prefer to the former as IMHO I don't know when will be a later point w/o certain kernel changes to actually break the userspace policy built on a wrong accounting scheme... Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 2/2] PCI: hv: Propagate coherence from VMbus device to PCI device
On Wed, Mar 23, 2022 at 01:31:12PM -0700, Michael Kelley wrote: > PCI pass-thru devices in a Hyper-V VM are represented as a VMBus > device and as a PCI device. The coherence of the VMbus device is > set based on the VMbus node in ACPI, but the PCI device has no > ACPI node and defaults to not hardware coherent. This results > in extra software coherence management overhead on ARM64 when > devices are hardware coherent. > > Fix this by setting up the PCI host bus so that normal > PCI mechanisms will propagate the coherence of the VMbus > device to the PCI device. There's no effect on x86/x64 where > devices are always hardware coherent. > > Signed-off-by: Michael Kelley Acked-by: Boqun Feng Regards, Boqun > --- > drivers/pci/controller/pci-hyperv.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/drivers/pci/controller/pci-hyperv.c > b/drivers/pci/controller/pci-hyperv.c > index ae0bc2f..88b3b56 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -3404,6 +3404,15 @@ static int hv_pci_probe(struct hv_device *hdev, > hbus->bridge->domain_nr = dom; > #ifdef CONFIG_X86 > hbus->sysdata.domain = dom; > +#elif defined(CONFIG_ARM64) > + /* > + * Set the PCI bus parent to be the corresponding VMbus > + * device. Then the VMbus device will be assigned as the > + * ACPI companion in pcibios_root_bridge_prepare() and > + * pci_dma_configure() will propagate device coherence > + * information to devices created on the bus. > + */ > + hbus->sysdata.parent = hdev->device.parent; > #endif > > hbus->hdev = hdev; > -- > 1.8.3.1 > > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows
On Tue, Mar 22, 2022 at 12:27 PM Robin Murphy wrote: > > Originally, creating the dma_ranges resource list in pre-sorted fashion > was the simplest and most efficient way to enforce the order required by > iova_reserve_pci_windows(). However since then at least one PCI host > driver is now re-sorting the list for its own probe-time processing, > which doesn't seem entirely unreasonable, so that basic assumption no > longer holds. Make iommu-dma robust and get the sort order it needs by > explicitly sorting, which means we can also save the effort at creation > time and just build the list in whatever natural order the DT had. > > Signed-off-by: Robin Murphy > --- > > Looking at this area off the back of the XGene thread[1] made me realise > that we need to do it anyway, regardless of whether it might also happen > to restore the previous XGene behaviour or not. Presumably nobody's > tried to use pcie-cadence-host behind an IOMMU yet... > > Boot-tested on Juno to make sure I hadn't got the sort comparison > backwards. > > Robin. > > [1] https://lore.kernel.org/linux-pci/20220321104843.949645-1-...@kernel.org/ > > drivers/iommu/dma-iommu.c | 13 - > drivers/pci/of.c | 7 +-- > 2 files changed, 13 insertions(+), 7 deletions(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index b22034975301..91d134c0c9b1 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -20,6 +20,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -414,6 +415,15 @@ static int cookie_init_hw_msi_region(struct > iommu_dma_cookie *cookie, > return 0; > } > > +static int iommu_dma_ranges_sort(void *priv, const struct list_head *a, > + const struct list_head *b) > +{ > + struct resource_entry *res_a = list_entry(a, typeof(*res_a), node); > + struct resource_entry *res_b = list_entry(b, typeof(*res_b), node); > + > + return res_a->res->start > res_b->res->start; > +} > + > static int iova_reserve_pci_windows(struct pci_dev *dev, > struct iova_domain *iovad) > { > @@ -432,6 +442,7 @@ static int iova_reserve_pci_windows(struct pci_dev *dev, > } > > /* Get reserved DMA windows from host bridge */ > + list_sort(NULL, &bridge->dma_ranges, iommu_dma_ranges_sort); > resource_list_for_each_entry(window, &bridge->dma_ranges) { > end = window->res->start - window->offset; > resv_iova: > @@ -440,7 +451,7 @@ static int iova_reserve_pci_windows(struct pci_dev *dev, > hi = iova_pfn(iovad, end); > reserve_iova(iovad, lo, hi); > } else if (end < start) { > - /* dma_ranges list should be sorted */ > + /* DMA ranges should be non-overlapping */ > dev_err(&dev->dev, > "Failed to reserve IOVA [%pa-%pa]\n", > &start, &end); > diff --git a/drivers/pci/of.c b/drivers/pci/of.c > index cb2e8351c2cc..d176b4bc6193 100644 > --- a/drivers/pci/of.c > +++ b/drivers/pci/of.c > @@ -393,12 +393,7 @@ static int devm_of_pci_get_host_bridge_resources(struct > device *dev, > goto failed; > } > > - /* Keep the resource list sorted */ > - resource_list_for_each_entry(entry, ib_resources) > - if (entry->res->start > res->start) > - break; > - > - pci_add_resource_offset(&entry->node, res, entry is now unused and causes a warning. > + pci_add_resource_offset(ib_resources, res, > res->start - range.pci_addr); > } > > -- > 2.28.0.dirty > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows
On Wed, Mar 23, 2022 at 5:15 PM dann frazier wrote: > > On Wed, Mar 23, 2022 at 09:49:04AM +, Marc Zyngier wrote: > > On Tue, 22 Mar 2022 17:27:36 +, > > Robin Murphy wrote: > > > > > > Originally, creating the dma_ranges resource list in pre-sorted fashion > > > was the simplest and most efficient way to enforce the order required by > > > iova_reserve_pci_windows(). However since then at least one PCI host > > > driver is now re-sorting the list for its own probe-time processing, > > > which doesn't seem entirely unreasonable, so that basic assumption no > > > longer holds. Make iommu-dma robust and get the sort order it needs by > > > explicitly sorting, which means we can also save the effort at creation > > > time and just build the list in whatever natural order the DT had. > > > > > > Signed-off-by: Robin Murphy > > > --- > > > > > > Looking at this area off the back of the XGene thread[1] made me realise > > > that we need to do it anyway, regardless of whether it might also happen > > > to restore the previous XGene behaviour or not. Presumably nobody's > > > tried to use pcie-cadence-host behind an IOMMU yet... > > > > This definitely restores PCIe functionality on my Mustang (XGene-1). > > Hopefully dann can comment on whether this addresses his own issue, as > > his firmware is significantly different. > > Robin, Marc, > > Adding just this patch on top of v5.17 (w/ vendor dtb) isn't enough to > fix m400 networking: I wouldn't expect it to given both the IB register selection changed and the 2nd dma-ranges entry is ignored. Can you (and others) try out this branch: git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git xgene-pci-fix It should maintain the same IB register usage for both cases and handle the error in 'dma-ranges'. Rob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > My overall question here would be whether we can actually achieve a > compatibility interface that has sufficient feature transparency that we > can dump vfio code in favor of this interface, or will there be enough > niche use cases that we need to keep type1 and vfio containers around > through a deprecation process? Other than SPAPR, I think we can. > The locked memory differences for one seem like something that > libvirt wouldn't want hidden I'm first interested to have an understanding how this change becomes a real problem in practice that requires libvirt to do something different for vfio or iommufd. We can discuss in the other thread If this is the make or break point then I think we can deal with it either by going back to what vfio does now or perhaps some other friendly compat approach.. > and we have questions regarding support for vaddr hijacking I'm not sure what vaddr hijacking is? Do you mean VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to implement it in a functionally compatible way without the deadlock problem. I estimate this as a small project. > and different ideas how to implement dirty page tracking, I don't think this is compatibility. No kernel today triggers qemu to use this feature as no kernel supports live migration. No existing qemu will trigger this feature with new kernels that support live migration v2. Therefore we can adjust qemu's dirty tracking at the same time we enable migration v2 in qemu. With Joao's work we are close to having a solid RFC to come with something that can be fully implemented. Hopefully we can agree to this soon enough that qemu can come with a full package of migration v2 support including the dirty tracking solution. > not to mention the missing features that are currently well used, > like p2p mappings, coherency tracking, mdev, etc. I consider these all mandatory things, they won't be left out. The reason they are not in the RFC is mostly because supporting them requires work outside just this iommufd area, and I'd like this series to remain self-contained. I've already got a draft to add DMABUF support to VFIO PCI which nicely solves the follow_pfn security problem, we want to do this for another reason already. I'm waiting for some testing feedback before posting it. Need some help from Daniel make the DMABUF revoke semantic him and I have been talking about. In the worst case can copy the follow_pfn approach. Intel no-snoop is simple enough, just needs some Intel cleanup parts. mdev will come along with the final VFIO integration, all the really hard parts are done already. The VFIO integration is a medium sized task overall. So, I'm not ready to give up yet :) > Where do we focus attention? Is symlinking device files our proposal > to userspace and is that something achievable, or do we want to use > this compatibility interface as a means to test the interface and > allow userspace to make use of it for transition, if their use cases > allow it, perhaps eventually performing the symlink after deprecation > and eventual removal of the vfio container and type1 code? Thanks, symlinking device files is definitely just a suggested way to expedite testing. Things like qemu that are learning to use iommufd-only features should learn to directly open iommufd instead of vfio container to activate those features. Looking long down the road I don't think we want to have type 1 and iommufd code forever. So, I would like to make an option to compile out vfio container support entirely and have that option arrange for iommufd to provide the container device node itself. I think we can get there pretty quickly, or at least I haven't got anything that is scaring me alot (beyond SPAPR of course) For the dpdk/etcs of the world I think we are already there. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote: > Stated another way, any platform that wires dev_is_dma_coherent() to > true, like all x86 does, must support IOMMU_CACHE and report > IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform > supports. The platform obviously declares it support this in order to > support the in-kernel DMA API. That gives me a nice simple idea: diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 3c6b95ad026829..8366884df4a030 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -8,6 +8,7 @@ #include #include #include +#include #include "iommufd_private.h" @@ -61,6 +62,10 @@ struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev, struct iommu_group *group; int rc; + /* iommufd always uses IOMMU_CACHE */ + if (!dev_is_dma_coherent(&pdev->dev)) + return ERR_PTR(-EINVAL); + ictx = iommufd_fget(fd); if (!ictx) return ERR_PTR(-EINVAL); diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c index 48149988c84bbc..3d6df1ffbf93e6 100644 --- a/drivers/iommu/iommufd/ioas.c +++ b/drivers/iommu/iommufd/ioas.c @@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags) * We provide no manual cache coherency ioctls to userspace and most * architectures make the CPU ops for cache flushing privileged. * Therefore we require the underlying IOMMU to support CPU coherent -* operation. +* operation. Support for IOMMU_CACHE is enforced by the +* dev_is_dma_coherent() test during bind. */ iommu_prot = IOMMU_CACHE; if (map_flags & IOMMU_IOAS_MAP_WRITEABLE) Looking at it I would say all the places that test IOMMU_CAP_CACHE_COHERENCY can be replaced with dev_is_dma_coherent() except for the one call in VFIO that is supporting the Intel no-snoop behavior. Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of abusing IOMMU_CACHE. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, 18 Mar 2022 14:27:36 -0300 Jason Gunthorpe wrote: > iommufd can directly implement the /dev/vfio/vfio container IOCTLs by > mapping them into io_pagetable operations. Doing so allows the use of > iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to > SET_CONTAINER using a iommufd instead of a container fd is a followup > series. > > Internally the compatibility API uses a normal IOAS object that, like > vfio, is automatically allocated when the first device is > attached. > > Userspace can also query or set this IOAS object directly using the > IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only > features while still using the VFIO style map/unmap ioctls. > > While this is enough to operate qemu, it is still a bit of a WIP with a > few gaps to be resolved: > > - Only the TYPE1v2 mode is supported where unmap cannot punch holes or >split areas. The old mode can be implemented with a new operation to >split an iopt_area into two without disturbing the iopt_pages or the >domains, then unmapping a whole area as normal. > > - Resource limits rely on memory cgroups to bound what userspace can do >instead of the module parameter dma_entry_limit. > > - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will >require some additional work to properly expose PFN lifecycle between >VFIO and iommfd > > - Various components of the mdev API are not completed yet > > - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not >implemented. > > - The 'dirty tracking' is not implemented > > - A full audit for pedantic compatibility details (eg errnos, etc) has >not yet been done > > - powerpc SPAPR is left out, as it is not connected to the iommu_domain >framework. My hope is that SPAPR will be moved into the iommu_domain >framework as a special HW specific type and would expect power to >support the generic interface through a normal iommu_domain. My overall question here would be whether we can actually achieve a compatibility interface that has sufficient feature transparency that we can dump vfio code in favor of this interface, or will there be enough niche use cases that we need to keep type1 and vfio containers around through a deprecation process? The locked memory differences for one seem like something that libvirt wouldn't want hidden and we have questions regarding support for vaddr hijacking and different ideas how to implement dirty page tracking, not to mention the missing features that are currently well used, like p2p mappings, coherency tracking, mdev, etc. It seems like quite an endeavor to fill all these gaps, while at the same time QEMU will be working to move to use iommufd directly in order to gain all the new features. Where do we focus attention? Is symlinking device files our proposal to userspace and is that something achievable, or do we want to use this compatibility interface as a means to test the interface and allow userspace to make use of it for transition, if their use cases allow it, perhaps eventually performing the symlink after deprecation and eventual removal of the vfio container and type1 code? Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows
On Wed, Mar 23, 2022 at 09:49:04AM +, Marc Zyngier wrote: > On Tue, 22 Mar 2022 17:27:36 +, > Robin Murphy wrote: > > > > Originally, creating the dma_ranges resource list in pre-sorted fashion > > was the simplest and most efficient way to enforce the order required by > > iova_reserve_pci_windows(). However since then at least one PCI host > > driver is now re-sorting the list for its own probe-time processing, > > which doesn't seem entirely unreasonable, so that basic assumption no > > longer holds. Make iommu-dma robust and get the sort order it needs by > > explicitly sorting, which means we can also save the effort at creation > > time and just build the list in whatever natural order the DT had. > > > > Signed-off-by: Robin Murphy > > --- > > > > Looking at this area off the back of the XGene thread[1] made me realise > > that we need to do it anyway, regardless of whether it might also happen > > to restore the previous XGene behaviour or not. Presumably nobody's > > tried to use pcie-cadence-host behind an IOMMU yet... > > This definitely restores PCIe functionality on my Mustang (XGene-1). > Hopefully dann can comment on whether this addresses his own issue, as > his firmware is significantly different. Robin, Marc, Adding just this patch on top of v5.17 (w/ vendor dtb) isn't enough to fix m400 networking: https://paste.ubuntu.com/p/H5ZNbRvP8V/ -dann ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP
On 2022-03-23 19:16, Linus Torvalds wrote: On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy wrote: On 2022-03-23 17:27, Linus Torvalds wrote: I'm assuming that the ath9k issue is that it gives DMA mapping a big enough area to handle any possible packet size, and just expects - quite reasonably - smaller packets to only fill the part they need. Which that "info leak" patch obviously breaks entirely. Except that's the exact case which the new patch is addressing Not "addressing". Breaking. Which is why it will almost certainly get reverted. Not doing DMA to the whole area seems to be quite the sane thing to do for things like network packets, and overwriting the part that didn't get DMA'd with zeroes seems to be exactly the wrong thing here. So the SG_IO - and other random untrusted block command sources - data leak will almost certainly have to be addressed differently. Possibly by simply allocating the area with GFP_ZERO to begin with. Er, the point of the block layer case is that whole area *is* zeroed to begin with, and a latent memory corruption problem in SWIOTLB itself replaces those zeros with random other kernel data unexpectedly. Let me try illustrating some sequences for clarity... Expected behaviour/without SWIOTLB: Memory --- start12345678 dma_map(DMA_FROM_DEVICE) no-op device writes partial data 12ABC678 <- ABC dma_unmap(DMA_FROM_DEVICE) 12ABC678 SWIOTLB previously: Memory Bounce buffer --- start12345678 dma_map(DMA_FROM_DEVICE) no-op device writes partial data 12345678xxABCxxx <- ABC dma_unmap(DMA_FROM_DEVICE) xxABCxxx <- xxABCxxx SWIOTLB Now: Memory Bounce buffer --- start12345678 dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 device writes partial data 1234567812ABC678 <- ABC dma_unmap(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678 Now, sure we can prevent any actual information leakage by initialising the bounce buffer slot with zeros, but then we're just corrupting the not-written-to parts of the mapping with zeros instead of anyone else's old data. That's still fundamentally not OK. The only thing SWIOTLB can do to be correct is treat DMA_FROM_DEVICE as a read-modify-write of the entire mapping, because it has no way to know how much of it is actually going to be modified. I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least). I don't think it's wrong per se, but as I said I do think it can bite anyone who's been doing dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk. Thanks, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote: > On Wed, 23 Mar 2022 16:34:39 -0300 > Jason Gunthorpe wrote: > > > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote: > > > On Fri, 18 Mar 2022 14:27:33 -0300 > > > Jason Gunthorpe wrote: > > > > > > > +static int conv_iommu_prot(u32 map_flags) > > > > +{ > > > > + int iommu_prot; > > > > + > > > > + /* > > > > +* We provide no manual cache coherency ioctls to userspace and > > > > most > > > > +* architectures make the CPU ops for cache flushing privileged. > > > > +* Therefore we require the underlying IOMMU to support CPU > > > > coherent > > > > +* operation. > > > > +*/ > > > > + iommu_prot = IOMMU_CACHE; > > > > > > Where is this requirement enforced? AIUI we'd need to test > > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like > > > intel_iommu_map() simply drop the flag when not supported by HW. > > > > You are right, the correct thing to do is to fail device > > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there, > > however we can't do that because Intel abuses the meaning of > > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is > > supported. > > > > I want Intel to split out their special no-snoop from IOMMU_CACHE and > > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in > > all iommu drivers. Once this is done vfio and iommufd should both > > always set IOMMU_CACHE and refuse to work without > > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE > > arch that does in fact work today with vfio, somehow, but I don't..) > > IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see > lack of snoop-control support, causing us to have mixed coherent and > non-coherent domains. I don't recall if you go back far enough in VT-d > history if the primary IOMMU might have lacked this support. So I > think there are systems we care about with IOMMUs that can't enforce > DMA coherency. > > As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all > mappings make use of IOMMU_CACHE, then all DMA is coherent. Are you > suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings > are coherent regardless of mapping protection flags? What's the point > of IOMMU_CACHE at that point? IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's change. It only means normal DMAs issued in a normal way are coherent with the CPU and do not require special cache flushing instructions. ie DMA issued by a kernel driver using the DMA API. It does not mean that non-coherence DMA does not exist, or that platform or device specific ways to trigger non-coherence are blocked. Stated another way, any platform that wires dev_is_dma_coherent() to true, like all x86 does, must support IOMMU_CACHE and report IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform supports. The platform obviously declares it support this in order to support the in-kernel DMA API. Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop blocking' should be created to cover Intel's special need. From what I know it is only implemented in the Intel driver and apparently only for some IOMMUs connected to IGD. > > Yes, it was missed in the notes for vfio compat that Intel no-snoop is > > not working currently, I fixed it. > > Right, I see it in the comments relative to extensions, but missed in > the commit log. Thanks, Oh good, I remembered it was someplace.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 2/2] PCI: hv: Propagate coherence from VMbus device to PCI device
PCI pass-thru devices in a Hyper-V VM are represented as a VMBus device and as a PCI device. The coherence of the VMbus device is set based on the VMbus node in ACPI, but the PCI device has no ACPI node and defaults to not hardware coherent. This results in extra software coherence management overhead on ARM64 when devices are hardware coherent. Fix this by setting up the PCI host bus so that normal PCI mechanisms will propagate the coherence of the VMbus device to the PCI device. There's no effect on x86/x64 where devices are always hardware coherent. Signed-off-by: Michael Kelley --- drivers/pci/controller/pci-hyperv.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index ae0bc2f..88b3b56 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -3404,6 +3404,15 @@ static int hv_pci_probe(struct hv_device *hdev, hbus->bridge->domain_nr = dom; #ifdef CONFIG_X86 hbus->sysdata.domain = dom; +#elif defined(CONFIG_ARM64) + /* +* Set the PCI bus parent to be the corresponding VMbus +* device. Then the VMbus device will be assigned as the +* ACPI companion in pcibios_root_bridge_prepare() and +* pci_dma_configure() will propagate device coherence +* information to devices created on the bus. +*/ + hbus->sysdata.parent = hdev->device.parent; #endif hbus->hdev = hdev; -- 1.8.3.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 1/2] Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device
VMbus synthetic devices are not represented in the ACPI DSDT -- only the top level VMbus device is represented. As a result, on ARM64 coherence information in the _CCA method is not specified for synthetic devices, so they default to not hardware coherent. Drivers for some of these synthetic devices have been recently updated to use the standard DMA APIs, and they are incurring extra overhead of unneeded software coherence management. Fix this by propagating coherence information from the VMbus node in ACPI to the individual synthetic devices. There's no effect on x86/x64 where devices are always hardware coherent. Signed-off-by: Michael Kelley --- drivers/hv/hv_common.c | 11 +++ drivers/hv/vmbus_drv.c | 23 +++ include/asm-generic/mshyperv.h | 1 + 3 files changed, 35 insertions(+) diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c index 181d16b..820e814 100644 --- a/drivers/hv/hv_common.c +++ b/drivers/hv/hv_common.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -216,6 +217,16 @@ bool hv_query_ext_cap(u64 cap_query) } EXPORT_SYMBOL_GPL(hv_query_ext_cap); +void hv_setup_dma_ops(struct device *dev, bool coherent) +{ + /* +* Hyper-V does not offer a vIOMMU in the guest +* VM, so pass 0/NULL for the IOMMU settings +*/ + arch_setup_dma_ops(dev, 0, 0, NULL, coherent); +} +EXPORT_SYMBOL_GPL(hv_setup_dma_ops); + bool hv_is_hibernation_supported(void) { return !hv_root_partition && acpi_sleep_state_supported(ACPI_STATE_S4); diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 12a2b37..2d2c54c 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -905,6 +905,14 @@ static int vmbus_probe(struct device *child_device) struct hv_device *dev = device_to_hv_device(child_device); const struct hv_vmbus_device_id *dev_id; + /* +* On ARM64, propagate the DMA coherence setting from the top level +* VMbus ACPI device to the child VMbus device being added here. +* On x86/x64 coherence is assumed and these calls have no effect. +*/ + hv_setup_dma_ops(child_device, + device_get_dma_attr(&hv_acpi_dev->dev) == DEV_DMA_COHERENT); + dev_id = hv_vmbus_get_id(drv, dev); if (drv->probe) { ret = drv->probe(dev, dev_id); @@ -2428,6 +2436,21 @@ static int vmbus_acpi_add(struct acpi_device *device) hv_acpi_dev = device; + /* +* Older versions of Hyper-V for ARM64 fail to include the _CCA +* method on the top level VMbus device in the DSDT. But devices +* are hardware coherent in all current Hyper-V use cases, so fix +* up the ACPI device to behave as if _CCA is present and indicates +* hardware coherence. +*/ + ACPI_COMPANION_SET(&device->dev, device); + if (IS_ENABLED(CONFIG_ACPI_CCA_REQUIRED) && + device_get_dma_attr(&device->dev) == DEV_DMA_NOT_SUPPORTED) { + pr_info("No ACPI _CCA found; assuming coherent device I/O\n"); + device->flags.cca_seen = true; + device->flags.coherent_dma = true; + } + result = acpi_walk_resources(device->handle, METHOD_NAME__CRS, vmbus_walk_resources, NULL); diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h index c08758b..c05d2ce 100644 --- a/include/asm-generic/mshyperv.h +++ b/include/asm-generic/mshyperv.h @@ -269,6 +269,7 @@ static inline int cpumask_to_vpset_noself(struct hv_vpset *vpset, u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size); void hyperv_cleanup(void); bool hv_query_ext_cap(u64 cap_query); +void hv_setup_dma_ops(struct device *dev, bool coherent); void *hv_map_memory(void *addr, unsigned long size); void hv_unmap_memory(void *addr); #else /* CONFIG_HYPERV */ -- 1.8.3.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 0/2] Fix coherence for VMbus and PCI pass-thru devices in Hyper-V VM
Hyper-V VMs have VMbus synthetic devices and PCI pass-thru devices that are added dynamically via the VMbus protocol and are not represented in the ACPI DSDT. Only the top level VMbus node exists in the DSDT. As such, on ARM64 these devices don't pick up coherence information and default to not hardware coherent. This results in extra software coherence management overhead since the synthetic devices are always hardware coherent. PCI pass-thru devices are also hardware coherent in all current usage scenarios. Fix this by propagating coherence information from the top level VMbus node in the DSDT to all VMbus synthetic devices and PCI pass-thru devices. While smaller granularity of control would be better, basing on the VMbus node in the DSDT gives as escape path if a future scenario arises with devices that are not hardware coherent. Robin Murphy -- I'm not ignoring your feedback about pci_dma_configure(), but I wanted to try this alternate approach where pci_dma_configure() works as is. If reviewers prefer modifying pci_dma_configure() to handle the Hyper-V specifics, I can go back to that. Changes since v1: * Use device_get_dma_attr() instead of acpi_get_dma_attr(), eliminating the need to export acpi_get_dma_attr() [Robin Murphy] * Use arch_setup_dma_ops() to set device coherence [Robin Murphy] * Move handling of missing _CCA to vmbus_acpi_add() so it is only done once * Rework handling of PCI devices so existing code in pci_dma_configure() just works Michael Kelley (2): Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device PCI: hv: Propagate coherence from VMbus device to PCI device drivers/hv/hv_common.c | 11 +++ drivers/hv/vmbus_drv.c | 23 +++ drivers/pci/controller/pci-hyperv.c | 9 + include/asm-generic/mshyperv.h | 1 + 4 files changed, 44 insertions(+) -- 1.8.3.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
On Wed, 23 Mar 2022 16:34:39 -0300 Jason Gunthorpe wrote: > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote: > > On Fri, 18 Mar 2022 14:27:33 -0300 > > Jason Gunthorpe wrote: > > > > > +static int conv_iommu_prot(u32 map_flags) > > > +{ > > > + int iommu_prot; > > > + > > > + /* > > > + * We provide no manual cache coherency ioctls to userspace and most > > > + * architectures make the CPU ops for cache flushing privileged. > > > + * Therefore we require the underlying IOMMU to support CPU coherent > > > + * operation. > > > + */ > > > + iommu_prot = IOMMU_CACHE; > > > > Where is this requirement enforced? AIUI we'd need to test > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like > > intel_iommu_map() simply drop the flag when not supported by HW. > > You are right, the correct thing to do is to fail device > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there, > however we can't do that because Intel abuses the meaning of > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is > supported. > > I want Intel to split out their special no-snoop from IOMMU_CACHE and > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in > all iommu drivers. Once this is done vfio and iommufd should both > always set IOMMU_CACHE and refuse to work without > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE > arch that does in fact work today with vfio, somehow, but I don't..) IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see lack of snoop-control support, causing us to have mixed coherent and non-coherent domains. I don't recall if you go back far enough in VT-d history if the primary IOMMU might have lacked this support. So I think there are systems we care about with IOMMUs that can't enforce DMA coherency. As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all mappings make use of IOMMU_CACHE, then all DMA is coherent. Are you suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings are coherent regardless of mapping protection flags? What's the point of IOMMU_CACHE at that point? > I added a fixme about this. > > > This also seems like an issue relative to vfio compatibility that I > > don't see mentioned in that patch. Thanks, > > Yes, it was missed in the notes for vfio compat that Intel no-snoop is > not working currently, I fixed it. Right, I see it in the comments relative to extensions, but missed in the commit log. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote: > On Fri, 18 Mar 2022 14:27:33 -0300 > Jason Gunthorpe wrote: > > > +static int conv_iommu_prot(u32 map_flags) > > +{ > > + int iommu_prot; > > + > > + /* > > +* We provide no manual cache coherency ioctls to userspace and most > > +* architectures make the CPU ops for cache flushing privileged. > > +* Therefore we require the underlying IOMMU to support CPU coherent > > +* operation. > > +*/ > > + iommu_prot = IOMMU_CACHE; > > Where is this requirement enforced? AIUI we'd need to test > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like > intel_iommu_map() simply drop the flag when not supported by HW. You are right, the correct thing to do is to fail device binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there, however we can't do that because Intel abuses the meaning of IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is supported. I want Intel to split out their special no-snoop from IOMMU_CACHE and IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in all iommu drivers. Once this is done vfio and iommufd should both always set IOMMU_CACHE and refuse to work without IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE arch that does in fact work today with vfio, somehow, but I don't..) I added a fixme about this. > This also seems like an issue relative to vfio compatibility that I > don't see mentioned in that patch. Thanks, Yes, it was missed in the notes for vfio compat that Intel no-snoop is not working currently, I fixed it. Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP
On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy wrote: > > On 2022-03-23 17:27, Linus Torvalds wrote: > > > > I'm assuming that the ath9k issue is that it gives DMA mapping a big > > enough area to handle any possible packet size, and just expects - > > quite reasonably - smaller packets to only fill the part they need. > > > > Which that "info leak" patch obviously breaks entirely. > > Except that's the exact case which the new patch is addressing Not "addressing". Breaking. Which is why it will almost certainly get reverted. Not doing DMA to the whole area seems to be quite the sane thing to do for things like network packets, and overwriting the part that didn't get DMA'd with zeroes seems to be exactly the wrong thing here. So the SG_IO - and other random untrusted block command sources - data leak will almost certainly have to be addressed differently. Possibly by simply allocating the area with GFP_ZERO to begin with. Linus ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
On Fri, 18 Mar 2022 14:27:33 -0300 Jason Gunthorpe wrote: > +static int conv_iommu_prot(u32 map_flags) > +{ > + int iommu_prot; > + > + /* > + * We provide no manual cache coherency ioctls to userspace and most > + * architectures make the CPU ops for cache flushing privileged. > + * Therefore we require the underlying IOMMU to support CPU coherent > + * operation. > + */ > + iommu_prot = IOMMU_CACHE; Where is this requirement enforced? AIUI we'd need to test IOMMU_CAP_CACHE_COHERENCY somewhere since functions like intel_iommu_map() simply drop the flag when not supported by HW. This also seems like an issue relative to vfio compatibility that I don't see mentioned in that patch. Thanks, Alex > + if (map_flags & IOMMU_IOAS_MAP_WRITEABLE) > + iommu_prot |= IOMMU_WRITE; > + if (map_flags & IOMMU_IOAS_MAP_READABLE) > + iommu_prot |= IOMMU_READ; > + return iommu_prot; > +} ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP
On 2022-03-23 17:27, Linus Torvalds wrote: On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko wrote: The following upstream commits: aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE" ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no client can connect to it, either from the very beginning, or shortly after start. These are the only symptoms I've noticed (i.e., no BUG/WARNING messages in `dmesg` etc). Funky, but clearly true: These commits appeared in v5.17 and v5.16.15, and both kernels are broken for me. I'm pretty confident these commits make the difference since I've built both v5.17 and v5.16.15 without them, and it fixed the issue. Can you double-check (or just explicitly confirm if you already did that test) that you need to revert *both* of those commits, and it's the later "rework" fix that triggers it? So, I do understand this might be an issue with regard to SG I/O handling in ath9k, hence relevant people in Cc. Yeah, almost certainly an ath9k bug, but a regression is a regression, so if people can't find the issue in ath9k, we'll have to revert those commits. Honestly, I personally think they were a bit draconian to begin with, and didn't limit their effects sufficiently. I'm assuming that the ath9k issue is that it gives DMA mapping a big enough area to handle any possible packet size, and just expects - quite reasonably - smaller packets to only fill the part they need. Which that "info leak" patch obviously breaks entirely. Except that's the exact case which the new patch is addressing - by copying the whole original area into the SWIOTLB bounce buffer to begin with, if we bounce the whole lot back after the device has only updated part of it, the non-updated parts now get overwritten with the same original contents, rather than whatever random crap happened to be left in the SWIOTLB buffer by its previous user. I'm extremely puzzled how any driver could somehow be dependent on non-device-written data getting replaced with random crap, given that it wouldn't happen with a real IOMMU, or if SWIOTLB just didn't need to bounce, and the data would hardly be deterministic either. I think I can see how aa6f8dcbab47 might increase the severity of a driver bug where it calls dma_sync_*_for_device() on part of a DMA_FROM_DEVICE mapping that the device *has* written to, without having called a corresponding dma_sync_*_for_cpu() first - previously that would have had no effect, but now SWIOTLB will effectively behave more like an eagerly-prefetching non-coherent cache and write back old data over new - but if ddbd89deb7d3 alone makes a difference then something really weird must be going on. Has anyone run a sanity check with CONFIG_DMA_API_DEBUG enabled to see if that flags anything up? Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
On Tue, Mar 22, 2022 at 04:15:44PM -0600, Alex Williamson wrote: > > +static struct iopt_area * > > +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages, > > + unsigned long iova, unsigned long start_byte, > > + unsigned long length, int iommu_prot, unsigned int flags) > > +{ > > + struct iopt_area *area; > > + int rc; > > + > > + area = kzalloc(sizeof(*area), GFP_KERNEL); > > + if (!area) > > + return ERR_PTR(-ENOMEM); > > + > > + area->iopt = iopt; > > + area->iommu_prot = iommu_prot; > > + area->page_offset = start_byte % PAGE_SIZE; > > + area->pages_node.start = start_byte / PAGE_SIZE; > > + if (check_add_overflow(start_byte, length - 1, &area->pages_node.last)) > > + return ERR_PTR(-EOVERFLOW); > > + area->pages_node.last = area->pages_node.last / PAGE_SIZE; > > + if (WARN_ON(area->pages_node.last >= pages->npages)) > > + return ERR_PTR(-EOVERFLOW); > > @area leaked in the above two error cases. Yes, thanks > > +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova, > > + unsigned long length, struct page **out_pages, bool write) > > +{ > > + unsigned long cur_iova = iova; > > + unsigned long last_iova; > > + struct iopt_area *area; > > + int rc; > > + > > + if (!length) > > + return -EINVAL; > > + if (check_add_overflow(iova, length - 1, &last_iova)) > > + return -EOVERFLOW; > > + > > + down_read(&iopt->iova_rwsem); > > + for (area = iopt_area_iter_first(iopt, iova, last_iova); area; > > +area = iopt_area_iter_next(area, iova, last_iova)) { > > + unsigned long last = min(last_iova, iopt_area_last_iova(area)); > > + unsigned long last_index; > > + unsigned long index; > > + > > + /* Need contiguous areas in the access */ > > + if (iopt_area_iova(area) < cur_iova || !area->pages) { > ^^^ > Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)? Oh good eye That is a typo < should be >: if (iopt_area_iova(area) > cur_iova || !area->pages) { There are three boundary conditions here to worry about - interval trees return any nodes that intersect the queried range so the first found node can start after iova - There can be a gap in the intervals - The final area can end short of last_iova The last one is botched too and needs this: for ... { ... } + if (cur_iova != last_iova) + goto out_remove; The test suite isn't covering the boundary cases here yet, I added a FIXME for this for now. Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
On Wed, Mar 23, 2022 at 12:10:01PM -0600, Alex Williamson wrote: > > +EXPORT_SYMBOL_GPL(iommufd_bind_pci_device); > > I'm stumped why this needs to be PCI specific. Anything beyond the RID > comment? Please enlighten. Thanks, The way it turned out in the end it is not for a good reason any more. I'll change it Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
On Fri, 18 Mar 2022 14:27:35 -0300 Jason Gunthorpe wrote: > +/** > + * iommufd_bind_pci_device - Bind a physical device to an iommu fd > + * @fd: iommufd file descriptor. > + * @pdev: Pointer to a physical PCI device struct > + * @id: Output ID number to return to userspace for this device > + * > + * A successful bind establishes an ownership over the device and returns > + * struct iommufd_device pointer, otherwise returns error pointer. > + * > + * A driver using this API must set driver_managed_dma and must not touch > + * the device until this routine succeeds and establishes ownership. > + * > + * Binding a PCI device places the entire RID under iommufd control. > + * > + * The caller must undo this with iommufd_unbind_device() > + */ > +struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev, > +u32 *id) > +{ > + struct iommufd_device *idev; > + struct iommufd_ctx *ictx; > + struct iommu_group *group; > + int rc; > + > + ictx = iommufd_fget(fd); > + if (!ictx) > + return ERR_PTR(-EINVAL); > + > + group = iommu_group_get(&pdev->dev); > + if (!group) { > + rc = -ENODEV; > + goto out_file_put; > + } > + > + /* > + * FIXME: Use a device-centric iommu api and this won't work with > + * multi-device groups > + */ > + rc = iommu_group_claim_dma_owner(group, ictx->filp); > + if (rc) > + goto out_group_put; > + > + idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE); > + if (IS_ERR(idev)) { > + rc = PTR_ERR(idev); > + goto out_release_owner; > + } > + idev->ictx = ictx; > + idev->dev = &pdev->dev; > + /* The calling driver is a user until iommufd_unbind_device() */ > + refcount_inc(&idev->obj.users); > + /* group refcount moves into iommufd_device */ > + idev->group = group; > + > + /* > + * If the caller fails after this success it must call > + * iommufd_unbind_device() which is safe since we hold this refcount. > + * This also means the device is a leaf in the graph and no other object > + * can take a reference on it. > + */ > + iommufd_object_finalize(ictx, &idev->obj); > + *id = idev->obj.id; > + return idev; > + > +out_release_owner: > + iommu_group_release_dma_owner(group); > +out_group_put: > + iommu_group_put(group); > +out_file_put: > + fput(ictx->filp); > + return ERR_PTR(rc); > +} > +EXPORT_SYMBOL_GPL(iommufd_bind_pci_device); I'm stumped why this needs to be PCI specific. Anything beyond the RID comment? Please enlighten. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP
On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko wrote: > > The following upstream commits: > > aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE" > ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE > > break ath9k-based Wi-Fi access point for me. The AP emits beacons, but > no client can connect to it, either from the very beginning, or > shortly after start. These are the only symptoms I've noticed (i.e., > no BUG/WARNING messages in `dmesg` etc). Funky, but clearly true: > These commits appeared in v5.17 and v5.16.15, and both kernels are > broken for me. I'm pretty confident these commits make the difference > since I've built both v5.17 and v5.16.15 without them, and it fixed > the issue. Can you double-check (or just explicitly confirm if you already did that test) that you need to revert *both* of those commits, and it's the later "rework" fix that triggers it? > So, I do understand this might be an issue with regard to SG I/O > handling in ath9k, hence relevant people in Cc. Yeah, almost certainly an ath9k bug, but a regression is a regression, so if people can't find the issue in ath9k, we'll have to revert those commits. Honestly, I personally think they were a bit draconian to begin with, and didn't limit their effects sufficiently. I'm assuming that the ath9k issue is that it gives DMA mapping a big enough area to handle any possible packet size, and just expects - quite reasonably - smaller packets to only fill the part they need. Which that "info leak" patch obviously breaks entirely. So my expectation is that this is something we'll just revert, but it would be really good to have the ath9k people double-check. Linus ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages
On Wed, Mar 23, 2022 at 04:37:50PM +0100, Niklas Schnelle wrote: > > +/* > > + * This holds a pinned page list for multiple areas of IO address space. > > The > > + * pages always originate from a linear chunk of userspace VA. Multiple > > + * io_pagetable's, through their iopt_area's, can share a single iopt_pages > > + * which avoids multi-pinning and double accounting of page consumption. > > + * > > + * indexes in this structure are measured in PAGE_SIZE units, are 0 based > > from > > + * the start of the uptr and extend to npages. pages are pinned dynamically > > + * according to the intervals in the users_itree and domains_itree, npages > > + * records the current number of pages pinned. > > This sounds wrong or at least badly named. If npages records the > current number of pages pinned then what does npinned record? Oh, it is a typo, the 2nd npages should be npinned, thanks Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v8 05/11] ACPI/IORT: Add a helper to retrieve RMR memory regions
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: 22 March 2022 19:09 > To: Shameerali Kolothum Thodi ; > linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org; > iommu@lists.linux-foundation.org > Cc: j...@solid-run.com; Linuxarm ; > steven.pr...@arm.com; Guohanjun (Hanjun Guo) ; > yangyicong ; sami.muja...@arm.com; > w...@kernel.org; wanghuiqiang > Subject: Re: [PATCH v8 05/11] ACPI/IORT: Add a helper to retrieve RMR > memory regions > > On 2022-02-21 15:43, Shameer Kolothum via iommu wrote: > > Add helper functions (iort_iommu_get/put_rmrs()) that > > retrieves/releases RMR memory descriptors associated > > with a given IOMMU. This will be used by IOMMU drivers > > to set up necessary mappings. > > > > Invoke it from the generic iommu helper functions. > > iommu_dma_get_resv_regions() already exists - please extend that rather > than adding a parallel implementation of the same thing but different. > IORT should export a single get_resv_regions helper which combines the > new RMRs with the existing MSI workaround, and a separate "do I need to > bypass this StreamID" helper for the SMMU drivers to call directly at > reset time, since the latter isn't really an iommu-dma responsibility. Right. I actually had couple of back and forth on the interfaces and settled on this mainly because it just requires a single interface from IORT for both get_resv_regions() and SMMU driver reset bypass path. ie, iort_iommu_get/put_rmrs()). But agree the above comment is also valid. > I'm happy to do that just by shuffling wrappers around for now - we can > come back and streamline the code properly afterwards - but the sheer > amount of indirection currently at play here is so hard to follow that > it's not even all that easy to see how it's crossing abstraction levels > improperly. Please find below the revised ones. Please take a look and let me know. Thanks, Shameer iommu-dma: void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list) { if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode)) iort_iommu_get_resv_regions(dev, list); } void iommu_dma_put_resv_regions(struct device *dev, struct list_head *list) { if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode)) iort_iommu_put_resv_regions(dev, list); generic_iommu_put_resv_regions(dev, list); } iort: void iort_iommu_get_resv_regions(struct device *dev, struct list_head *head) { struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); iort_iommu_msi_get_resv_regions(dev, head); iort_iommu_get_rmrs(fwspec->iommu_fwnode, dev, head); } void iort_iommu_put_resv_regions(struct device *dev, struct list_head *head) { ./*Free both RMRs and HW MSI ones */ } /* The below ones will be directly called from SMMU drivers during reset */ void iort_get_rmr_sids(struct fwnode_handle *iommu_fwnode, struct list_head *head) { iort_iommu_get_rmrs(iommu_fwnode, NULL, head); } } void iort_put_rmr_sids(struct fwnode_handle *iommu_fwnode, struct list_head *head) { iort_iommu_put_resv_regions(NULL, head); } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v8 02/11] iommu: Introduce a union to struct iommu_resv_region
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: 22 March 2022 18:27 > To: Shameerali Kolothum Thodi ; > linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org; > iommu@lists.linux-foundation.org > Cc: Linuxarm ; lorenzo.pieral...@arm.com; > j...@8bytes.org; w...@kernel.org; wanghuiqiang > ; Guohanjun (Hanjun Guo) > ; steven.pr...@arm.com; sami.muja...@arm.com; > j...@solid-run.com; eric.au...@redhat.com; yangyicong > > Subject: Re: [PATCH v8 02/11] iommu: Introduce a union to struct > iommu_resv_region > > On 2022-02-21 15:43, Shameer Kolothum wrote: > > A union is introduced to struct iommu_resv_region to hold > > any firmware specific data. This is in preparation to add > > support for IORT RMR reserve regions and the union now holds > > the RMR specific information. > > > > Signed-off-by: Shameer Kolothum > > > --- > > include/linux/iommu.h | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > > index de0c57a567c8..b06952a75f95 100644 > > --- a/include/linux/iommu.h > > +++ b/include/linux/iommu.h > > @@ -126,6 +126,11 @@ enum iommu_resv_type { > > IOMMU_RESV_SW_MSI, > > }; > > > > +struct iommu_iort_rmr_data { > > + u32 *sids; /* Stream Ids associated with IORT RMR entry */ > > Please make this const. > > Further nit: capitalisation of "IDs" in the comment, otherwise I might > worry about the possibility of Stream Egos too :P True :). Will do that. Thanks, Shameer > > > + u32 num_sids; > > +}; > > + > > /** > >* struct iommu_resv_region - descriptor for a reserved memory region > >* @list: Linked list pointers > > @@ -133,6 +138,7 @@ enum iommu_resv_type { > >* @length: Length of the region in bytes > >* @prot: IOMMU Protection flags (READ/WRITE/...) > >* @type: Type of the reserved region > > + * @fw_data: FW specific reserved region data > > Nit: we've got plenty of room to spell out "Firmware-specific", and it > never hurts to make documentation as easy to read as possible. > > Thanks, > Robin. > > >*/ > > struct iommu_resv_region { > > struct list_headlist; > > @@ -140,6 +146,9 @@ struct iommu_resv_region { > > size_t length; > > int prot; > > enum iommu_resv_typetype; > > + union { > > + struct iommu_iort_rmr_data rmr; > > + } fw_data; > > }; > > > > /** ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages
On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote: > The top of the data structure provides an IO Address Space (IOAS) that is > similar to a VFIO container. The IOAS allows map/unmap of memory into > ranges of IOVA called iopt_areas. Domains and in-kernel users (like VFIO > mdevs) can be attached to the IOAS to access the PFNs that those IOVA > areas cover. > > The IO Address Space (IOAS) datastructure is composed of: > - struct io_pagetable holding the IOVA map > - struct iopt_areas representing populated portions of IOVA > - struct iopt_pages representing the storage of PFNs > - struct iommu_domain representing the IO page table in the system IOMMU > - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO >mdevs) > - struct xarray pinned_pfns holding a list of pages pinned by in-kernel >users > > This patch introduces the lowest part of the datastructure - the movement > of PFNs in a tiered storage scheme: > 1) iopt_pages::pinned_pfns xarray > 2) An iommu_domain > 3) The origin of the PFNs, i.e. the userspace pointer > > PFN have to be copied between all combinations of tiers, depending on the > configuration. > > The interface is an iterator called a 'pfn_reader' which determines which > tier each PFN is stored and loads it into a list of PFNs held in a struct > pfn_batch. > > Each step of the iterator will fill up the pfn_batch, then the caller can > use the pfn_batch to send the PFNs to the required destination. Repeating > this loop will read all the PFNs in an IOVA range. > > The pfn_reader and pfn_batch also keep track of the pinned page accounting. > > While PFNs are always stored and accessed as full PAGE_SIZE units the > iommu_domain tier can store with a sub-page offset/length to support > IOMMUs with a smaller IOPTE size than PAGE_SIZE. > > Signed-off-by: Jason Gunthorpe > --- > drivers/iommu/iommufd/Makefile | 3 +- > drivers/iommu/iommufd/io_pagetable.h| 101 > drivers/iommu/iommufd/iommufd_private.h | 20 + > drivers/iommu/iommufd/pages.c | 723 > 4 files changed, 846 insertions(+), 1 deletion(-) > create mode 100644 drivers/iommu/iommufd/io_pagetable.h > create mode 100644 drivers/iommu/iommufd/pages.c > > ---8<--- > + > +/* > + * This holds a pinned page list for multiple areas of IO address space. The > + * pages always originate from a linear chunk of userspace VA. Multiple > + * io_pagetable's, through their iopt_area's, can share a single iopt_pages > + * which avoids multi-pinning and double accounting of page consumption. > + * > + * indexes in this structure are measured in PAGE_SIZE units, are 0 based > from > + * the start of the uptr and extend to npages. pages are pinned dynamically > + * according to the intervals in the users_itree and domains_itree, npages > + * records the current number of pages pinned. This sounds wrong or at least badly named. If npages records the current number of pages pinned then what does npinned record? > + */ > +struct iopt_pages { > + struct kref kref; > + struct mutex mutex; > + size_t npages; > + size_t npinned; > + size_t last_npinned; > + struct task_struct *source_task; > + struct mm_struct *source_mm; > + struct user_struct *source_user; > + void __user *uptr; > + bool writable:1; > + bool has_cap_ipc_lock:1; > + > + struct xarray pinned_pfns; > + /* Of iopt_pages_user::node */ > + struct rb_root_cached users_itree; > + /* Of iopt_area::pages_node */ > + struct rb_root_cached domains_itree; > +}; > + ---8<--- ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/arm: Expose ARM SMMUv3 related registers via sysfs
> On 23 Mar 2022, at 12:40, Robin Murphy wrote: > > On 2022-03-23 12:54, Miguel Luis wrote: >> Allows userspace to check for SMMUv3 features. > > What will userspace do with that information? > > It hardly matters what fancy new features might be present, if the kernel > and/or the abstracted interfaces available to userspace aren't using them. > Any functionality which is supported by a usable interface should ideally be > represented via that interface itself. > The inspiration was the same that Intel (cap/ecap) and AMD (cap/features) took exposing it's iommu feature registers on sysfs. It's an easy way to understand which features are supported by the hardware regardless of what the kernel supports. For example I could print the smmu->features and that would cover kernel supported features but wouldn't help when new hardware arrives to know which features are supported by the hardware. > Furthermore many of the raw SMMU features depend on other system components > and/or firmware, so the ID registers alone don't tell the full story anyway. > Would you mind to elaborate a bit more on that please? Would that mean that if a feature bit is set it doesn’t really tell that the feature is supported? >> Expose the following RO registers related to ARM SMMUv3 via sysfs: >> SMMU_IDR0 >> SMMU_IDR1 >> SMMU_IDR2 >> SMMU_IDR3 >> SMMU_IDR4 >> SMMU_IDR5 >> SMMU_IDR6 >> SMMU_IIDR >> SMMU_AIDR >> # find /sys | grep arm-iommu >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu > > Nit: my main comment above notwithstanding, is this level of hierarchcy > meaningful or useful? "arm-iommu" isn't an established name for anything as > far as I'm aware. > I've followed the existent convention in other IOMMUs but I'm totally open to alternatives/suggestions. Miguel > Robin. > >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr5 >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr3 >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr1 >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_aidr >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr6 >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr4 >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_iidr >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr2 >> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr0 >> Signed-off-by: Miguel Luis >> --- >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 199 >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 27 +++ >> 2 files changed, 191 insertions(+), 35 deletions(-) >> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> index 6dc6d8b6b368..7f779d3f88f2 100644 >> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> @@ -3424,17 +3424,16 @@ static int arm_smmu_device_reset(struct >> arm_smmu_device *smmu, bool bypass) >>static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) >> { >> -u32 reg; >> bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; >> /* IDR0 */ >> -reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0); >> +smmu->idr0 = readl_relaxed(smmu->base + ARM_SMMU_IDR0); >> /* 2-level structures */ >> -if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL) >> +if (FIELD_GET(IDR0_ST_LVL, smmu->idr0) == IDR0_ST_LVL_2LVL) >> smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB; >> - if (reg & IDR0_CD2L) >> +if (smmu->idr0 & IDR0_CD2L) >> smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB; >> /* >> @@ -3442,7 +3441,7 @@ static int arm_smmu_device_hw_probe(struct >> arm_smmu_device *smmu) >> * We currently require the same endianness as the CPU, but this >> * could be changed later by adding a new IO_PGTABLE_QUIRK. >> */ >> -switch (FIELD_GET(IDR0_TTENDIAN, reg)) { >> +switch (FIELD_GET(IDR0_TTENDIAN, smmu->idr0)) { >> case IDR0_TTENDIAN_MIXED: >> smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE; >> break; >> @@ -3461,22 +3460,22 @@ static int arm_smmu_device_hw_probe(struct >> arm_smmu_device *smmu) >> } >> /* Boolean feature flags */ >> -if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI) >> +if (IS_ENABLED(CONFIG_PCI_PRI) && smmu->idr0 & IDR0_PRI) >> smmu->features |= ARM_SMMU_FEAT_PRI; >> - if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS) >> +if (IS_ENABLED(CONFIG_PCI_ATS) && smmu->idr0 & IDR0_ATS) >> smmu->features |= ARM_SMMU_FEAT_ATS; >> - if (reg & IDR0_SEV) >> +if (smmu->idr0 & IDR0_SEV) >> smmu->features |= ARM_SMM
Re: [PATCH] iommu/arm: Expose ARM SMMUv3 related registers via sysfs
On 2022-03-23 12:54, Miguel Luis wrote: Allows userspace to check for SMMUv3 features. What will userspace do with that information? It hardly matters what fancy new features might be present, if the kernel and/or the abstracted interfaces available to userspace aren't using them. Any functionality which is supported by a usable interface should ideally be represented via that interface itself. Furthermore many of the raw SMMU features depend on other system components and/or firmware, so the ID registers alone don't tell the full story anyway. Expose the following RO registers related to ARM SMMUv3 via sysfs: SMMU_IDR0 SMMU_IDR1 SMMU_IDR2 SMMU_IDR3 SMMU_IDR4 SMMU_IDR5 SMMU_IDR6 SMMU_IIDR SMMU_AIDR # find /sys | grep arm-iommu /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu Nit: my main comment above notwithstanding, is this level of hierarchcy meaningful or useful? "arm-iommu" isn't an established name for anything as far as I'm aware. Robin. /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr5 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr3 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr1 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_aidr /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr6 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr4 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_iidr /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr2 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr0 Signed-off-by: Miguel Luis --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 199 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 27 +++ 2 files changed, 191 insertions(+), 35 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6dc6d8b6b368..7f779d3f88f2 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3424,17 +3424,16 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { - u32 reg; bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; /* IDR0 */ - reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0); + smmu->idr0 = readl_relaxed(smmu->base + ARM_SMMU_IDR0); /* 2-level structures */ - if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL) + if (FIELD_GET(IDR0_ST_LVL, smmu->idr0) == IDR0_ST_LVL_2LVL) smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB; - if (reg & IDR0_CD2L) + if (smmu->idr0 & IDR0_CD2L) smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB; /* @@ -3442,7 +3441,7 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) * We currently require the same endianness as the CPU, but this * could be changed later by adding a new IO_PGTABLE_QUIRK. */ - switch (FIELD_GET(IDR0_TTENDIAN, reg)) { + switch (FIELD_GET(IDR0_TTENDIAN, smmu->idr0)) { case IDR0_TTENDIAN_MIXED: smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE; break; @@ -3461,22 +3460,22 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) } /* Boolean feature flags */ - if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI) + if (IS_ENABLED(CONFIG_PCI_PRI) && smmu->idr0 & IDR0_PRI) smmu->features |= ARM_SMMU_FEAT_PRI; - if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS) + if (IS_ENABLED(CONFIG_PCI_ATS) && smmu->idr0 & IDR0_ATS) smmu->features |= ARM_SMMU_FEAT_ATS; - if (reg & IDR0_SEV) + if (smmu->idr0 & IDR0_SEV) smmu->features |= ARM_SMMU_FEAT_SEV; - if (reg & IDR0_MSI) { + if (smmu->idr0 & IDR0_MSI) { smmu->features |= ARM_SMMU_FEAT_MSI; if (coherent && !disable_msipolling) smmu->options |= ARM_SMMU_OPT_MSIPOLL; } - if (reg & IDR0_HYP) { + if (smmu->idr0 & IDR0_HYP) { smmu->features |= ARM_SMMU_FEAT_HYP; if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN)) smmu->features |= ARM_SMMU_FEAT_E2H; @@ -3486,11 +3485,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) * The coherency feature as set by FW is used in preference to the ID * register, but warn on mismatch. */ - if (!!(reg & IDR0_COHACC) != coherent) + if (!!(smmu->idr0 & IDR0_COHACC) != coherent) dev_warn(smmu->dev, "IDR0.COHACC overridden by FW configuration (%s)\
[PATCH] iommu/arm: Expose ARM SMMUv3 related registers via sysfs
Allows userspace to check for SMMUv3 features. Expose the following RO registers related to ARM SMMUv3 via sysfs: SMMU_IDR0 SMMU_IDR1 SMMU_IDR2 SMMU_IDR3 SMMU_IDR4 SMMU_IDR5 SMMU_IDR6 SMMU_IIDR SMMU_AIDR # find /sys | grep arm-iommu /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr5 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr3 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr1 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_aidr /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr6 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr4 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_iidr /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr2 /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr0 Signed-off-by: Miguel Luis --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 199 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 27 +++ 2 files changed, 191 insertions(+), 35 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6dc6d8b6b368..7f779d3f88f2 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3424,17 +3424,16 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { - u32 reg; bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; /* IDR0 */ - reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0); + smmu->idr0 = readl_relaxed(smmu->base + ARM_SMMU_IDR0); /* 2-level structures */ - if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL) + if (FIELD_GET(IDR0_ST_LVL, smmu->idr0) == IDR0_ST_LVL_2LVL) smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB; - if (reg & IDR0_CD2L) + if (smmu->idr0 & IDR0_CD2L) smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB; /* @@ -3442,7 +3441,7 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) * We currently require the same endianness as the CPU, but this * could be changed later by adding a new IO_PGTABLE_QUIRK. */ - switch (FIELD_GET(IDR0_TTENDIAN, reg)) { + switch (FIELD_GET(IDR0_TTENDIAN, smmu->idr0)) { case IDR0_TTENDIAN_MIXED: smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE; break; @@ -3461,22 +3460,22 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) } /* Boolean feature flags */ - if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI) + if (IS_ENABLED(CONFIG_PCI_PRI) && smmu->idr0 & IDR0_PRI) smmu->features |= ARM_SMMU_FEAT_PRI; - if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS) + if (IS_ENABLED(CONFIG_PCI_ATS) && smmu->idr0 & IDR0_ATS) smmu->features |= ARM_SMMU_FEAT_ATS; - if (reg & IDR0_SEV) + if (smmu->idr0 & IDR0_SEV) smmu->features |= ARM_SMMU_FEAT_SEV; - if (reg & IDR0_MSI) { + if (smmu->idr0 & IDR0_MSI) { smmu->features |= ARM_SMMU_FEAT_MSI; if (coherent && !disable_msipolling) smmu->options |= ARM_SMMU_OPT_MSIPOLL; } - if (reg & IDR0_HYP) { + if (smmu->idr0 & IDR0_HYP) { smmu->features |= ARM_SMMU_FEAT_HYP; if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN)) smmu->features |= ARM_SMMU_FEAT_E2H; @@ -3486,11 +3485,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) * The coherency feature as set by FW is used in preference to the ID * register, but warn on mismatch. */ - if (!!(reg & IDR0_COHACC) != coherent) + if (!!(smmu->idr0 & IDR0_COHACC) != coherent) dev_warn(smmu->dev, "IDR0.COHACC overridden by FW configuration (%s)\n", coherent ? "true" : "false"); - switch (FIELD_GET(IDR0_STALL_MODEL, reg)) { + switch (FIELD_GET(IDR0_STALL_MODEL, smmu->idr0)) { case IDR0_STALL_MODEL_FORCE: smmu->features |= ARM_SMMU_FEAT_STALL_FORCE; fallthrough; @@ -3498,19 +3497,19 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) smmu->features |= ARM_SMMU_FEAT_STALLS; } - if (reg & IDR0_S1P) + if (smmu->idr0 & IDR0_S1P) smmu->features |= ARM_SMMU_FEAT_TRANS_S1; - if (reg & IDR0_S2P) + if (smmu->idr0 & IDR0_S2P) smmu->features
Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP
Adding regressions list so that this can be tracked properly, including the full report below. Oleksandr Natalenko writes: > Hello. > > The following upstream commits: > > aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE" > ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE > > break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no > client can connect to it, either from the very beginning, or shortly after > start. These are the only symptoms I've noticed (i.e., no BUG/WARNING > messages in `dmesg` etc). > > The hardware is: > > ``` > $ dmesg | grep -i swiotlb > [0.426785] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) > > BIOS Information > Vendor: American Megatrends Inc. > Version: P1.50 > Release Date: 04/16/2018 > > Base Board Information > Manufacturer: ASRock > Product Name: J3710-ITX > > 02:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter > (rev 01) > Subsystem: Lite-On Communications Inc Device 6621 > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR-Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 17 > Region 0: Memory at 8140 (64-bit, non-prefetchable) [size=512K] > Expansion ROM at 8148 [disabled] [size=64K] > Capabilities: [40] Power Management version 2 > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA > PME(D0+,D1+,D2+,D3hot+,D3cold+) > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [50] MSI: Enable- Count=1/4 Maskable+ 64bit+ > Address: Data: > Masking: Pending: > Capabilities: [70] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s > unlimited, L1 <64us > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > SlotPowerLimit 10.000W > DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ > TransPend- > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit > Latency L0s <4us, L1 <64us > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok) > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Not Supported, TimeoutDis+ > NROPrPrP- LTR- >10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- > EETLPPrefix- >EmergencyPowerReduction Not Supported, > EmergencyPowerReductionInit- >FRS- TPHComp- ExtTPHComp- >AtomicOpsCap: 32bit- 64bit- 128bitCAS- > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- > OBFF Disabled, >AtomicOpsCtl: ReqEn- > LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- > 2Retimers- DRS- > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- >Transmit Margin: Normal Operating Range, > EnterModifiedCompliance- ComplianceSOS- >Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- > EqualizationPhase1- >EqualizationPhase2- EqualizationPhase3- > LinkEqualizationRequest- >Retimer- 2Retimers- CrosslinkRes: unsupported > Capabilities: [100 v1] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- > RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- > AdvNonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- > AdvNonFatalErr+ > AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- > ECRCChkCap- ECRCChkEn- > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > HeaderLog: > Capabilities: [140 v1] Virtual Channel > Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 > Arb:Fixed- WRR32- WRR64- WRR128- > Ctrl: ArbSelect=Fixed >
[REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP
Hello. The following upstream commits: aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE" ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no client can connect to it, either from the very beginning, or shortly after start. These are the only symptoms I've noticed (i.e., no BUG/WARNING messages in `dmesg` etc). The hardware is: ``` $ dmesg | grep -i swiotlb [0.426785] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) BIOS Information Vendor: American Megatrends Inc. Version: P1.50 Release Date: 04/16/2018 Base Board Information Manufacturer: ASRock Product Name: J3710-ITX 02:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter (rev 01) Subsystem: Lite-On Communications Inc Device 6621 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows
On Tue, 22 Mar 2022 17:27:36 +, Robin Murphy wrote: > > Originally, creating the dma_ranges resource list in pre-sorted fashion > was the simplest and most efficient way to enforce the order required by > iova_reserve_pci_windows(). However since then at least one PCI host > driver is now re-sorting the list for its own probe-time processing, > which doesn't seem entirely unreasonable, so that basic assumption no > longer holds. Make iommu-dma robust and get the sort order it needs by > explicitly sorting, which means we can also save the effort at creation > time and just build the list in whatever natural order the DT had. > > Signed-off-by: Robin Murphy > --- > > Looking at this area off the back of the XGene thread[1] made me realise > that we need to do it anyway, regardless of whether it might also happen > to restore the previous XGene behaviour or not. Presumably nobody's > tried to use pcie-cadence-host behind an IOMMU yet... This definitely restores PCIe functionality on my Mustang (XGene-1). Hopefully dann can comment on whether this addresses his own issue, as his firmware is significantly different. Thanks, M. -- Without deviation from the norm, progress is not possible. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu