Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-01 Thread da...@gibson.dropbear.id.au
On Fri, Oct 01, 2021 at 09:25:05AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 04:19:22PM +1000, da...@gibson.dropbear.id.au wrote:
> > On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > > > 
> > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > > format information for the target I/O page table.
> > > > > >
> > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > > semantics. For this type the user should specify the addr_width of
> > > > > > the I/O address space and whether the I/O page table is created in
> > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this 
> > > > > > point,
> > > > > > as the false setting requires additional contract with KVM on 
> > > > > > handling
> > > > > > WBINVD emulation, which can be added later.
> > > > > >
> > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > > > for what formats can be specified when allocating an IOASID.
> > > > > >
> > > > > > Open:
> > > > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > > > vfio.
> > > > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > > > there
> > > > > >   is a way to claim a specific iova range from a system-wide 
> > > > > > address space.
> > > > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > > > devices
> > > > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC 
> > > > > > hasn't
> > > > > >   adopted this design yet. We hope to have formal alignment in v1
> > > > > discussion
> > > > > >   and then decide how to incorporate it in v2.
> > > > > 
> > > > > I think the request was to include a start/end IO address hint when
> > > > > creating the ios. When the kernel creates it then it can return the
> > > > 
> > > > is the hint single-range or could be multiple-ranges?
> > > 
> > > David explained it here:
> > > 
> > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> > 
> > Apparently not well enough.  I've attempted again in this thread.
> > 
> > > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > > bit range.
> > 
> > No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> > guest, and therefore needs to request both from the host.
> 
> As I understood your remarks each IOAS can only be one of the formats
> as they have a different PTE layout. So here I ment that qmeu needs to
> be able to pick *for each IOAS* which of the two formats it is.

No.  Both windows are in the same IOAS.  A device could do DMA
simultaneously to both windows.  More realstically a 64-bit DMA
capable and a non-64-bit DMA capable device could be in the same group
and be doing DMAs to different windows simultaneously.

> > Or rather, it *might* need to supply both.  It will supply just the
> > 32-bit range by default, but the guest can request the 64-bit range
> > and/or remove and resize the 32-bit range via hypercall interfaces.
> > Vaguely recent Linux guests certainly will request the 64-bit range in
> > addition to the default 32-bit range.
> 
> And this would result in two different IOAS objects

There might be two different IOAS objects for setup, but at some point
they need to be combined into one IOAS to which the device is actually
attached.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Upstream-Patching for iommu (intel)

2021-10-01 Thread Ajay Garg
Hi All.

What is the upstream list, wherein patches for iommu (intel) might be posted?
Is it iommu@lists.linux-foundation.org?


Thanks and Regards,
Ajay
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread Logan Gunthorpe




On 2021-10-01 4:46 p.m., Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote:
> 
>>> It would close this issue, however synchronize_rcu() is very slow
>>> (think > 1second) in some cases and thus cannot be inserted here.
>>
>> It shouldn't be *that* slow, at least not the vast majority of the
>> time... it seems a bit unreasonable that a CPU wouldn't schedule for
>> more than a second. 
> 
> I've seen bug reports on exactly this, it is well known. Loaded
> big multi-cpu systems have high delays here, for whatever reason.
> 
>> But these aren't fast paths and synchronize_rcu() already gets
>> called in the unbind path for p2pdma a of couple times. I'm sure it
>> would also be fine to slow down the vma_close() path as well.
> 
> vma_close is done in a loop destroying vma's and if each synchronize
> costs > 1s it can take forever to close a process. We had to kill a
> similar use of synchronize_rcu in RDMA because users were complaining
> of > 40s process exit times.

Ah, fair. This adds a bit of complexity, but we could do a call_rcu() in
 vma_close to do the page frees.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread John Hubbard via iommu

On 10/1/21 15:46, Jason Gunthorpe wrote:

On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote:


It would close this issue, however synchronize_rcu() is very slow
(think > 1second) in some cases and thus cannot be inserted here.


It shouldn't be *that* slow, at least not the vast majority of the
time... it seems a bit unreasonable that a CPU wouldn't schedule for
more than a second.


I've seen bug reports on exactly this, it is well known. Loaded
big multi-cpu systems have high delays here, for whatever reason.



So have I. One reason is that synchronize_rcu() doesn't merely wait
for a context switch on each CPU--it also waits for callbacks (such as
those set up by call_rcu(), if I understand correctly) to run.

These can really add up to something quite substantial. In fact, I don't
think there is an upper limit on the running times, anywhere.

thanks,
--
John Hubbard
NVIDIA
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread Jason Gunthorpe
On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote:

> > It would close this issue, however synchronize_rcu() is very slow
> > (think > 1second) in some cases and thus cannot be inserted here.
> 
> It shouldn't be *that* slow, at least not the vast majority of the
> time... it seems a bit unreasonable that a CPU wouldn't schedule for
> more than a second. 

I've seen bug reports on exactly this, it is well known. Loaded
big multi-cpu systems have high delays here, for whatever reason.

> But these aren't fast paths and synchronize_rcu() already gets
> called in the unbind path for p2pdma a of couple times. I'm sure it
> would also be fine to slow down the vma_close() path as well.

vma_close is done in a loop destroying vma's and if each synchronize
costs > 1s it can take forever to close a process. We had to kill a
similar use of synchronize_rcu in RDMA because users were complaining
of > 40s process exit times.

The driver unload path is fine to be slow, and is probably done on an
unloaded system where synchronize_rcu is not so bad

Anyway, it is not really something for this series to fix, just
something we should all be aware of and probably ought to get fixed
before we do much more with ZONE_DEVICE pages

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread Logan Gunthorpe




On 2021-10-01 4:14 p.m., Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 02:13:14PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote:
 Before the invalidation, an active flag is cleared to ensure no new
 mappings can be created while the unmap is proceeding.
 unmap_mapping_range() should sequence itself with the TLB flush and
>>>
>>> AFIAK unmap_mapping_range() kicks off the TLB flush and then
>>> returns. It doesn't always wait for the flush to fully finish. Ie some
>>> cases use RCU to lock the page table against GUP fast and so the
>>> put_page() doesn't happen until the call_rcu completes - after a grace
>>> period. The unmap_mapping_range() does not wait for grace periods.
>>
>> Admittedly, the tlb flush code isn't the easiest code to understand.
>> But, yes it seems at least on some arches the pages are freed by
>> call_rcu(). But can't this be fixed easily by adding a synchronize_rcu()
>> call after calling unmap_mapping_range()? Certainly after a
>> synchronize_rcu(), the TLB has been flushed and it is safe to free those
>> pages.
> 
> It would close this issue, however synchronize_rcu() is very slow
> (think > 1second) in some cases and thus cannot be inserted here.

It shouldn't be *that* slow, at least not the vast majority of the
time... it seems a bit unreasonable that a CPU wouldn't schedule for
more than a second. But these aren't fast paths and synchronize_rcu()
already gets called in the unbind path for p2pdma a of couple times. I'm
sure it would also be fine to slow down the vma_close() path as well.

> I'm also not completely sure that rcu is the only case, I don't know
> how every arch handles its gather structure.. I have a feeling the
> general intention was for this to be asynchronous

Yeah, this is not clear to me either.

> My preferences are to either remove devmap from gup_fast, or fix it to
> not use special pages - the latter being obviously better.

Yeah, I rather expect DAX users want the optimization provided by
gup_fast. I don't think P2PDMA users would be happy about being stuck
with slow gup either.

Loga
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread Jason Gunthorpe
On Fri, Oct 01, 2021 at 02:13:14PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote:
> >> Before the invalidation, an active flag is cleared to ensure no new
> >> mappings can be created while the unmap is proceeding.
> >> unmap_mapping_range() should sequence itself with the TLB flush and
> > 
> > AFIAK unmap_mapping_range() kicks off the TLB flush and then
> > returns. It doesn't always wait for the flush to fully finish. Ie some
> > cases use RCU to lock the page table against GUP fast and so the
> > put_page() doesn't happen until the call_rcu completes - after a grace
> > period. The unmap_mapping_range() does not wait for grace periods.
> 
> Admittedly, the tlb flush code isn't the easiest code to understand.
> But, yes it seems at least on some arches the pages are freed by
> call_rcu(). But can't this be fixed easily by adding a synchronize_rcu()
> call after calling unmap_mapping_range()? Certainly after a
> synchronize_rcu(), the TLB has been flushed and it is safe to free those
> pages.

It would close this issue, however synchronize_rcu() is very slow
(think > 1second) in some cases and thus cannot be inserted here.

I'm also not completely sure that rcu is the only case, I don't know
how every arch handles its gather structure.. I have a feeling the
general intention was for this to be asynchronous

My preferences are to either remove devmap from gup_fast, or fix it to
not use special pages - the latter being obviously better.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread Logan Gunthorpe



On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote:
>> Before the invalidation, an active flag is cleared to ensure no new
>> mappings can be created while the unmap is proceeding.
>> unmap_mapping_range() should sequence itself with the TLB flush and
> 
> AFIAK unmap_mapping_range() kicks off the TLB flush and then
> returns. It doesn't always wait for the flush to fully finish. Ie some
> cases use RCU to lock the page table against GUP fast and so the
> put_page() doesn't happen until the call_rcu completes - after a grace
> period. The unmap_mapping_range() does not wait for grace periods.

Admittedly, the tlb flush code isn't the easiest code to understand.
But, yes it seems at least on some arches the pages are freed by
call_rcu(). But can't this be fixed easily by adding a synchronize_rcu()
call after calling unmap_mapping_range()? Certainly after a
synchronize_rcu(), the TLB has been flushed and it is safe to free those
pages.

>> P2PDMA follows this pattern, except pages are not mapped linearly and
>> are returned to the genalloc when their refcount falls to 1. This only
>> happens after a VMA is closed which should imply the PTEs have already
>> been unlinked from the pages. 
> 
> And here is the problem, since the genalloc is being used we now care
> that a page should not continue to be accessed by userspace after it
> has be placed back into the genalloc. I suppose fsdax has the same
> basic issue too.

Ok, similar question. Then if we call synchronize_rcu() in vm_close(),
before the put_page() calls which return the pages to the genalloc,
would that not guarantee the TLBs have been appropriately flushed?


>> Not to say that all this couldn't use a big conceptual cleanup. A
>> similar question exists with the single find_special_page() user
>> (xen/gntdev) and it's definitely not clear what the differences are
>> between the find_special_page() and vmf_insert_mixed() techniques and
>> when one should be used over the other. Or could they both be merged to
>> use the same technique?
> 
> Oh that gntdev stuff is just nonsense. IIRC is trying to delegate
> control over a PTE entry itself to the hypervisor.
> 
>   /*
>* gntdev takes the address of the PTE in find_grant_ptes() and
>* passes it to the hypervisor in gntdev_map_grant_pages(). The
>* purpose of the notifier is to prevent the hypervisor pointer
>* to the PTE from going stale.
>*
>* Since this vma's mappings can't be touched without the
>* mmap_lock, and we are holding it now, there is no need for
>* the notifier_range locking pattern.
> 
> I vaugely recall it stuffs in a normal page then has the hypervisor
> overwrite the PTE. When it comes time to free the PTE it recovers the
> normal page via the 'find_special_page' hack and frees it. Somehow the
> hypervisor is also using the normal page for something.
> 
> It is all very strange and one shouldn't think about it :|

Found this from an old commit message which seems to be a better
explanation, though I still don't fully understand it:

   In a Xen PV guest, the PTEs contain MFNs so get_user_pages() (for
   example) must do an MFN to PFN (M2P) lookup before it can get the
   page.  For foreign pages (those owned by another guest) the M2P
   lookup returns the PFN as seen by the foreign guest (which would be
   completely the wrong page for the local guest).

   This cannot be fixed up improving the M2P lookup since one MFN may be
   mapped onto two or more pages so getting the right page is impossible
   given just the MFN.

Yes, all very strange.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [git pull] IOMMU Fixes for Linux v5.15-rc3

2021-10-01 Thread pr-tracker-bot
The pull request you sent on Fri, 1 Oct 2021 18:01:38 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
> tags/iommu-fixes-v5.15-rc3

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/89e503592385fbed872c7ea1fb89931ece3409a5

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread Jason Gunthorpe
On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote:

> In device-dax, the refcount is only used to prevent the device, and
> therefore the pages, from going away on device unbind. Pages cannot be
> recycled, as you say, as they are mapped linearly within the device. The
> address space invalidation is done only when the device is unbound.

By address space invalidation I mean invalidation of the VMA that is
pointing to those pages.

device-dax may not have a issue with use-after-VMA-invalidation by
it's very nature since every PFN always points to the same
thing. fsdax and this p2p stuff are different though.

> Before the invalidation, an active flag is cleared to ensure no new
> mappings can be created while the unmap is proceeding.
> unmap_mapping_range() should sequence itself with the TLB flush and

AFIAK unmap_mapping_range() kicks off the TLB flush and then
returns. It doesn't always wait for the flush to fully finish. Ie some
cases use RCU to lock the page table against GUP fast and so the
put_page() doesn't happen until the call_rcu completes - after a grace
period. The unmap_mapping_range() does not wait for grace periods.

This is why for normal memory the put_page is done after the TLB flush
completes, not when unmap_mapping_range() finishes. This ensures that
before the refcount reaches 0 no concurrent GUP fast can still observe
the old PTEs.

> GUP-fast using the same mechanism it does for regular pages. As far as I
> can see, by the time unmap_mapping_range() returns, we should be
> confident that there are no pages left in any mapping (seeing no new
> pages could be added since before the call). 

When viewed under the page table locks this is true, but the 'fast'
walkers like gup_fast and hmm_range_fault can continue to be working
on old data in the ptes because they don't take the page table locks.

They interact with unmap_mapping_range() via the IPI/rcu (gup fast) or
mmu notifier sequence count (hmm_range_fault)

> P2PDMA follows this pattern, except pages are not mapped linearly and
> are returned to the genalloc when their refcount falls to 1. This only
> happens after a VMA is closed which should imply the PTEs have already
> been unlinked from the pages. 

And here is the problem, since the genalloc is being used we now care
that a page should not continue to be accessed by userspace after it
has be placed back into the genalloc. I suppose fsdax has the same
basic issue too.

> Not to say that all this couldn't use a big conceptual cleanup. A
> similar question exists with the single find_special_page() user
> (xen/gntdev) and it's definitely not clear what the differences are
> between the find_special_page() and vmf_insert_mixed() techniques and
> when one should be used over the other. Or could they both be merged to
> use the same technique?

Oh that gntdev stuff is just nonsense. IIRC is trying to delegate
control over a PTE entry itself to the hypervisor.

/*
 * gntdev takes the address of the PTE in find_grant_ptes() and
 * passes it to the hypervisor in gntdev_map_grant_pages(). The
 * purpose of the notifier is to prevent the hypervisor pointer
 * to the PTE from going stale.
 *
 * Since this vma's mappings can't be touched without the
 * mmap_lock, and we are holding it now, there is no need for
 * the notifier_range locking pattern.

I vaugely recall it stuffs in a normal page then has the hypervisor
overwrite the PTE. When it comes time to free the PTE it recovers the
normal page via the 'find_special_page' hack and frees it. Somehow the
hypervisor is also using the normal page for something.

It is all very strange and one shouldn't think about it :|

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/5] [RFC]iommu: Add a IOMMU_DEVONLY protection flag

2021-10-01 Thread Alyssa Rosenzweig
> The IOMMU_DEVONLY flag allows the caller to flag a mappings backed by
> device-private buffers. That means other devices or CPUs are not
> expected to access the physical memory region pointed by the mapping,
> and the MMU driver can safely restrict the shareability domain to the
> device itself.
> 
> Will be used by the ARM MMU driver to flag Mali mappings accessed only
> by the GPU as Inner-shareable.
> 
> Signed-off-by: Boris Brezillon 
> ---
>  include/linux/iommu.h | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index d2f3435e7d17..db14781b522f 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -31,6 +31,13 @@
>   * if the IOMMU page table format is equivalent.
>   */
>  #define IOMMU_PRIV   (1 << 5)
> +/*
> + * Mapping is only accessed by the device behind the iommu. That means other
> + * devices or CPUs are not expected to access this physical memory region,
> + * and the MMU driver can safely restrict the shareability domain to the
> + * device itself.
> + */
> +#define IOMMU_DEVONLY(1 << 6)
>  
>  struct iommu_ops;
>  struct iommu_group;

This seems totally reasonable to me, but it is well-known that I'm not
on good terms with the iommu subsystem. Let's wait for Robin to NAK :-P
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 4/5] drm/panfrost: Add a PANFROST_BO_GPUONLY flag

2021-10-01 Thread Alyssa Rosenzweig
> > This seems reasonable to me - it matches the kbase
> > BASE_MEM_COHERENT_SYSTEM (only backwards obviously) and it worked
> > reasonably well for the blob.

Oh, is that what that was for? I remember seeing it set on Midgard for
varyings. Good to go full circle now.

> > But I'm wondering if we need to do anything special to deal with the
> > fact we will now have some non-coherent mappings on an otherwise
> > coherent device.
> > 
> > There are certainly some oddities around how these buffers will be
> > mapped into user space if requested, e.g. panfrost_gem_create_object()
> > sets 'map_wc' based on pfdev->coherent which is arguably wrong for
> > GPUONLY. So there are two things we could consider:
> > 
> > a) Actually prevent user space mapping GPUONLY flagged buffers. Which
> > matches the intention of the name.
> 
> I intended to do that, just forgot to add wrappers around
> drm_gem_shmem_{mmap,vmap}() to forbid CPU-mappings on gpuonly buffers.

This feels like the cleaner solution to me.

> > b) Attempt to provide user space with the tools to safely interact with
> > the buffers (this is the kbase approach). This does have the benefit of
> > allowing *mostly* GPU access. An example here is the tiler heap where
> > the CPU could zero out as necessary but mostly the GPU has ownership and
> > the CPU never reads the contents. GPUONLY/DEVONLY might not be the best
> > name in that case.
> 
> Uh, right, I forgot we had to zero the tiler heap on Midgard (most of
> the time done with a WRITE_VALUE job, but there's an exception on some
> old Midgard GPUs IIRC).

"Attempt" is the key word here :|

We indeed only touch the tiler heap from the CPU on v4, and life's too
short to care about new optimizations for v4. Unless the patch is
trivial, my vote is for a) preventing the mappings and only setting
GPUONLY on the tiler_heap starting with v5.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu: arm-smmu-qcom: Add compatible for qcm2290

2021-10-01 Thread Bjorn Andersson
On Fri 01 Oct 07:00 PDT 2021, Loic Poulain wrote:

> Signed-off-by: Loic Poulain 
> ---
>  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
> b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index 55690af..d105186 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -412,6 +412,7 @@ static const struct of_device_id __maybe_unused 
> qcom_smmu_impl_of_match[] = {
>   { .compatible = "qcom,sm8150-smmu-500" },
>   { .compatible = "qcom,sm8250-smmu-500" },
>   { .compatible = "qcom,sm8350-smmu-500" },
> + { .compatible = "qcom,qcm2290-smmu-500" },

Would be nice if we kept the alphabetical sort order on these and in the
binding.

With that, please feel free to add my R-b

Regards,
Bjorn

>   { }
>  };
>  
> -- 
> 2.7.4
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread Logan Gunthorpe




On 2021-10-01 7:48 a.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 09:36:52PM -0300, Jason Gunthorpe wrote:
> 
>> Why would DAX want to do this in the first place?? This means the
>> address space zap is much more important that just speeding up
>> destruction, it is essential for correctness since the PTEs are not
>> holding refcounts naturally...
> 
> It is not really for this series to fix, but I think the whole thing
> is probably racy once you start allowing pte_special pages to be
> accessed by GUP.
> 
> If we look at unmapping the PTE relative to GUP fast the important
> sequence is how the TLB flushing doesn't decrement the page refcount
> until after it knows any concurrent GUP fast is completed. This is
> arch specific, eg it could be done async through a call_rcu handler.
> 
> This ensures that pages can't cross back into the free pool and be
> reallocated until we know for certain that nobody is walking the PTEs
> and could potentially take an additional reference on it. The scheme
> cannot rely on the page refcount being 0 because oce it goes into the
> free pool it could be immeidately reallocated back to a non-zero
> refcount.
> 
> A DAX user that simply does an address space invalidation doesn't
> sequence itself with any of this mechanism. So we can race with a
> thread doing GUP fast and another thread re-cycling the page into
> another use - creating a leakage of the page from one security context
> to another.
> 
> This seems to be made worse for the pgmap stuff due to the wonky
> refcount usage - at least if the refcount had dropped to zero gup fast
> would be blocked for a time, but even that doesn't happen.
> 
> In short, I think using pg special for anything that can be returned
> by gup fast (and maybe even gup!) is racy/wrong. We must have the
> normal refcount mechanism work for correctness of the recycling flow.

I'm not quite following all of this. I'm not entirely sure how fs/dax
works in this regard, but for device-dax (and similarly p2pdma) it
doesn't seem as bad as you say.

In device-dax, the refcount is only used to prevent the device, and
therefore the pages, from going away on device unbind. Pages cannot be
recycled, as you say, as they are mapped linearly within the device. The
address space invalidation is done only when the device is unbound.
Before the invalidation, an active flag is cleared to ensure no new
mappings can be created while the unmap is proceeding.
unmap_mapping_range() should sequence itself with the TLB flush and
GUP-fast using the same mechanism it does for regular pages. As far as I
can see, by the time unmap_mapping_range() returns, we should be
confident that there are no pages left in any mapping (seeing no new
pages could be added since before the call). Then before finishing the
unbind, device-dax decrements the refcount of all pages and then waits
for the refcount of all pages to go to zero. Thus, any pages that
successfully were got with GUP, during or before unmap_mapping_range
should hold a reference and once all those references are returned,
unbind can finish.

P2PDMA follows this pattern, except pages are not mapped linearly and
are returned to the genalloc when their refcount falls to 1. This only
happens after a VMA is closed which should imply the PTEs have already
been unlinked from the pages. And the same situation occurs on unbind
with a flag preventing new mappings from being created before
unmap_mapping_range(), etc.

Not to say that all this couldn't use a big conceptual cleanup. A
similar question exists with the single find_special_page() user
(xen/gntdev) and it's definitely not clear what the differences are
between the find_special_page() and vmf_insert_mixed() techniques and
when one should be used over the other. Or could they both be merged to
use the same technique?

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 4/5] drm/panfrost: Add a PANFROST_BO_GPUONLY flag

2021-10-01 Thread Boris Brezillon
On Fri, 1 Oct 2021 16:13:42 +0100
Steven Price  wrote:

> On 01/10/2021 15:34, Boris Brezillon wrote:
> > This lets the driver reduce shareability domain of the MMU mapping,
> > which can in turn reduce access time and save power on cache-coherent
> > systems.
> > 
> > Signed-off-by: Boris Brezillon   
> 
> This seems reasonable to me - it matches the kbase
> BASE_MEM_COHERENT_SYSTEM (only backwards obviously) and it worked
> reasonably well for the blob.
> 
> But I'm wondering if we need to do anything special to deal with the
> fact we will now have some non-coherent mappings on an otherwise
> coherent device.
> 
> There are certainly some oddities around how these buffers will be
> mapped into user space if requested, e.g. panfrost_gem_create_object()
> sets 'map_wc' based on pfdev->coherent which is arguably wrong for
> GPUONLY. So there are two things we could consider:
> 
> a) Actually prevent user space mapping GPUONLY flagged buffers. Which
> matches the intention of the name.

I intended to do that, just forgot to add wrappers around
drm_gem_shmem_{mmap,vmap}() to forbid CPU-mappings on gpuonly buffers.

> 
> b) Attempt to provide user space with the tools to safely interact with
> the buffers (this is the kbase approach). This does have the benefit of
> allowing *mostly* GPU access. An example here is the tiler heap where
> the CPU could zero out as necessary but mostly the GPU has ownership and
> the CPU never reads the contents. GPUONLY/DEVONLY might not be the best
> name in that case.

Uh, right, I forgot we had to zero the tiler heap on Midgard (most of
the time done with a WRITE_VALUE job, but there's an exception on some
old Midgard GPUs IIRC).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[git pull] IOMMU Fixes for Linux v5.15-rc3

2021-10-01 Thread Joerg Roedel
Hi Linus,

The following changes since commit 5816b3e6577eaa676ceb00a848f0fd65fe2adc29:

  Linux 5.15-rc3 (2021-09-26 14:08:19 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
tags/iommu-fixes-v5.15-rc3

for you to fetch changes up to f0b636804c7c4c564efbca5981e3c56b5c6fe0c5:

  iommu/dart: Clear sid2group entry when a group is freed (2021-09-28 11:47:24 
+0200)


IOMMU Fixes for Linux v5.15-rc3

Including:

- Two fixes for the new Apple DART driver to fix a kernel panic
  and a stale data usage issue

- Intel VT-d fix for how PCI device ids are printed


Bjorn Helgaas (1):
  iommu/vt-d: Drop "0x" prefix from PCI bus & device addresses

Sven Peter (2):
  iommu/dart: Remove iommu_flush_ops
  iommu/dart: Clear sid2group entry when a group is freed

 drivers/iommu/apple-dart.c | 56 +-
 drivers/iommu/intel/dmar.c |  6 ++---
 2 files changed, 38 insertions(+), 24 deletions(-)

Please pull.

Thanks,

Joerg


signature.asc
Description: Digital signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2] iommu/arm-smmu-v3: poll cmdq until it has space

2021-10-01 Thread John Garry

On 26/09/2021 19:51, Fernand Sieber wrote:

When a thread sends commands to the SMMU, it needs to allocate some
space to write its commands in a ring buffer.

The allocation algorithms works as follows: until enough free spaced is
available in the queue, repeat the following outer loop. First, try to
acquire an exclusive lock to read the consumer index from the SMMU
register over MMIO. If that fails, it means that another thread holds
the lock (or multiple threads, in shared mode, for sync commands). The
current code guarantees that when a thread is holding the lock, the
consumer index will be updated from the SMMU register. So then when the
acquisition of the exclusive lock fails, we can safely assume that
another thread will eventually update the consumer index and enter an
inner waiting loop until that happens.



You have written a lot in the commit message. I think that this 
paragraph can be dropped to try to make it more concise and readable.



The problem that this patch fixes is that the waiting loop exits as soon
as any space is available in the queue, so it is possible that it exits
immediately if there's some space available but not enough to write the
thread's commands.


To me, this seems strange. From my experience, I would expect that if 
the queue is full, and then some space comes available after updating 
the cons pointer from HW, then the cons pointer would have advanced so 
far such that we have more than enough space for the maximum amount of 
commands in a batch, 64.



That means the cmdq allocation queue will fail (outer
loop) and the thread will spin attempting to acquire the exclusive lock
to update the consumer index from the SMMU register.

If a lot of threads are failing to allocate commands,


As above, what is special for your HW such that when the queue becomes 
not full that there is still not enough space? The cons pointer is only 
updated at 2x locations:

- in arm_smmu_cmdq_poll_until_not_full()
- the last CPU gathered unlocks the shared lock at the function exit

So it's not updated very regularly.


this can cause
heavy contention on the lock, to the point where the system slowdown or
livelocks. The livelock is possible if some other threads are attempting
to execute synchronous commands.


What are synchronous commands in this context? Do you mean we're issuing 
a CMD_SYNC?



These threads need to ensure that they
control updates to the consumer index so that they can correctly observe
when their command is executed, they enforce that by acquiring the lock
in shared mode. If there's too much contention, they never succeed to
acquire the lock via the read+cmpxchg mechanism, and their progress
stall.


Why is this? Why can they not get the lock? They spin waiting for it.


But because they already hold allocated space in the command
queue, their stall prevents progress from other threads attempting to
allocate space in the cmdq. This causes a livelock.

This patch makes the waiting loop exit as soon as enough space is
available, rather than as soon as any space is available. This means
that when two threads are competing for the exclusive lock when
allocating space in the queue, one of them will fail to acquiire the
lock in exclusive lock and be knocked to the waiting loop and stay there
until there's enough free space rather than exiting it immediately when
any space is available. Threads in the waiting loop do not compete for
the lock, reducing contention enough to enable synchronous threads to
make progress, when applicable.

Note that we cannot afford to have all threads parked in the waiting
loop unless there are synchronous threads executing concurrenty,


nit: I'd say CPU when describing this problem, rather than thread


otherwise no thread is observing the SMMU register and updating the
consumer index. Thus if we succeed to acquire the lock in exclusive
mode, we cannot enter the waiting loop because we could be the last
thread observing the SMMU. Similarly, if the producer index is updated,
we need to exit the waiting loop because it could mean that the latest
thread to observe the SMMU has succeeded to allocate commands and thus
has moved on.

Signed-off-by: Fernand Sieber 
---
Changes in v2
   - Fix inverted condition to break out the loop when queue_has_space
   - Replace obsolete comment reference to llq->state->val by llq->val

  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 41 +++--
  1 file changed, 29 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a388e318f86e..a0a04cc9c57d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -118,12 +118,6 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, 
u32 n)
return space >= n;
  }

-static bool queue_full(struct arm_smmu_ll_queue *q)
-{
-   return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
-  Q_WRP(q, q->prod) !=

Re: [PATCH v2 4/5] drm/panfrost: Add a PANFROST_BO_GPUONLY flag

2021-10-01 Thread Steven Price
On 01/10/2021 15:34, Boris Brezillon wrote:
> This lets the driver reduce shareability domain of the MMU mapping,
> which can in turn reduce access time and save power on cache-coherent
> systems.
> 
> Signed-off-by: Boris Brezillon 

This seems reasonable to me - it matches the kbase
BASE_MEM_COHERENT_SYSTEM (only backwards obviously) and it worked
reasonably well for the blob.

But I'm wondering if we need to do anything special to deal with the
fact we will now have some non-coherent mappings on an otherwise
coherent device.

There are certainly some oddities around how these buffers will be
mapped into user space if requested, e.g. panfrost_gem_create_object()
sets 'map_wc' based on pfdev->coherent which is arguably wrong for
GPUONLY. So there are two things we could consider:

a) Actually prevent user space mapping GPUONLY flagged buffers. Which
matches the intention of the name.

b) Attempt to provide user space with the tools to safely interact with
the buffers (this is the kbase approach). This does have the benefit of
allowing *mostly* GPU access. An example here is the tiler heap where
the CPU could zero out as necessary but mostly the GPU has ownership and
the CPU never reads the contents. GPUONLY/DEVONLY might not be the best
name in that case.

Any thoughts?

Thanks,

Steve

> ---
>  drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++-
>  drivers/gpu/drm/panfrost/panfrost_gem.c | 1 +
>  drivers/gpu/drm/panfrost/panfrost_gem.h | 1 +
>  drivers/gpu/drm/panfrost/panfrost_mmu.c | 3 +++
>  include/uapi/drm/panfrost_drm.h | 1 +
>  5 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
> b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index b29ac942ae2d..b176921b9392 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -77,7 +77,8 @@ static int panfrost_ioctl_get_param(struct drm_device 
> *ddev, void *data, struct
>  
>  #define PANFROST_BO_FLAGS \
>   (PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \
> -  PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE)
> +  PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE | \
> +  PANFROST_BO_GPUONLY)
>  
>  static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
>   struct drm_file *file)
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c 
> b/drivers/gpu/drm/panfrost/panfrost_gem.c
> index d6c1bb1445f2..4b1f85c0b98f 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
> @@ -254,6 +254,7 @@ panfrost_gem_create_with_handle(struct drm_file 
> *file_priv,
>   bo->noread = !!(flags & PANFROST_BO_NOREAD);
>   bo->nowrite = !!(flags & PANFROST_BO_NOWRITE);
>   bo->is_heap = !!(flags & PANFROST_BO_HEAP);
> + bo->gpuonly = !!(flags & PANFROST_BO_GPUONLY);
>  
>   /*
>* Allocate an id of idr table where the obj is registered
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h 
> b/drivers/gpu/drm/panfrost/panfrost_gem.h
> index 6246b5fef446..e332d5a4c24f 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
> @@ -40,6 +40,7 @@ struct panfrost_gem_object {
>   bool noread :1;
>   bool nowrite:1;
>   bool is_heap:1;
> + bool gpuonly:1;
>  };
>  
>  struct panfrost_gem_mapping {
> diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
> b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> index 6a5c9d94d6f2..89eee8e80aa5 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> @@ -321,6 +321,9 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping)
>   if (!bo->noread)
>   prot |= IOMMU_READ;
>  
> + if (bo->gpuonly)
> + prot |= IOMMU_DEVONLY;
> +
>   sgt = drm_gem_shmem_get_pages_sgt(obj);
>   if (WARN_ON(IS_ERR(sgt)))
>   return PTR_ERR(sgt);
> diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
> index a2de81225125..538b58b2d095 100644
> --- a/include/uapi/drm/panfrost_drm.h
> +++ b/include/uapi/drm/panfrost_drm.h
> @@ -88,6 +88,7 @@ struct drm_panfrost_wait_bo {
>  #define PANFROST_BO_HEAP 2
>  #define PANFROST_BO_NOREAD   4
>  #define PANFROST_BO_NOWRITE  8
> +#define PANFROST_BO_GPUONLY  0x10
>  
>  /**
>   * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs.
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH kernel] powerpc/iommu: Report the correct most efficient DMA mask for PCI devices

2021-10-01 Thread Michael Ellerman
On Thu, 30 Sep 2021 13:44:54 +1000, Alexey Kardashevskiy wrote:
> According to dma-api.rst, the dma_get_required_mask() helper should return
> "the mask that the platform requires to operate efficiently". Which in
> the case of PPC64 means the bypass mask and not a mask from an IOMMU table
> which is shorter and slower to use due to map/unmap operations (especially
> expensive on "pseries").
> 
> However the existing implementation ignores the possibility of bypassing
> and returns the IOMMU table mask on the pseries platform which makes some
> drivers (mpt3sas is one example) choose 32bit DMA even though bypass is
> supported. The powernv platform sort of handles it by having a bigger
> default window with a mask >=40 but it only works as drivers choose
> 63/64bit if the required mask is >32 which is rather pointless.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/iommu: Report the correct most efficient DMA mask for PCI devices
  https://git.kernel.org/powerpc/c/23c216b335d1fbd716076e8263b54a714ea3cf0e

cheers
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/5] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags

2021-10-01 Thread Boris Brezillon
On Fri,  1 Oct 2021 16:34:25 +0200
Boris Brezillon  wrote:

> So we can create GPU mappings without R/W permissions. Particularly
> useful to debug corruptions caused by out-of-bound writes.
> 
> Signed-off-by: Boris Brezillon 

Oops, forgot:

Reviewed-by: Alyssa Rosenzweig 
Reviewed-by: Steven Price 

> ---
>  drivers/gpu/drm/panfrost/panfrost_drv.c | 11 ++-
>  drivers/gpu/drm/panfrost/panfrost_gem.c |  2 ++
>  drivers/gpu/drm/panfrost/panfrost_gem.h |  2 ++
>  drivers/gpu/drm/panfrost/panfrost_mmu.c |  8 +++-
>  include/uapi/drm/panfrost_drm.h |  2 ++
>  5 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
> b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index 82ad9a67f251..b29ac942ae2d 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -75,6 +75,10 @@ static int panfrost_ioctl_get_param(struct drm_device 
> *ddev, void *data, struct
>   return 0;
>  }
>  
> +#define PANFROST_BO_FLAGS \
> + (PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \
> +  PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE)
> +
>  static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
>   struct drm_file *file)
>  {
> @@ -84,7 +88,7 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, 
> void *data,
>   struct panfrost_gem_mapping *mapping;
>  
>   if (!args->size || args->pad ||
> - (args->flags & ~(PANFROST_BO_NOEXEC | PANFROST_BO_HEAP)))
> + (args->flags & ~PANFROST_BO_FLAGS))
>   return -EINVAL;
>  
>   /* Heaps should never be executable */
> @@ -92,6 +96,11 @@ static int panfrost_ioctl_create_bo(struct drm_device 
> *dev, void *data,
>   !(args->flags & PANFROST_BO_NOEXEC))
>   return -EINVAL;
>  
> + /* Executable implies readable */
> + if ((args->flags & PANFROST_BO_NOREAD) &&
> + !(args->flags & PANFROST_BO_NOEXEC))
> + return -EINVAL;
> +
>   bo = panfrost_gem_create_with_handle(file, dev, args->size, args->flags,
>&args->handle);
>   if (IS_ERR(bo))
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c 
> b/drivers/gpu/drm/panfrost/panfrost_gem.c
> index 23377481f4e3..d6c1bb1445f2 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
> @@ -251,6 +251,8 @@ panfrost_gem_create_with_handle(struct drm_file 
> *file_priv,
>  
>   bo = to_panfrost_bo(&shmem->base);
>   bo->noexec = !!(flags & PANFROST_BO_NOEXEC);
> + bo->noread = !!(flags & PANFROST_BO_NOREAD);
> + bo->nowrite = !!(flags & PANFROST_BO_NOWRITE);
>   bo->is_heap = !!(flags & PANFROST_BO_HEAP);
>  
>   /*
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h 
> b/drivers/gpu/drm/panfrost/panfrost_gem.h
> index 8088d5fd8480..6246b5fef446 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
> @@ -37,6 +37,8 @@ struct panfrost_gem_object {
>   atomic_t gpu_usecount;
>  
>   bool noexec :1;
> + bool noread :1;
> + bool nowrite:1;
>   bool is_heap:1;
>  };
>  
> diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
> b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> index f51d3f791a17..6a5c9d94d6f2 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> @@ -307,7 +307,7 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping)
>   struct drm_gem_object *obj = &bo->base.base;
>   struct panfrost_device *pfdev = to_panfrost_device(obj->dev);
>   struct sg_table *sgt;
> - int prot = IOMMU_READ | IOMMU_WRITE;
> + int prot = 0;
>  
>   if (WARN_ON(mapping->active))
>   return 0;
> @@ -315,6 +315,12 @@ int panfrost_mmu_map(struct panfrost_gem_mapping 
> *mapping)
>   if (bo->noexec)
>   prot |= IOMMU_NOEXEC;
>  
> + if (!bo->nowrite)
> + prot |= IOMMU_WRITE;
> +
> + if (!bo->noread)
> + prot |= IOMMU_READ;
> +
>   sgt = drm_gem_shmem_get_pages_sgt(obj);
>   if (WARN_ON(IS_ERR(sgt)))
>   return PTR_ERR(sgt);
> diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
> index 061e700dd06c..a2de81225125 100644
> --- a/include/uapi/drm/panfrost_drm.h
> +++ b/include/uapi/drm/panfrost_drm.h
> @@ -86,6 +86,8 @@ struct drm_panfrost_wait_bo {
>  
>  #define PANFROST_BO_NOEXEC   1
>  #define PANFROST_BO_HEAP 2
> +#define PANFROST_BO_NOREAD   4
> +#define PANFROST_BO_NOWRITE  8
>  
>  /**
>   * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 5/5] drm/panfrost: Bump the driver version to 1.3

2021-10-01 Thread Boris Brezillon
Bump the driver version to 1.3 to account for the
PANFROST_BO_NO{READ,WRITE,GPUONLY} flags addition.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index b176921b9392..5afcec5ab4db 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -530,6 +530,7 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
  * - 1.0 - initial interface
  * - 1.1 - adds HEAP and NOEXEC flags for CREATE_BO
  * - 1.2 - adds AFBC_FEATURES query
+ * - 1.3 - adds PANFROST_BO_NO{READ,WRITE,GPUONLY} flags
  */
 static const struct drm_driver panfrost_drm_driver = {
.driver_features= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
@@ -542,7 +543,7 @@ static const struct drm_driver panfrost_drm_driver = {
.desc   = "panfrost DRM",
.date   = "20180908",
.major  = 1,
-   .minor  = 2,
+   .minor  = 3,
 
.gem_create_object  = panfrost_gem_create_object,
.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/5] [RFC]iommu/io-pgtable-arm: Take the DEVONLY flag into account on ARM_MALI_LPAE

2021-10-01 Thread Boris Brezillon
Restrict the shareability domain when mapping buffers that are
GPU-visible only.

Signed-off-by: Boris Brezillon 
---
Flagged RFC because I'm not sure adding a new flag is the right
way to convey the 'dev-private buffer' information.
---
 drivers/iommu/io-pgtable-arm.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index dd9e47189d0d..6ac3defb9ae1 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -450,16 +450,25 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
arm_lpae_io_pgtable *data,
<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
}
 
-   /*
-* Also Mali has its own notions of shareability wherein its Inner
-* domain covers the cores within the GPU, and its Outer domain is
-* "outside the GPU" (i.e. either the Inner or System domain in CPU
-* terms, depending on coherency).
-*/
-   if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE)
+   if (data->iop.fmt == ARM_MALI_LPAE) {
+   /*
+* Mali has its own notions of shareability wherein its Inner
+* domain covers the cores within the GPU, and its Outer domain
+* is "outside the GPU" (i.e. either the Inner or System domain
+* in CPU terms, depending on coherency).
+* If the mapping is only device-visible, we can use the Inner
+* domain, otherwise we need to stick to Outer domain
+* shareability.
+*/
+   if (prot & IOMMU_DEVONLY)
+   pte |= ARM_LPAE_PTE_SH_IS;
+   else
+   pte |= ARM_LPAE_PTE_SH_OS;
+   } else if (prot & IOMMU_CACHE) {
pte |= ARM_LPAE_PTE_SH_IS;
-   else
+   } else {
pte |= ARM_LPAE_PTE_SH_OS;
+   }
 
if (prot & IOMMU_NOEXEC)
pte |= ARM_LPAE_PTE_XN;
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 3/5] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags

2021-10-01 Thread Boris Brezillon
So we can create GPU mappings without R/W permissions. Particularly
useful to debug corruptions caused by out-of-bound writes.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 11 ++-
 drivers/gpu/drm/panfrost/panfrost_gem.c |  2 ++
 drivers/gpu/drm/panfrost/panfrost_gem.h |  2 ++
 drivers/gpu/drm/panfrost/panfrost_mmu.c |  8 +++-
 include/uapi/drm/panfrost_drm.h |  2 ++
 5 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 82ad9a67f251..b29ac942ae2d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -75,6 +75,10 @@ static int panfrost_ioctl_get_param(struct drm_device *ddev, 
void *data, struct
return 0;
 }
 
+#define PANFROST_BO_FLAGS \
+   (PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \
+PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE)
+
 static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
struct drm_file *file)
 {
@@ -84,7 +88,7 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, 
void *data,
struct panfrost_gem_mapping *mapping;
 
if (!args->size || args->pad ||
-   (args->flags & ~(PANFROST_BO_NOEXEC | PANFROST_BO_HEAP)))
+   (args->flags & ~PANFROST_BO_FLAGS))
return -EINVAL;
 
/* Heaps should never be executable */
@@ -92,6 +96,11 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, 
void *data,
!(args->flags & PANFROST_BO_NOEXEC))
return -EINVAL;
 
+   /* Executable implies readable */
+   if ((args->flags & PANFROST_BO_NOREAD) &&
+   !(args->flags & PANFROST_BO_NOEXEC))
+   return -EINVAL;
+
bo = panfrost_gem_create_with_handle(file, dev, args->size, args->flags,
 &args->handle);
if (IS_ERR(bo))
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c 
b/drivers/gpu/drm/panfrost/panfrost_gem.c
index 23377481f4e3..d6c1bb1445f2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
@@ -251,6 +251,8 @@ panfrost_gem_create_with_handle(struct drm_file *file_priv,
 
bo = to_panfrost_bo(&shmem->base);
bo->noexec = !!(flags & PANFROST_BO_NOEXEC);
+   bo->noread = !!(flags & PANFROST_BO_NOREAD);
+   bo->nowrite = !!(flags & PANFROST_BO_NOWRITE);
bo->is_heap = !!(flags & PANFROST_BO_HEAP);
 
/*
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h 
b/drivers/gpu/drm/panfrost/panfrost_gem.h
index 8088d5fd8480..6246b5fef446 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.h
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
@@ -37,6 +37,8 @@ struct panfrost_gem_object {
atomic_t gpu_usecount;
 
bool noexec :1;
+   bool noread :1;
+   bool nowrite:1;
bool is_heap:1;
 };
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index f51d3f791a17..6a5c9d94d6f2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -307,7 +307,7 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping)
struct drm_gem_object *obj = &bo->base.base;
struct panfrost_device *pfdev = to_panfrost_device(obj->dev);
struct sg_table *sgt;
-   int prot = IOMMU_READ | IOMMU_WRITE;
+   int prot = 0;
 
if (WARN_ON(mapping->active))
return 0;
@@ -315,6 +315,12 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping)
if (bo->noexec)
prot |= IOMMU_NOEXEC;
 
+   if (!bo->nowrite)
+   prot |= IOMMU_WRITE;
+
+   if (!bo->noread)
+   prot |= IOMMU_READ;
+
sgt = drm_gem_shmem_get_pages_sgt(obj);
if (WARN_ON(IS_ERR(sgt)))
return PTR_ERR(sgt);
diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index 061e700dd06c..a2de81225125 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -86,6 +86,8 @@ struct drm_panfrost_wait_bo {
 
 #define PANFROST_BO_NOEXEC 1
 #define PANFROST_BO_HEAP   2
+#define PANFROST_BO_NOREAD 4
+#define PANFROST_BO_NOWRITE8
 
 /**
  * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs.
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 4/5] drm/panfrost: Add a PANFROST_BO_GPUONLY flag

2021-10-01 Thread Boris Brezillon
This lets the driver reduce shareability domain of the MMU mapping,
which can in turn reduce access time and save power on cache-coherent
systems.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++-
 drivers/gpu/drm/panfrost/panfrost_gem.c | 1 +
 drivers/gpu/drm/panfrost/panfrost_gem.h | 1 +
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 3 +++
 include/uapi/drm/panfrost_drm.h | 1 +
 5 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index b29ac942ae2d..b176921b9392 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -77,7 +77,8 @@ static int panfrost_ioctl_get_param(struct drm_device *ddev, 
void *data, struct
 
 #define PANFROST_BO_FLAGS \
(PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \
-PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE)
+PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE | \
+PANFROST_BO_GPUONLY)
 
 static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
struct drm_file *file)
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c 
b/drivers/gpu/drm/panfrost/panfrost_gem.c
index d6c1bb1445f2..4b1f85c0b98f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
@@ -254,6 +254,7 @@ panfrost_gem_create_with_handle(struct drm_file *file_priv,
bo->noread = !!(flags & PANFROST_BO_NOREAD);
bo->nowrite = !!(flags & PANFROST_BO_NOWRITE);
bo->is_heap = !!(flags & PANFROST_BO_HEAP);
+   bo->gpuonly = !!(flags & PANFROST_BO_GPUONLY);
 
/*
 * Allocate an id of idr table where the obj is registered
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h 
b/drivers/gpu/drm/panfrost/panfrost_gem.h
index 6246b5fef446..e332d5a4c24f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.h
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
@@ -40,6 +40,7 @@ struct panfrost_gem_object {
bool noread :1;
bool nowrite:1;
bool is_heap:1;
+   bool gpuonly:1;
 };
 
 struct panfrost_gem_mapping {
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 6a5c9d94d6f2..89eee8e80aa5 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -321,6 +321,9 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping)
if (!bo->noread)
prot |= IOMMU_READ;
 
+   if (bo->gpuonly)
+   prot |= IOMMU_DEVONLY;
+
sgt = drm_gem_shmem_get_pages_sgt(obj);
if (WARN_ON(IS_ERR(sgt)))
return PTR_ERR(sgt);
diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index a2de81225125..538b58b2d095 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -88,6 +88,7 @@ struct drm_panfrost_wait_bo {
 #define PANFROST_BO_HEAP   2
 #define PANFROST_BO_NOREAD 4
 #define PANFROST_BO_NOWRITE8
+#define PANFROST_BO_GPUONLY0x10
 
 /**
  * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs.
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/5] drm/panfrost: Add extra GPU-usage flags

2021-10-01 Thread Boris Brezillon
Hello,

This is a follow-up of [1], which was adding the read/write
restrictions on GPU buffers. Robin and Steven suggested that I add a
flag to restrict the shareability domain on GPU-private buffers, so
here it is.

As you can see, the first patch is flagges RFC, since I'm not sure
adding a new IOMMU_ flag is the right solution, but IOMMU_CACHE
doesn't feel like a good fit either. Please let me know if you have
better ideas.

Regards,

Boris

[1]https://patchwork.kernel.org/project/dri-devel/patch/20210930184723.1482426-1-boris.brezil...@collabora.com/

Boris Brezillon (5):
  [RFC]iommu: Add a IOMMU_DEVONLY protection flag
  [RFC]iommu/io-pgtable-arm: Take the DEVONLY flag into account on
ARM_MALI_LPAE
  drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags
  drm/panfrost: Add a PANFROST_BO_GPUONLY flag
  drm/panfrost: Bump the driver version to 1.3

 drivers/gpu/drm/panfrost/panfrost_drv.c | 15 +--
 drivers/gpu/drm/panfrost/panfrost_gem.c |  3 +++
 drivers/gpu/drm/panfrost/panfrost_gem.h |  3 +++
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 11 ++-
 drivers/iommu/io-pgtable-arm.c  | 25 +
 include/linux/iommu.h   |  7 +++
 include/uapi/drm/panfrost_drm.h |  3 +++
 7 files changed, 56 insertions(+), 11 deletions(-)

-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/5] [RFC]iommu: Add a IOMMU_DEVONLY protection flag

2021-10-01 Thread Boris Brezillon
The IOMMU_DEVONLY flag allows the caller to flag a mappings backed by
device-private buffers. That means other devices or CPUs are not
expected to access the physical memory region pointed by the mapping,
and the MMU driver can safely restrict the shareability domain to the
device itself.

Will be used by the ARM MMU driver to flag Mali mappings accessed only
by the GPU as Inner-shareable.

Signed-off-by: Boris Brezillon 
---
 include/linux/iommu.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d2f3435e7d17..db14781b522f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -31,6 +31,13 @@
  * if the IOMMU page table format is equivalent.
  */
 #define IOMMU_PRIV (1 << 5)
+/*
+ * Mapping is only accessed by the device behind the iommu. That means other
+ * devices or CPUs are not expected to access this physical memory region,
+ * and the MMU driver can safely restrict the shareability domain to the
+ * device itself.
+ */
+#define IOMMU_DEVONLY  (1 << 6)
 
 struct iommu_ops;
 struct iommu_group;
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/2] iommu: arm-smmu-qcom: Add compatible for qcm2290

2021-10-01 Thread Loic Poulain
Signed-off-by: Loic Poulain 
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 55690af..d105186 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -412,6 +412,7 @@ static const struct of_device_id __maybe_unused 
qcom_smmu_impl_of_match[] = {
{ .compatible = "qcom,sm8150-smmu-500" },
{ .compatible = "qcom,sm8250-smmu-500" },
{ .compatible = "qcom,sm8350-smmu-500" },
+   { .compatible = "qcom,qcm2290-smmu-500" },
{ }
 };
 
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-01 Thread Barry Song
On Wed, Sep 22, 2021 at 5:14 PM Jacob Pan  wrote:
>
> Hi Joerg/Jason/Christoph et all,
>
> The current in-kernel supervisor PASID support is based on the SVM/SVA
> machinery in sva-lib. Kernel SVA is achieved by extending a special flag
> to indicate the binding of the device and a page table should be performed
> on init_mm instead of the mm of the current process.Page requests and other
> differences between user and kernel SVA are handled as special cases.
>
> This unrestricted binding with the kernel page table is being challenged
> for security and the convention that in-kernel DMA must be compatible with
> DMA APIs.
> (https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/)
> There is also the lack of IOTLB synchronization upon kernel page table 
> updates.
>
> This patchset is trying to address these concerns by having an explicit DMA
> API compatible model while continue to support in-kernel use of DMA requests
> with PASID. Specifically, the following DMA-IOMMU APIs are introduced:
>
> int iommu_dma_pasid_enable/disable(struct device *dev,
>struct iommu_domain **domain,
>enum iommu_dma_pasid_mode mode);
> int iommu_map/unmap_kva(struct iommu_domain *domain,
> void *cpu_addr,size_t size, int prot);
>
> The following three addressing modes are supported with example API usages
> by device drivers.
>
> 1. Physical address (bypass) mode. Similar to DMA direct where trusted devices
> can DMA pass through IOMMU on a per PASID basis.
> Example:
> pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_BYPASS);
> /* Use the returning PASID and PA for work submission */
>
> 2. IOVA mode. DMA API compatible. Map a supervisor PASID the same way as the
> PCI requester ID (RID)
> Example:
> pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_IOVA);
> /* Use the PASID and DMA API allocated IOVA for work submission */

Hi Jacob,
might be stupid question, what is the performance benefit of this IOVA
mode comparing
with the current dma_map/unmap_single/sg API which have enabled IOMMU like
drivers/iommu/arm/arm-smmu-v3? Do we still need to flush IOTLB by sending
commands to IOMMU  each time while doing dma_unmap?

>
> 3. KVA mode. New kva map/unmap APIs. Support fast and strict sub-modes
> transparently based on device trustfulness.
> Example:
> pasid = iommu_dma_pasid_enable(dev, &domain, IOMMU_DMA_PASID_KVA);
> iommu_map_kva(domain, &buf, size, prot);
> /* Use the returned PASID and KVA to submit work */
> Where:
> Fast mode: Shared CPU page tables for trusted devices only
> Strict mode: IOMMU domain returned for the untrusted device to
> replicate KVA-PA mapping in IOMMU page tables.

a huge bottleneck of IOMMU we have seen before is that dma_unmap will
require IOTLB
flush, for example, in arm_smmu_cmdq_issue_cmdlist(), we are having
serious contention
on acquiring lock and delay on waiting for iotlb flush completion in
arm_smmu_cmdq_poll_until_sync() while multi-threads run.

I assume KVA mode can avoid this iotlb flush as the device is using
the page table
of the kernel and sharing the whole kernel space. But will users be
glad to accept
this mode?
It seems users are enduring the performance decrease of IOVA mapping
and unmapping
because it has better security. dma operations can only run on some
specific dma buffers
which have been mapped in the current dma-map/unmap with IOMMU backend.
some drivers are using bouncing buffer to overcome the performance loss of
dma_map/unmap as copying is faster than unmapping:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f

BTW, we have been debugging on dma_map/unmap performance by this
benchmark:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/dma/map_benchmark.c
you might be able to use it for your benchmarking as well :-)

>
> On a per device basis, DMA address and performance modes are enabled by the
> device drivers. Platform information such as trustability, user command line
> input (not included in this set) could also be taken into consideration (not
> implemented in this RFC).
>
> This RFC is intended to communicate the API directions. Little testing is done
> outside IDXD and DMA engine tests.
>
> For PA and IOVA modes, the implementation is straightforward and tested with
> Intel IDXD driver. But several opens remain in KVA fast mode thus not tested:
> 1. Lack of IOTLB synchronization, kernel direct map alias can be updated as a
> result of module loading/eBPF load. Adding kernel mmu notifier?
> 2. The use of the auxiliary domain for KVA map, will aux domain stay in the
> long term? Is there another way to represent sub-device granu isolation?
> 3. Is limiting the KVA sharing to the direct map range reasonable and
> practical for all architectures?
>
>
> Many thanks to Ashok Raj, Kevin 

Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-01 Thread Barry Song
On Sat, Oct 2, 2021 at 1:36 AM Jason Gunthorpe  wrote:
>
> On Sat, Oct 02, 2021 at 01:24:54AM +1300, Barry Song wrote:
>
> > I assume KVA mode can avoid this iotlb flush as the device is using
> > the page table of the kernel and sharing the whole kernel space. But
> > will users be glad to accept this mode?
>
> You can avoid the lock be identity mapping the physical address space
> of the kernel and maping map/unmap a NOP.
>
> KVA is just a different way to achive this identity map with slightly
> different security properties than the normal way, but it doesn't
> reach to the same security level as proper map/unmap.
>
> I'm not sure anyone who cares about DMA security would see value in
> the slight difference between KVA and a normal identity map.

yes. This is an important question. if users want a high security level,
kva might not their choice; if users don't want the security, they are using
iommu passthrough. So when will users choose KVA?

>
> > which have been mapped in the current dma-map/unmap with IOMMU backend.
> > some drivers are using bouncing buffer to overcome the performance loss of
> > dma_map/unmap as copying is faster than unmapping:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f
>
> It is pretty unforuntate that drivers are hard coding behaviors based
> on assumptions of what the portable API is doing under the covers.

not real when it has a tx_copybreak which can be set by ethtool or
similar userspace
tools . if users are using iommu passthrough, copying won't happen by
the default
tx_copybreak.  if users are using restrict iommu mode, socket buffers
are copied into
the buffers allocated and mapped in the driver. so this won't require
mapping and
unmapping socket buffers frequently.

>
> Jason

Thanks
barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/2] dt-bindings: arm-smmu: Add qcm2290 compatible strings

2021-10-01 Thread Loic Poulain
Add compatible strings for qcm2290 iommu to documentation.

Signed-off-by: Loic Poulain 
---
 Documentation/devicetree/bindings/iommu/arm,smmu.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml 
b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
index 03f2b2d..e66f20d 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
@@ -40,6 +40,7 @@ properties:
   - qcom,sm8150-smmu-500
   - qcom,sm8250-smmu-500
   - qcom,sm8350-smmu-500
+  - qcom,qcm2290-smmu-500
   - const: arm,mmu-500
   - description: Qcom Adreno GPUs implementing "arm,smmu-v2"
 items:
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/2] arm64: retry dropping HAVE_ARCH_PFN_VALID

2021-10-01 Thread Will Deacon
On Thu, 30 Sep 2021 04:30:37 +0300, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Hi,
> 
> This is a new attempt to drop HAVE_ARCH_PFN_VALID on arm64 and start using
> the generic implementation of pfn_valid().
> 
> [...]

Applied to arm64 (for-next/pfn-valid), thanks!

[1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource
  https://git.kernel.org/arm64/c/a9c38c5d267c
[2/2] arm64/mm: drop HAVE_ARCH_PFN_VALID
  https://git.kernel.org/arm64/c/3de360c3fdb3

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-01 Thread Jason Gunthorpe
On Wed, Sep 29, 2021 at 09:36:52PM -0300, Jason Gunthorpe wrote:

> Why would DAX want to do this in the first place?? This means the
> address space zap is much more important that just speeding up
> destruction, it is essential for correctness since the PTEs are not
> holding refcounts naturally...

It is not really for this series to fix, but I think the whole thing
is probably racy once you start allowing pte_special pages to be
accessed by GUP.

If we look at unmapping the PTE relative to GUP fast the important
sequence is how the TLB flushing doesn't decrement the page refcount
until after it knows any concurrent GUP fast is completed. This is
arch specific, eg it could be done async through a call_rcu handler.

This ensures that pages can't cross back into the free pool and be
reallocated until we know for certain that nobody is walking the PTEs
and could potentially take an additional reference on it. The scheme
cannot rely on the page refcount being 0 because oce it goes into the
free pool it could be immeidately reallocated back to a non-zero
refcount.

A DAX user that simply does an address space invalidation doesn't
sequence itself with any of this mechanism. So we can race with a
thread doing GUP fast and another thread re-cycling the page into
another use - creating a leakage of the page from one security context
to another.

This seems to be made worse for the pgmap stuff due to the wonky
refcount usage - at least if the refcount had dropped to zero gup fast
would be blocked for a time, but even that doesn't happen.

In short, I think using pg special for anything that can be returned
by gup fast (and maybe even gup!) is racy/wrong. We must have the
normal refcount mechanism work for correctness of the recycling flow.

I don't know why DAX did this, I think we should be talking about
udoing it all of it, not just the wonky refcounting Alistair and Felix
are working on, but also the use of MIXEDMAP and pte special for
struct page backed memory.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-10-01 Thread Gerald Schaefer
On Thu, 30 Sep 2021 15:37:33 +0200
Karsten Graul  wrote:

> On 14/09/2021 17:45, Ioana Ciornei wrote:
> > On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote:
> >> +DPAA2, netdev maintainers
> >> Hi,
> >>
> >> On 5/18/21 7:54 AM, Hamza Mahfooz wrote:
> >>> Since, overlapping mappings are not supported by the DMA API we should
> >>> report an error if active_cacheline_insert returns -EEXIST.
> >>
> >> It seems this patch found a victim. I was trying to run iperf3 on a
> >> honeycomb (5.14.0, fedora 35) and the console is blasting this error 
> >> message
> >> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, which
> >> is attached below.
> >>
> > 
> > These frags are allocated by the stack, transformed into a scatterlist
> > by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the
> > dpaa2-eth's decision to use two fragments from the same page (that will
> > also end un in the same cacheline) in two different in-flight skbs.
> > 
> > Is this behavior normal?
> > 
> 
> We see the same problem here and it started with 5.15-rc2 in our nightly CI 
> runs.
> The CI has panic_on_warn enabled so we see the panic every day now.

Adding a WARN for a case that be detected false-positive seems not
acceptable, exactly for this reason (kernel panic on unaffected
systems).

So I guess it boils down to the question if the behavior that Ioana
described is legit behavior, on a system that is dma coherent. We
are apparently hitting the same scenario, although it could not yet be
reproduced with debug printks for some reason.

If the answer is yes, than please remove at lease the WARN, so that
it will not make systems crash that behave valid, and have
panic_on_warn set. Even a normal printk feels wrong to me in that
case, it really sounds rather like you want to fix / better refine
the overlap check, if you want to report anything here.

BTW, there is already a WARN in the add_dma_entry() path, related
to cachlline overlap and -EEXIST:

add_dma_entry() -> active_cacheline_insert() -> -EEXIST ->
active_cacheline_inc_overlap()

That will only trigger when "overlap > ACTIVE_CACHELINE_MAX_OVERLAP".
Not familiar with that code, but it seems that there are now two
warnings for more or less the same, and the new warning is much more
prone to false-positives.

How do these 2 warnings relate, are they both really necessary?
I think the new warning was only introduced because of some old
TODO comment in add_dma_entry(), see commit 2b4bbc6231d78
("dma-debug: report -EEXIST errors in add_dma_entry").

That comment was initially added by Dan long time ago, and he
added several fix-ups for overlap detection after that, including
the "overlap > ACTIVE_CACHELINE_MAX_OVERLAP" stuff in
active_cacheline_inc_overlap(). So could it be that the TODO
comment was simply not valid any more, and better be removed
instead of adding new / double warnings, that also generate
false-positives and kernel crashes?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

2021-10-01 Thread Jason Gunthorpe via iommu
On Thu, Sep 30, 2021 at 01:10:29PM +1000, David Gibson wrote:
> On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote:
> 65;6402;1c> On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:
> > 
> > > > +struct iommufd_device {
> > > > +   unsigned int id;
> > > > +   struct iommufd_ctx *ictx;
> > > > +   struct device *dev; /* always be the physical device */
> > > > +   u64 dev_cookie;
> > > 
> > > Why do you need both an 'id' and a 'dev_cookie'?  Since they're both
> > > unique, couldn't you just use the cookie directly as the index into
> > > the xarray?
> > 
> > ID is the kernel value in the xarray - xarray is much more efficient &
> > safe with small kernel controlled values.
> > 
> > dev_cookie is a user assigned value that may not be unique. It's
> > purpose is to allow userspace to receive and event and go back to its
> > structure. Most likely userspace will store a pointer here, but it is
> > also possible userspace could not use it.
> > 
> > It is a pretty normal pattern
> 
> Hm, ok.  Could you point me at an example?

For instance user_data vs fd in io_uring

RDMA has many similar examples.

More or less anytime you want to allow the kernel to async retun some
information providing a 64 bit user_data lets userspace have an easier
time to deal with it.

Jason 


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-01 Thread Jason Gunthorpe via iommu
On Sat, Oct 02, 2021 at 01:24:54AM +1300, Barry Song wrote:

> I assume KVA mode can avoid this iotlb flush as the device is using
> the page table of the kernel and sharing the whole kernel space. But
> will users be glad to accept this mode?

You can avoid the lock be identity mapping the physical address space
of the kernel and maping map/unmap a NOP.

KVA is just a different way to achive this identity map with slightly
different security properties than the normal way, but it doesn't
reach to the same security level as proper map/unmap.

I'm not sure anyone who cares about DMA security would see value in
the slight difference between KVA and a normal identity map.

> which have been mapped in the current dma-map/unmap with IOMMU backend.
> some drivers are using bouncing buffer to overcome the performance loss of
> dma_map/unmap as copying is faster than unmapping:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f

It is pretty unforuntate that drivers are hard coding behaviors based
on assumptions of what the portable API is doing under the covers.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-01 Thread Jason Gunthorpe via iommu
On Fri, Oct 01, 2021 at 04:19:22PM +1000, da...@gibson.dropbear.id.au wrote:
> On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > > 
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this 
> > > > > point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > > vfio.
> > > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > > there
> > > > >   is a way to claim a specific iova range from a system-wide address 
> > > > > space.
> > > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC 
> > > > > hasn't
> > > > >   adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > >   and then decide how to incorporate it in v2.
> > > > 
> > > > I think the request was to include a start/end IO address hint when
> > > > creating the ios. When the kernel creates it then it can return the
> > > 
> > > is the hint single-range or could be multiple-ranges?
> > 
> > David explained it here:
> > 
> > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> 
> Apparently not well enough.  I've attempted again in this thread.
> 
> > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > bit range.
> 
> No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> guest, and therefore needs to request both from the host.

As I understood your remarks each IOAS can only be one of the formats
as they have a different PTE layout. So here I ment that qmeu needs to
be able to pick *for each IOAS* which of the two formats it is.

> Or rather, it *might* need to supply both.  It will supply just the
> 32-bit range by default, but the guest can request the 64-bit range
> and/or remove and resize the 32-bit range via hypercall interfaces.
> Vaguely recent Linux guests certainly will request the 64-bit range in
> addition to the default 32-bit range.

And this would result in two different IOAS objects

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-01 Thread Jason Gunthorpe via iommu
On Fri, Oct 01, 2021 at 04:13:58PM +1000, David Gibson wrote:
> On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > > 
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > > 
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > > 
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address 
> > > space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci 
> > > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1 
> > > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> > actual geometry including any holes via a query.
> 
> So part of the point of specifying start/end addresses is that
> explicitly querying holes shouldn't be necessary: if the requested
> range crosses a hole, it should fail.  If you didn't really need all
> that range, you shouldn't have asked for it.
> 
> Which means these aren't really "hints" but optionally supplied
> constraints.

We have to be very careful here, there are two very different use
cases. When we are talking about the generic API I am mostly
interested to see that applications like DPDK can use this API and be
portable to any IOMMU HW the kernel supports. I view the fact that
there is VFIO PPC specific code in DPDK as a failing of the kernel to
provide a HW abstraction.

This means we cannot define an input that has a magic HW specific
value. DPDK can never provide that portably. Thus all these kinds of
inputs in the generic API need to be hints, if they exist at all.

As 'address space size hint'/'address space start hint' is both
generic, useful, and providable by DPDK I think it is OK. PPC can use
it to pick which of the two page table formats to use for this IOAS if
it wants.

The second use case is when we have a userspace driver for a specific
HW IOMMU. Eg a vIOMMU in qemu doing specific PPC/ARM/x86 acceleration.
We can look here for things to make general, but I would expect a
fairly high bar. Instead, I would rather see the userspace driver
communicate with the kernel driver in its own private language, so
that the entire functionality of the unique HW can be used.

So, when it comes to providing exact ranges as an input parameter we
have to decide if that is done as some additional general data, or if
it should be part of a IOAS_FORMAT_KERNEL_PPC. In this case I suggest
the guiding factor should be if every single IOMMU implementation can
be updated to support the value.

Jason


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 6/9] PCI: Add pci_find_dvsec_capability to find designated VSEC

2021-10-01 Thread Jonathan Cameron
On Thu, 23 Sep 2021 10:26:44 -0700
Ben Widawsky  wrote:

> Add pci_find_dvsec_capability to locate a Designated Vendor-Specific
> Extended Capability with the specified DVSEC ID.
> 
> The Designated Vendor-Specific Extended Capability (DVSEC) allows one or
> more vendor specific capabilities that aren't tied to the vendor ID of
> the PCI component.
> 
> DVSEC is critical for both the Compute Express Link (CXL) driver as well
> as the driver for OpenCAPI coherent accelerator (OCXL).
> 
> Cc: David E. Box 
> Cc: Jonathan Cameron 
> Cc: Bjorn Helgaas 
> Cc: Dan Williams 
> Cc: linux-...@vger.kernel.org
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: Andrew Donnellan 
> Cc: Lu Baolu 
> Reviewed-by: Frederic Barrat 
> Signed-off-by: Ben Widawsky 

Great to see this cleaned up.

Reviewed-by: Jonathan Cameron 

> ---
>  drivers/pci/pci.c   | 32 
>  include/linux/pci.h |  1 +
>  2 files changed, 33 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index ce2ab62b64cf..94ac86ff28b0 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -732,6 +732,38 @@ u16 pci_find_vsec_capability(struct pci_dev *dev, u16 
> vendor, int cap)
>  }
>  EXPORT_SYMBOL_GPL(pci_find_vsec_capability);
>  
> +/**
> + * pci_find_dvsec_capability - Find DVSEC for vendor
> + * @dev: PCI device to query
> + * @vendor: Vendor ID to match for the DVSEC
> + * @dvsec: Designated Vendor-specific capability ID
> + *
> + * If DVSEC has Vendor ID @vendor and DVSEC ID @dvsec return the capability
> + * offset in config space; otherwise return 0.
> + */
> +u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec)
> +{
> + int pos;
> +
> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
> + if (!pos)
> + return 0;
> +
> + while (pos) {
> + u16 v, id;
> +
> + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &v);
> + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &id);
> + if (vendor == v && dvsec == id)
> + return pos;
> +
> + pos = pci_find_next_ext_capability(dev, pos, 
> PCI_EXT_CAP_ID_DVSEC);
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_find_dvsec_capability);
> +
>  /**
>   * pci_find_parent_resource - return resource region of parent bus of given
>   * region
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index cd8aa6fce204..c93ccfa4571b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1130,6 +1130,7 @@ u16 pci_find_ext_capability(struct pci_dev *dev, int 
> cap);
>  u16 pci_find_next_ext_capability(struct pci_dev *dev, u16 pos, int cap);
>  struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
>  u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap);
> +u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec);
>  
>  u64 pci_get_dsn(struct pci_dev *dev);
>  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-10-01 Thread Ioana Ciornei
On Fri, Oct 01, 2021 at 06:19:59AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 14, 2021 at 03:45:06PM +, Ioana Ciornei wrote:
> > [  245.927020] fsl_dpaa2_eth dpni.3: scather-gather idx 0 P=20a732 
> > N=20a7320 D=20a732 L=30 DMA_BIDIRECTIONAL dma map error check not 
> > applicableĀ·
> > [  245.927048] fsl_dpaa2_eth dpni.3: scather-gather idx 1 P=20a7320030 
> > N=20a7320 D=20a7320030 L=5a8 DMA_BIDIRECTIONAL dma map error check not 
> > applicable
> > [  245.927062] DMA-API: cacheline tracking EEXIST, overlapping mappings 
> > aren't supported
> > 
> > The first line is the dump of the dma_debug_entry which is already present
> > in the radix tree and the second one is the entry which just triggered
> > the EEXIST.
> > 
> > As we can see, they are not actually overlapping, at least from my
> > understanding. The first one starts at 0x20a732 with a size 0x30
> > and the second one at 0x20a7320030.
> 
> They overlap the cache lines.  Which means if you use this driver
> on a system that is not dma coherent you will corrupt data.

This is a driver of an integrated ethernet controller which is DMA
coherent.

I added a print just to make sure of this:

--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -567,6 +567,7 @@ static void add_dma_entry(struct dma_debug_entry *entry)
pr_err("cacheline tracking ENOMEM, dma-debug disabled\n");
global_disable = true;
} else if (rc == -EEXIST) {
+   pr_err("dev_is_dma_coherent(%s) = %d\n", dev_name(entry->dev), 
dev_is_dma_coherent(entry->dev));
err_printk(entry->dev, entry,
"cacheline tracking EEXIST, overlapping mappings aren't 
supported\n");
}


[   85.852218] DMA-API: dev_is_dma_coherent(dpni.3) = 1
[   85.858891] [ cut here ]
[   85.858893] DMA-API: fsl_dpaa2_eth dpni.3: cacheline tracking EEXIST, 
overlapping mappings aren't supported
[   85.858901] WARNING: CPU: 13 PID: 1046 at kernel/dma/debug.c:571 
add_dma_entry+0x330/0x390
[   85.858911] Modules linked in:
[   85.858915] CPU: 13 PID: 1046 Comm: iperf3 Not tainted 
5.15.0-rc2-00478-g34286ba6a164-dirty #1275
[   85.858919] Hardware name: NXP Layerscape LX2160ARDB (DT)


Shouldn't this case not generate this kind of warning?

Ioana
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu