Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Benjamin Herrenschmidt
On Thu, 2018-08-23 at 07:24 +0200, Christoph Hellwig wrote:
> > Well, iommus can have bypass regions, which we also use for
> > performance, so we do at dma_set_mask() time "swap" the ops around, and
> > in that case, we do want to check the mask against the actual top of
> > memory...
> 
> That is a bit of a powerpc special case (we also had one other arch
> doing that, but it got removed in the great purge, can't rember which
> one right now).  Everyone else has one set of ops, and they just switch
> to the direct mapping inside the iommu ops.

We more or less do that too in some of ours these days bcs of the whole
coherent_mask vs mask where a given device might need either depending
on the type of mapping.

Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 23, 2018 at 10:01:45AM +1000, Benjamin Herrenschmidt wrote:
> > The general scheme that architectures should implement is:
> > 
> > ZONE_DMA:   Any memory below a magic threshold that is lower than
> > 32-bit.  Only enabled if actually required (usually
> > either 24-bit for ISA, or some other weird architecture
> > specific value like 32-bit for S/390)
> 
> It should have been ZONE_ISA_DMA :-)

For most of these use cases it should have been indeed, and that
would avoid a lot of confusion where people use GFP_DMA just because
they do DMA.

Anyway, switching powerpc to this scheme would be great, but I don't
think it is required - GFP_KERNEL will silently fall back to ZONE_DMA,
so except for an additional GFP_DMA fallback allocation when the
GFP_KERNEL one fails the code should just work.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 23, 2018 at 09:59:18AM +1000, Benjamin Herrenschmidt wrote:
> > Yeah, the other platforms that support these devices support ZONE_DMA
> > to reliably handle these devices. But there is two other ways the
> > current code would actually handle these fine despite the dma_direct
> > checks:
> > 
> >  1) if the device only has physical addresses up to 31-bit anyway
> >  2) by trying again to find a lower address.  But this only works
> > for coherent allocations and not streaming maps (unless we have
> > swiotlb with a buffer below 31-bits).
> > 
> > It seems powerpc can have ZONE_DMA, though and we will cover these
> > devices just fine.  If it didn't have that the current powerpc
> > code would not work either.
> 
> Not exactly. powerpc has ZONE_DMA covering all of system memory.
> 
> What happens in ppc32 is that we somewhat "know" that none of the
> systems with those stupid 31-bit limited pieces of HW is capable of
> having more than 2GB of memory anyway.
> 
> So we get away with just returning "1".

I think I can up with a proper way of handling that by checking
the actual amount of physical memory present instead of the hard coded
32-bit.

> > If your PCI bridge / PCIe root port doesn't support dma to addresses
> > larger than 32-bit the device capabilities above that don't matter, it
> > just won't work.  We have this case at least for some old VIA x86 chipsets
> > and some relatively modern Xilinx FPGAs with PCIe.
> 
> Hrm... that's the usual confusion dma_capable() vs. dma_set_mask().
> 
> It's always been perfectly fine for a driver to do a dma_set_mask(64-
> bit) on a system where the bridge can only do 32-bits ...

No, it hasn't.  That's why we have this pattern of trying a 64-bit
mask first and then setting a 32-bit mask if that fails all over
drivers/.  However with all the work we've done over the last month
we are getting really close to a world where:

 - the driver just does one dma_set_mask for the capabilities and
   stores that in the dma_mask
 - other limitations go elsewhere and will be automatically taken
   into account.

Which is I guess what you always wanted, but which wasn't how things
actually worked before.

> We shouldn't fail there, we should instead "clamp" the mask to 32-bit,
> see what I mean ? It doesn't matter that the device itself is capable
> of issuing >32 addresses, I agree, but what we need to express is that
> the combination device+bridge doesn't want addresses above 32-bit, so
> it's equivalent to making the device do a set_mask(32-bit).

As said, we'll get there (but with the new separate bus_dma_mask in 4.19),
but this is not how things currently work.

> > Your observation is right, but there always has been the implicit
> > assumption that architectures with more than 4GB of physical address
> > space must either support and iommu or swiotlb and use that.  It's
> > never been document anywhere, but I'm working on integrating all
> > this code to make more sense.
> 
> Well, iommus can have bypass regions, which we also use for
> performance, so we do at dma_set_mask() time "swap" the ops around, and
> in that case, we do want to check the mask against the actual top of
> memory...

That is a bit of a powerpc special case (we also had one other arch
doing that, but it got removed in the great purge, can't rember which
one right now).  Everyone else has one set of ops, and they just switch
to the direct mapping inside the iommu ops.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:58 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 09:54:33AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > > We need to take the DMA offset and encryption bit into account when 
> > > selecting
> > > a zone.  Add a helper that takes those into account and use it.
> > 
> > That whole "encryption" stuff seems to be completely specific to the
> > way x86 does memory encryption, or am I mistaken ? It's not clear to me
> > what that does in practice and how it relates to DMA mappings.
> 
> Not even all of x86, but AMD in particular, Intel does it yet another
> way.  But it still is easier to take this into the core with a few
> overrides than duplicating all the code.
> 
> > I'm also not sure about that whole business with ZONE_DMA and
> > ARCH_ZONE_DMA_BITS...
> 
> ZONE_DMA usually (but not always) maps to 24-bits of address space,
> if it doesn't (I mostly through about s390 with it's odd 31-bits)
> the architecture can override it if it cares).
> 
> > On ppc64, unless you enable swiotlb (which we only do currently on
> > some embedded platforms), you have all of memory in ZONE_DMA.
> > 
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x-0x001f]
> > [0.00]   DMA32empty
> > [0.00]   Normal   empty
> > [0.00]   Device   empty
> 
> This is really weird.  Why would you wire up ZONE_DMA like this?

We always did :-) It predates my involvement and I think it predates
even Pauls. It's quite silly actually since the first powerpc machines
actually had ISA devices in them, but that's how it's been for ever. I
suppose we could change it but that would mean digging out some old
stuff to test.

> The general scheme that architectures should implement is:
> 
> ZONE_DMA: Any memory below a magic threshold that is lower than
>   32-bit.  Only enabled if actually required (usually
>   either 24-bit for ISA, or some other weird architecture
>   specific value like 32-bit for S/390)

It should have been ZONE_ISA_DMA :-)

> ZONE_DMA32:   Memory <= 32-bit if the architecture supports more than
>   32-bits worth of physical address space.  Should generally
>   be enabled on all 64-bit architectures unless you have
>   a very good reason not to.

Yeah so we sort-of enable the config option but only populate the zone
on platforms using swiotlb (freescale stuff). It's a bit messy at the
moment I must admit.

> ZONE_NORMAL:  Everything above 32-bit not falling into HIGHMEM or
>   MOVEABLE.

Cheers,
Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:53 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 09:44:18AM +1000, Benjamin Herrenschmidt wrote:
> > We do have the occasional device with things like 31-bit DMA
> > limitation. We know they happens to work because those systems
> > can't have enough memory to be a problem. This is why our current
> > DMA direct ops in powerpc just unconditionally return true on ppc32.
> > 
> > The test against a full 32-bit mask here will break them I think.
> > 
> > Thing is, I'm not sure I still have access to one of these things
> > to test, I'll have to dig (from memory things like b43 wifi).
> 
> Yeah, the other platforms that support these devices support ZONE_DMA
> to reliably handle these devices. But there is two other ways the
> current code would actually handle these fine despite the dma_direct
> checks:
> 
>  1) if the device only has physical addresses up to 31-bit anyway
>  2) by trying again to find a lower address.  But this only works
> for coherent allocations and not streaming maps (unless we have
> swiotlb with a buffer below 31-bits).
> 
> It seems powerpc can have ZONE_DMA, though and we will cover these
> devices just fine.  If it didn't have that the current powerpc
> code would not work either.

Not exactly. powerpc has ZONE_DMA covering all of system memory.

What happens in ppc32 is that we somewhat "know" that none of the
systems with those stupid 31-bit limited pieces of HW is capable of
having more than 2GB of memory anyway.

So we get away with just returning "1".

> 
> >  - What is this trying to achieve ?
> > 
> > /*
> >  * Various PCI/PCIe bridges have broken support for > 32bit DMA even
> >  * if the device itself might support it.
> >  */
> > if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
> > return 0;
> > 
> > IE, if the device has a 32-bit limit, we fail an attempt at checking
> > if a >32-bit mask works ? That doesn't quite seem to be the right thing
> > to do... Shouldn't this be in dma_set_mask() and just clamp the mask down ?
> > 
> > IE, dma_set_mask() is what a driver uses to establish the device capability,
> > so it makes sense tot have dma_32bit_limit just reduce that capability, not
> > fail because the device can do more than what the bridge can 
> 
> If your PCI bridge / PCIe root port doesn't support dma to addresses
> larger than 32-bit the device capabilities above that don't matter, it
> just won't work.  We have this case at least for some old VIA x86 chipsets
> and some relatively modern Xilinx FPGAs with PCIe.

Hrm... that's the usual confusion dma_capable() vs. dma_set_mask().

It's always been perfectly fine for a driver to do a dma_set_mask(64-
bit) on a system where the bridge can only do 32-bits ...

We shouldn't fail there, we should instead "clamp" the mask to 32-bit,
see what I mean ? It doesn't matter that the device itself is capable
of issuing >32 addresses, I agree, but what we need to express is that
the combination device+bridge doesn't want addresses above 32-bit, so
it's equivalent to making the device do a set_mask(32-bit).

This will succeed if the system can limit the addresses (for example
because memory is never above 32-bit) and will fail if the system
can't.

So that's equivalent of writing

if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
mask = phys_to_dma(dev, DMA_BIT_MASK(32)); 

Effectively meaning "don't give me addresses aboe 32-bit".

Still, your code doesn't check the mask against the memory size. Which
means it will fail for 32-bit masks even on systems that do not have
memory above 4G.

> >  - How is that file supposed to work on 64-bit platforms ? From what I can
> > tell, dma_supported() will unconditionally return true if the mask is
> > 32-bit or larger (appart from the above issue). This doesn't look right,
> > the mask needs to be compared to the max memory address. There are a bunch
> > of devices out there with masks anywhere bettween 40 and 64 bits, and
> > some of these will not work "out of the box" if the offseted top
> > of memory is beyond the mask limit. Or am I missing something ?
> 
> Your are not missing anything except for the history of this code.
> 
> Your observation is right, but there always has been the implicit
> assumption that architectures with more than 4GB of physical address
> space must either support and iommu or swiotlb and use that.  It's
> never been document anywhere, but I'm working on integrating all
> this code to make more sense.

Well, iommus can have bypass regions, which we also use for
performance, so we do at dma_set_mask() time "swap" the ops around, and
in that case, we do want to check the mask against the actual top of
memory...

Cheers,
Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:45 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 10:01:16AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> > > It turns out cxl actually uses it.  So for now skip this patch,
> > > although random code in drivers messing with dma ops will need to
> > > be sorted out sooner or later.
> > 
> > CXL devices are "special", they bypass the classic iommu in favor of
> > allowing the device to operate using the main processor page tables
> > using an MMU context (so basically the device can use userspace
> > addresses directly), akin to ATS.
> > 
> > I think the code currently uses the nommu ops as a way to do a simple
> > kernel mapping for kernel drivers using CXL (not userspace stuff)
> > though.
> 
> Its still a horrible idea to have this in drivers/, we need some
> core API to mediate this behavior.  Also if the device supports
> using virtual addresses dma_nommu_ops seems wrong as it won't do
> the right thing for e.g. vmalloc addresses not mapped into the
> kernel linear mapping (which I guess can't currently happen on
> powerpc, but still..)

You are right it won't do the right thing, but neither will standard
DMA ops, will they ? Drivers know not to try to dma_map vmalloc
addresses without first getting the underlying page, nothing unusal
there.

Yes I agree having this in drivers somewhat sucks though.

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 10/20] powerpc/dma-noncoherent: don't disable irqs over kmap_atomic

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 09:02 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 10:27:46AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > > The requirement to disable local irqs over kmap_atomic is long gone,
> > > so remove those calls.
> > 
> > Really ? I'm trying to verify that and getting lost in a mess of macros
> > from hell in the per-cpu stuff but if you look at our implementation
> > of kmap_atomic_prot(), all it does is a preempt_disable(), and then
> > it uses kmap_atomic_idx_push():
> > 
> > int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
> > 
> > Note the use of __this_cpu_inc_return(), not this_cpu_inc_return(),
> > ie this is the non-interrupt safe version...
> 
> Looks like the powerpc variant indeed isn't save.
> 
> I did look a bit more through the code and history, and it seems
> like we remove the need to disable irqs when called from process
> context a while ago, but we still require disabling irqs when called
> from irq context.  Given that this code can also be called from
> irq context we'll have to keep the local_irq_save.

This is the same with x86 no ?

32-bit x86 kmap_atomic_prot is the same as ours...

In fact I wonder why the preempt_disable() in there since it needs to
be protected against interrupt ?

Or is it that we never actually call kmap_atomic_* these days from
interrupt, and the atomic versions are just about dealing with
spinlocks ?

Cheers,
Ben.
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems

2018-08-22 Thread Ganapatrao Kulkarni
On Wed, Aug 22, 2018 at 9:08 AM John Garry  wrote:
>
> On 22/08/2018 15:56, Robin Murphy wrote:
> > Hi John,
> >
> > On 22/08/18 14:44, John Garry wrote:
> >> On 21/09/2017 09:59, Ganapatrao Kulkarni wrote:
> >>> Adding numa aware memory allocations used for iommu dma allocation and
> >>> memory allocated for SMMU stream tables, page walk tables and command
> >>> queues.
> >>>
> >>> With this patch, iperf testing on ThunderX2, with 40G NIC card on
> >>> NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.
> >>>
> >>> Ganapatrao Kulkarni (4):
> >>>   mm: move function alloc_pages_exact_nid out of __meminit
> >>>   numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
> >>> translation tables
> >>>   iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
> >>> comamnd queues
> >>>   iommu/dma, numa: Use NUMA aware memory allocations in
> >>> __iommu_dma_alloc_pages
> >>>
> >>>  drivers/iommu/arm-smmu-v3.c| 57
> >>> +-
> >>>  drivers/iommu/dma-iommu.c  | 17 +++--
> >>>  drivers/iommu/io-pgtable-arm.c |  4 ++-
> >>>  include/linux/gfp.h|  2 +-
> >>>  mm/page_alloc.c|  3 ++-
> >>>  5 files changed, 67 insertions(+), 16 deletions(-)
> >>>
> >>
> >> Hi Ganapatrao,
> >>
> >> Have you any plans for further work on this patchset? I have not seen
> >> anything since this v1 was posted+discussed.
> >
>
> Hi Robin,
>
> Thanks for the info. I thought I remembered 4b12 but couldn't put my
> finger on it.
>
> > Looks like I ended up doing the version of the io-pgtable change that I
> > suggested here, which was merged recently (4b123757eeaa). Patch #3
> > should also be effectively obsolete now since the SWIOTLB/dma-direct
> > rework (21f237e4d085). Apparently I also started reworking patch #4 in
> > my tree at some point but sidelined it - I think that was at least
> > partly due to another thread[1] which made it seem less clear-cut
> > whether this is always the right thing to do.
>
> Right, so #4 seems less straightforward and not directly related to
> IOMMU driver anyway.
>

thanks Robin for pulling up the patch. I couldn't followup with this
due to other tasks.

> Cheers,
> John
>
> >
> > Robin.
> >
> > [1]
> > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1693026.html
> >
> > .
> >
>
>
thanks,
Ganapat
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 3/5] iommu/io-pgtable-arm: add support for non-strict mode

2018-08-22 Thread Robin Murphy

On 15/08/18 02:28, Zhen Lei wrote:

To support the non-strict mode, now we only tlbi and sync for the strict
mode. But for the non-leaf case, always follow strict mode.

Signed-off-by: Zhen Lei 
---
  drivers/iommu/io-pgtable-arm.c | 20 ++--
  drivers/iommu/io-pgtable.h |  3 +++
  2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 010a254..20d3e98 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -538,6 +538,7 @@ static size_t arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
phys_addr_t blk_paddr;
size_t tablesz = ARM_LPAE_GRANULE(data);
size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+   size_t unmapped = size;
int i, unmap_idx = -1;

if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
@@ -575,11 +576,16 @@ static size_t arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
tablep = iopte_deref(pte, data);
}

-   if (unmap_idx < 0)


[ side note: the more I see this test the more it looks slightly wrong, 
but that's a separate issue. I'll have to sit down and really remember 
what's going on here... ]



-   return __arm_lpae_unmap(data, iova, size, lvl, tablep);
+   if (unmap_idx < 0) {
+   unmapped = __arm_lpae_unmap(data, iova, size, lvl, tablep);
+   if (!(data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT))
+   return unmapped;
+   }


I don't quite get this change - we should only be recursing back into 
__arm_lpae_unmap() here if nothing's actually been unmapped at this 
point - the block entry is simply replaced by a full next-level table 
and we're going to come back and split another block at that next level, 
or we raced against someone else splitting the same block and that's 
their table instead. Since that's reverting back to a "regular" unmap, I 
don't see where the need to introduce an additional flush to that path 
comes from (relative to the existing behaviour, at least).



io_pgtable_tlb_add_flush(>iop, iova, size, size, true);


This is the flush which corresponds to whatever page split_blk_unmap() 
actually unmapped itself (and also covers any recursively-split 
intermediate-level entries)...



-   return size;
+   io_pgtable_tlb_sync(>iop);


...which does want this sync, but only technically for non-strict mode, 
since it's otherwise covered by the sync in iommu_unmap().


I'm not *against* tightening up the TLB maintenance here in general, but 
if so that should be a separately-reasoned patch, not snuck in with 
other changes.


Robin.


+
+   return unmapped;
  }

  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
@@ -609,7 +615,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable 
*data,
io_pgtable_tlb_sync(iop);
ptep = iopte_deref(pte, data);
__arm_lpae_free_pgtable(data, lvl + 1, ptep);
-   } else {
+   } else if (!(iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT)) {
io_pgtable_tlb_add_flush(iop, iova, size, size, true);
}

@@ -771,7 +777,8 @@ static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg 
*cfg)
u64 reg;
struct arm_lpae_io_pgtable *data;

-   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA))
+   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA |
+   IO_PGTABLE_QUIRK_NON_STRICT))
return NULL;

data = arm_lpae_alloc_pgtable(cfg);
@@ -863,7 +870,8 @@ static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg 
*cfg)
struct arm_lpae_io_pgtable *data;

/* The NS quirk doesn't apply at stage 2 */
-   if (cfg->quirks & ~IO_PGTABLE_QUIRK_NO_DMA)
+   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_NO_DMA |
+   IO_PGTABLE_QUIRK_NON_STRICT))
return NULL;

data = arm_lpae_alloc_pgtable(cfg);
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 2df7909..beb14a3 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -71,12 +71,15 @@ struct io_pgtable_cfg {
 *  be accessed by a fully cache-coherent IOMMU or CPU (e.g. for a
 *  software-emulated IOMMU), such that pagetable updates need not
 *  be treated as explicit DMA data.
+* IO_PGTABLE_QUIRK_NON_STRICT: Put off TLBs invalidation and release
+*  memory first.
 */
#define IO_PGTABLE_QUIRK_ARM_NS BIT(0)
#define IO_PGTABLE_QUIRK_NO_PERMS   BIT(1)
#define IO_PGTABLE_QUIRK_TLBI_ON_MAPBIT(2)
#define IO_PGTABLE_QUIRK_ARM_MTK_4GBBIT(3)
#define IO_PGTABLE_QUIRK_NO_DMA BIT(4)
+   #define IO_PGTABLE_QUIRK_NON_STRICT BIT(5)
unsigned long  

Re: [PATCH v5 5/5] iommu/arm-smmu-v3: add bootup option "iommu.non_strict"

2018-08-22 Thread Robin Murphy

On 15/08/18 02:28, Zhen Lei wrote:

Add a bootup option to make the system manager can choose which mode to
be used. The default mode is strict.

Signed-off-by: Zhen Lei 
---
  Documentation/admin-guide/kernel-parameters.txt | 13 +
  drivers/iommu/arm-smmu-v3.c | 22 +-
  2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 5cde1ff..cb9d043e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1720,6 +1720,19 @@
nobypass[PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.

+   iommu.non_strict=   [ARM64]
+   Format: { "0" | "1" }
+   0 - strict mode, default.
+   Release IOVAs after the related TLBs are invalid
+   completely.
+   1 - non-strict mode.
+   Put off TLBs invalidation and release memory first.
+   It's good for scatter-gather performance but lacks
+   full isolation, an untrusted device can access the
+   reused memory because the TLBs may still valid.
+   Please take full consideration before choosing this
+   mode. Note that, VFIO will always use strict mode.
+
iommu.passthrough=
[ARM64] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 61eb7ec..0eda90e 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -631,6 +631,26 @@ struct arm_smmu_option_prop {
{ 0, NULL},
  };

+static bool smmu_non_strict __read_mostly;
+
+static int __init arm_smmu_setup(char *str)
+{
+   int ret;
+
+   ret = kstrtobool(str, _non_strict);
+   if (ret)
+   return ret;
+
+   if (smmu_non_strict) {
+   pr_warn("WARNING: iommu non-strict mode is chosen.\n"
+   "It's good for scatter-gather performance but lacks full 
isolation\n");
+   add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+   }
+
+   return 0;
+}
+early_param("iommu.non_strict", arm_smmu_setup);


As I said on v3, the option should be parsed by iommu-dma, since that's 
where it takes effect, and I'm sure SMMUv2 users will be interested in 
trying it out too.


In other words, if iommu_dma_init_domain() does something like:

if (iommu_dma_non_strict && domain->ops->flush_iotlb_all) {
domain->non_strict = true;
cookie->domain = domain;
init_iova_flush_queue(...);
}


+
  static inline void __iomem *arm_smmu_page1_fixup(unsigned long offset,
 struct arm_smmu_device *smmu)
  {
@@ -1622,7 +1642,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain)
if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;

-   if (domain->type == IOMMU_DOMAIN_DMA) {
+   if ((domain->type == IOMMU_DOMAIN_DMA) && smmu_non_strict) {
domain->non_strict = true;
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
}


...then all the driver should need to do is:

if (domain->non_strict)
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;


Now, that would make it possible to request non-strict mode even with 
drivers which *don't* understand it, but I think that's not actually 
harmful, just means that some TLBIs will still get issued synchronously 
and the flush queue might not do much. If you wanted to avoid even that, 
you could replace domain->non_strict with an iommu_domain_set_attr() 
call, so iommu_dma could tell up-front whether the driver understands 
non-strict mode and it's worth setting the queue up or not.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems

2018-08-22 Thread John Garry

On 22/08/2018 15:56, Robin Murphy wrote:

Hi John,

On 22/08/18 14:44, John Garry wrote:

On 21/09/2017 09:59, Ganapatrao Kulkarni wrote:

Adding numa aware memory allocations used for iommu dma allocation and
memory allocated for SMMU stream tables, page walk tables and command
queues.

With this patch, iperf testing on ThunderX2, with 40G NIC card on
NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.

Ganapatrao Kulkarni (4):
  mm: move function alloc_pages_exact_nid out of __meminit
  numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
translation tables
  iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
comamnd queues
  iommu/dma, numa: Use NUMA aware memory allocations in
__iommu_dma_alloc_pages

 drivers/iommu/arm-smmu-v3.c| 57
+-
 drivers/iommu/dma-iommu.c  | 17 +++--
 drivers/iommu/io-pgtable-arm.c |  4 ++-
 include/linux/gfp.h|  2 +-
 mm/page_alloc.c|  3 ++-
 5 files changed, 67 insertions(+), 16 deletions(-)



Hi Ganapatrao,

Have you any plans for further work on this patchset? I have not seen
anything since this v1 was posted+discussed.




Hi Robin,

Thanks for the info. I thought I remembered 4b12 but couldn't put my 
finger on it.



Looks like I ended up doing the version of the io-pgtable change that I
suggested here, which was merged recently (4b123757eeaa). Patch #3
should also be effectively obsolete now since the SWIOTLB/dma-direct
rework (21f237e4d085). Apparently I also started reworking patch #4 in
my tree at some point but sidelined it - I think that was at least
partly due to another thread[1] which made it seem less clear-cut
whether this is always the right thing to do.


Right, so #4 seems less straightforward and not directly related to 
IOMMU driver anyway.


Cheers,
John



Robin.

[1]
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1693026.html

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v14 0/4] iommu/arm-smmu: Add runtime pm/sleep support

2018-08-22 Thread Robin Murphy

On 20/08/18 10:31, Tomasz Figa wrote:

Hi Robin,

On Fri, Jul 27, 2018 at 4:02 PM Vivek Gautam
 wrote:


This series provides the support for turning on the arm-smmu's
clocks/power domains using runtime pm. This is done using
device links between smmu and client devices. The device link
framework keeps the two devices in correct order for power-cycling
across runtime PM or across system-wide PM.

With addition of a new device link flag DL_FLAG_AUTOREMOVE_SUPPLIER [8]
(available in linux-next of Rafael's linux-pm tree [9]), the device links
created between arm-smmu and its clients will be automatically purged
when arm-smmu driver unbinds from its device.

As not all implementations support clock/power gating, we are checking
for a valid 'smmu->dev's pm_domain' to conditionally enable the runtime
power management for such smmu implementations that can support it.
Otherwise, the clocks are turned to be always on in .probe until .remove.
With conditional runtime pm now, we avoid touching dev->power.lock
in fastpaths for smmu implementations that don't need to do anything
useful with pm_runtime.
This lets us to use the much-argued pm_runtime_get_sync/put_sync()
calls in map/unmap callbacks so that the clients do not have to
worry about handling any of the arm-smmu's power.

This series also adds support for Qcom's arm-smmu-v2 variant that
has different clocks and power requirements.

Previous version of this patch series is @ [2].

Tested this series on msm8996, and sdm845 after pulling in Rafael's linux-pm
linux-next[9] and Joerg's iommu next[10] branches, and related changes for
device tree, etc.

Hi Robin, Will,
I have addressed the comments for v13. If there's still a chance
can you please consider pulling this for v4.19.
Thanks.

[v14]
* Moved arm_smmu_device_reset() from arm_smmu_pm_resume() to
  arm_smmu_runtime_resume() so that the pm_resume callback calls
  only runtime_resume to resume the device.
  This should take care of restoring the state of smmu in systems
  in which smmu lose register state on power-domain collapse.


It's been a while since this series was posted and no more comments
seem to be left anymore. Would you have some time to take a look
again? Thanks.


Other than the binding issue which turned up in the meantime, I *think* 
this is looking OK now in terms of being sufficiently safe for all the 
various awkward retention vs. state-loss combinations. There's almost 
certainly still ways to improve it in future, but what we have now seems 
like a reasonable starting point that isn't impossibly complicated to 
reason about.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v14 4/4] iommu/arm-smmu: Add support for qcom,smmu-v2 variant

2018-08-22 Thread Robin Murphy

On 27/07/18 08:02, Vivek Gautam wrote:

qcom,smmu-v2 is an arm,smmu-v2 implementation with specific
clock and power requirements. This smmu core is used with
multiple masters on msm8996, viz. mdss, video, etc.
Add bindings for the same.

Signed-off-by: Vivek Gautam 
Reviewed-by: Rob Herring 
Reviewed-by: Tomasz Figa 
---

Change since v13:
  - No change.

  .../devicetree/bindings/iommu/arm,smmu.txt | 42 ++
  drivers/iommu/arm-smmu.c   | 13 +++
  2 files changed, 55 insertions(+)

diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt 
b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 8a6ffce12af5..7c71a6ed465a 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -17,10 +17,19 @@ conditions.
  "arm,mmu-401"
  "arm,mmu-500"
  "cavium,smmu-v2"
+"qcom,-smmu-v2", "qcom,smmu-v2"
  
depending on the particular implementation and/or the

version of the architecture implemented.
  
+  A number of Qcom SoCs use qcom,smmu-v2 version of the IP.

+  "qcom,-smmu-v2" represents a soc specific compatible
+  string that should be present along with the "qcom,smmu-v2"
+  to facilitate SoC specific clocks/power connections and to
+  address specific bug fixes.


As demonstrated in the GPU thread, this proves a bit too vague for a 
useful binding. Provided Qcom folks can reach a consensus on what a 
given SoC is actually called, I'd rather just unambiguously list 
whatever sets of fully-defined strings we need.


Robin.


+  An example string would be -
+  "qcom,msm8996-smmu-v2", "qcom,smmu-v2".
+
  - reg   : Base address and size of the SMMU.
  
  - #global-interrupts : The number of global interrupts exposed by the

@@ -71,6 +80,22 @@ conditions.
or using stream matching with #iommu-cells = <2>, and
may be ignored if present in such cases.
  
+- clock-names:List of the names of clocks input to the device. The

+  required list depends on particular implementation and
+  is as follows:
+  - for "qcom,smmu-v2":
+- "bus": clock required for downstream bus access and
+ for the smmu ptw,
+- "iface": clock required to access smmu's registers
+   through the TCU's programming interface.
+  - unspecified for other implementations.
+
+- clocks: Specifiers for all clocks listed in the clock-names property,
+  as per generic clock bindings.
+
+- power-domains:  Specifiers for power domains required to be powered on for
+  the SMMU to operate, as per generic power domain bindings.
+
  ** Deprecated properties:
  
  - mmu-masters (deprecated in favour of the generic "iommus" binding) :

@@ -137,3 +162,20 @@ conditions.
  iommu-map = <0  0 0x400>;
  ...
  };
+
+   /* Qcom's arm,smmu-v2 implementation */
+   smmu4: iommu {
+   compatible = "qcom,msm8996-smmu-v2", "qcom,smmu-v2";
+   reg = <0xd0 0x1>;
+
+   #global-interrupts = <1>;
+   interrupts = ,
+,
+;
+   #iommu-cells = <1>;
+   power-domains = < MDSS_GDSC>;
+
+   clocks = < SMMU_MDP_AXI_CLK>,
+< SMMU_MDP_AHB_CLK>;
+   clock-names = "bus", "iface";
+   };
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index e558abf1ecfc..2b4edba188a5 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -119,6 +119,7 @@ enum arm_smmu_implementation {
GENERIC_SMMU,
ARM_MMU500,
CAVIUM_SMMUV2,
+   QCOM_SMMUV2,
  };
  
  struct arm_smmu_s2cr {

@@ -1971,6 +1972,17 @@ ARM_SMMU_MATCH_DATA(arm_mmu401, ARM_SMMU_V1_64K, 
GENERIC_SMMU);
  ARM_SMMU_MATCH_DATA(arm_mmu500, ARM_SMMU_V2, ARM_MMU500);
  ARM_SMMU_MATCH_DATA(cavium_smmuv2, ARM_SMMU_V2, CAVIUM_SMMUV2);
  
+static const char * const qcom_smmuv2_clks[] = {

+   "bus", "iface",
+};
+
+static const struct arm_smmu_match_data qcom_smmuv2 = {
+   .version = ARM_SMMU_V2,
+   .model = QCOM_SMMUV2,
+   .clks = qcom_smmuv2_clks,
+   .num_clks = ARRAY_SIZE(qcom_smmuv2_clks),
+};
+
  static const struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v1", .data = _generic_v1 },
{ .compatible = "arm,smmu-v2", .data = _generic_v2 },
@@ -1978,6 +1990,7 @@ static const struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,mmu-401", .data = _mmu401 },
{ 

Re: [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems

2018-08-22 Thread Robin Murphy

Hi John,

On 22/08/18 14:44, John Garry wrote:

On 21/09/2017 09:59, Ganapatrao Kulkarni wrote:

Adding numa aware memory allocations used for iommu dma allocation and
memory allocated for SMMU stream tables, page walk tables and command 
queues.


With this patch, iperf testing on ThunderX2, with 40G NIC card on
NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.

Ganapatrao Kulkarni (4):
  mm: move function alloc_pages_exact_nid out of __meminit
  numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
    translation tables
  iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
    comamnd queues
  iommu/dma, numa: Use NUMA aware memory allocations in
    __iommu_dma_alloc_pages

 drivers/iommu/arm-smmu-v3.c    | 57 
+-

 drivers/iommu/dma-iommu.c  | 17 +++--
 drivers/iommu/io-pgtable-arm.c |  4 ++-
 include/linux/gfp.h    |  2 +-
 mm/page_alloc.c    |  3 ++-
 5 files changed, 67 insertions(+), 16 deletions(-)



Hi Ganapatrao,

Have you any plans for further work on this patchset? I have not seen 
anything since this v1 was posted+discussed.


Looks like I ended up doing the version of the io-pgtable change that I 
suggested here, which was merged recently (4b123757eeaa). Patch #3 
should also be effectively obsolete now since the SWIOTLB/dma-direct 
rework (21f237e4d085). Apparently I also started reworking patch #4 in 
my tree at some point but sidelined it - I think that was at least 
partly due to another thread[1] which made it seem less clear-cut 
whether this is always the right thing to do.


Robin.

[1] 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1693026.html

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems

2018-08-22 Thread John Garry

On 21/09/2017 09:59, Ganapatrao Kulkarni wrote:

Adding numa aware memory allocations used for iommu dma allocation and
memory allocated for SMMU stream tables, page walk tables and command queues.

With this patch, iperf testing on ThunderX2, with 40G NIC card on
NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.

Ganapatrao Kulkarni (4):
  mm: move function alloc_pages_exact_nid out of __meminit
  numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
translation tables
  iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
comamnd queues
  iommu/dma, numa: Use NUMA aware memory allocations in
__iommu_dma_alloc_pages

 drivers/iommu/arm-smmu-v3.c| 57 +-
 drivers/iommu/dma-iommu.c  | 17 +++--
 drivers/iommu/io-pgtable-arm.c |  4 ++-
 include/linux/gfp.h|  2 +-
 mm/page_alloc.c|  3 ++-
 5 files changed, 67 insertions(+), 16 deletions(-)



Hi Ganapatrao,

Have you any plans for further work on this patchset? I have not seen 
anything since this v1 was posted+discussed.


Thanks,
John


--
2.9.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 17/20] powerpc/dma-swiotlb: use generic swiotlb_dma_ops

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 11:57:53AM +1000, Benjamin Herrenschmidt wrote:
> Note: We will still need to implement some custom variant of this
> for our secure VMs ... 
> 
> Basically we'll need to use the existing bounce bufferring as-is but
> the condition will be different, it won't be whether the address is
> below a certain limit, it will be *always*.

The conditions are in the dma_capable() helper that the architecture
can override (and which powerpc already does override).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 10/20] powerpc/dma-noncoherent: don't disable irqs over kmap_atomic

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 10:27:46AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > The requirement to disable local irqs over kmap_atomic is long gone,
> > so remove those calls.
> 
> Really ? I'm trying to verify that and getting lost in a mess of macros
> from hell in the per-cpu stuff but if you look at our implementation
> of kmap_atomic_prot(), all it does is a preempt_disable(), and then
> it uses kmap_atomic_idx_push():
> 
>   int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
> 
> Note the use of __this_cpu_inc_return(), not this_cpu_inc_return(),
> ie this is the non-interrupt safe version...

Looks like the powerpc variant indeed isn't save.

I did look a bit more through the code and history, and it seems
like we remove the need to disable irqs when called from process
context a while ago, but we still require disabling irqs when called
from irq context.  Given that this code can also be called from
irq context we'll have to keep the local_irq_save.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 09:54:33AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > We need to take the DMA offset and encryption bit into account when 
> > selecting
> > a zone.  Add a helper that takes those into account and use it.
> 
> That whole "encryption" stuff seems to be completely specific to the
> way x86 does memory encryption, or am I mistaken ? It's not clear to me
> what that does in practice and how it relates to DMA mappings.

Not even all of x86, but AMD in particular, Intel does it yet another
way.  But it still is easier to take this into the core with a few
overrides than duplicating all the code.

> I'm also not sure about that whole business with ZONE_DMA and
> ARCH_ZONE_DMA_BITS...

ZONE_DMA usually (but not always) maps to 24-bits of address space,
if it doesn't (I mostly through about s390 with it's odd 31-bits)
the architecture can override it if it cares).

> On ppc64, unless you enable swiotlb (which we only do currently on
> some embedded platforms), you have all of memory in ZONE_DMA.
> 
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x-0x001f]
> [0.00]   DMA32empty
> [0.00]   Normal   empty
> [0.00]   Device   empty

This is really weird.  Why would you wire up ZONE_DMA like this?
The general scheme that architectures should implement is:

ZONE_DMA:   Any memory below a magic threshold that is lower than
32-bit.  Only enabled if actually required (usually
either 24-bit for ISA, or some other weird architecture
specific value like 32-bit for S/390)
ZONE_DMA32: Memory <= 32-bit if the architecture supports more than
32-bits worth of physical address space.  Should generally
be enabled on all 64-bit architectures unless you have
a very good reason not to.
ZONE_NORMAL:Everything above 32-bit not falling into HIGHMEM or
MOVEABLE.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 09:44:18AM +1000, Benjamin Herrenschmidt wrote:
> We do have the occasional device with things like 31-bit DMA
> limitation. We know they happens to work because those systems
> can't have enough memory to be a problem. This is why our current
> DMA direct ops in powerpc just unconditionally return true on ppc32.
> 
> The test against a full 32-bit mask here will break them I think.
> 
> Thing is, I'm not sure I still have access to one of these things
> to test, I'll have to dig (from memory things like b43 wifi).

Yeah, the other platforms that support these devices support ZONE_DMA
to reliably handle these devices. But there is two other ways the
current code would actually handle these fine despite the dma_direct
checks:

 1) if the device only has physical addresses up to 31-bit anyway
 2) by trying again to find a lower address.  But this only works
for coherent allocations and not streaming maps (unless we have
swiotlb with a buffer below 31-bits).

It seems powerpc can have ZONE_DMA, though and we will cover these
devices just fine.  If it didn't have that the current powerpc
code would not work either.

>  - What is this trying to achieve ?
> 
>   /*
>* Various PCI/PCIe bridges have broken support for > 32bit DMA even
>* if the device itself might support it.
>*/
>   if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
>   return 0;
> 
> IE, if the device has a 32-bit limit, we fail an attempt at checking
> if a >32-bit mask works ? That doesn't quite seem to be the right thing
> to do... Shouldn't this be in dma_set_mask() and just clamp the mask down ?
> 
> IE, dma_set_mask() is what a driver uses to establish the device capability,
> so it makes sense tot have dma_32bit_limit just reduce that capability, not
> fail because the device can do more than what the bridge can 

If your PCI bridge / PCIe root port doesn't support dma to addresses
larger than 32-bit the device capabilities above that don't matter, it
just won't work.  We have this case at least for some old VIA x86 chipsets
and some relatively modern Xilinx FPGAs with PCIe.

>  - How is that file supposed to work on 64-bit platforms ? From what I can
> tell, dma_supported() will unconditionally return true if the mask is
> 32-bit or larger (appart from the above issue). This doesn't look right,
> the mask needs to be compared to the max memory address. There are a bunch
> of devices out there with masks anywhere bettween 40 and 64 bits, and
> some of these will not work "out of the box" if the offseted top
> of memory is beyond the mask limit. Or am I missing something ?

Your are not missing anything except for the history of this code.

Your observation is right, but there always has been the implicit
assumption that architectures with more than 4GB of physical address
space must either support and iommu or swiotlb and use that.  It's
never been document anywhere, but I'm working on integrating all
this code to make more sense.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 10:01:16AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> > It turns out cxl actually uses it.  So for now skip this patch,
> > although random code in drivers messing with dma ops will need to
> > be sorted out sooner or later.
> 
> CXL devices are "special", they bypass the classic iommu in favor of
> allowing the device to operate using the main processor page tables
> using an MMU context (so basically the device can use userspace
> addresses directly), akin to ATS.
> 
> I think the code currently uses the nommu ops as a way to do a simple
> kernel mapping for kernel drivers using CXL (not userspace stuff)
> though.

Its still a horrible idea to have this in drivers/, we need some
core API to mediate this behavior.  Also if the device supports
using virtual addresses dma_nommu_ops seems wrong as it won't do
the right thing for e.g. vmalloc addresses not mapped into the
kernel linear mapping (which I guess can't currently happen on
powerpc, but still..)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] sparc: use generic dma_noncoherent_ops

2018-08-22 Thread Christoph Hellwig
On Tue, Aug 21, 2018 at 10:57:37PM -0700, David Miller wrote:
> From: Christoph Hellwig 
> Date: Wed, 22 Aug 2018 07:36:13 +0200
> 
> > I fear you have already pushed out your tree, but otherwise it would
> > be better do merge it through the dma-mapping tree.
> 
> I did and asked Linus to pull as well.  Sorry :-/

Oh, if you pushed that for 4.19 that is actually perfect to prepare
for my 4.20 changes.  Never mind the interruption!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu