Re: [PATCH 19/28] gpu/drm: remove the powerpc hack in drm_legacy_sg_alloc

2020-04-09 Thread Benjamin Herrenschmidt
On Thu, 2020-04-09 at 11:41 +0200, Daniel Vetter wrote:
> Now if these boxes didn't ever have agp then I think we can get away
> with deleting this, since we've already deleted the legacy radeon
> driver. And that one used vmalloc for everything. The new kms one does
> use the dma-api if the gpu isn't connected through agp

Definitely no AGP there.

Cheers
Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 19/28] gpu/drm: remove the powerpc hack in drm_legacy_sg_alloc

2020-04-09 Thread Benjamin Herrenschmidt
On Wed, 2020-04-08 at 14:25 +0200, Daniel Vetter wrote:
> On Wed, Apr 08, 2020 at 01:59:17PM +0200, Christoph Hellwig wrote:
> > If this code was broken for non-coherent caches a crude powerpc hack
> > isn't going to help anyone else.  Remove the hack as it is the last
> > user of __vmalloc passing a page protection flag other than PAGE_KERNEL.
> 
> Well Ben added this to make stuff work on ppc, ofc the home grown dma
> layer in drm from back then isn't going to work in other places. I guess
> should have at least an ack from him, in case anyone still cares about
> this on ppc. Adding Ben to cc.

This was due to some drivers (radeon ?) trying to use vmalloc pages for
coherent DMA, which means on those 4xx powerpc's need to be non-cached.

There were machines using that (440 based iirc), though I honestly
can't tell if anybody still uses any of it.

Cheers,
Ben.

> -Daniel
> 
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  drivers/gpu/drm/drm_scatter.c | 11 +--
> >  1 file changed, 1 insertion(+), 10 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_scatter.c b/drivers/gpu/drm/drm_scatter.c
> > index ca520028b2cb..f4e6184d1877 100644
> > --- a/drivers/gpu/drm/drm_scatter.c
> > +++ b/drivers/gpu/drm/drm_scatter.c
> > @@ -43,15 +43,6 @@
> >  
> >  #define DEBUG_SCATTER 0
> >  
> > -static inline void *drm_vmalloc_dma(unsigned long size)
> > -{
> > -#if defined(__powerpc__) && defined(CONFIG_NOT_COHERENT_CACHE)
> > -   return __vmalloc(size, GFP_KERNEL, pgprot_noncached_wc(PAGE_KERNEL));
> > -#else
> > -   return vmalloc_32(size);
> > -#endif
> > -}
> > -
> >  static void drm_sg_cleanup(struct drm_sg_mem * entry)
> >  {
> > struct page *page;
> > @@ -126,7 +117,7 @@ int drm_legacy_sg_alloc(struct drm_device *dev, void 
> > *data,
> > return -ENOMEM;
> > }
> >  
> > -   entry->virtual = drm_vmalloc_dma(pages << PAGE_SHIFT);
> > +   entry->virtual = vmalloc_32(pages << PAGE_SHIFT);
> > if (!entry->virtual) {
> > kfree(entry->busaddr);
> > kfree(entry->pagelist);
> > -- 
> > 2.25.1
> > 
> 
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [tech-privileged] [RFC PATCH V1] riscv-privileged: Add broadcast mode to sfence.vma

2019-09-19 Thread Benjamin Herrenschmidt
On Thu, 2019-09-19 at 09:04 -0700, Andrew Waterman wrote:
> This needs to be discussed and debated at length; proposing edits to
> the spec at this stage is putting the cart before the horse!
> 
> We shouldn’t change the definition of the existing SFENCE.VMA
> instruction to accomplish this. It’s also not abundantly clear to me
> that this should be an instruction: TLB shootdown looks more like
> MMIO.

Quite a few points to make here:

 - TLB shootdown as MMIO is a problem when you start trying to do it
directly from guests (which is very desirable). I can elaborate if you
want, but it generally boils down to having a pile of copies of the
MMIO resources to assign to guests and having to constantly change
mappings which is unrealistic.

 - I generally have very serious doubts as to the value of doing
broadcast TLB shootdowns in HW. I went at lenght about it during Linux
Plumbers Conference, but I'll try to summarise here, from my experience
back at IBM working on POWER. In no special order:

   * It doesn't scale well. You have to drain all the target CPU queues
of already translated load stores, keep ordering, etc... it causes a
giant fabric traffic jam. Experience has shown that on some POWER
systems, it becomes extremely expensive.

   * Some OS such as Linux track which CPU has seen a given context.
That allows to "target" TLB invalidations in a more sensible way.
Broadcast instructions tend to lose that ability (it's hard to do in HW
esp. in a way that can be virtualized properly).

   * Because those instructions can take a very long time (for the
above reasons and some of below ones), or at least whatever context
synchronizing instruction that follow which waits for completion of the
invalidations, you end up with a CPU effectively "stuck" for a long
time, not taking interrupts, including those routed to higher priority
levels (ie. hypervisor etc...). This is problematic. A completion
polling mechanism is preferable so that once can still handle such work
while waiting but is hard to architect/implement properly when done as
"instructions" since they can happen concurrently from multiple
contexts. It's easier with MMIO but that has other issues.

   * It introduces races with anything that does SW walk of page
tables. For example MMIO emulation by a hypervisor cannot be done race-
free if the guest can do its own broadcast invalidations and the
hypervisor has to walk the guest page tables to translate. I can
elaborate if requested.

   * Those invalidations need to also target nest agents that can hold
translations, such as IOMMUs that can operate in usr contexts etc...
Such IOMMUs can take a VERY LONG time to process invalidations,
especially if translations have been checked out by devices such as
PCIe devices using ATS. Some GPUs for example can hit a worst case of
hundreds of *milliseconds* to process a TLB invalidation.

 - Now for the proposed scheme. I really don't like introducing a *new*
way of tagging an address space using the PPN. It's a hack. The right
way is to ensure that the existing context tags are big enough to not
require re-use and thus can be treated as global context tags by the
hypervisor and OS. IE. have big enough VMIDs and ASIDs that each
running VM can have a global stable VMID and each process within a
given VM can have a global (within that VM) ASID instead of playing
reallocation of ASID tricks at context switch (that's very inefficient
anyway, so that should be fixed for anything that claims performance
and scalability).

Now all of these things can probably have solutions but experience
doesn't seem to indicate that it's really worthwhile. We are better off
making sure we have a really fast IPI path to perform those via
interrupts and locally to the targetted CPUs IMHO.

Cheers,
Ben.

> 
> On Thu, Sep 19, 2019 at 5:36 AM Guo Ren  wrote:
> > From: Guo Ren 
> > 
> > The patch is for https://github.com/riscv/riscv-isa-manual
> > 
> > The proposal has been talked in LPC-2019 RISC-V MC ref [1]. Here is
> > the
> > formal patch.
> > 
> > Introduction
> > 
> > 
> > Using the Hardware TLB broadcast invalidation instruction to
> > maintain the
> > system TLB is a good choice and it'll simplify the system software
> > design.
> > The proposal hopes to add a broadcast mode to the sfence.vma in the
> > riscv-privilege specification. To support the sfence.vma broadcast
> > mode,
> > there are two modification introduced below:
> > 
> >  1) Add PGD.PPN (root page table's PPN) as the unique identifier of
> > the
> > address space in addition to asid/vmid. Compared to the
> > dynamically
> > changed asid/vmid, PGD.PPN is fixed throughout the address
> > space life
> > cycle. This feature enables uniform address space
> > identification
> > between different TLB systems (actually, it's difficult to
> > unify the
> > asid/vmid between the CPU system and the IOMMU system, because
> > their
> > mechanisms are different)
> > 
> >  2) Modify the 

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Benjamin Herrenschmidt
On Mon, 2019-07-15 at 19:03 -0300, Thiago Jung Bauermann wrote:
> > > Indeed. The idea is that QEMU can offer the flag, old guests can
> > > reject
> > > it (or even new guests can reject it, if they decide not to
> > > convert into
> > > secure VMs) and the feature negotiation will succeed with the
> > > flag
> > > unset.
> > 
> > OK. And then what does QEMU do? Assume guest is not encrypted I
> > guess?
> 
> There's nothing different that QEMU needs to do, with or without the
> flag. the perspective of the host, a secure guest and a regular guest
> work the same way with respect to virtio.

This is *precisely* why I was against adding a flag and touch the
protocol negociation with qemu in the first place, back when I cared
about that stuff...

Guys, this has gone in circles over and over again.

This has nothing to do with qemu. Qemu doesn't need to know about this.
It's entirely guest local. This is why the one-liner in virtio was a
far better and simpler solution.

This is something the guest does to itself (with the participation of a
ultravisor but that's not something qemu cares about at this stage, at
least not as far as virtio is concerned).

Basically, the guest "hides" its memory from the host using a HW secure
memory facility. As a result, it needs to ensure that all of its DMA
pages are bounced through insecure pages that aren't hidden. That's it,
it's all guest side. Qemu shouldn't have to care about it at all.

Cheers,
Ben.




Re: use generic DMA mapping code in powerpc V4

2018-12-11 Thread Benjamin Herrenschmidt
On Tue, 2018-12-11 at 19:17 +0100, Christian Zigotzky wrote:
> X5000 (P5020 board): U-Boot loads the kernel and the dtb file. Then the 
> kernel starts but it doesn't find any hard disks (partitions). That 
> means this is also the bad commit for the P5020 board.

What are the disks hanging off ? A PCIe device of some sort ?

Can you send good & bad dmesg logs ?

Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: use generic DMA mapping code in powerpc V4

2018-12-08 Thread Benjamin Herrenschmidt
On Tue, 2018-11-27 at 08:42 +0100, Christoph Hellwig wrote:
> Any comments?  I'd like to at least get the ball moving on the easy
> bits.

So I had to cleanup some dust but it works on G5 with and without iommu
and 32-bit powermacs at least.

We're doing more tests, hopefully mpe can dig out some PASemi and
NXP/FSL HW as well. I'll try to review & ack the patches over the next
few days too.

Cheers,
Ben.

> On Wed, Nov 14, 2018 at 09:22:40AM +0100, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series switches the powerpc port to use the generic swiotlb and
> > noncoherent dma ops, and to use more generic code for the coherent
> > direct mapping, as well as removing a lot of dead code.
> > 
> > As this series is very large and depends on the dma-mapping tree I've
> > also published a git tree:
> > 
> > git://git.infradead.org/users/hch/misc.git powerpc-dma.4
> > 
> > Gitweb:
> > 
> > 
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.4
> > 
> > Changes since v3:
> >  - rebase on the powerpc fixes tree
> >  - add a new patch to actually make the baseline amigaone config
> >configure without warnings
> >  - only use ZONE_DMA for 64-bit embedded CPUs, on pseries an IOMMU is
> >always present
> >  - fix compile in mem.c for one configuration
> >  - drop the full npu removal for now, will be resent separately
> >  - a few git bisection fixes
> > 
> > The changes since v1 are to big to list and v2 was not posted in public.
> > 
> > ___
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> ---end quoted text---

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: use generic DMA mapping code in powerpc V4

2018-12-08 Thread Benjamin Herrenschmidt
On Tue, 2018-11-27 at 08:42 +0100, Christoph Hellwig wrote:
> Any comments?  I'd like to at least get the ball moving on the easy
> bits.

I completely missed your posting of V4 ! I was wondering what was
taking you so long :)

I'll give it a spin & send acks over the next 2 or 3 days.

Cheers,
Ben.

> On Wed, Nov 14, 2018 at 09:22:40AM +0100, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series switches the powerpc port to use the generic swiotlb and
> > noncoherent dma ops, and to use more generic code for the coherent
> > direct mapping, as well as removing a lot of dead code.
> > 
> > As this series is very large and depends on the dma-mapping tree I've
> > also published a git tree:
> > 
> > git://git.infradead.org/users/hch/misc.git powerpc-dma.4
> > 
> > Gitweb:
> > 
> > 
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.4
> > 
> > Changes since v3:
> >  - rebase on the powerpc fixes tree
> >  - add a new patch to actually make the baseline amigaone config
> >configure without warnings
> >  - only use ZONE_DMA for 64-bit embedded CPUs, on pseries an IOMMU is
> >always present
> >  - fix compile in mem.c for one configuration
> >  - drop the full npu removal for now, will be resent separately
> >  - a few git bisection fixes
> > 
> > The changes since v1 are to big to list and v2 was not posted in public.
> > 
> > ___
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> ---end quoted text---

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 16/33] powerpc/powernv: remove dead npu-dma code

2018-10-14 Thread Benjamin Herrenschmidt
On Mon, 2018-10-15 at 12:34 +1100, Alexey Kardashevskiy wrote:
> On 10/10/2018 00:24, Christoph Hellwig wrote:
> > This code has been unused since it was merged and is in the way of
> > cleaning up the DMA code, thus remove it.
> > 
> > This effectively reverts commit 5d2aa710 ("powerpc/powernv: Add support
> > for Nvlink NPUs").
> 
> 
> This code is heavily used by the NVIDIA GPU driver.

Some of it is, yes. And while I don't want to be involved in the
discussion about that specific can of worms, there is code in this file
related to the custom "always error" DMA ops that I suppose we could
remove, which is what is getting in the way of Christoph cleanups. It's
just meant as a debug stuff to catch incorrect attempts at doing the
dma mappings on the wrong "side" of the GPU.

Cheers,
Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/33] powerpc: use mm zones more sensibly

2018-10-14 Thread Benjamin Herrenschmidt
On Tue, 2018-10-09 at 15:24 +0200, Christoph Hellwig wrote:
>   * Find the least restrictive zone that is entirely below the
> @@ -324,11 +305,14 @@ void __init paging_init(void)
> printk(KERN_DEBUG "Memory hole size: %ldMB\n",
>(long int)((top_of_ram - total_ram) >> 20));
>  
> +#ifdef CONFIG_ZONE_DMA
> +   max_zone_pfns[ZONE_DMA] = min(max_low_pfn, 0x7fffUL >> 
> PAGE_SHIFT);
> +#endif
> +   max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
>  #ifdef CONFIG_HIGHMEM
> -   limit_zone_pfn(ZONE_NORMAL, lowmem_end_addr >> PAGE_SHIFT);
> +   max_zone_pfns[ZONE_HIGHMEM] = max_pfn
   ^
Missing a  ";" here  --|

Sorry ... works with that fix on an old laptop with highmem.

>  #endif
> -   limit_zone_pfn(TOP_ZONE, top_of_ram >> PAGE_SHIFT);
> -   zone_limits_final = true;
> +
> free_area_init_nodes(max_zone_pfns);
>  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size

2018-10-08 Thread Benjamin Herrenschmidt
On Wed, 2018-10-03 at 16:10 -0700, Alexander Duyck wrote:
> > -* Because 32-bit DMA masks are so common we expect every 
> > architecture
> > -* to be able to satisfy them - either by not supporting more 
> > physical
> > -* memory, or by providing a ZONE_DMA32.  If neither is the case, 
> > the
> > -* architecture needs to use an IOMMU instead of the direct mapping.
> > -*/
> > -   if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
> > +   u64 min_mask;
> > +
> > +   if (IS_ENABLED(CONFIG_ZONE_DMA))
> > +   min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
> > +   else
> > +   min_mask = DMA_BIT_MASK(32);
> > +
> > +   min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
> > +
> > +   if (mask >= phys_to_dma(dev, min_mask))
> >  return 0;
> > -#endif
> >  return 1;
> >   }
> 
> So I believe I have run into the same issue that Guenter reported. On
> an x86_64 system w/ Intel IOMMU. I wasn't able to complete boot and
> all probe attempts for various devices were failing with -EIO errors.
> 
> I believe the last mask check should be "if (mask < phys_to_dma(dev,
> min_mask))" not a ">=" check.

Right, that test is backwards. I needed to change it here too (powermac
with the rest of the powerpc series).

Cheers,
Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma-direct: document the zone selection logic

2018-10-08 Thread Benjamin Herrenschmidt
On Mon, 2018-10-08 at 09:03 +0200, Christoph Hellwig wrote:
> Ben, does this resolve your issues with the confusing zone selection?

The comment does make things a tad clearer yes :)

Thanks !

Cheers,
Ben.

> On Mon, Oct 01, 2018 at 01:10:16PM -0700, Christoph Hellwig wrote:
> > What we are doing here isn't quite obvious, so add a comment explaining
> > it.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  kernel/dma/direct.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> > index ba6f5956a291..14b966e2349a 100644
> > --- a/kernel/dma/direct.c
> > +++ b/kernel/dma/direct.c
> > @@ -84,7 +84,14 @@ static gfp_t __dma_direct_optimal_gfp_mask(struct device 
> > *dev, u64 dma_mask,
> > else
> > *phys_mask = dma_to_phys(dev, dma_mask);
> >  
> > -   /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: */
> > +   /*
> > +* Optimistically try the zone that the physicall address mask falls
> > +* into first.  If that returns memory that isn't actually addressable
> > +* we will fallback to the next lower zone and try again.
> > +*
> > +* Note that GFP_DMA32 and GFP_DMA are no ops without the corresponding
> > +* zones.
> > +*/
> > if (*phys_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
> > return GFP_DMA;
> > if (*phys_mask <= DMA_BIT_MASK(32))
> > -- 
> > 2.19.0
> > 
> > ___
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 
> ---end quoted text---

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: dma mask related fixups (including full bus_dma_mask support) v2

2018-10-01 Thread Benjamin Herrenschmidt
On Mon, 2018-10-01 at 16:32 +0200, Christoph Hellwig wrote:
> FYI, I've pulled this into the dma-mapping tree to make forward
> progress.  All but oatch 4 had formal ACKs, and for that one Robin
> was fine even without and explicit ack.  I'll also send a patch to
> better document the zone selection as it confuses even well versed
> people like Ben.

Thanks. I"ll try to dig out some older systems to test.

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-27 Thread Benjamin Herrenschmidt
On Thu, 2018-09-27 at 15:49 +0200, Christoph Hellwig wrote:
> On Thu, Sep 27, 2018 at 11:45:15AM +1000, Benjamin Herrenschmidt wrote:
> > I'm not sure this is entirely right.
> > 
> > Let's say the mask is 30 bits. You will return GFP_DMA32, which will
> > fail if you allocate something above 1G (which is legit for
> > ZONE_DMA32).
> 
> And then we will try GFP_DMA further down in the function:
> 
>   if (IS_ENABLED(CONFIG_ZONE_DMA) &&
>   dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
>   !(gfp & GFP_DMA)) {
>   gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
>   goto again;
>   }
> 
> This is and old optimization from x86, because chances are high that
> GFP_DMA32 will give you suitable memory for the infamous 31-bit
> dma mask devices (at least at boot time) and thus we don't have
> to deplete the tiny ZONE_DMA pool.

I see, it's rather confusing :-) Wouldn't it be better to check against
top of 32-bit memory instead here too ?


Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/5] dma-mapping: make the get_required_mask method available unconditionally

2018-09-26 Thread Benjamin Herrenschmidt
On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> This save some duplication for ia64, and makes the interface more
> general.  In the long run we want each dma_map_ops instance to fill this
> out, but this will take a little more prep work.
> 
> Signed-off-by: Christoph Hellwig 

(For powerpc)

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/ia64/include/asm/dma-mapping.h  |  2 --
>  arch/ia64/include/asm/machvec.h  |  7 ---
>  arch/ia64/include/asm/machvec_init.h |  1 -
>  arch/ia64/include/asm/machvec_sn2.h  |  2 --
>  arch/ia64/pci/pci.c  | 26 --
>  arch/ia64/sn/pci/pci_dma.c   |  4 ++--
>  drivers/base/platform.c  | 13 +++--
>  drivers/pci/controller/vmd.c |  4 
>  include/linux/dma-mapping.h  |  2 --
>  9 files changed, 13 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/dma-mapping.h 
> b/arch/ia64/include/asm/dma-mapping.h
> index 76e4d6632d68..522745ae67bb 100644
> --- a/arch/ia64/include/asm/dma-mapping.h
> +++ b/arch/ia64/include/asm/dma-mapping.h
> @@ -10,8 +10,6 @@
>  #include 
>  #include 
>  
> -#define ARCH_HAS_DMA_GET_REQUIRED_MASK
> -
>  extern const struct dma_map_ops *dma_ops;
>  extern struct ia64_machine_vector ia64_mv;
>  extern void set_iommu_machvec(void);
> diff --git a/arch/ia64/include/asm/machvec.h b/arch/ia64/include/asm/machvec.h
> index 267f4f170191..5133739966bc 100644
> --- a/arch/ia64/include/asm/machvec.h
> +++ b/arch/ia64/include/asm/machvec.h
> @@ -44,7 +44,6 @@ typedef void ia64_mv_kernel_launch_event_t(void);
>  
>  /* DMA-mapping interface: */
>  typedef void ia64_mv_dma_init (void);
> -typedef u64 ia64_mv_dma_get_required_mask (struct device *);
>  typedef const struct dma_map_ops *ia64_mv_dma_get_ops(struct device *);
>  
>  /*
> @@ -127,7 +126,6 @@ extern void machvec_tlb_migrate_finish (struct mm_struct 
> *);
>  #  define platform_global_tlb_purge  ia64_mv.global_tlb_purge
>  #  define platform_tlb_migrate_finishia64_mv.tlb_migrate_finish
>  #  define platform_dma_init  ia64_mv.dma_init
> -#  define platform_dma_get_required_mask ia64_mv.dma_get_required_mask
>  #  define platform_dma_get_ops   ia64_mv.dma_get_ops
>  #  define platform_irq_to_vector ia64_mv.irq_to_vector
>  #  define platform_local_vector_to_irq   ia64_mv.local_vector_to_irq
> @@ -171,7 +169,6 @@ struct ia64_machine_vector {
>   ia64_mv_global_tlb_purge_t *global_tlb_purge;
>   ia64_mv_tlb_migrate_finish_t *tlb_migrate_finish;
>   ia64_mv_dma_init *dma_init;
> - ia64_mv_dma_get_required_mask *dma_get_required_mask;
>   ia64_mv_dma_get_ops *dma_get_ops;
>   ia64_mv_irq_to_vector *irq_to_vector;
>   ia64_mv_local_vector_to_irq *local_vector_to_irq;
> @@ -211,7 +208,6 @@ struct ia64_machine_vector {
>   platform_global_tlb_purge,  \
>   platform_tlb_migrate_finish,\
>   platform_dma_init,  \
> - platform_dma_get_required_mask, \
>   platform_dma_get_ops,   \
>   platform_irq_to_vector, \
>   platform_local_vector_to_irq,   \
> @@ -286,9 +282,6 @@ extern const struct dma_map_ops *dma_get_ops(struct 
> device *);
>  #ifndef platform_dma_get_ops
>  # define platform_dma_get_opsdma_get_ops
>  #endif
> -#ifndef platform_dma_get_required_mask
> -# define  platform_dma_get_required_mask ia64_dma_get_required_mask
> -#endif
>  #ifndef platform_irq_to_vector
>  # define platform_irq_to_vector  __ia64_irq_to_vector
>  #endif
> diff --git a/arch/ia64/include/asm/machvec_init.h 
> b/arch/ia64/include/asm/machvec_init.h
> index 2b32fd06b7c6..2aafb69a3787 100644
> --- a/arch/ia64/include/asm/machvec_init.h
> +++ b/arch/ia64/include/asm/machvec_init.h
> @@ -4,7 +4,6 @@
>  
>  extern ia64_mv_send_ipi_t ia64_send_ipi;
>  extern ia64_mv_global_tlb_purge_t ia64_global_tlb_purge;
> -extern ia64_mv_dma_get_required_mask ia64_dma_get_required_mask;
>  extern ia64_mv_irq_to_vector __ia64_irq_to_vector;
>  extern ia64_mv_local_vector_to_irq __ia64_local_vector_to_irq;
>  extern ia64_mv_pci_get_legacy_mem_t ia64_pci_get_legacy_mem;
> diff --git a/arch/ia64/include/asm/machvec_sn2.h 
> b/arch/ia64/include/asm/machvec_sn2.h
> index ece9fa85be88..b5153d300289 100644
> --- a/arch/ia64/include/asm/machvec_sn2.h
> +++ b/arch/ia64/include/asm/machvec_sn2.h
> @@ -55,7 +55,6 @@ extern ia64_mv_readb_t __sn_readb_relaxed;
>  extern ia64_mv_readw_t __sn_readw_relaxed;
>  extern ia64_mv_readl_t __sn_readl_relaxed;
>  extern ia64_mv_readq_t __sn_readq_relaxed;
> -extern ia64_mv_dma_get_requir

Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size

2018-09-26 Thread Benjamin Herrenschmidt
On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> This way an architecture with less than 4G of RAM can support dma_mask
> smaller than 32-bit without a ZONE_DMA.  Apparently that is a common
> case on powerpc.

Anything that uses a b43 wifi adapter which has a 31-bit limitation
actually :-)
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/dma/direct.c | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 64466b7ef67b..d1e103c6b107 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -4,6 +4,7 @@
>   *
>   * DMA operations that map physical memory directly without using an IOMMU.
>   */
> +#include  /* for max_pfn */
>  #include 
>  #include 
>  #include 
> @@ -283,21 +284,24 @@ int dma_direct_map_sg(struct device *dev, struct 
> scatterlist *sgl, int nents,
>   return nents;
>  }
>  
> +/*
> + * Because 32-bit DMA masks are so common we expect every architecture to be
> + * able to satisfy them - either by not supporting more physical memory, or 
> by
> + * providing a ZONE_DMA32.  If neither is the case, the architecture needs to
> + * use an IOMMU instead of the direct mapping.
> + */
>  int dma_direct_supported(struct device *dev, u64 mask)
>  {
> -#ifdef CONFIG_ZONE_DMA
> - if (mask < phys_to_dma(dev, DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)))
> - return 0;
> -#else
> - /*
> -  * Because 32-bit DMA masks are so common we expect every architecture
> -  * to be able to satisfy them - either by not supporting more physical
> -  * memory, or by providing a ZONE_DMA32.  If neither is the case, the
> -  * architecture needs to use an IOMMU instead of the direct mapping.
> -  */
> - if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
> + u64 min_mask;
> +
> + if (IS_ENABLED(CONFIG_ZONE_DMA))
> + min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
> + else
> + min_mask = min_t(u64, DMA_BIT_MASK(32),
> +  (max_pfn - 1) << PAGE_SHIFT);
> +
> + if (mask >= phys_to_dma(dev, min_mask))
>   return 0;

nitpick ... to be completely "correct", I would have written

if (IS_ENABLED(CONFIG_ZONE_DMA))
min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
else
min_mask = DMA_BIT_MASK(32);

min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);

In "theory" it's also ok to have a mask < ZONE_DMA_BITS as long as it's
big enough to fit all memory :-)

> -#endif
>   return 1;
>  }
>  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-26 Thread Benjamin Herrenschmidt
On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> +static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,
> +   u64 *phys_mask)
> +{
> +   if (force_dma_unencrypted())
> +   *phys_mask = __dma_to_phys(dev, dma_mask);
> +   else
> +   *phys_mask = dma_to_phys(dev, dma_mask);
> +
> +   /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: 
> */
> +   if (*phys_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
> +   return GFP_DMA;
> +   if (*phys_mask <= DMA_BIT_MASK(32))
> +   return GFP_DMA32;
> +   return 0;
> +}

I'm not sure this is entirely right.

Let's say the mask is 30 bits. You will return GFP_DMA32, which will
fail if you allocate something above 1G (which is legit for
ZONE_DMA32).

I think the logic should be:

if (mask < ZONE_DMA)
fail;
else if (mask < ZONE_DMA32)
use ZONE_DMA or fail if doesn't exist
else if (mask < top_of_ram)
use ZONE_DMA32 or fail if doesn't exist
else
use ZONE_NORMAL

Additionally, we want to fold-in the top-of-ram test such that we don't
fail the second case if the mask is 31-bits (smaller than ZONE_DMA32)
but top of ram is also small enough.

So the top of ram test should take precendence.

Cheers,
Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] dma-direct: add an explicit dma_direct_get_required_mask

2018-09-26 Thread Benjamin Herrenschmidt
On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> This is somewhat modelled after the powerpc version, and differs from
> the legacy fallback in use fls64 instead of pointlessly splitting up the
> address into low and high dwords and in that it takes (__)phys_to_dma
> into account.

This looks like it will be usable if/when we switch powerpc to
dma/direct.c

Acked-by: Benjamin Herrenschmidt 
---
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/dma-direct.h |  1 +
>  kernel/dma/direct.c| 21 ++---
>  2 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index 86a59ba5a7f3..b79496d8c75b 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -55,6 +55,7 @@ static inline void dma_mark_clean(void *addr, size_t size)
>  }
>  #endif /* CONFIG_ARCH_HAS_DMA_MARK_CLEAN */
>  
> +u64 dma_direct_get_required_mask(struct device *dev);
>  void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t 
> *dma_handle,
>   gfp_t gfp, unsigned long attrs);
>  void dma_direct_free(struct device *dev, size_t size, void *cpu_addr,
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index c954f0a6dc62..81b73a5bba54 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -53,11 +53,25 @@ check_addr(struct device *dev, dma_addr_t dma_addr, 
> size_t size,
>   return true;
>  }
>  
> +static inline dma_addr_t phys_to_dma_direct(struct device *dev,
> + phys_addr_t phys)
> +{
> + if (force_dma_unencrypted())
> + return __phys_to_dma(dev, phys);
> + return phys_to_dma(dev, phys);
> +}
> +
> +u64 dma_direct_get_required_mask(struct device *dev)
> +{
> + u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT);
> +
> + return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
> +}
> +
>  static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t 
> size)
>  {
> - dma_addr_t addr = force_dma_unencrypted() ?
> - __phys_to_dma(dev, phys) : phys_to_dma(dev, phys);
> - return addr + size - 1 <= dev->coherent_dma_mask;
> + return phys_to_dma_direct(dev, phys) + size - 1 <=
> + dev->coherent_dma_mask;
>  }
>  
>  void *dma_direct_alloc_pages(struct device *dev, size_t size,
> @@ -296,6 +310,7 @@ const struct dma_map_ops dma_direct_ops = {
>   .unmap_page = dma_direct_unmap_page,
>   .unmap_sg   = dma_direct_unmap_sg,
>  #endif
> + .get_required_mask  = dma_direct_get_required_mask,
>   .dma_supported  = dma_direct_supported,
>   .mapping_error  = dma_direct_mapping_error,
>   .cache_sync = arch_dma_cache_sync,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Benjamin Herrenschmidt
On Thu, 2018-08-23 at 07:24 +0200, Christoph Hellwig wrote:
> > Well, iommus can have bypass regions, which we also use for
> > performance, so we do at dma_set_mask() time "swap" the ops around, and
> > in that case, we do want to check the mask against the actual top of
> > memory...
> 
> That is a bit of a powerpc special case (we also had one other arch
> doing that, but it got removed in the great purge, can't rember which
> one right now).  Everyone else has one set of ops, and they just switch
> to the direct mapping inside the iommu ops.

We more or less do that too in some of ours these days bcs of the whole
coherent_mask vs mask where a given device might need either depending
on the type of mapping.

Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:58 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 09:54:33AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > > We need to take the DMA offset and encryption bit into account when 
> > > selecting
> > > a zone.  Add a helper that takes those into account and use it.
> > 
> > That whole "encryption" stuff seems to be completely specific to the
> > way x86 does memory encryption, or am I mistaken ? It's not clear to me
> > what that does in practice and how it relates to DMA mappings.
> 
> Not even all of x86, but AMD in particular, Intel does it yet another
> way.  But it still is easier to take this into the core with a few
> overrides than duplicating all the code.
> 
> > I'm also not sure about that whole business with ZONE_DMA and
> > ARCH_ZONE_DMA_BITS...
> 
> ZONE_DMA usually (but not always) maps to 24-bits of address space,
> if it doesn't (I mostly through about s390 with it's odd 31-bits)
> the architecture can override it if it cares).
> 
> > On ppc64, unless you enable swiotlb (which we only do currently on
> > some embedded platforms), you have all of memory in ZONE_DMA.
> > 
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x-0x001f]
> > [0.00]   DMA32empty
> > [0.00]   Normal   empty
> > [0.00]   Device   empty
> 
> This is really weird.  Why would you wire up ZONE_DMA like this?

We always did :-) It predates my involvement and I think it predates
even Pauls. It's quite silly actually since the first powerpc machines
actually had ISA devices in them, but that's how it's been for ever. I
suppose we could change it but that would mean digging out some old
stuff to test.

> The general scheme that architectures should implement is:
> 
> ZONE_DMA: Any memory below a magic threshold that is lower than
>   32-bit.  Only enabled if actually required (usually
>   either 24-bit for ISA, or some other weird architecture
>   specific value like 32-bit for S/390)

It should have been ZONE_ISA_DMA :-)

> ZONE_DMA32:   Memory <= 32-bit if the architecture supports more than
>   32-bits worth of physical address space.  Should generally
>   be enabled on all 64-bit architectures unless you have
>   a very good reason not to.

Yeah so we sort-of enable the config option but only populate the zone
on platforms using swiotlb (freescale stuff). It's a bit messy at the
moment I must admit.

> ZONE_NORMAL:  Everything above 32-bit not falling into HIGHMEM or
>   MOVEABLE.

Cheers,
Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:53 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 09:44:18AM +1000, Benjamin Herrenschmidt wrote:
> > We do have the occasional device with things like 31-bit DMA
> > limitation. We know they happens to work because those systems
> > can't have enough memory to be a problem. This is why our current
> > DMA direct ops in powerpc just unconditionally return true on ppc32.
> > 
> > The test against a full 32-bit mask here will break them I think.
> > 
> > Thing is, I'm not sure I still have access to one of these things
> > to test, I'll have to dig (from memory things like b43 wifi).
> 
> Yeah, the other platforms that support these devices support ZONE_DMA
> to reliably handle these devices. But there is two other ways the
> current code would actually handle these fine despite the dma_direct
> checks:
> 
>  1) if the device only has physical addresses up to 31-bit anyway
>  2) by trying again to find a lower address.  But this only works
> for coherent allocations and not streaming maps (unless we have
> swiotlb with a buffer below 31-bits).
> 
> It seems powerpc can have ZONE_DMA, though and we will cover these
> devices just fine.  If it didn't have that the current powerpc
> code would not work either.

Not exactly. powerpc has ZONE_DMA covering all of system memory.

What happens in ppc32 is that we somewhat "know" that none of the
systems with those stupid 31-bit limited pieces of HW is capable of
having more than 2GB of memory anyway.

So we get away with just returning "1".

> 
> >  - What is this trying to achieve ?
> > 
> > /*
> >  * Various PCI/PCIe bridges have broken support for > 32bit DMA even
> >  * if the device itself might support it.
> >  */
> > if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
> > return 0;
> > 
> > IE, if the device has a 32-bit limit, we fail an attempt at checking
> > if a >32-bit mask works ? That doesn't quite seem to be the right thing
> > to do... Shouldn't this be in dma_set_mask() and just clamp the mask down ?
> > 
> > IE, dma_set_mask() is what a driver uses to establish the device capability,
> > so it makes sense tot have dma_32bit_limit just reduce that capability, not
> > fail because the device can do more than what the bridge can 
> 
> If your PCI bridge / PCIe root port doesn't support dma to addresses
> larger than 32-bit the device capabilities above that don't matter, it
> just won't work.  We have this case at least for some old VIA x86 chipsets
> and some relatively modern Xilinx FPGAs with PCIe.

Hrm... that's the usual confusion dma_capable() vs. dma_set_mask().

It's always been perfectly fine for a driver to do a dma_set_mask(64-
bit) on a system where the bridge can only do 32-bits ...

We shouldn't fail there, we should instead "clamp" the mask to 32-bit,
see what I mean ? It doesn't matter that the device itself is capable
of issuing >32 addresses, I agree, but what we need to express is that
the combination device+bridge doesn't want addresses above 32-bit, so
it's equivalent to making the device do a set_mask(32-bit).

This will succeed if the system can limit the addresses (for example
because memory is never above 32-bit) and will fail if the system
can't.

So that's equivalent of writing

if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
mask = phys_to_dma(dev, DMA_BIT_MASK(32)); 

Effectively meaning "don't give me addresses aboe 32-bit".

Still, your code doesn't check the mask against the memory size. Which
means it will fail for 32-bit masks even on systems that do not have
memory above 4G.

> >  - How is that file supposed to work on 64-bit platforms ? From what I can
> > tell, dma_supported() will unconditionally return true if the mask is
> > 32-bit or larger (appart from the above issue). This doesn't look right,
> > the mask needs to be compared to the max memory address. There are a bunch
> > of devices out there with masks anywhere bettween 40 and 64 bits, and
> > some of these will not work "out of the box" if the offseted top
> > of memory is beyond the mask limit. Or am I missing something ?
> 
> Your are not missing anything except for the history of this code.
> 
> Your observation is right, but there always has been the implicit
> assumption that architectures with more than 4GB of physical address
> space must either support and iommu or swiotlb and use that.  It's
> never been document anywhere, but I'm working on integrating all
> this code to make more sense.

Well, iommus can have bypass regions, which we also use for
performance, so we do at dma_set_mask() time "swap" the ops around, and
in that case, we do want to check the mask against the actual top of
memory...

Cheers,
Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:45 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 10:01:16AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> > > It turns out cxl actually uses it.  So for now skip this patch,
> > > although random code in drivers messing with dma ops will need to
> > > be sorted out sooner or later.
> > 
> > CXL devices are "special", they bypass the classic iommu in favor of
> > allowing the device to operate using the main processor page tables
> > using an MMU context (so basically the device can use userspace
> > addresses directly), akin to ATS.
> > 
> > I think the code currently uses the nommu ops as a way to do a simple
> > kernel mapping for kernel drivers using CXL (not userspace stuff)
> > though.
> 
> Its still a horrible idea to have this in drivers/, we need some
> core API to mediate this behavior.  Also if the device supports
> using virtual addresses dma_nommu_ops seems wrong as it won't do
> the right thing for e.g. vmalloc addresses not mapped into the
> kernel linear mapping (which I guess can't currently happen on
> powerpc, but still..)

You are right it won't do the right thing, but neither will standard
DMA ops, will they ? Drivers know not to try to dma_map vmalloc
addresses without first getting the underlying page, nothing unusal
there.

Yes I agree having this in drivers somewhat sucks though.

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 10/20] powerpc/dma-noncoherent: don't disable irqs over kmap_atomic

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 09:02 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 10:27:46AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > > The requirement to disable local irqs over kmap_atomic is long gone,
> > > so remove those calls.
> > 
> > Really ? I'm trying to verify that and getting lost in a mess of macros
> > from hell in the per-cpu stuff but if you look at our implementation
> > of kmap_atomic_prot(), all it does is a preempt_disable(), and then
> > it uses kmap_atomic_idx_push():
> > 
> > int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
> > 
> > Note the use of __this_cpu_inc_return(), not this_cpu_inc_return(),
> > ie this is the non-interrupt safe version...
> 
> Looks like the powerpc variant indeed isn't save.
> 
> I did look a bit more through the code and history, and it seems
> like we remove the need to disable irqs when called from process
> context a while ago, but we still require disabling irqs when called
> from irq context.  Given that this code can also be called from
> irq context we'll have to keep the local_irq_save.

This is the same with x86 no ?

32-bit x86 kmap_atomic_prot is the same as ours...

In fact I wonder why the preempt_disable() in there since it needs to
be protected against interrupt ?

Or is it that we never actually call kmap_atomic_* these days from
interrupt, and the atomic versions are just about dealing with
spinlocks ?

Cheers,
Ben.
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 17/20] powerpc/dma-swiotlb: use generic swiotlb_dma_ops

2018-08-08 Thread Benjamin Herrenschmidt
On Thu, 2018-08-09 at 10:54 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > These are identical to the arch specific ones, so remove them.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Acked-by: Benjamin Herrenschmidt 

Note: We will still need to implement some custom variant of this
for our secure VMs ... 

Basically we'll need to use the existing bounce bufferring as-is but
the condition will be different, it won't be whether the address is
below a certain limit, it will be *always*.

Cheers,
Ben.

> > ---
> >  arch/powerpc/include/asm/dma-direct.h |  4 
> >  arch/powerpc/include/asm/swiotlb.h|  2 --
> >  arch/powerpc/kernel/dma-swiotlb.c | 28 ++-
> >  arch/powerpc/sysdev/fsl_pci.c |  2 +-
> >  4 files changed, 7 insertions(+), 29 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/dma-direct.h 
> > b/arch/powerpc/include/asm/dma-direct.h
> > index 0fba19445ae8..657f84ddb20d 100644
> > --- a/arch/powerpc/include/asm/dma-direct.h
> > +++ b/arch/powerpc/include/asm/dma-direct.h
> > @@ -30,4 +30,8 @@ static inline phys_addr_t __dma_to_phys(struct device 
> > *dev, dma_addr_t daddr)
> > return daddr - PCI_DRAM_OFFSET;
> > return daddr - dev->archdata.dma_offset;
> >  }
> > +
> > +u64 swiotlb_powerpc_get_required(struct device *dev);
> > +#define swiotlb_get_required_mask swiotlb_powerpc_get_required
> > +
> >  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> > diff --git a/arch/powerpc/include/asm/swiotlb.h 
> > b/arch/powerpc/include/asm/swiotlb.h
> > index f65ecf57b66c..1d8c1da26ab3 100644
> > --- a/arch/powerpc/include/asm/swiotlb.h
> > +++ b/arch/powerpc/include/asm/swiotlb.h
> > @@ -13,8 +13,6 @@
> >  
> >  #include 
> >  
> > -extern const struct dma_map_ops powerpc_swiotlb_dma_ops;
> > -
> >  extern unsigned int ppc_swiotlb_enable;
> >  int __init swiotlb_setup_bus_notifier(void);
> >  
> > diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> > b/arch/powerpc/kernel/dma-swiotlb.c
> > index 25986fcd1e5e..0c269de61f39 100644
> > --- a/arch/powerpc/kernel/dma-swiotlb.c
> > +++ b/arch/powerpc/kernel/dma-swiotlb.c
> > @@ -24,7 +24,7 @@
> >  
> >  unsigned int ppc_swiotlb_enable;
> >  
> > -static u64 swiotlb_powerpc_get_required(struct device *dev)
> > +u64 swiotlb_powerpc_get_required(struct device *dev)
> >  {
> > u64 end, mask, max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
> >  
> > @@ -38,30 +38,6 @@ static u64 swiotlb_powerpc_get_required(struct device 
> > *dev)
> > return mask;
> >  }
> >  
> > -/*
> > - * At the moment, all platforms that use this code only require
> > - * swiotlb to be used if we're operating on HIGHMEM.  Since
> > - * we don't ever call anything other than map_sg, unmap_sg,
> > - * map_page, and unmap_page on highmem, use normal dma_ops
> > - * for everything else.
> > - */
> > -const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> > -   .alloc = dma_direct_alloc,
> > -   .free = dma_direct_free,
> > -   .mmap = dma_nommu_mmap_coherent,
> > -   .map_sg = swiotlb_map_sg_attrs,
> > -   .unmap_sg = swiotlb_unmap_sg_attrs,
> > -   .dma_supported = swiotlb_dma_supported,
> > -   .map_page = swiotlb_map_page,
> > -   .unmap_page = swiotlb_unmap_page,
> > -   .sync_single_for_cpu = swiotlb_sync_single_for_cpu,
> > -   .sync_single_for_device = swiotlb_sync_single_for_device,
> > -   .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
> > -   .sync_sg_for_device = swiotlb_sync_sg_for_device,
> > -   .mapping_error = swiotlb_dma_mapping_error,
> > -   .get_required_mask = swiotlb_powerpc_get_required,
> > -};
> > -
> >  void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
> >  {
> > struct pci_controller *hose;
> > @@ -88,7 +64,7 @@ static int ppc_swiotlb_bus_notify(struct notifier_block 
> > *nb,
> >  
> > /* May need to bounce if the device can't address all of DRAM */
> > if ((dma_get_mask(dev) + 1) < memblock_end_of_DRAM())
> > -   set_dma_ops(dev, _swiotlb_dma_ops);
> > +   set_dma_ops(dev, _dma_ops);
> >  
> > return NOTIFY_DONE;
> >  }
> > diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
> > index 918be816b097..daf44bc0108d 100644
> > --- a/arch/powerpc/sysdev/fsl_pci.c
> > +++ b/arch/powerpc/sysdev/fsl_pci.c
> > @@ -118,7 +118,7 @@ static void setup_swiotlb_ops(struct pci_controller 
> > *hose)
> >  {
> > if (ppc_swiotlb_enable) {
> > hose->controller_ops.dma_dev_setup = pci_dma_dev_setup_swiotlb;
> > -   set_pci_dma_ops(_swiotlb_dma_ops);
> > +   set_pci_dma_ops(_dma_ops);
> > }
> >  }
> >  #else

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 20/20] powerpc/dma: remove dma_nommu_mmap_coherent

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The remaining implementation for coherent caches is functionally
> identical to the default provided in common code.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h |  7 ---
>  arch/powerpc/kernel/dma-iommu.c|  1 -
>  arch/powerpc/kernel/dma.c  | 13 -
>  arch/powerpc/platforms/pseries/vio.c   |  1 -
>  4 files changed, 22 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 879c4efba785..e62e23aa3714 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -18,13 +18,6 @@
>  #include 
>  #include 
>  
> -/* Some dma direct funcs must be visible for use in other dma_ops */
> -extern int dma_nommu_mmap_coherent(struct device *dev,
> - struct vm_area_struct *vma,
> - void *cpu_addr, dma_addr_t handle,
> - size_t size, unsigned long attrs);
> -
> -
>  static inline unsigned long device_to_mask(struct device *dev)
>  {
>   if (dev->dma_mask && *dev->dma_mask)
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index f9fe2080ceb9..bf5234e1f71b 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -114,7 +114,6 @@ int dma_iommu_mapping_error(struct device *dev, 
> dma_addr_t dma_addr)
>  struct dma_map_ops dma_iommu_ops = {
>   .alloc  = dma_iommu_alloc_coherent,
>   .free   = dma_iommu_free_coherent,
> - .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_iommu_map_sg,
>   .unmap_sg   = dma_iommu_unmap_sg,
>   .dma_supported  = dma_iommu_dma_supported,
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 08b12cbd7abf..5b71c9d1b8cc 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -70,18 +70,6 @@ static void dma_nommu_free_coherent(struct device *dev, 
> size_t size,
>   iommu_free_coherent(iommu, size, vaddr, dma_handle);
>  }
>  
> -int dma_nommu_mmap_coherent(struct device *dev, struct vm_area_struct *vma,
> -  void *cpu_addr, dma_addr_t handle, size_t size,
> -  unsigned long attrs)
> -{
> - unsigned long pfn = page_to_pfn(virt_to_page(cpu_addr));
> -
> - return remap_pfn_range(vma, vma->vm_start,
> -pfn + vma->vm_pgoff,
> -vma->vm_end - vma->vm_start,
> -vma->vm_page_prot);
> -}
> -
>  /* note: needs to be called arch_get_required_mask for dma-noncoherent.c */
>  u64 arch_get_required_mask(struct device *dev)
>  {
> @@ -98,7 +86,6 @@ u64 arch_get_required_mask(struct device *dev)
>  const struct dma_map_ops dma_nommu_ops = {
>   .alloc  = dma_nommu_alloc_coherent,
>   .free   = dma_nommu_free_coherent,
> - .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_direct_map_sg,
>   .map_page   = dma_direct_map_page,
>   .get_required_mask  = arch_get_required_mask,
> diff --git a/arch/powerpc/platforms/pseries/vio.c 
> b/arch/powerpc/platforms/pseries/vio.c
> index 49e04ec19238..51d564313bd0 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -618,7 +618,6 @@ static u64 vio_dma_get_required_mask(struct device *dev)
>  static const struct dma_map_ops vio_dma_mapping_ops = {
>   .alloc = vio_dma_iommu_alloc_coherent,
>   .free  = vio_dma_iommu_free_coherent,
> - .mmap  = dma_nommu_mmap_coherent,
>   .map_sg= vio_dma_iommu_map_sg,
>   .unmap_sg  = vio_dma_iommu_unmap_sg,
>   .map_page  = vio_dma_iommu_map_page,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 18/20] powerpc/dma-noncoherent: use generic dma_noncoherent_ops

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The generic dma-noncoherent code provides all that is needed by powerpc.
> 
> Note that the cache maintainance in the existing code is a bit odd
> as it implements both the sync_to_device and sync_to_cpu callouts,
> but never flushes caches when unmapping.  This patch keeps both
> directions arounds, which will lead to more flushing than the previous
> implementation.  Someone more familar with the required CPUs should
> eventually take a look and optimize the cache flush handling if needed.

The original code looks bogus indeed.

I think we got away with it because those older CPUs wouldn't speculate
or prefetch aggressively enough (or at all) so the flush on map was
sufficient, the stuff wouldn't come back into the cache.

But safe is better than sorry, so ... tentative Ack, I do need to try
to dig one of these things to test, which might take a while.

Acked-by: Benjamin Herrenschmidt 


> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/Kconfig   |  2 +-
>  arch/powerpc/include/asm/dma-mapping.h | 29 -
>  arch/powerpc/kernel/dma.c  | 59 +++---
>  arch/powerpc/kernel/pci-common.c   |  5 ++-
>  arch/powerpc/kernel/setup-common.c |  4 ++
>  arch/powerpc/mm/dma-noncoherent.c  | 52 +--
>  arch/powerpc/platforms/44x/warp.c  |  2 +-
>  arch/powerpc/platforms/Kconfig.cputype |  6 ++-
>  8 files changed, 60 insertions(+), 99 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index bbfa6a8df4da..33c6017ffce6 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -129,7 +129,7 @@ config PPC
>   # Please keep this list sorted alphabetically.
>   #
>   select ARCH_HAS_DEVMEM_IS_ALLOWED
> - select ARCH_HAS_DMA_SET_COHERENT_MASK
> + select ARCH_HAS_DMA_SET_COHERENT_MASK if !NOT_COHERENT_CACHE
>   select ARCH_HAS_ELF_RANDOMIZE
>   select ARCH_HAS_FORTIFY_SOURCE
>   select ARCH_HAS_GCOV_PROFILE_ALL
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index f0bf7ac2686c..879c4efba785 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -19,40 +19,11 @@
>  #include 
>  
>  /* Some dma direct funcs must be visible for use in other dma_ops */
> -extern void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> -  dma_addr_t *dma_handle, gfp_t flag,
> -  unsigned long attrs);
> -extern void __dma_nommu_free_coherent(struct device *dev, size_t size,
> -void *vaddr, dma_addr_t dma_handle,
> -unsigned long attrs);
>  extern int dma_nommu_mmap_coherent(struct device *dev,
>   struct vm_area_struct *vma,
>   void *cpu_addr, dma_addr_t handle,
>   size_t size, unsigned long attrs);
>  
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> -/*
> - * DMA-consistent mapping functions for PowerPCs that don't support
> - * cache snooping.  These allocate/free a region of uncached mapped
> - * memory space for use with DMA devices.  Alternatively, you could
> - * allocate the space "normally" and use the cache management functions
> - * to ensure it is consistent.
> - */
> -struct device;
> -extern void __dma_sync(void *vaddr, size_t size, int direction);
> -extern void __dma_sync_page(struct page *page, unsigned long offset,
> -  size_t size, int direction);
> -extern unsigned long __dma_get_coherent_pfn(unsigned long cpu_addr);
> -
> -#else /* ! CONFIG_NOT_COHERENT_CACHE */
> -/*
> - * Cache coherent cores.
> - */
> -
> -#define __dma_sync(addr, size, rw)   ((void)0)
> -#define __dma_sync_page(pg, off, sz, rw) ((void)0)
> -
> -#endif /* ! CONFIG_NOT_COHERENT_CACHE */
>  
>  static inline unsigned long device_to_mask(struct device *dev)
>  {
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 2b90a403cdac..b2e88075b2ea 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -36,12 +36,7 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>* we can really use the direct ops
>*/
>   if (dma_direct_supported(dev, dev->coherent_dma_mask))
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - return __dma_nommu_alloc_coherent(dev, size, dma_handle,
> -flag, attrs);
> -#else
>   return dma_

Re: [PATCH 17/20] powerpc/dma-swiotlb: use generic swiotlb_dma_ops

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These are identical to the arch specific ones, so remove them.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-direct.h |  4 
>  arch/powerpc/include/asm/swiotlb.h|  2 --
>  arch/powerpc/kernel/dma-swiotlb.c | 28 ++-
>  arch/powerpc/sysdev/fsl_pci.c |  2 +-
>  4 files changed, 7 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-direct.h 
> b/arch/powerpc/include/asm/dma-direct.h
> index 0fba19445ae8..657f84ddb20d 100644
> --- a/arch/powerpc/include/asm/dma-direct.h
> +++ b/arch/powerpc/include/asm/dma-direct.h
> @@ -30,4 +30,8 @@ static inline phys_addr_t __dma_to_phys(struct device *dev, 
> dma_addr_t daddr)
>   return daddr - PCI_DRAM_OFFSET;
>   return daddr - dev->archdata.dma_offset;
>  }
> +
> +u64 swiotlb_powerpc_get_required(struct device *dev);
> +#define swiotlb_get_required_mask swiotlb_powerpc_get_required
> +
>  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> diff --git a/arch/powerpc/include/asm/swiotlb.h 
> b/arch/powerpc/include/asm/swiotlb.h
> index f65ecf57b66c..1d8c1da26ab3 100644
> --- a/arch/powerpc/include/asm/swiotlb.h
> +++ b/arch/powerpc/include/asm/swiotlb.h
> @@ -13,8 +13,6 @@
>  
>  #include 
>  
> -extern const struct dma_map_ops powerpc_swiotlb_dma_ops;
> -
>  extern unsigned int ppc_swiotlb_enable;
>  int __init swiotlb_setup_bus_notifier(void);
>  
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index 25986fcd1e5e..0c269de61f39 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -24,7 +24,7 @@
>  
>  unsigned int ppc_swiotlb_enable;
>  
> -static u64 swiotlb_powerpc_get_required(struct device *dev)
> +u64 swiotlb_powerpc_get_required(struct device *dev)
>  {
>   u64 end, mask, max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
>  
> @@ -38,30 +38,6 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
>   return mask;
>  }
>  
> -/*
> - * At the moment, all platforms that use this code only require
> - * swiotlb to be used if we're operating on HIGHMEM.  Since
> - * we don't ever call anything other than map_sg, unmap_sg,
> - * map_page, and unmap_page on highmem, use normal dma_ops
> - * for everything else.
> - */
> -const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> - .alloc = dma_direct_alloc,
> - .free = dma_direct_free,
> - .mmap = dma_nommu_mmap_coherent,
> - .map_sg = swiotlb_map_sg_attrs,
> - .unmap_sg = swiotlb_unmap_sg_attrs,
> - .dma_supported = swiotlb_dma_supported,
> - .map_page = swiotlb_map_page,
> - .unmap_page = swiotlb_unmap_page,
> - .sync_single_for_cpu = swiotlb_sync_single_for_cpu,
> - .sync_single_for_device = swiotlb_sync_single_for_device,
> - .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
> - .sync_sg_for_device = swiotlb_sync_sg_for_device,
> - .mapping_error = swiotlb_dma_mapping_error,
> - .get_required_mask = swiotlb_powerpc_get_required,
> -};
> -
>  void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
>  {
>   struct pci_controller *hose;
> @@ -88,7 +64,7 @@ static int ppc_swiotlb_bus_notify(struct notifier_block *nb,
>  
>   /* May need to bounce if the device can't address all of DRAM */
>   if ((dma_get_mask(dev) + 1) < memblock_end_of_DRAM())
> - set_dma_ops(dev, _swiotlb_dma_ops);
> + set_dma_ops(dev, _dma_ops);
>  
>   return NOTIFY_DONE;
>  }
> diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
> index 918be816b097..daf44bc0108d 100644
> --- a/arch/powerpc/sysdev/fsl_pci.c
> +++ b/arch/powerpc/sysdev/fsl_pci.c
> @@ -118,7 +118,7 @@ static void setup_swiotlb_ops(struct pci_controller *hose)
>  {
>   if (ppc_swiotlb_enable) {
>   hose->controller_ops.dma_dev_setup = pci_dma_dev_setup_swiotlb;
> - set_pci_dma_ops(_swiotlb_dma_ops);
> + set_pci_dma_ops(_dma_ops);
>   }
>  }
>  #else

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 16/20] powerpc/dma: use dma_direct_{alloc,free}

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These do the same functionality as the existing helpers, but do it
> simpler, and also allow the (optional) use of CMA.
> 
> Note that the swiotlb code now calls into the dma_direct code directly,
> given that it doesn't work with noncoherent caches at all, and isn't called
> when we have an iommu either, so the iommu special case in
> dma_nommu_alloc_coherent isn't required for swiotlb.

I am not convinced that this will produce the same results due to
the way the zone picking works.

As for the interaction with swiotlb, we'll need the FSL guys to have
a look. Scott, do you remember what this is about ?

> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/include/asm/pgtable.h |  1 -
>  arch/powerpc/kernel/dma-swiotlb.c  |  4 +-
>  arch/powerpc/kernel/dma.c  | 78 --
>  arch/powerpc/mm/mem.c  | 19 
>  4 files changed, 11 insertions(+), 91 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 14c79a7dc855..123de4958d2e 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -38,7 +38,6 @@ extern unsigned long empty_zero_page[];
>  extern pgd_t swapper_pg_dir[];
>  
>  void limit_zone_pfn(enum zone_type zone, unsigned long max_pfn);
> -int dma_pfn_limit_to_zone(u64 pfn_limit);
>  extern void paging_init(void);
>  
>  /*
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index f6e0701c5303..25986fcd1e5e 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -46,8 +46,8 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
>   * for everything else.
>   */
>  const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> - .alloc = __dma_nommu_alloc_coherent,
> - .free = __dma_nommu_free_coherent,
> + .alloc = dma_direct_alloc,
> + .free = dma_direct_free,
>   .mmap = dma_nommu_mmap_coherent,
>   .map_sg = swiotlb_map_sg_attrs,
>   .unmap_sg = swiotlb_unmap_sg_attrs,
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 2cfc45acbb52..2b90a403cdac 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -26,75 +26,6 @@
>   * can set archdata.dma_data to an unsigned long holding the offset. By
>   * default the offset is PCI_DRAM_OFFSET.
>   */
> -
> -static u64 __maybe_unused get_pfn_limit(struct device *dev)
> -{
> - u64 pfn = (dev->coherent_dma_mask >> PAGE_SHIFT) + 1;
> - struct dev_archdata __maybe_unused *sd = >archdata;
> -
> -#ifdef CONFIG_SWIOTLB
> - if (sd->max_direct_dma_addr && dev->dma_ops == _swiotlb_dma_ops)
> - pfn = min_t(u64, pfn, sd->max_direct_dma_addr >> PAGE_SHIFT);
> -#endif
> -
> - return pfn;
> -}
> -
> -#ifndef CONFIG_NOT_COHERENT_CACHE
> -void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> -   dma_addr_t *dma_handle, gfp_t flag,
> -   unsigned long attrs)
> -{
> - void *ret;
> - struct page *page;
> - int node = dev_to_node(dev);
> -#ifdef CONFIG_FSL_SOC
> - u64 pfn = get_pfn_limit(dev);
> - int zone;
> -
> - /*
> -  * This code should be OK on other platforms, but we have drivers that
> -  * don't set coherent_dma_mask. As a workaround we just ifdef it. This
> -  * whole routine needs some serious cleanup.
> -  */
> -
> - zone = dma_pfn_limit_to_zone(pfn);
> - if (zone < 0) {
> - dev_err(dev, "%s: No suitable zone for pfn %#llx\n",
> - __func__, pfn);
> - return NULL;
> - }
> -
> - switch (zone) {
> - case ZONE_DMA:
> - flag |= GFP_DMA;
> - break;
> -#ifdef CONFIG_ZONE_DMA32
> - case ZONE_DMA32:
> - flag |= GFP_DMA32;
> - break;
> -#endif
> - };
> -#endif /* CONFIG_FSL_SOC */
> -
> - page = alloc_pages_node(node, flag, get_order(size));
> - if (page == NULL)
> - return NULL;
> - ret = page_address(page);
> - memset(ret, 0, size);
> - *dma_handle = phys_to_dma(dev,__pa(ret));
> -
> - return ret;
> -}
> -
> -void __dma_nommu_free_coherent(struct device *dev, size_t size,
> - void *vaddr, dma_addr_t dma_handle,
> - unsigned long attrs)
> -{
> - free_pages((unsigned long)vaddr, get_order(size));
> -}
> -#endif /* !CONFIG_NOT_COHERENT_CACHE */
> -
>  static void *dma_nommu_alloc_coherent(struct device *dev, size_t size,
>  dma_addr_t *dma_handle, gfp_t flag,
>  unsigned long attrs)
> @@ -105,8 +36,12 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>* we can really use the direct ops
>*/
>   if (dma_direct_supported(dev, dev->coherent_dma_mask))
> 

Re: [PATCH 15/20] powerpc/dma: remove the unused unmap_page and unmap_sg methods

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These methods are optional to start with, no need to implement no-op
> versions.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/kernel/dma.c | 16 
>  1 file changed, 16 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 511a4972560d..2cfc45acbb52 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -178,12 +178,6 @@ static int dma_nommu_map_sg(struct device *dev, struct 
> scatterlist *sgl,
>   return nents;
>  }
>  
> -static void dma_nommu_unmap_sg(struct device *dev, struct scatterlist *sg,
> - int nents, enum dma_data_direction direction,
> - unsigned long attrs)
> -{
> -}
> -
>  static u64 dma_nommu_get_required_mask(struct device *dev)
>  {
>   u64 end, mask;
> @@ -209,14 +203,6 @@ static inline dma_addr_t dma_nommu_map_page(struct 
> device *dev,
>   return phys_to_dma(dev, page_to_phys(page)) + offset;
>  }
>  
> -static inline void dma_nommu_unmap_page(struct device *dev,
> -  dma_addr_t dma_address,
> -  size_t size,
> -  enum dma_data_direction direction,
> -  unsigned long attrs)
> -{
> -}
> -
>  #ifdef CONFIG_NOT_COHERENT_CACHE
>  static inline void dma_nommu_sync_sg(struct device *dev,
>   struct scatterlist *sgl, int nents,
> @@ -242,10 +228,8 @@ const struct dma_map_ops dma_nommu_ops = {
>   .free   = dma_nommu_free_coherent,
>   .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_nommu_map_sg,
> - .unmap_sg   = dma_nommu_unmap_sg,
>   .dma_supported  = dma_direct_supported,
>   .map_page   = dma_nommu_map_page,
> - .unmap_page = dma_nommu_unmap_page,
>   .get_required_mask  = dma_nommu_get_required_mask,
>  #ifdef CONFIG_NOT_COHERENT_CACHE
>   .sync_single_for_cpu= dma_nommu_sync_single,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 14/20] powerpc/dma: replace dma_nommu_dma_supported with dma_direct_supported

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The ppc32 case of dma_nommu_dma_supported already was a no-op, and the
> 64-bit case came to the same conclusion as dma_direct_supported, so
> replace it with the generic version.

It's not at all equivalent (see my review on your earlier patch) or
am I missing something ?

 - ppc32 always return 1, but dma_direct_supported() will not for
devices with a <32-bit mask (and yes ppc32 isn't quite right to do
so, it should check against memory size, but in practice it worked
as the only limited devices we deal with on systems we still support
have a 31-bit limitation)

 - ppc64 needs to check against the end of DRAM as some devices will
fail the check, dma_direct_supported() doesn't seem to be doing that.

Also as I mentioned, I'm not sure about the business with ZONE_DMA,
and that arbitrary 24-bit limit since our entire memory is in ZONE_DMA
but that's a different can of worms I suppose.

> Signed-off-by: Christoph Hellwig 


> ---
>  arch/powerpc/Kconfig  |  1 +
>  arch/powerpc/kernel/dma.c | 28 +++-
>  2 files changed, 4 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index f9cae7edd735..bbfa6a8df4da 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -158,6 +158,7 @@ config PPC
>   select CLONE_BACKWARDS
>   select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
>   select DYNAMIC_FTRACE   if FUNCTION_TRACER
> + select DMA_DIRECT_OPS
>   select EDAC_ATOMIC_SCRUB
>   select EDAC_SUPPORT
>   select GENERIC_ATOMIC64 if PPC32
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 3487de83bb37..511a4972560d 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -40,28 +40,6 @@ static u64 __maybe_unused get_pfn_limit(struct device *dev)
>   return pfn;
>  }
>  
> -static int dma_nommu_dma_supported(struct device *dev, u64 mask)
> -{
> -#ifdef CONFIG_PPC64
> - u64 limit = phys_to_dma(dev, (memblock_end_of_DRAM() - 1));
> -
> - /* Limit fits in the mask, we are good */
> - if (mask >= limit)
> - return 1;
> -
> -#ifdef CONFIG_FSL_SOC
> - /* Freescale gets another chance via ZONE_DMA/ZONE_DMA32, however
> -  * that will have to be refined if/when they support iommus
> -  */
> - return 1;
> -#endif
> - /* Sorry ... */
> - return 0;
> -#else
> - return 1;
> -#endif
> -}
> -
>  #ifndef CONFIG_NOT_COHERENT_CACHE
>  void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t flag,
> @@ -126,7 +104,7 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>   /* The coherent mask may be smaller than the real mask, check if
>* we can really use the direct ops
>*/
> - if (dma_nommu_dma_supported(dev, dev->coherent_dma_mask))
> + if (dma_direct_supported(dev, dev->coherent_dma_mask))
>   return __dma_nommu_alloc_coherent(dev, size, dma_handle,
>  flag, attrs);
>  
> @@ -148,7 +126,7 @@ static void dma_nommu_free_coherent(struct device *dev, 
> size_t size,
>   struct iommu_table *iommu;
>  
>   /* See comments in dma_nommu_alloc_coherent() */
> - if (dma_nommu_dma_supported(dev, dev->coherent_dma_mask))
> + if (dma_direct_supported(dev, dev->coherent_dma_mask))
>   return __dma_nommu_free_coherent(dev, size, vaddr, dma_handle,
> attrs);
>   /* Maybe we used an iommu ... */
> @@ -265,7 +243,7 @@ const struct dma_map_ops dma_nommu_ops = {
>   .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_nommu_map_sg,
>   .unmap_sg   = dma_nommu_unmap_sg,
> - .dma_supported  = dma_nommu_dma_supported,
> + .dma_supported  = dma_direct_supported,
>   .map_page   = dma_nommu_map_page,
>   .unmap_page = dma_nommu_unmap_page,
>   .get_required_mask  = dma_nommu_get_required_mask,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 13/20] powerpc/dma: remove get_dma_offset

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Just fold the calculation into __phys_to_dma/__dma_to_phys as those are
> the only places that should know about it.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-direct.h  |  8 ++--
>  arch/powerpc/include/asm/dma-mapping.h | 16 
>  2 files changed, 6 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-direct.h 
> b/arch/powerpc/include/asm/dma-direct.h
> index 7702875aabb7..0fba19445ae8 100644
> --- a/arch/powerpc/include/asm/dma-direct.h
> +++ b/arch/powerpc/include/asm/dma-direct.h
> @@ -19,11 +19,15 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>  
>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
>  {
> - return paddr + get_dma_offset(dev);
> + if (!dev)
> + return paddr + PCI_DRAM_OFFSET;
> + return paddr + dev->archdata.dma_offset;
>  }
>  
>  static inline phys_addr_t __dma_to_phys(struct device *dev, dma_addr_t daddr)
>  {
> - return daddr - get_dma_offset(dev);
> + if (!dev)
> + return daddr - PCI_DRAM_OFFSET;
> + return daddr - dev->archdata.dma_offset;
>  }
>  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index dacd0f93f2b2..f0bf7ac2686c 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -80,22 +80,6 @@ static inline const struct dma_map_ops 
> *get_arch_dma_ops(struct bus_type *bus)
>   return NULL;
>  }
>  
> -/*
> - * get_dma_offset()
> - *
> - * Get the dma offset on configurations where the dma address can be 
> determined
> - * from the physical address by looking at a simple offset.  Direct dma and
> - * swiotlb use this function, but it is typically not used by implementations
> - * with an iommu.
> - */
> -static inline dma_addr_t get_dma_offset(struct device *dev)
> -{
> - if (dev)
> - return dev->archdata.dma_offset;
> -
> - return PCI_DRAM_OFFSET;
> -}
> -
>  static inline void set_dma_offset(struct device *dev, dma_addr_t off)
>  {
>   if (dev)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 12/20] powerpc/dma: use phys_to_dma instead of get_dma_offset

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Use the standard portable helper instead of the powerpc specific one,
> which is about to go away.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/kernel/dma-swiotlb.c |  5 ++---
>  arch/powerpc/kernel/dma.c | 12 ++--
>  2 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index 88f3963ca30f..f6e0701c5303 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -11,7 +11,7 @@
>   *
>   */
>  
> -#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -31,9 +31,8 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
> end = memblock_end_of_DRAM();
> if (max_direct_dma_addr && end > max_direct_dma_addr)
> end = max_direct_dma_addr;
> -   end += get_dma_offset(dev);
>  
> -   mask = 1ULL << (fls64(end) - 1);
> +   mask = 1ULL << (fls64(phys_to_dma(dev, end)) - 1);
> mask += mask - 1;
>  
> return mask;
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index eceaa92e6986..3487de83bb37 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -6,7 +6,7 @@
>   */
>  
>  #include 
> -#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -43,7 +43,7 @@ static u64 __maybe_unused get_pfn_limit(struct device *dev)
>  static int dma_nommu_dma_supported(struct device *dev, u64 mask)
>  {
>  #ifdef CONFIG_PPC64
> -   u64 limit = get_dma_offset(dev) + (memblock_end_of_DRAM() - 1);
> +   u64 limit = phys_to_dma(dev, (memblock_end_of_DRAM() - 1));
>  
> /* Limit fits in the mask, we are good */
> if (mask >= limit)
> @@ -104,7 +104,7 @@ void *__dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
> return NULL;
> ret = page_address(page);
> memset(ret, 0, size);
> -   *dma_handle = __pa(ret) + get_dma_offset(dev);
> +   *dma_handle = phys_to_dma(dev,__pa(ret));
>  
> return ret;
>  }
> @@ -188,7 +188,7 @@ static int dma_nommu_map_sg(struct device *dev, struct 
> scatterlist *sgl,
> int i;
>  
> for_each_sg(sgl, sg, nents, i) {
> -   sg->dma_address = sg_phys(sg) + get_dma_offset(dev);
> +   sg->dma_address = phys_to_dma(dev, sg_phys(sg));
> sg->dma_length = sg->length;
>  
> if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
> @@ -210,7 +210,7 @@ static u64 dma_nommu_get_required_mask(struct device *dev)
>  {
> u64 end, mask;
>  
> -   end = memblock_end_of_DRAM() + get_dma_offset(dev);
> +   end = phys_to_dma(dev, memblock_end_of_DRAM());
>  
> mask = 1ULL << (fls64(end) - 1);
> mask += mask - 1;
> @@ -228,7 +228,7 @@ static inline dma_addr_t dma_nommu_map_page(struct device 
> *dev,
> if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> __dma_sync_page(page, offset, size, dir);
>  
> -   return page_to_phys(page) + offset + get_dma_offset(dev);
> +   return phys_to_dma(dev, page_to_phys(page)) + offset;
>  }
>  
>  static inline void dma_nommu_unmap_page(struct device *dev,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 11/20] powerpc/dma: split the two __dma_alloc_coherent implementations

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The implemementation for the CONFIG_NOT_COHERENT_CACHE case doesn't share
> any code with the one for systems with coherent caches.  Split it off
> and merge it with the helpers in dma-noncoherent.c that have no other
> callers.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h |  5 -
>  arch/powerpc/kernel/dma.c  | 14 ++
>  arch/powerpc/mm/dma-noncoherent.c  | 15 +++
>  arch/powerpc/platforms/44x/warp.c  |  2 +-
>  4 files changed, 10 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index f2a4a7142b1e..dacd0f93f2b2 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -39,9 +39,6 @@ extern int dma_nommu_mmap_coherent(struct device *dev,
>   * to ensure it is consistent.
>   */
>  struct device;
> -extern void *__dma_alloc_coherent(struct device *dev, size_t size,
> -   dma_addr_t *handle, gfp_t gfp);
> -extern void __dma_free_coherent(size_t size, void *vaddr);
>  extern void __dma_sync(void *vaddr, size_t size, int direction);
>  extern void __dma_sync_page(struct page *page, unsigned long offset,
>size_t size, int direction);
> @@ -52,8 +49,6 @@ extern unsigned long __dma_get_coherent_pfn(unsigned long 
> cpu_addr);
>   * Cache coherent cores.
>   */
>  
> -#define __dma_alloc_coherent(dev, gfp, size, handle) NULL
> -#define __dma_free_coherent(size, addr)  ((void)0)
>  #define __dma_sync(addr, size, rw)   ((void)0)
>  #define __dma_sync_page(pg, off, sz, rw) ((void)0)
>  
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 3939589aab04..eceaa92e6986 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -62,18 +62,12 @@ static int dma_nommu_dma_supported(struct device *dev, 
> u64 mask)
>  #endif
>  }
>  
> +#ifndef CONFIG_NOT_COHERENT_CACHE
>  void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t flag,
> unsigned long attrs)
>  {
>   void *ret;
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - ret = __dma_alloc_coherent(dev, size, dma_handle, flag);
> - if (ret == NULL)
> - return NULL;
> - *dma_handle += get_dma_offset(dev);
> - return ret;
> -#else
>   struct page *page;
>   int node = dev_to_node(dev);
>  #ifdef CONFIG_FSL_SOC
> @@ -113,19 +107,15 @@ void *__dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>   *dma_handle = __pa(ret) + get_dma_offset(dev);
>  
>   return ret;
> -#endif
>  }
>  
>  void __dma_nommu_free_coherent(struct device *dev, size_t size,
>   void *vaddr, dma_addr_t dma_handle,
>   unsigned long attrs)
>  {
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - __dma_free_coherent(size, vaddr);
> -#else
>   free_pages((unsigned long)vaddr, get_order(size));
> -#endif
>  }
> +#endif /* !CONFIG_NOT_COHERENT_CACHE */
>  
>  static void *dma_nommu_alloc_coherent(struct device *dev, size_t size,
>  dma_addr_t *dma_handle, gfp_t flag,
> diff --git a/arch/powerpc/mm/dma-noncoherent.c 
> b/arch/powerpc/mm/dma-noncoherent.c
> index d1c16456abac..cfc48a253707 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -29,7 +29,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  #include 
>  
>  #include 
> @@ -151,8 +151,8 @@ static struct ppc_vm_region *ppc_vm_region_find(struct 
> ppc_vm_region *head, unsi
>   * Allocate DMA-coherent memory space and return both the kernel remapped
>   * virtual and bus address for that space.
>   */
> -void *
> -__dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle, 
> gfp_t gfp)
> +void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> + dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>  {
>   struct page *page;
>   struct ppc_vm_region *c;
> @@ -223,7 +223,7 @@ __dma_alloc_coherent(struct device *dev, size_t size, 
> dma_addr_t *handle, gfp_t
>   /*
>* Set the "dma handle"
>*/
> - *handle = page_to_phys(page);
> + *dma_handle = phys_to_dma(dev, page_to_phys(page));
>  
>   do {
>  

Re: [PATCH 09/20] powerpc/dma: remove the unused ISA_DMA_THRESHOLD export

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 


> ---
>  arch/powerpc/kernel/setup_32.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c
> index 74457485574b..3c2d093f74c7 100644
> --- a/arch/powerpc/kernel/setup_32.c
> +++ b/arch/powerpc/kernel/setup_32.c
> @@ -55,7 +55,6 @@ unsigned long ISA_DMA_THRESHOLD;
>  unsigned int DMA_MODE_READ;
>  unsigned int DMA_MODE_WRITE;
>  
> -EXPORT_SYMBOL(ISA_DMA_THRESHOLD);
>  EXPORT_SYMBOL(DMA_MODE_READ);
>  EXPORT_SYMBOL(DMA_MODE_WRITE);
>  fooBenjam

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-08 Thread Benjamin Herrenschmidt
On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> It turns out cxl actually uses it.  So for now skip this patch,
> although random code in drivers messing with dma ops will need to
> be sorted out sooner or later.

CXL devices are "special", they bypass the classic iommu in favor of
allowing the device to operate using the main processor page tables
using an MMU context (so basically the device can use userspace
addresses directly), akin to ATS.

I think the code currently uses the nommu ops as a way to do a simple
kernel mapping for kernel drivers using CXL (not userspace stuff)
though.

Ben.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 07/20] powerpc/dma: remove the unused ARCH_HAS_DMA_MMAP_COHERENT define

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 8fa394520af6..f2a4a7142b1e 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -112,7 +112,5 @@ extern int dma_set_mask(struct device *dev, u64 dma_mask);
>  
>  extern u64 __dma_get_required_mask(struct device *dev);
>  
> -#define ARCH_HAS_DMA_MMAP_COHERENT
> -
>  #endif /* __KERNEL__ */
>  #endif   /* _ASM_DMA_MAPPING_H */

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> We need to take the DMA offset and encryption bit into account when selecting
> a zone.  Add a helper that takes those into account and use it.

That whole "encryption" stuff seems to be completely specific to the
way x86 does memory encryption, or am I mistaken ? It's not clear to me
what that does in practice and how it relates to DMA mappings.

I'm also not sure about that whole business with ZONE_DMA and
ARCH_ZONE_DMA_BITS...

On ppc64, unless you enable swiotlb (which we only do currently on
some embedded platforms), you have all of memory in ZONE_DMA.

[0.00] Zone ranges:
[0.00]   DMA  [mem 0x-0x001f]
[0.00]   DMA32empty
[0.00]   Normal   empty
[0.00]   Device   empty

I'm not sure how this will work with that dma direct code.

I also see a number of tests against a 64-bit mask rather than the
top of memory...

Ben.

> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/dma/direct.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index d32d4f0d2c0c..c2c1df8827f2 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -58,6 +58,14 @@ static bool dma_coherent_ok(struct device *dev, 
> phys_addr_t phys, size_t size)
>   return addr + size - 1 <= dev->coherent_dma_mask;
>  }
>  
> +static bool dma_coherent_below(struct device *dev, u64 mask)
> +{
> + dma_addr_t addr = force_dma_unencrypted() ?
> + __phys_to_dma(dev, mask) : phys_to_dma(dev, mask);
> +
> + return dev->coherent_dma_mask <= addr;
> +}
> +
>  void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t 
> *dma_handle,
>   gfp_t gfp, unsigned long attrs)
>  {
> @@ -70,9 +78,9 @@ void *dma_direct_alloc(struct device *dev, size_t size, 
> dma_addr_t *dma_handle,
>   gfp &= ~__GFP_ZERO;
>  
>   /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: */
> - if (dev->coherent_dma_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
> + if (dma_coherent_below(dev, DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)))
>   gfp |= GFP_DMA;
> - if (dev->coherent_dma_mask <= DMA_BIT_MASK(32) && !(gfp & GFP_DMA))
> + if (dma_coherent_below(dev, DMA_BIT_MASK(32) && !(gfp & GFP_DMA)))
>   gfp |= GFP_DMA32;
>  
>  again:
> @@ -92,14 +100,14 @@ void *dma_direct_alloc(struct device *dev, size_t size, 
> dma_addr_t *dma_handle,
>   page = NULL;
>  
>   if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
> - dev->coherent_dma_mask < DMA_BIT_MASK(64) &&
> + dma_coherent_below(dev, DMA_BIT_MASK(64)) &&
>   !(gfp & (GFP_DMA32 | GFP_DMA))) {
>   gfp |= GFP_DMA32;
>   goto again;
>   }
>  
>   if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> - dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
> + dma_coherent_below(dev, DMA_BIT_MASK(32)) &&
>   !(gfp & GFP_DMA)) {
>   gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
>   goto again;

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> When a device has a DMA offset the dma capable result will change due
> to the difference between the physical and DMA address.  Take that into
> account.

The patch in itself makes sense.

However, there are a number of things in that dma_direct.c file that I
don't quite get:

 - looking more generally at what that function does, I worry about the
switch of ppc32 to this later on:

We do have the occasional device with things like 31-bit DMA
limitation. We know they happens to work because those systems
can't have enough memory to be a problem. This is why our current
DMA direct ops in powerpc just unconditionally return true on ppc32.

The test against a full 32-bit mask here will break them I think.

Thing is, I'm not sure I still have access to one of these things
to test, I'll have to dig (from memory things like b43 wifi).

Also those platforms don't have an iommu.

 - What is this trying to achieve ?

/*
 * Various PCI/PCIe bridges have broken support for > 32bit DMA even
 * if the device itself might support it.
 */
if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
return 0;

IE, if the device has a 32-bit limit, we fail an attempt at checking
if a >32-bit mask works ? That doesn't quite seem to be the right thing
to do... Shouldn't this be in dma_set_mask() and just clamp the mask down ?

IE, dma_set_mask() is what a driver uses to establish the device capability,
so it makes sense tot have dma_32bit_limit just reduce that capability, not
fail because the device can do more than what the bridge can 

Sorry if I'm a bit confused here.

 - How is that file supposed to work on 64-bit platforms ? From what I can
tell, dma_supported() will unconditionally return true if the mask is
32-bit or larger (appart from the above issue). This doesn't look right,
the mask needs to be compared to the max memory address. There are a bunch
of devices out there with masks anywhere bettween 40 and 64 bits, and
some of these will not work "out of the box" if the offseted top
of memory is beyond the mask limit. Or am I missing something ?

Cheers,
Ben.

> Signed-off-by: Christoph Hellwig 

Reviewed-by: Benjamin Herrenschmidt 

> ---
>  kernel/dma/direct.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 8be8106270c2..d32d4f0d2c0c 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -167,7 +167,7 @@ int dma_direct_map_sg(struct device *dev, struct 
> scatterlist *sgl, int nents,
>  int dma_direct_supported(struct device *dev, u64 mask)
>  {
>  #ifdef CONFIG_ZONE_DMA
> - if (mask < DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
> + if (mask < phys_to_dma(dev, DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)))
>   return 0;
>  #else
>   /*
> @@ -176,14 +176,14 @@ int dma_direct_supported(struct device *dev, u64 mask)
>* memory, or by providing a ZONE_DMA32.  If neither is the case, the
>* architecture needs to use an IOMMU instead of the direct mapping.
>*/
> - if (mask < DMA_BIT_MASK(32))
> + if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
>   return 0;
>  #endif
>   /*
>* Various PCI/PCIe bridges have broken support for > 32bit DMA even
>* if the device itself might support it.
>*/
> - if (dev->dma_32bit_limit && mask > DMA_BIT_MASK(32))
> + if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
>   return 0;
>   return 1;
>  }

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DMA mappings and crossing boundaries

2018-07-19 Thread Benjamin Herrenschmidt
On Wed, 2018-07-11 at 14:51 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-07-04 at 13:57 +0100, Robin Murphy wrote:
> > 
> > > Could it ? I wouldn't think dma_map_page is allows to cross page
> > > boundaries ... what about single() ? The main worry is people using
> > > these things on kmalloc'ed memory
> > 
> > Oh, absolutely - the underlying operation is just "prepare for DMA 
> > to/from this physically-contiguous region"; the only real difference 
> > between map_page and map_single is for the sake of the usual "might be 
> > highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits 
> > on the size and offset parameters (in fact, if anyone asks I would say 
> > the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few 
> > months back is valid for dma_map_page too, however silly it may seem).
> 
> I think this is a very bad idea though. We should thrive to avoid
> coalescing before the iommu layer and we should avoid creating sglists
> that cross natural alignment boundaries.

Ping ? Jens, Christoph ?

I really really don't like how sg_alloc_table_from_pages() coalesces
the sglist before it gets mapped. This will potentially create huge
constraints and fragmentation in the IOMMU allocator by passing large
chunks to it.

> I had a look at sg_alloc_table_from_pages() and it basically freaks me
> out.
> 
> Back in the old days, we used to have the block layer aggressively
> coalesce requests iirc, but that was changed to let the iommu layer do
> it.
> 
> If you pass to dma_map_sg() something already heavily coalesced, such
> as what sg_alloc_table_from_pages() can do since it seems to have
> absolutely no limits there, you will create a significant fragmentation
> problem in the iommu allocator.
> 
> Instead, we should probably avoid such coalescing at that level and
> instead let the iommu opportunistically coalesce at the point of
> mapping which it does just fine.
> 
> We've been racking our brains here and we can't find a way to implement
> something we want that is both very performance efficient (no global
> locks etc...) and works with boundary crossing mappings.
> 
> We could always fallback to classic small page mappings but that is
> quite suboptimal.
> 
> I also notice that sg_alloc_table_from_pages() doesn't try to prevent
> crossing the 4G boundary. I remember pretty clearly that it was
> explicitely forbidden to create such a crossing.
> 
> > Of course, given that the allocators tend to give out size/order-aligned 
> > chunks, I think you'd have to be pretty tricksy to get two allocations 
> > to line up either side of a large power-of-two boundary *and* go out of 
> > your way to then make a single request spanning both, but it's certainly 
> > not illegal. Realistically, the kind of "scrape together a large buffer 
> > from smaller pieces" code which is liable to hit a boundary-crossing 
> > case by sheer chance is almost certainly going to be taking the 
> > sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather 
> > than implementing its own merging and piecemeal mapping.
> 
> Yes and I  think what sg_alloc_table_from_pages() does is quite wrong.
> 
> > > > Conceptually it looks pretty easy to extend the allocation constraints
> > > > to cope with that - even the pathological worst case would have an
> > > > absolute upper bound of 3 IOMMU entries for any one physical region -
> > > > but if in practice it's a case of mapping arbitrary CPU pages to 32-bit
> > > > DMA addresses having only 4 1GB slots to play with, I can't really see a
> > > > way to make that practical :(
> > > 
> > > No we are talking about 40-ish-bits of address space, so there's a bit
> > > of leeway. Of course no scheme will work if the user app tries to map
> > > more than the GPU can possibly access.
> > > 
> > > But with newer AMD adding a few more bits and nVidia being at 47-bits,
> > > I think we have some margin, it's just that they can't reach our
> > > discontiguous memory with a normal 'bypass' mapping and I'd rather not
> > > teach Linux about every single way our HW can scatter memory accross
> > > nodes, so an "on demand" mechanism is by far the most flexible way to
> > > deal with all configurations.
> > > 
> > > > Maybe the best compromise would be some sort of hybrid scheme which
> > > > makes sure that one of the IOMMU entries always covers the SWIOTLB
> > > > buffer, and invokes software bouncing for the awkward cases.
> &

Re: DMA mappings and crossing boundaries

2018-07-10 Thread Benjamin Herrenschmidt
On Wed, 2018-07-04 at 13:57 +0100, Robin Murphy wrote:
> 
> > Could it ? I wouldn't think dma_map_page is allows to cross page
> > boundaries ... what about single() ? The main worry is people using
> > these things on kmalloc'ed memory
> 
> Oh, absolutely - the underlying operation is just "prepare for DMA 
> to/from this physically-contiguous region"; the only real difference 
> between map_page and map_single is for the sake of the usual "might be 
> highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits 
> on the size and offset parameters (in fact, if anyone asks I would say 
> the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few 
> months back is valid for dma_map_page too, however silly it may seem).

I think this is a very bad idea though. We should thrive to avoid
coalescing before the iommu layer and we should avoid creating sglists
that cross natural alignment boundaries.

I had a look at sg_alloc_table_from_pages() and it basically freaks me
out.

Back in the old days, we used to have the block layer aggressively
coalesce requests iirc, but that was changed to let the iommu layer do
it.

If you pass to dma_map_sg() something already heavily coalesced, such
as what sg_alloc_table_from_pages() can do since it seems to have
absolutely no limits there, you will create a significant fragmentation
problem in the iommu allocator.

Instead, we should probably avoid such coalescing at that level and
instead let the iommu opportunistically coalesce at the point of
mapping which it does just fine.

We've been racking our brains here and we can't find a way to implement
something we want that is both very performance efficient (no global
locks etc...) and works with boundary crossing mappings.

We could always fallback to classic small page mappings but that is
quite suboptimal.

I also notice that sg_alloc_table_from_pages() doesn't try to prevent
crossing the 4G boundary. I remember pretty clearly that it was
explicitely forbidden to create such a crossing.

> Of course, given that the allocators tend to give out size/order-aligned 
> chunks, I think you'd have to be pretty tricksy to get two allocations 
> to line up either side of a large power-of-two boundary *and* go out of 
> your way to then make a single request spanning both, but it's certainly 
> not illegal. Realistically, the kind of "scrape together a large buffer 
> from smaller pieces" code which is liable to hit a boundary-crossing 
> case by sheer chance is almost certainly going to be taking the 
> sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather 
> than implementing its own merging and piecemeal mapping.

Yes and I  think what sg_alloc_table_from_pages() does is quite wrong.

> > > Conceptually it looks pretty easy to extend the allocation constraints
> > > to cope with that - even the pathological worst case would have an
> > > absolute upper bound of 3 IOMMU entries for any one physical region -
> > > but if in practice it's a case of mapping arbitrary CPU pages to 32-bit
> > > DMA addresses having only 4 1GB slots to play with, I can't really see a
> > > way to make that practical :(
> > 
> > No we are talking about 40-ish-bits of address space, so there's a bit
> > of leeway. Of course no scheme will work if the user app tries to map
> > more than the GPU can possibly access.
> > 
> > But with newer AMD adding a few more bits and nVidia being at 47-bits,
> > I think we have some margin, it's just that they can't reach our
> > discontiguous memory with a normal 'bypass' mapping and I'd rather not
> > teach Linux about every single way our HW can scatter memory accross
> > nodes, so an "on demand" mechanism is by far the most flexible way to
> > deal with all configurations.
> > 
> > > Maybe the best compromise would be some sort of hybrid scheme which
> > > makes sure that one of the IOMMU entries always covers the SWIOTLB
> > > buffer, and invokes software bouncing for the awkward cases.
> > 
> > Hrm... not too sure about that. I'm happy to limit that scheme to well
> > known GPU vendor/device IDs, and SW bouncing is pointless in these
> > cases. It would be nice if we could have some kind of guarantee that a
> > single mapping or sglist entry never crossed a specific boundary
> > though... We more/less have that for 4G already (well, we are supposed
> > to at least). Who are the main potential problematic subsystems here ?
> > I'm thinking network skb allocation pools ... and page cache if it
> > tries to coalesce entries before issuing the map request, does it ?
> 
> I don't know of anything definite off-hand, but my hunch is to be most 
> wary of anything wanting to do zero-copy access to large buffers in 
> userspace pages. In particular, sg_alloc_table_from_pages() lacks any 
> kind of boundary enforcement (and most all users don't even use the 
> segment-length-limiting variant either), so I'd say any caller of that 
> currently has a very small, but nonzero, 

Re: DMA mappings and crossing boundaries

2018-07-02 Thread Benjamin Herrenschmidt
On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote:

 .../...

Thanks Robin, I was starting to depair anybody would reply ;-)

> > AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
> > HOWTO.txt) as always allocating to the next power-of-2 order, so we
> > should never have the problem unless we allocate a single chunk larger
> > than the IOMMU page size.
> 
> (and even then it's not *that* much of a problem, since it comes down to 
> just finding n > 1 consecutive unused IOMMU entries for exclusive use by 
> that new chunk)

Yes, this case is not my biggest worry.

> > For dma_map_sg() however, if a request that has a single "entry"
> > spawning such a boundary, we need to ensure that the result mapping is
> > 2 contiguous "large" iommu pages as well.
> > 
> > However, that doesn't fit well with us re-using existing mappings since
> > they may already exist and either not be contiguous, or partially exist
> > with no free hole around them.
> > 
> > Now, we *could* possibly construe a way to solve this by detecting this
> > case and just allocating another "pair" (or set if we cross even more
> > pages) of IOMMU pages elsewhere, thus partially breaking our re-use
> > scheme.
> > 
> > But while doable, this introduce some serious complexity in the
> > implementation, which I would very much like to avoid.
> > 
> > So I was wondering if you guys thought that was ever likely to happen ?
> > Do you see reasonable cases where dma_map_sg() would be called with a
> > list in which a single entry crosses a 256M or 1G boundary ?
> 
> For streaming mappings of buffers cobbled together out of any old CPU 
> pages (e.g. user memory), you may well happen to get two 
> physically-adjacent pages falling either side of an IOMMU boundary, 
> which comprise all or part of a single request - note that whilst it's 
> probably less likely than the scatterlist case, this could technically 
> happen for dma_map_{page, single}() calls too.

Could it ? I wouldn't think dma_map_page is allows to cross page
boundaries ... what about single() ? The main worry is people using
these things on kmalloc'ed memory

> Conceptually it looks pretty easy to extend the allocation constraints 
> to cope with that - even the pathological worst case would have an 
> absolute upper bound of 3 IOMMU entries for any one physical region - 
> but if in practice it's a case of mapping arbitrary CPU pages to 32-bit 
> DMA addresses having only 4 1GB slots to play with, I can't really see a 
> way to make that practical :(

No we are talking about 40-ish-bits of address space, so there's a bit
of leeway. Of course no scheme will work if the user app tries to map
more than the GPU can possibly access.

But with newer AMD adding a few more bits and nVidia being at 47-bits,
I think we have some margin, it's just that they can't reach our
discontiguous memory with a normal 'bypass' mapping and I'd rather not
teach Linux about every single way our HW can scatter memory accross
nodes, so an "on demand" mechanism is by far the most flexible way to
deal with all configurations.

> Maybe the best compromise would be some sort of hybrid scheme which 
> makes sure that one of the IOMMU entries always covers the SWIOTLB 
> buffer, and invokes software bouncing for the awkward cases.

Hrm... not too sure about that. I'm happy to limit that scheme to well
known GPU vendor/device IDs, and SW bouncing is pointless in these
cases. It would be nice if we could have some kind of guarantee that a
single mapping or sglist entry never crossed a specific boundary
though... We more/less have that for 4G already (well, we are supposed
to at least). Who are the main potential problematic subsystems here ?
I'm thinking network skb allocation pools ... and page cache if it
tries to coalesce entries before issuing the map request, does it ?

Ben.

> Robin.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


DMA mappings and crossing boundaries

2018-06-24 Thread Benjamin Herrenschmidt
Hi Folks !

So due work around issues with devices having to strict limitations in
DMA address bits (GPUs ugh) on POWER, we've been playing with a
mechanism that does dynamic mapping in the IOMMU but using a very large
IOMMU page size (256M on POWER8 and 1G on POWER9) for performances.

Now, with such page size, we can't just pop out new entries for every
DMA map, we need to try to re-use entries for mappings in the same
"area". 

We've prototypes something using refcounts on the entires. It does
imply some locking which is potentially problematic, and we'll be
looking at options there long run, but it works... so far.

My worry is that it will fail if we ever get a mapping request (or
coherent allocation request) that spawns one of those giant pages
boundaries. At least our current implementation.

AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
HOWTO.txt) as always allocating to the next power-of-2 order, so we
should never have the problem unless we allocate a single chunk larger
than the IOMMU page size.

For dma_map_sg() however, if a request that has a single "entry"
spawning such a boundary, we need to ensure that the result mapping is
2 contiguous "large" iommu pages as well.

However, that doesn't fit well with us re-using existing mappings since
they may already exist and either not be contiguous, or partially exist
with no free hole around them.

Now, we *could* possibly construe a way to solve this by detecting this
case and just allocating another "pair" (or set if we cross even more
pages) of IOMMU pages elsewhere, thus partially breaking our re-use
scheme.

But while doable, this introduce some serious complexity in the
implementation, which I would very much like to avoid.

So I was wondering if you guys thought that was ever likely to happen ?
Do you see reasonable cases where dma_map_sg() would be called with a
list in which a single entry crosses a 256M or 1G boundary ?

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-16 Thread Benjamin Herrenschmidt
On Wed, 2017-08-16 at 10:56 -0600, Alex Williamson wrote:
> 
> > WTF  Alex, can you stop once and for all with all that "POWER is
> > not standard" bullshit please ? It's completely wrong.
> 
> As you've stated, the MSI-X vector table on POWER is currently updated
> via a hypercall.  POWER is overall PCI compliant (I assume), but the
> guest does not directly modify the vector table in MMIO space of the
> device.  This is important...

Well no. On qemu the guest doesn't always (but it can save/restore it),
but on PowerVM this is done by the FW running inside the partition
itself. And that firmware just does normal stores to the device table.

IE. The problem here isn't so much who does the actual stores to the
device table but where they get the address and data values from, which
isn't covered by the spec.

The added fact that qemu hijacks the stores not just to "remap" them
but also do the whole reuqesting of the interrupt etc... in the host
system is a qemu design choice which also hasn't any relation to the
spec (and arguably isnt' a great choice for our systems).

For example, on PowerVM, the HV assigns a pile of MSIs to the guest to
assign to its devices. The FW inside the guest does a default
assignment but that can be changed.

Thus the interrupts are effectively "hooked up" at the HV level at the
point where the PCI bridge is mapped into the guest.

> > This has nothing to do with PCIe standard !
> 
> Yes, it actually does, because if the guest relies on the vector table
> to be virtualized then it doesn't particularly matter whether the
> vfio-pci kernel driver allows that portion of device MMIO space to be
> directly accessed or mapped because QEMU needs for it to be trapped in
> order to provide that virtualization.

And this has nothing to do with the PCIe standard... this has
everything to do with a combination of qemu design choices and
defficient FW interfaces on x86 platforms.

> I'm not knocking POWER, it's a smart thing for virtualization to have
> defined this hypercall which negates the need for vector table
> virtualization and allows efficient mapping of the device.  On other
> platform, it's not necessarily practical given the broad base of legacy
> guests supported where we'd never get agreement to implement this as
> part of the platform spec... if there even was such a thing.  Maybe we
> could provide the hypercall and dynamically enable direct vector table
> mapping (disabling vector table virtualization) only if the hypercall
> is used.

No I think a better approach would be to provide the guest with a pile
of MSIs to use with devices and have FW (such as ACPI) tell the guest
about them.

> > The PCIe standard says strictly *nothing* whatsoever about how an OS
> > obtains the magic address/values to put in the device and how the PCIe
> > host bridge may do appropriate fitering.
> 
> And now we've jumped the tracks...  The only way the platform specific
> address/data values become important is if we allow direct access to
> the vector table AND now we're formulating how the user/guest might
> write to it directly.  Otherwise the virtualization of the vector
> table, or paravirtualization via hypercall provides the translation
> where the host and guest address/data pairs can operate in completely
> different address spaces.

They can regardless if things are done properly :-)

> > There is nothing on POWER that prevents the guest from writing the MSI-
> > X address/data by hand. The problem isn't who writes the values or even
> > how. The problem breaks down into these two things that are NOT covered
> > by any aspect of the PCIe standard:
> 
> You've moved on to a different problem, I think everyone aside from
> POWER is still back at the problem where who writes the vector table
> values is a forefront problem.
>  
> >   1- The OS needs to obtain address/data values for an MSI that will
> > "work" for the device.
> > 
> >   2- The HW+HV needs to prevent collateral damage caused by a device
> > issuing stores to incorrect address or with incorrect data. Now *this*
> > is necessary for *ANY* kind of DMA whether it's an MSI or something
> > else anyway.
> > 
> > Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> > and whatever excluse about "making it harder" doesn't fly a meter when
> > it comes to security. Making it "harder to break accidentally" I also
> > don't buy, people don't just randomly put things in their MSI-X tables
> > "accidentally", that stuff works or doesn't.
> 
> As I said before, I'm not willing to preserve the weak attributes that
> blocking direct vector table access provides over pursuing a more
> performant interface, but I also don't think their value is absolute
> zero either.
> 
> > That leaves us with 1). Now this is purely a platform specific matters,
> > not a spec matter. Once the HW has a way to enforce you can only
> > generate "allowed" MSIs it becomes a matter of having some FW mechanism
> > that can be used to 

Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-15 Thread Benjamin Herrenschmidt
On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> Of course I don't think either of those are worth imposing a
> performance penalty where we don't otherwise need one.  However, if we
> look at a VM scenario where the guest is following the PCI standard for
> programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> intercept those MMIO writes to the vector table and configure the host
> interrupt domain of the device rather than allowing the guest direct
> access.  This is simply part of virtualizing the device to the guest.
> So even if the kernel allows mmap'ing the vector table, the hypervisor
> needs to trap it, so the mmap isn't required or used anyway.  It's only
> when you define a non-PCI standard for your guest to program
> interrupts, as POWER has done, and can therefore trust that the
> hypervisor does not need to trap on the vector table that having that
> mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> pages too... does ARM have any strategy that would actually make it
> possible to make use of an mmap covering the vector table?  Thanks,

WTF  Alex, can you stop once and for all with all that "POWER is
not standard" bullshit please ? It's completely wrong.

This has nothing to do with PCIe standard !

The PCIe standard says strictly *nothing* whatsoever about how an OS
obtains the magic address/values to put in the device and how the PCIe
host bridge may do appropriate fitering.

There is nothing on POWER that prevents the guest from writing the MSI-
X address/data by hand. The problem isn't who writes the values or even
how. The problem breaks down into these two things that are NOT covered
by any aspect of the PCIe standard:

  1- The OS needs to obtain address/data values for an MSI that will
"work" for the device.

  2- The HW+HV needs to prevent collateral damage caused by a device
issuing stores to incorrect address or with incorrect data. Now *this*
is necessary for *ANY* kind of DMA whether it's an MSI or something
else anyway.

Now, the filtering done by qemu is NOT a reasonable way to handle 2)
and whatever excluse about "making it harder" doesn't fly a meter when
it comes to security. Making it "harder to break accidentally" I also
don't buy, people don't just randomly put things in their MSI-X tables
"accidentally", that stuff works or doesn't.

That leaves us with 1). Now this is purely a platform specific matters,
not a spec matter. Once the HW has a way to enforce you can only
generate "allowed" MSIs it becomes a matter of having some FW mechanism
that can be used to informed the OS what address/values to use for a
given interrupts.

This is provided on POWER by a combination of device-tree and RTAS. It
could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
but this is no way a problem of standard compliance, just inferior
firmware interfaces.

So again, for the 234789246th time in years, can we get that 1-bit-of-
information sorted one way or another so we can fix our massive
performance issue instead of adding yet another dozen layers of paint
on that shed ?

Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Benjamin Herrenschmidt
On Mon, 2017-08-14 at 14:12 +0100, Robin Murphy wrote:
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

There is 0 value in trying to prevent the guest kernel from shooting
itself in the foot. There are so many other ways it can do it that I
fail the point of even attempting it here.

In addition, this actually harms performance on some devices. There
are cases where the MSI-X table shares a page with other registrers
that are used during normal device operation. This is especially
problematic on architectures such as powerpc that use 64K pages.

Those devices thus suffer a massive performance loss, for the sake of
something that never happens in practice (especially on pseries where
the MSI configuration is done by paravirt calls, thus by qemu itself).

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Benjamin Herrenschmidt
On Tue, 2017-08-15 at 09:47 +0800, Jike Song wrote:
> On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > > Taking a step back, though, why does vfio-pci perform this check in the
> > > > first place? If a malicious guest already has control of a device, any
> > > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > > message address/data it could simply do with a DMA write anyway, so the
> > > > security argument doesn't stand up in general (sure, not all PCIe
> > > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > > tenuous security-by-obscurity angle to me).
> > 
> > I tried to make that point for years, thanks for re-iterating it :-)
> > 
> > > Hi Robin,
> > > 
> > > DMA writes will be translated (thereby censored) by DMA Remapping 
> > > hardware,
> > > while MSI/MSI-X will not. Is this different for non-x86?
> > 
> > There is no way your DMA remapping HW can differenciate. The only
> > difference between a DMA write and an MSI is ... the address. So if I
> > can make my device DMA to the MSI address range, I've defeated your
> > security.
> 
> I don't think with IRQ remapping enabled, you can make your device DMA to
> MSI address, without being treated as an IRQ and remapped. If so, the IRQ
> remapping hardware is simply broken :)

You are mixing things here.

Robin's point is that there is no security provided by the obfuscating
of the MSI-X table by qemu because whatever qemu does to "filter" the
MSI-X targer addresses can be worked around by making the device DMA
wherever you want.

None of what you say invalidates that basic fact.

Now, as far as your remapping HW goes, either it filters interrupts or
it doesn't. If it does then yes, it can't be spoofed, and thus you
don't need the filtering of the table in qemu.

If it doesn't, then the guest can spoof any interrupt using DMAs and
whatever qemu does to filter the table is not going to fix it.

Thus the point remains that the only value in qemu filtering the table
is to enable already insecure use cases to work, without actually
making them any more secure.

Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH kernel v4 4/6] iommu: Set PCI_BUS_FLAGS_MSI_REMAP on iommu driver initialization

2017-07-26 Thread Benjamin Herrenschmidt
On Wed, 2017-07-26 at 11:50 +0200, Joerg Roedel wrote:
> > 3. Create IOMMU_DOMAIN_UNMANAGED IOMMU domains for PPC64/powernv IOMMU
> > groups and only define capable() hook to report IOMMU_CAP_INTR_REMAP;
> > others already use these IOMMU domains. VFIO-PCI's mmap() hook could then
> > check the capability via iommu_capable(bus). The problems is as Robin said:
> > "iommu_capable() is a fundamentally broken and unworkable interface
> > anyway"; however it is still not clear to me why it is unworkable in this
> > particular case of isolation checking.
> 
> That one is wrong, IRQ remapping is not a property of a domain. A domain
> is an abstraction for a device address space. Attaching IRQ information
> there is just wrong.

Except it somewhat is ... an MSI is a store in the device address
space, the way MSIs are handled and/or filtered can be considered a
property of the domain. In our case, it's the exact same piece of HW
that defines domains and which MSIs they are allowed to generate.

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH kernel v4 4/6] iommu: Set PCI_BUS_FLAGS_MSI_REMAP on iommu driver initialization

2017-07-19 Thread Benjamin Herrenschmidt
On Tue, 2017-07-11 at 11:39 +0100, Robin Murphy wrote:
> I have no idea what the context is here, but this flag looks wrong
> generally. IRQ remapping is a property of the irqchip and has nothing to
> do with PCI, so pretending it's a property of PCI buses looks like a
> massive hack around... something.
> 
> Also, iommu_capable() is a fundamentally broken and unworkable interface
> anyway. The existing IRQ remapping code really wants updating to
> advertise IRQ_DOMAIN_FLAG_MSI_REMAP on the relevant MSI domains so that
> IOMMU_CAP_INTR_REMAP can go away for good. That way, all consumers who
> actually care about whether IRQ remapping is implemented (see e.g. VFIO)
> can use irq_domain_check_msi_remap() or check specific devices in a sane
> and scalable manner.

As you rightfully said, you have no idea what the context is :-)

This is not an interrupt domain.

On powerpc we have a single global unified domains that contains all
our MSIs, IPIs, internally interrupts and what not, because of the way
our interrupts infrastructure works.

So there is no such thing as "a property of the MSI domain".

The way the HW works is that the PCI Host Bridge has the ability
to filter which device can access which MSIs. This is done by the IOMMU
portion of the bridge.

Thus it's a filtering capability, not a remapping capability per-se,
and it's a property of the IOMMU domain.

Sicne this is also a paravitualized interface, the "remapping" part
is handled by the HV calls done by the guest to configure the MSIs.

The HW filtering ensures that an evil guest cannot do bad things if
it goes manually whack the MSI table.

Now this issue have been discussed and patches floated around for
*YEARS* now and there is still no upstream solution for what is a
completely trivial problem: Don't bloody bloock MSI BAR table access on
pseries platforms. It kills performance on a number of device due to
our 64K pages.

A 1-liner fix in qemu would have done it YEARS ago. But instead we have
now painted about 1000 bike sheds and going without anything that
actually works. Yay.

Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: clean up and modularize arch dma_mapping interface V2

2017-06-24 Thread Benjamin Herrenschmidt
On Sat, 2017-06-24 at 09:18 +0200, Christoph Hellwig wrote:
> On Wed, Jun 21, 2017 at 12:24:28PM -0700, tndave wrote:
> > Thanks for doing this.
> > So archs can still have their own definition for dma_set_mask() if 
> > HAVE_ARCH_DMA_SET_MASK is y?
> > (and similarly for dma_set_coherent_mask() when 
> > CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK is y)
> > Any plan to change these?
> 
> Yes, those should go away, but I'm not entirely sure how yet.  We'll
> need some hook for switching between an IOMMU and a direct mapping
> (I guess that's what you want to do for sparc as well?), and I need
> to find the best way to do that.  Reimplementing all of dma_set_mask
> and dma_set_coherent_mask is something that I want to move away from.

I think we still need to do it. For example we have a bunch new "funky"
cases.

We already have the case where we mix the direct and iommu mappings,
on some powerpc platforms that translates in an iommu mapping down at
0 for the 32-bit space and a direct mapping high up in the PCI address
space (which crops the top bits and thus hits memory at 0 onwards).

This type of hybrid layout is needed by some adapters, typically
storage, which want to keep the "coherent" mask at 32-bit but support
64-bit for streaming masks.

Another one we are trying to deal better with at the moment is devices
with DMA addressing limitations. Some GPUs typically (but not only)
have limits that go all accross the gamut, typically I've seen 40 bits,
44 bits and 47 bits... And of course those are "high peformance"
adapters so we don't want to limit them to the comparatively small
iommu mapping with extra overhead.

At the moment, we're looking at a dma_set_mask() hook that will, for
these guys, re-configure the iommu mapping to create a "compressed"
linear mapping of system memory (ie, skipping the holes we have between
chip on P9 for example) using the largest possible iommu page size
(256M on P8, 1G on P9).

This is made tricky of course because several devices can potentially
share a DMA domain based on various platform specific reasons. And of
course we have no way to figure out what's the "common denominator" of
all those devices before they start doing DMA. A driver can start
before the neighbour is probed and a driver can start doing DMAs using
the standard 32-bit mapping without doing dma_set_mask().

So heuristics ... ugh. Better ideas welcome :-) All that to say that we
are far from being able to get rid of dma_set_mask() custom
implementations (and coherent mask too).

I was tempted at some point retiring the 32-bit iommu mapping
completely, just doing that "linear" thing I mentioned above and
swiotlb for the rest, along with introducing ZONE_DMA32 on powerpc
(with the real 64-bit bypass still around for non-limited devices but
that's then just extra speed by bypassing the iommu xlate & cache).

But I worry of the impact on those silly adapters that set the coherent
mask to 32-bits to keep their microcode & descriptor ring down in 32-
bit space. I'm not sure how well ZONE_DMA32 behaves in those cases.

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 42/44] powerpc/cell: use the dma_supported method for ops switching

2017-06-18 Thread Benjamin Herrenschmidt
On Sun, 2017-06-18 at 00:13 -0700, Christoph Hellwig wrote:
> On Sun, Jun 18, 2017 at 06:50:27AM +1000, Benjamin Herrenschmidt wrote:
> > What is your rationale here ? (I have missed patch 0 it seems).
> 
> Less code duplication, more modular dma_map_ops insteance.
> 
> > dma_supported() was supposed to be pretty much a "const" function
> > simply informing whether a given setup is possible. Having it perform
> > an actual switch of ops seems to be pushing it...
> 
> dma_supported() is already gone from the public DMA API as it doesn't
> make sense to be called separately from set_dma_mask.  It will be
> entirely gone in the next series after this one.

Ah ok, in that case it makes much more sense, we can rename it then.

> > What if a driver wants to test various dma masks and then pick one ?
> > 
> > Where does the API documents that if a driver calls dma_supported() it
> > then *must* set the corresponding mask and use that ?
> 
> Where is the API document for _any_ of the dma routines? (A: work in
> progress by me, but I need to clean up the mess of arch hooks before
> it can make any sense)

Heh fair enough.

> > I don't like a function that is a "boolean query" like this one to have
> > such a major side effect.
> > 
> > > From an API standpoint, dma_set_mask() is when the mask is established,
> > 
> > and thus when the ops switch should happen.
> 
> And that's exactly what happens at the driver API level.  It just turns
> out the dma_capable method is they way better place to actually
> implement it, as the ->set_dma_mask method requires lots of code
> duplication while not offering any actual benefit over ->dma_capable.
> And because of that it's gone after this series.
> 
> In theory we could rename ->dma_capable now, but it would require a
> _lot_ of churn.  Give me another merge window or two and we should
> be down to be about 2 handful of dma_map_ops instance, at which point
> we could do all this gratious renaming a lot more easily :)

Sure, I get it now, as long as it's not publicly exposed to drivers
we are fine.

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 42/44] powerpc/cell: use the dma_supported method for ops switching

2017-06-17 Thread Benjamin Herrenschmidt
On Fri, 2017-06-16 at 20:10 +0200, Christoph Hellwig wrote:
> Besides removing the last instance of the set_dma_mask method this also
> reduced the code duplication.

What is your rationale here ? (I have missed patch 0 it seems).

dma_supported() was supposed to be pretty much a "const" function
simply informing whether a given setup is possible. Having it perform
an actual switch of ops seems to be pushing it...

What if a driver wants to test various dma masks and then pick one ?

Where does the API documents that if a driver calls dma_supported() it
then *must* set the corresponding mask and use that ?

I don't like a function that is a "boolean query" like this one to have
such a major side effect.

>From an API standpoint, dma_set_mask() is when the mask is established,
and thus when the ops switch should happen.

Ben.

> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/platforms/cell/iommu.c | 25 +
>  1 file changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/cell/iommu.c 
> b/arch/powerpc/platforms/cell/iommu.c
> index 497bfbdbd967..29d4f96ed33e 100644
> --- a/arch/powerpc/platforms/cell/iommu.c
> +++ b/arch/powerpc/platforms/cell/iommu.c
> @@ -644,20 +644,14 @@ static void dma_fixed_unmap_sg(struct device *dev, 
> struct scatterlist *sg,
>  direction, attrs);
>  }
>  
> -static int dma_fixed_dma_supported(struct device *dev, u64 mask)
> -{
> - return mask == DMA_BIT_MASK(64);
> -}
> -
> -static int dma_set_mask_and_switch(struct device *dev, u64 dma_mask);
> +static int dma_suported_and_switch(struct device *dev, u64 dma_mask);
>  
>  static const struct dma_map_ops dma_iommu_fixed_ops = {
>   .alloc  = dma_fixed_alloc_coherent,
>   .free   = dma_fixed_free_coherent,
>   .map_sg = dma_fixed_map_sg,
>   .unmap_sg   = dma_fixed_unmap_sg,
> - .dma_supported  = dma_fixed_dma_supported,
> - .set_dma_mask   = dma_set_mask_and_switch,
> + .dma_supported  = dma_suported_and_switch,
>   .map_page   = dma_fixed_map_page,
>   .unmap_page = dma_fixed_unmap_page,
>   .mapping_error  = dma_iommu_mapping_error,
> @@ -952,11 +946,8 @@ static u64 cell_iommu_get_fixed_address(struct device 
> *dev)
>   return dev_addr;
>  }
>  
> -static int dma_set_mask_and_switch(struct device *dev, u64 dma_mask)
> +static int dma_suported_and_switch(struct device *dev, u64 dma_mask)
>  {
> - if (!dev->dma_mask || !dma_supported(dev, dma_mask))
> - return -EIO;
> -
>   if (dma_mask == DMA_BIT_MASK(64) &&
>   cell_iommu_get_fixed_address(dev) != OF_BAD_ADDR) {
>   u64 addr = cell_iommu_get_fixed_address(dev) +
> @@ -965,14 +956,16 @@ static int dma_set_mask_and_switch(struct device *dev, 
> u64 dma_mask)
>   dev_dbg(dev, "iommu: fixed addr = %llx\n", addr);
>   set_dma_ops(dev, _iommu_fixed_ops);
>   set_dma_offset(dev, addr);
> - } else {
> + return 1;
> + }
> +
> + if (dma_iommu_dma_supported(dev, dma_mask)) {
>   dev_dbg(dev, "iommu: not 64-bit, using default ops\n");
>   set_dma_ops(dev, get_pci_dma_ops());
>   cell_dma_dev_setup(dev);
> + return 1;
>   }
>  
> - *dev->dma_mask = dma_mask;
> -
>   return 0;
>  }
>  
> @@ -1127,7 +1120,7 @@ static int __init cell_iommu_fixed_mapping_init(void)
>   cell_iommu_setup_window(iommu, np, dbase, dsize, 0);
>   }
>  
> - dma_iommu_ops.set_dma_mask = dma_set_mask_and_switch;
> + dma_iommu_ops.dma_supported = dma_suported_and_switch;
>   set_pci_dma_ops(_iommu_ops);
>  
>   return 0;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 09/22] PCI: Add pci_peer_traffic_supported()

2015-10-21 Thread Benjamin Herrenschmidt
On Tue, 2015-09-15 at 12:10 -0500, Will Davis wrote:

> +bool pci_peer_traffic_supported(struct pci_dev *dev, struct pci_dev
> *peer)
> +{
> +>> struct pci_host_bridge *dev_host_bridge;
> +>> struct pci_host_bridge *peer_host_bridge;
> +
> +>> /*
> +>>  * Disallow the peer-to-peer traffic if the devices do not share a
> +>>  * host bridge. The PCI specifications does not make any guarantees
> +>>  * about P2P capabilities between devices under separate domains.
> +>>  *
> +>>  * PCI Local Bus Specification Revision 3.0, section 3.10:
> +>>  *"Peer-to-peer transactions crossing multiple host bridges
> +>>  * PCI host bridges may, but are not required to, support PCI
> +>>  * peer-to-peer transactions that traverse multiple PCI host
> +>>  * bridges."
> +>>  */
> + dev_host_bridge = pci_find_host_bridge(dev->bus);
> +>> peer_host_bridge = pci_find_host_bridge(peer->bus);
> +>> if (dev_host_bridge != peer_host_bridge)
> +>>   > return false;

This needs to be platform specific. Some architectures will allow
routing between multiple bridges, some won't.

> + /*
> +>>   >  * Access Control Services (ACS) Checks
> +>>   >  *
> +>>   >  * ACS has a capability bit for P2P Request Redirects (RR),
> +>>   >  * but unfortunately it doesn't tell us much about the real
> +>>   >  * capabilities of the hardware.
> +>>   >  *
> +>>   >  * PCI Express Base Specification Revision 3.0, section
> +>>   >  * 6.12.1.1:
> +>>   >  *"ACS P2P Request Redirect: must be implemented by Root
> +>>   >  * Ports that support peer-to-peer traffic with other
> +>>   >  * Root Ports; [80]"
> +>>   >  * but
> +>>   >  *"[80] Root Port indication of ACS P2P Request Redirect
> +>>   >  * or ACS P2P Completion Redirect support does not imply
> +>>   >  * any particular level of peer-to-peer support by the
> +>>   >  * Root Complex, or that peer-to-peer traffic is
> +>>   >  * supported at all"
> +>>   >  */
> +>>   > struct pci_dev *rpdev = dev->bus->self;
> +>>   > struct pci_dev *rppeer = peer->bus->self;
> +>>   > struct pci_dev *common_upstream;
> +>>   > int pos;
> +>>   > u16 cap;
> +
> +>>   > while ((rpdev) && (pci_is_pcie(rpdev)) &&
> +>>   >(pci_pcie_type(rpdev) != PCI_EXP_TYPE_ROOT_PORT))
> +>>   >   > rpdev = rpdev->bus->self;
> +
> +>>   > while ((rppeer) && (pci_is_pcie(rppeer)) &&
> +>>   >(pci_pcie_type(rppeer) != PCI_EXP_TYPE_ROOT_PORT))
> +>>   >   > rppeer = rppeer->bus->self;
> +
> +>>   > common_upstream = pci_find_common_upstream_dev(dev, peer);
> +
> +>>   > /*
> +>>   >  * If ACS is not implemented, we have no idea about P2P
> +>>   >  * support. Optimistically allow this if there is a common
> +>>   >  * upstream device.
> +>>   >  */
> +>>   > pos = pci_find_ext_capability(rpdev, PCI_EXT_CAP_ID_ACS);
> +>>   > if (!pos)
> +>>   >   > return common_upstream != NULL;

We might need a hook as well here. PLX switch may or may not allow it
depending on some configuration bits.

> + /*
> +>>   >  * If the devices are under the same root port and have a 
> common
> +>>   >  * upstream device, allow if the root port is further upstream
> +>>   >  * from the common upstream device and the common upstream
> +>>   >  * device has Upstream Forwarding disabled, or if the root 
> port
> +>>   >  * is the common upstream device and ACS is not implemented.
> +>>   >  */
> +>>   > pci_read_config_word(rpdev, pos + PCI_ACS_CAP, );
> +>>   > if ((rpdev == rppeer && common_upstream) &&
> +>>   > (((common_upstream != rpdev) &&
> +>>   >   !pci_acs_enabled(common_upstream, PCI_ACS_UF)) ||
> +>>   >  ((common_upstream == rpdev) && ((cap & PCI_ACS_RR) == 
> 0
> +>>   >   > return true;
> +
> +>>   > /*
> +>>   >  * If ACS RR is implemented and disabled, allow only if the
> +>>   >  * devices are under the same root port.
> +>>   >  */
> +>>   > if (cap & PCI_ACS_RR && !pci_acs_enabled(rpdev, PCI_ACS_RR))
> +>>   >   > return rpdev == rppeer;
> +
> +>>   > /*
> +>>   >  * If ACS RR is not implemented, or is implemented and 
> enabled,
> +>>   >  * only allow if there's a translation agent enabled to do the
> +>>   >  * redirect.
> +>>   >  */
> +>>   > return iommu_present(_bus_type);
> +>> }
> +
> +>> return false;
> +}
> +
>  #define RESOURCE_ALIGNMENT_PARAM_SIZE COMMAND_LINE_SIZE
>  static char resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] 

Re: [PATCH v2 2/7] iommu: IOMMU Groups

2012-06-20 Thread Benjamin Herrenschmidt
On Wed, 2012-05-30 at 14:18 -0600, Alex Williamson wrote:

 IOMMU groups also include a userspace representation in sysfs under
 /sys/kernel/iommu_groups.  When allocated, each group is given a
 dynamically assign ID (int).  The ID is managed by the core IOMMU group
 code to support multiple heterogeneous iommu drivers, which could
 potentially collide in group naming/numbering.  This also keeps group
 IDs to small, easily managed values.  A directory is created under
 /sys/kernel/iommu_groups for each group.  A further subdirectory named
 devices contains links to each device within the group.  The iommu_group
 file in the device's sysfs directory, which formerly contained a group
 number when read, is now a link to the iommu group.  Example:

So first, I'm generally ok with the patch, I have a few comments mostly
for discussion and possible further improvements, but so far nothing
that can't be done via subsequent patches, so let's start with

Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org
---

Now:

How easy would it be add our own files there (in sysfs) ? I'm thinking
mostly for debug/diagnostic purposes it would be handy to show some HW
state related to the group or should I just add debugfs stuff
elsewhere ?

 This patch also extends the IOMMU API to allow attaching groups to
 domains.  This is currently a simple wrapper for iterating through
 devices within a group, but it's expected that the IOMMU API may
 eventually make groups a more integral part of domains.

I assume that by domains you mean iommu domains ? Just to be sure
because we also have PCI domains so it can be confusing :-)

 +/**
 + * iommu_group_alloc - Allocate a new group
 + * @name: Optional name to associate with group, visible in sysfs
 + *
 + * This function is called by an iommu driver to allocate a new iommu
 + * group.  The iommu group represents the minimum granularity of the iommu.
 + * Upon successful return, the caller holds a reference to the supplied
 + * group in order to hold the group until devices are added.  Use
 + * iommu_group_put() to release this extra reference count, allowing the
 + * group to be automatically reclaimed once it has no devices or external
 + * references.
 + */
 +struct iommu_group *iommu_group_alloc(void)
  {
 - unsigned int groupid;
 + struct iommu_group *group;
 + int ret;
 +
 + group = kzalloc(sizeof(*group), GFP_KERNEL);
 + if (!group)
 + return ERR_PTR(-ENOMEM);
 +
 + group-kobj.kset = iommu_group_kset;
 + mutex_init(group-mutex);
 + INIT_LIST_HEAD(group-devices);
 + BLOCKING_INIT_NOTIFIER_HEAD(group-notifier);
 +
 + mutex_lock(iommu_group_mutex);
 +
 +again:
 + if (unlikely(0 == ida_pre_get(iommu_group_ida, GFP_KERNEL))) {
 + kfree(group);
 + mutex_unlock(iommu_group_mutex);
 + return ERR_PTR(-ENOMEM);
 + }
 +
 + if (-EAGAIN == ida_get_new(iommu_group_ida, group-id))
 + goto again;
 +
 + mutex_unlock(iommu_group_mutex);
  
 - if (iommu_device_group(dev, groupid) == 0)
 - return device_create_file(dev, dev_attr_iommu_group);
 + ret = kobject_init_and_add(group-kobj, iommu_group_ktype,
 +NULL, %d, group-id);
 + if (ret) {
 + mutex_lock(iommu_group_mutex);
 + ida_remove(iommu_group_ida, group-id);
 + mutex_unlock(iommu_group_mutex);
 + kfree(group);
 + return ERR_PTR(ret);
 + }
 +
 + group-devices_kobj = kobject_create_and_add(devices, group-kobj);
 + if (!group-devices_kobj) {
 + kobject_put(group-kobj); /* triggers .release  free */
 + return ERR_PTR(-ENOMEM);
 + }
 +
 + /*
 +  * The devices_kobj holds a reference on the group kobject, so
 +  * as long as that exists so will the group.  We can therefore
 +  * use the devices_kobj for reference counting.
 +  */
 + kobject_put(group-kobj);
 +
 + return group;
 +}
 +EXPORT_SYMBOL_GPL(iommu_group_alloc);
 +
 +/**
 + * iommu_group_get_iommudata - retrieve iommu_data registered for a group
 + * @group: the group
 + *
 + * iommu drivers can store data in the group for use when doing iommu
 + * operations.  This function provides a way to retrieve it.  Caller
 + * should hold a group reference.
 + */
 +void *iommu_group_get_iommudata(struct iommu_group *group)
 +{
 + return group-iommu_data;
 +}
 +EXPORT_SYMBOL_GPL(iommu_group_get_iommudata);

That probably wants to be a static inline ? No biggie, could be done in
a followup patch if we really care.

 +/**
 + * iommu_group_set_iommudata - set iommu_data for a group
 + * @group: the group
 + * @iommu_data: new data
 + * @release: release function for iommu_data
 + *
 + * iommu drivers can store data in the group for use when doing iommu
 + * operations.  This function provides a way to set the data after
 + * the group has been allocated.  Caller should hold a group reference