Re: [PATCH 2/3] powerpc/powernv: DMA operations for discontiguous allocation

2018-07-04 Thread Russell Currey
On Fri, 2018-06-29 at 17:34 +1000, Russell Currey wrote:



> + /*
> +  * The TCE isn't being used, so let's try and
> allocate it.
> +  * Bits 0 and 1 are read/write, and we use bit 2 as
> a "lock"
> +  * bit.  This is to prevent any race where the value
> is set in
> +  * the TCE table but the invalidate/mb() hasn't
> finished yet.
> +  */
> + entry = cpu_to_be64((addr - offset) | 7);
> + ret = cmpxchg(>tces[i], tce, entry);
> + if (ret != tce) {
> + /* conflict, start looking again just in
> case */
> + i--;
> + continue;
> + }
> + pnv_pci_phb3_tce_invalidate(pe, 0, 0, addr - offset,
> 1);

This is wrong and won't work outside of PHB3, will make a generic
handler

> + mb();
> + /* clear the lock bit now that we know it's active
> */
> + ret = cmpxchg(>tces[i], entry, cpu_to_be64((addr
> - offset) | 3));
> + if (ret != entry) {
> + /* conflict, start looking again just in
> case */
> + i--;
> + continue;
> + }
> +
> + return (i << phb->ioda.max_tce_order) | offset;
> + }
> + /* If we get here, the table must be full, so error out. */
> + return -1ULL;
> +}
> +

qtpass.desktop
Description: application/desktop


Re: [PATCH 2/3] powerpc/powernv: DMA operations for discontiguous allocation

2018-07-03 Thread Russell Currey
On Mon, 2018-07-02 at 18:47 +1000, Alexey Kardashevskiy wrote:
> On Fri, 29 Jun 2018 17:34:36 +1000
> Russell Currey  wrote:
> 
> > DMA pseudo-bypass is a new set of DMA operations that solve some
> > issues for
> > devices that want to address more than 32 bits but can't address
> > the 59
> > bits required to enable direct DMA.
> > 
> > The previous implementation for POWER8/PHB3 worked around this by
> > configuring a bypass from the default 32-bit address space into 64-
> > bit
> > address space.  This approach does not work for POWER9/PHB4 because
> > regions of memory are discontiguous and many devices will be unable
> > to
> > address memory beyond the first node.
> > 
> > Instead, implement a new set of DMA operations that allocate TCEs
> > as DMA
> > mappings are requested so that all memory is addressable even when
> > a
> > one-to-one mapping between real addresses and DMA addresses isn't
> > possible.  These TCEs are the maximum size available on the
> > platform,
> > which is 256M on PHB3 and 1G on PHB4.
> > 
> > Devices can now map any region of memory up to the maximum amount
> > they can
> > address according to the DMA mask set, in chunks of the largest
> > available
> > TCE size.
> > 
> > This implementation replaces the need for the existing PHB3
> > solution and
> > should be compatible with future PHB versions.
> > 
> > It is, however, rather naive.  There is no unmapping, and as a
> > result
> > devices can eventually run out of space if they address their
> > entire
> > DMA mask worth of TCEs.  An implementation with unmap() will come
> > in
> > future (and requires a much more complex implementation), but this
> > is a
> > good start due to the drastic performance improvement.
> 
> 
> Why does not dma_iommu_ops work in this case? I keep asking and yet
> no
> comment in the commit log or mails...

Yes, I should cover this in the commit message.

So the primary reason that the IOMMU doesn't work for this case is the
TCE allocation - the IOMMU doesn't have a refcount and will allocate
(in this case on P9) 1GB TCEs to each map which will quickly fail.

This isn't intended to be a replacement for the IOMMU, it's a
roundabout way of achieving what the direct ops do (like NVIDIA devices
on P8 can do today).


Re: [PATCH 2/3] powerpc/powernv: DMA operations for discontiguous allocation

2018-07-02 Thread Alexey Kardashevskiy
On Fri, 29 Jun 2018 17:34:36 +1000
Russell Currey  wrote:

> DMA pseudo-bypass is a new set of DMA operations that solve some issues for
> devices that want to address more than 32 bits but can't address the 59
> bits required to enable direct DMA.
> 
> The previous implementation for POWER8/PHB3 worked around this by
> configuring a bypass from the default 32-bit address space into 64-bit
> address space.  This approach does not work for POWER9/PHB4 because
> regions of memory are discontiguous and many devices will be unable to
> address memory beyond the first node.
> 
> Instead, implement a new set of DMA operations that allocate TCEs as DMA
> mappings are requested so that all memory is addressable even when a
> one-to-one mapping between real addresses and DMA addresses isn't
> possible.  These TCEs are the maximum size available on the platform,
> which is 256M on PHB3 and 1G on PHB4.
> 
> Devices can now map any region of memory up to the maximum amount they can
> address according to the DMA mask set, in chunks of the largest available
> TCE size.
> 
> This implementation replaces the need for the existing PHB3 solution and
> should be compatible with future PHB versions.
> 
> It is, however, rather naive.  There is no unmapping, and as a result
> devices can eventually run out of space if they address their entire
> DMA mask worth of TCEs.  An implementation with unmap() will come in
> future (and requires a much more complex implementation), but this is a
> good start due to the drastic performance improvement.


Why does not dma_iommu_ops work in this case? I keep asking and yet no
comment in the commit log or mails...


> 
> Signed-off-by: Russell Currey 
> ---
>  arch/powerpc/include/asm/dma-mapping.h|   1 +
>  arch/powerpc/platforms/powernv/Makefile   |   2 +-
>  arch/powerpc/platforms/powernv/pci-dma.c  | 243 ++
>  arch/powerpc/platforms/powernv/pci-ioda.c |  82 +++-
>  arch/powerpc/platforms/powernv/pci.h  |   7 +
>  5 files changed, 281 insertions(+), 54 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c

Too generic file name for such a hack.


> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 8fa394520af6..354f435160f3 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -74,6 +74,7 @@ static inline unsigned long device_to_mask(struct device 
> *dev)
>  extern struct dma_map_ops dma_iommu_ops;
>  #endif
>  extern const struct dma_map_ops dma_nommu_ops;
> +extern const struct dma_map_ops dma_pseudo_bypass_ops;
>  
>  static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type 
> *bus)
>  {
> diff --git a/arch/powerpc/platforms/powernv/Makefile 
> b/arch/powerpc/platforms/powernv/Makefile
> index 703a350a7f4e..2467bdab3c13 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -6,7 +6,7 @@ obj-y += opal-msglog.o opal-hmi.o 
> opal-power.o opal-irqchip.o
>  obj-y+= opal-kmsg.o opal-powercap.o opal-psr.o 
> opal-sensor-groups.o
>  
>  obj-$(CONFIG_SMP)+= smp.o subcore.o subcore-asm.o
> -obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o
> +obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o pci-dma.o
>  obj-$(CONFIG_CXL_BASE)   += pci-cxl.o
>  obj-$(CONFIG_EEH)+= eeh-powernv.o
>  obj-$(CONFIG_PPC_SCOM)   += opal-xscom.o
> diff --git a/arch/powerpc/platforms/powernv/pci-dma.c 
> b/arch/powerpc/platforms/powernv/pci-dma.c
> new file mode 100644
> index ..79382627c7be
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/pci-dma.c
> @@ -0,0 +1,243 @@
> +/*
> + * DMA operations supporting pseudo-bypass for PHB3+
> + *
> + * Author: Russell Currey 
> + *
> + * Copyright 2018 IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation; either version 2 of the License, or (at your
> + * option) any later version.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "pci.h"
> +
> +/*
> + * This is a naive implementation that directly operates on TCEs, allocating
> + * on demand.  There is no locking or refcounts since no TCEs are ever 
> removed
> + * and unmap does nothing.
> + */
> +static dma_addr_t dma_pseudo_bypass_get_address(struct device *dev,
> + phys_addr_t addr)
> +{
> + struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
> + struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> + struct pnv_phb *phb = hose->private_data;
> + struct pnv_ioda_pe *pe;
> + u64 i, tce, ret, offset;
> + __be64 entry;
> +
> + offset = addr & ((1 << phb->ioda.max_tce_order) - 1);
> +
> + 

Re: [PATCH 2/3] powerpc/powernv: DMA operations for discontiguous allocation

2018-06-29 Thread Benjamin Herrenschmidt
On Fri, 2018-06-29 at 17:34 +1000, Russell Currey wrote:
> DMA pseudo-bypass is a new set of DMA operations that solve some issues for
> devices that want to address more than 32 bits but can't address the 59
> bits required to enable direct DMA.

One thing you may need to add (I didn't see it with a cursory glance
but maybe it's there) is some form of handling of allocations or
mapping requests that span a TCE boundary.

For allocations, since they are page orders, that means you only have
to check if they are bigger than a TCE page.

For mappings, you need to check individual sglist entries.

At this stage, if you hit that all you can do is fail, with maybe a
rate limited printk. But it's better than whatever corruption or
misbehaviour will happen if you don't catch them. I don't expect this
to happen much if at all with 1G pages, as most "sg" mappings are
probably be in unit of pages, but I still want to catch if it does
happen.

> The previous implementation for POWER8/PHB3 worked around this by
> configuring a bypass from the default 32-bit address space into 64-bit
> address space.  This approach does not work for POWER9/PHB4 because
> regions of memory are discontiguous and many devices will be unable to
> address memory beyond the first node.
> 
> Instead, implement a new set of DMA operations that allocate TCEs as DMA
> mappings are requested so that all memory is addressable even when a
> one-to-one mapping between real addresses and DMA addresses isn't
> possible.  These TCEs are the maximum size available on the platform,
> which is 256M on PHB3 and 1G on PHB4.
> 
> Devices can now map any region of memory up to the maximum amount they can
> address according to the DMA mask set, in chunks of the largest available
> TCE size.
> 
> This implementation replaces the need for the existing PHB3 solution and
> should be compatible with future PHB versions.
> 
> It is, however, rather naive.  There is no unmapping, and as a result
> devices can eventually run out of space if they address their entire
> DMA mask worth of TCEs.  An implementation with unmap() will come in
> future (and requires a much more complex implementation), but this is a
> good start due to the drastic performance improvement.
> 
> Signed-off-by: Russell Currey 
> ---
>  arch/powerpc/include/asm/dma-mapping.h|   1 +
>  arch/powerpc/platforms/powernv/Makefile   |   2 +-
>  arch/powerpc/platforms/powernv/pci-dma.c  | 243 ++
>  arch/powerpc/platforms/powernv/pci-ioda.c |  82 +++-
>  arch/powerpc/platforms/powernv/pci.h  |   7 +
>  5 files changed, 281 insertions(+), 54 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 8fa394520af6..354f435160f3 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -74,6 +74,7 @@ static inline unsigned long device_to_mask(struct device 
> *dev)
>  extern struct dma_map_ops dma_iommu_ops;
>  #endif
>  extern const struct dma_map_ops dma_nommu_ops;
> +extern const struct dma_map_ops dma_pseudo_bypass_ops;
>  
>  static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type 
> *bus)
>  {
> diff --git a/arch/powerpc/platforms/powernv/Makefile 
> b/arch/powerpc/platforms/powernv/Makefile
> index 703a350a7f4e..2467bdab3c13 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -6,7 +6,7 @@ obj-y += opal-msglog.o opal-hmi.o 
> opal-power.o opal-irqchip.o
>  obj-y+= opal-kmsg.o opal-powercap.o opal-psr.o 
> opal-sensor-groups.o
>  
>  obj-$(CONFIG_SMP)+= smp.o subcore.o subcore-asm.o
> -obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o
> +obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o pci-dma.o
>  obj-$(CONFIG_CXL_BASE)   += pci-cxl.o
>  obj-$(CONFIG_EEH)+= eeh-powernv.o
>  obj-$(CONFIG_PPC_SCOM)   += opal-xscom.o
> diff --git a/arch/powerpc/platforms/powernv/pci-dma.c 
> b/arch/powerpc/platforms/powernv/pci-dma.c
> new file mode 100644
> index ..79382627c7be
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/pci-dma.c
> @@ -0,0 +1,243 @@
> +/*
> + * DMA operations supporting pseudo-bypass for PHB3+
> + *
> + * Author: Russell Currey 
> + *
> + * Copyright 2018 IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation; either version 2 of the License, or (at your
> + * option) any later version.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "pci.h"
> +
> +/*
> + * This is a naive implementation that directly operates on TCEs, allocating
> + * on demand.  There is no locking or refcounts since no TCEs are 

[PATCH 2/3] powerpc/powernv: DMA operations for discontiguous allocation

2018-06-29 Thread Russell Currey
DMA pseudo-bypass is a new set of DMA operations that solve some issues for
devices that want to address more than 32 bits but can't address the 59
bits required to enable direct DMA.

The previous implementation for POWER8/PHB3 worked around this by
configuring a bypass from the default 32-bit address space into 64-bit
address space.  This approach does not work for POWER9/PHB4 because
regions of memory are discontiguous and many devices will be unable to
address memory beyond the first node.

Instead, implement a new set of DMA operations that allocate TCEs as DMA
mappings are requested so that all memory is addressable even when a
one-to-one mapping between real addresses and DMA addresses isn't
possible.  These TCEs are the maximum size available on the platform,
which is 256M on PHB3 and 1G on PHB4.

Devices can now map any region of memory up to the maximum amount they can
address according to the DMA mask set, in chunks of the largest available
TCE size.

This implementation replaces the need for the existing PHB3 solution and
should be compatible with future PHB versions.

It is, however, rather naive.  There is no unmapping, and as a result
devices can eventually run out of space if they address their entire
DMA mask worth of TCEs.  An implementation with unmap() will come in
future (and requires a much more complex implementation), but this is a
good start due to the drastic performance improvement.

Signed-off-by: Russell Currey 
---
 arch/powerpc/include/asm/dma-mapping.h|   1 +
 arch/powerpc/platforms/powernv/Makefile   |   2 +-
 arch/powerpc/platforms/powernv/pci-dma.c  | 243 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |  82 +++-
 arch/powerpc/platforms/powernv/pci.h  |   7 +
 5 files changed, 281 insertions(+), 54 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 8fa394520af6..354f435160f3 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -74,6 +74,7 @@ static inline unsigned long device_to_mask(struct device *dev)
 extern struct dma_map_ops dma_iommu_ops;
 #endif
 extern const struct dma_map_ops dma_nommu_ops;
+extern const struct dma_map_ops dma_pseudo_bypass_ops;
 
 static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
 {
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 703a350a7f4e..2467bdab3c13 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,7 +6,7 @@ obj-y   += opal-msglog.o opal-hmi.o 
opal-power.o opal-irqchip.o
 obj-y  += opal-kmsg.o opal-powercap.o opal-psr.o 
opal-sensor-groups.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
-obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
+obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-dma.o
 obj-$(CONFIG_CXL_BASE) += pci-cxl.o
 obj-$(CONFIG_EEH)  += eeh-powernv.o
 obj-$(CONFIG_PPC_SCOM) += opal-xscom.o
diff --git a/arch/powerpc/platforms/powernv/pci-dma.c 
b/arch/powerpc/platforms/powernv/pci-dma.c
new file mode 100644
index ..79382627c7be
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pci-dma.c
@@ -0,0 +1,243 @@
+/*
+ * DMA operations supporting pseudo-bypass for PHB3+
+ *
+ * Author: Russell Currey 
+ *
+ * Copyright 2018 IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or (at your
+ * option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include "pci.h"
+
+/*
+ * This is a naive implementation that directly operates on TCEs, allocating
+ * on demand.  There is no locking or refcounts since no TCEs are ever removed
+ * and unmap does nothing.
+ */
+static dma_addr_t dma_pseudo_bypass_get_address(struct device *dev,
+   phys_addr_t addr)
+{
+   struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   struct pnv_ioda_pe *pe;
+   u64 i, tce, ret, offset;
+   __be64 entry;
+
+   offset = addr & ((1 << phb->ioda.max_tce_order) - 1);
+
+   pe = >ioda.pe_array[pci_get_pdn(pdev)->pe_number];
+
+   /* look through the tracking table for a free entry */
+   for (i = 0; i < pe->tce_count; i++) {
+   /* skip between 2GB and 4GB */
+   if ((i << phb->ioda.max_tce_order) >= 0x8000 &&
+   (i << phb->ioda.max_tce_order) < 0x1)
+   continue;
+
+   tce = be64_to_cpu(pe->tces[i]);
+
+   /* if