Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-06 Thread Nishanth Aravamudan
On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig 
> 
> ... but I doubt we'll ever bother updating it.  Most architectures
> with arger page sizes also have iommus and would need different settings
> for different iommus vs direct mapping for very little gain.  There's a
> reason why we never bothered for RDMA either.

FWIW, whose tree should this go through? The bug only appears on Power,
afaik, but the patch is now just an NVMe change.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-05 Thread Nishanth Aravamudan
On 03.11.2015 [13:46:25 +], Keith Busch wrote:
> On Tue, Nov 03, 2015 at 05:18:24AM -0800, Christoph Hellwig wrote:
> > On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > > index ccc0c1f93daa..a9a5285bdb39 100644
> > > --- a/drivers/block/nvme-core.c
> > > +++ b/drivers/block/nvme-core.c
> > > @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct 
> > > nvme_dev *dev)
> > >   u32 aqa;
> > >   u64 cap = readq(>bar->cap);
> > >   struct nvme_queue *nvmeq;
> > > - unsigned page_shift = PAGE_SHIFT;
> > > + /*
> > > +  * default to a 4K page size, with the intention to update this
> > > +  * path in the future to accomodate architectures with differing
> > > +  * kernel and IO page sizes.
> > > +  */
> > > + unsigned page_shift = 12;
> > >   unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
> > >   unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
> > 
> > Looks good as a start.  Note that all the MPSMIN/MAX checking could
> > be removed as NVMe devices must support 4k pages.
> 
> MAX can go, and while it's probably the case that all devices support 4k,
> it's not a spec requirement, so we should keep the dev_page_min check.

Ok, here's an updated patch.

We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

With the functionally equivalent v3 of this patch, our hardware test
exerciser survives when using 32-bit DMA; without the patch, the kernel
will BUG within a few minutes.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

v2 -> v3:
  In the interest of fixing the functional problem in the short-term,
  just force the device page size to 4K and work on adding the new API
  in the next merge window.

v3 -> v4:
  Rebase to the 4.3, including the new code locations.
  Based upon feedback from Keith Busch and Christoph Hellwig, remove the
  device max check, as the spec requires MPSMAX >= 4K.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e878590e71b6..00ca45bb0bc0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1701,9 +1701,13 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   /*
+* default to a 4K page size, with the intention to update this
+* path in the future to accomodate architectures with differing
+* kernel and IO page sizes.
+*/
+   unsigned page_shift = 12;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
-   unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 
if (page_shift < dev_page_min) {
dev_err(dev->dev,
@@ -1712,13 +1716,6 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
1 << page_shift);
return -ENODEV;
}
-   if (page_shift > dev_page_max) {
-   dev_info(dev->dev,
-   "Device maximum page size (%u) smaller than "
-   "host (%u); enabling work-around\n",
-   1 << dev_page_max, 1 << page_shift);
-   page_shift = dev_page_max;
-   }
 
dev->subsystem = readl(>bar->vs) >= NVME_VS(1, 1) ?
NVME_CAP_NSSRC(cap) : 0;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-05 Thread Nishanth Aravamudan
On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig 
> 
> ... but I doubt we'll ever bother updating it.  Most architectures
> with arger page sizes also have iommus and would need different settings
> for different iommus vs direct mapping for very little gain.  There's a
> reason why we never bothered for RDMA either.

Fair enough :) Thanks for all your reviews and comments.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/1 v3] drivers/nvme: default to 4k device page size

2015-10-30 Thread Nishanth Aravamudan
On 29.10.2015 [17:20:43 +], Busch, Keith wrote:
> On Thu, Oct 29, 2015 at 08:57:01AM -0700, Nishanth Aravamudan wrote:
> > On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> > > We had a quick cht about this issue and I think we simply should
> > > default to a NVMe controler page size of 4k everywhere as that's the
> > > safe default.  This is also what we do for RDMA Memory reigstrations and
> > > it works fine there for SRP and iSER.
> > 
> > So, would that imply changing just the NVMe driver code rather than
> > adding the dma_page_shift API at all? What about
> > architectures that can support the larger page sizes? There is an
> > implied performance impact, at least, of shifting the IO size down.
> 
> It is the safe option, but you're right that it might have a
> measurable performance impact (can you run an experiment?). Maybe we
> should just change the driver to always use MPSMIN for the moment in
> the interest of time, and you can flush out the new API before the
> next merge window.

Given that it's 4K just about everywhere by default (and sort of
implicitly expected to be, I guess), I think I'd prefer we default to
4K. That should mitigate the performance impact (I'll ask our IO team to
do some runs, but since this impacts functionality on some hardware, I
don't think it's too relevant for now). Unless there are NVMe devcies
with a MPSMAX < 4K? 

Something like the following?



We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

v2 -> v3:
  In the interest of fixing the functional problem in the short-term,
  just force the device page size to 4K and work on adding the new API
  in the next merge window.

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ccc0c1f93daa..a9a5285bdb39 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   /*
+* default to a 4K page size, with the intention to update this
+* path in the future to accomodate architectures with differing
+* kernel and IO page sizes.
+*/
+   unsigned page_shift = 12;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 




-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size

2015-10-30 Thread Nishanth Aravamudan
On 30.10.2015 [21:48:48 +], Keith Busch wrote:
> On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > Given that it's 4K just about everywhere by default (and sort of
> > implicitly expected to be, I guess), I think I'd prefer we default to
> > 4K. That should mitigate the performance impact (I'll ask our IO team to
> > do some runs, but since this impacts functionality on some hardware, I
> > don't think it's too relevant for now). Unless there are NVMe devcies
> > with a MPSMAX < 4K? 
> 
> Right, I assumed MPSMIN was always 4k for the same reason you mentioned,
> but you can hard code it like you've done in your patch.
> 
> The spec defines MPSMAX such that it's impossible to find a device
> with MPSMAX < 4k.

Great, thanks!

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-30 Thread Nishanth Aravamudan
On 29.10.2015 [18:49:55 -0700], David Miller wrote:
> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> Date: Thu, 29 Oct 2015 08:57:01 -0700
> 
> > So, would that imply changing just the NVMe driver code rather than
> > adding the dma_page_shift API at all? What about
> > architectures that can support the larger page sizes? There is an
> > implied performance impact, at least, of shifting the IO size down.
> > 
> > Sorry for the continuing questions -- I got lots of conflicting feedback
> > on the last series and want to make sure v4 is more acceptable.
> 
> In the long term I would be very happy to see us having a real interface
> for this stuff, just my opinion...

Yep, I think I'll try and balance the two -- fix NVMe for now with a 4K
page size as suggested by Christoph, and then work on the more complete
API for the next merge.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-29 Thread Nishanth Aravamudan
On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> On Wed, Oct 28, 2015 at 01:59:23PM +, Busch, Keith wrote:
> > The "new" interface for all the other architectures is the same as the
> > old one we've been using for the last 5 years.
> > 
> > I welcome x86 maintainer feedback to confirm virtual and DMA addresses
> > have the same offset at 4k alignment, but I have to insist we don't
> > break my currently working hardware to force their attention.
> 
> We had a quick cht about this issue and I think we simply should
> default to a NVMe controler page size of 4k everywhere as that's the
> safe default.  This is also what we do for RDMA Memory reigstrations and
> it works fine there for SRP and iSER.

So, would that imply changing just the NVMe driver code rather than
adding the dma_page_shift API at all? What about
architectures that can support the larger page sizes? There is an
implied performance impact, at least, of shifting the IO size down.

Sorry for the continuing questions -- I got lots of conflicting feedback
on the last series and want to make sure v4 is more acceptable.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> Date: Fri, 23 Oct 2015 13:54:20 -0700
> 
> > 1) add a generic dma_get_page_shift implementation that just returns
> > PAGE_SHIFT
> 
> I won't object to this patch series, but if I had implemented this I
> would have required the architectures to implement this explicitly,
> one-by-one.  I think it is less error prone and more likely to end
> up with all the architectures setting this correctly.

Well, looks like I should spin up a v4 anyways for the powerpc changes.
So, to make sure I understand your point, should I make the generic
dma_get_page_shift a compile-error kind of thing? It will only fail on
architectures that actually build the NVME driver (as the only caller).
But I'm not sure how exactly to achieve that, if you could give a bit
more detail I'd appreciate it!

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [16:56:10 +1100], Alexey Kardashevskiy wrote:
> On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote:
> >When DDW (Dynamic DMA Windows) are present for a device, we have stored
> >the TCE (Translation Control Entry) size in a special device tree
> >property. Check if we have enabled DDW for the device and return the TCE
> >size from that property if present. If the property isn't present,
> >fallback to looking the value up in struct iommu_table. If we don't find
> >a iommu_table, fallback to the kernel's page size.
> >
> >Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >---
> >  arch/powerpc/platforms/pseries/iommu.c | 36 
> > ++
> >  1 file changed, 36 insertions(+)
> >
> >diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> >b/arch/powerpc/platforms/pseries/iommu.c
> >index 0946b98..1bf6471 100644
> >--- a/arch/powerpc/platforms/pseries/iommu.c
> >+++ b/arch/powerpc/platforms/pseries/iommu.c
> >@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct 
> >device *dev)
> > return dma_iommu_ops.get_required_mask(dev);
> >  }
> >
> >+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
> >+{
> >+struct iommu_table *tbl;
> >+
> >+if (!disable_ddw && dev_is_pci(dev)) {
> >+struct pci_dev *pdev = to_pci_dev(dev);
> >+struct device_node *dn;
> >+
> >+dn = pci_device_to_OF_node(pdev);
> >+
> >+/* search upwards for ibm,dma-window */
> >+for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
> >+dn = dn->parent)
> >+if (of_get_property(dn, "ibm,dma-window", NULL))
> >+break;
> >+/*
> >+ * if there is a DDW configuration, the TCE shift is stored in
> >+ * the property
> >+ */
> >+if (dn && PCI_DN(dn)) {
> >+const struct dynamic_dma_window_prop *direct64 =
> >+of_get_property(dn, DIRECT64_PROPNAME, NULL);
> 
> 
> This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM
> does not set it as 64bit windows are dynamic there so something like
> find_existing_ddw() needs to be used here.

DIRECT64_PROPNAME is a Linux thing, not a pHyp or QEMU/KVM thing -- it's
created by the Linux DDW logic and left in the device-tree when we
successfully configure DDW.

You're right, though, that logically find_existing_ddw() would be better
to use here. I'll spin up a new version.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
> On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> >On Power, the kernel's page size can differ from the IOMMU's page size,
> >so we need to override the generic implementation, which always returns
> >the kernel's page size. Lookup the IOMMU's page size from struct
> >iommu_table, if available. Fallback to the kernel's page size,
> >otherwise.
> >
> >Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >---
> >  arch/powerpc/include/asm/dma-mapping.h | 3 +++
> >  arch/powerpc/kernel/dma.c  | 9 +
> >  2 files changed, 12 insertions(+)
> >
> >diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> >b/arch/powerpc/include/asm/dma-mapping.h
> >index 7f522c0..c5638f4 100644
> >--- a/arch/powerpc/include/asm/dma-mapping.h
> >+++ b/arch/powerpc/include/asm/dma-mapping.h
> >@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
> >dma_addr_t off)
> >  #define HAVE_ARCH_DMA_SET_MASK 1
> >  extern int dma_set_mask(struct device *dev, u64 dma_mask);
> >
> >+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> >+extern unsigned long dma_get_page_shift(struct device *dev);
> >+
> >  #include 
> >
> >  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> >diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> >index 59503ed..e805af2 100644
> >--- a/arch/powerpc/kernel/dma.c
> >+++ b/arch/powerpc/kernel/dma.c
> >@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
> >  }
> >  EXPORT_SYMBOL(dma_set_mask);
> >
> >+unsigned long dma_get_page_shift(struct device *dev)
> >+{
> >+struct iommu_table *tbl = get_iommu_table_base(dev);
> >+if (tbl)
> >+return tbl->it_page_shift;
> 
> 
> All PCI devices have this initialized on POWER (at least, our, IBM's
> POWER) so 4K will always be returned here while in the case of
> (get_dma_ops(dev)==_direct_ops) it could actually return
> PAGE_SHIFT. Is 4K still preferred value to return here?

Right, so the logic of my series, goes like this:

a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
PAGE_SHIFT everywhere, including Power.

b) After 2/7, the Power code will return either the IOMMU table's shift
value, if set, or PAGE_SHIFT (I guess this would be the case if
get_dma_ops(dev) == _direct_ops, as you said). That is no different
than we have now, except we can return the accurate IOMMU value if
available.

3) After 3/7, the platform can override the generic Power
get_dma_page_shift().

4) After 4/7, pseries will return the DDW value, if available, then
fallback to the IOMMU table's value. I think in the case of
get_dma_ops(dev)==_direct_ops, the only way that can happen is if we
are using DDW, right?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [09:57:48 +1100], Julian Calaby wrote:
> Hi Nishanth,
> 
> On Wed, Oct 28, 2015 at 9:20 AM, Nishanth Aravamudan
> <n...@linux.vnet.ibm.com> wrote:
> > On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> >> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >> Date: Fri, 23 Oct 2015 13:54:20 -0700
> >>
> >> > 1) add a generic dma_get_page_shift implementation that just returns
> >> > PAGE_SHIFT
> >>
> >> I won't object to this patch series, but if I had implemented this I
> >> would have required the architectures to implement this explicitly,
> >> one-by-one.  I think it is less error prone and more likely to end
> >> up with all the architectures setting this correctly.
> >
> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
> > So, to make sure I understand your point, should I make the generic
> > dma_get_page_shift a compile-error kind of thing? It will only fail on
> > architectures that actually build the NVME driver (as the only caller).
> > But I'm not sure how exactly to achieve that, if you could give a bit
> > more detail I'd appreciate it!
> 
> He's suggesting that you _don't_ put a generic implementation in
> /include/linux/dma-mapping.h and instead add it to _every_
> architecture.

Ah, I see! Well, I don't know much about the DMA internals of most
architectures -- and my approach kept things functionally the same
everywhere (using PAGE_SHIFT) except:

a) Power, where I know it doesn't work as-is
and
b) sparc, where the code implied that a different value than PAGE_SHIFT
should be used.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [12:00:20 +1100], Alexey Kardashevskiy wrote:
> On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote:
> >On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
> >>On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> >>>On Power, the kernel's page size can differ from the IOMMU's page size,
> >>>so we need to override the generic implementation, which always returns
> >>>the kernel's page size. Lookup the IOMMU's page size from struct
> >>>iommu_table, if available. Fallback to the kernel's page size,
> >>>otherwise.
> >>>
> >>>Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >>>---
> >>>  arch/powerpc/include/asm/dma-mapping.h | 3 +++
> >>>  arch/powerpc/kernel/dma.c  | 9 +
> >>>  2 files changed, 12 insertions(+)
> >>>
> >>>diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> >>>b/arch/powerpc/include/asm/dma-mapping.h
> >>>index 7f522c0..c5638f4 100644
> >>>--- a/arch/powerpc/include/asm/dma-mapping.h
> >>>+++ b/arch/powerpc/include/asm/dma-mapping.h
> >>>@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
> >>>dma_addr_t off)
> >>>  #define HAVE_ARCH_DMA_SET_MASK 1
> >>>  extern int dma_set_mask(struct device *dev, u64 dma_mask);
> >>>
> >>>+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> >>>+extern unsigned long dma_get_page_shift(struct device *dev);
> >>>+
> >>>  #include 
> >>>
> >>>  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> >>>diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> >>>index 59503ed..e805af2 100644
> >>>--- a/arch/powerpc/kernel/dma.c
> >>>+++ b/arch/powerpc/kernel/dma.c
> >>>@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
> >>>  }
> >>>  EXPORT_SYMBOL(dma_set_mask);
> >>>
> >>>+unsigned long dma_get_page_shift(struct device *dev)
> >>>+{
> >>>+  struct iommu_table *tbl = get_iommu_table_base(dev);
> >>>+  if (tbl)
> >>>+  return tbl->it_page_shift;
> >>
> >>
> >>All PCI devices have this initialized on POWER (at least, our, IBM's
> >>POWER) so 4K will always be returned here while in the case of
> >>(get_dma_ops(dev)==_direct_ops) it could actually return
> >>PAGE_SHIFT. Is 4K still preferred value to return here?
> >
> >Right, so the logic of my series, goes like this:
> >
> >a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
> >PAGE_SHIFT everywhere, including Power.
> >
> >b) After 2/7, the Power code will return either the IOMMU table's shift
> >value, if set, or PAGE_SHIFT (I guess this would be the case if
> >get_dma_ops(dev) == _direct_ops, as you said). That is no different
> >than we have now, except we can return the accurate IOMMU value if
> >available.
> 
> If it is not available, then something went wrong and BUG_ON(!tbl ||
> !tbl->it_page_shift) make more sense here than pretending that this
> function can ever return PAGE_SHIFT. imho.

That's a good point, thanks!

> >3) After 3/7, the platform can override the generic Power
> >get_dma_page_shift().
> >
> >4) After 4/7, pseries will return the DDW value, if available, then
> >fallback to the IOMMU table's value. I think in the case of
> >get_dma_ops(dev)==_direct_ops, the only way that can happen is if we
> >are using DDW, right?
> 
> This is for pseries guests; for the powernv host it is a "bypass"
> mode which does 64bit direct DMA mapping and there is no additional
> window for that (i.e. DIRECT64_PROPNAME, etc).

You're right! I should update the code to handle both cases.

In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?

Seems like this would be a different platform implentation I'd put in
for 'powernv', is that right?

My apologies for missing that, and thank you for the review!

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [17:53:22 -0700], David Miller wrote:
> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> Date: Tue, 27 Oct 2015 15:20:10 -0700
> 
> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
> > So, to make sure I understand your point, should I make the generic
> > dma_get_page_shift a compile-error kind of thing? It will only fail on
> > architectures that actually build the NVME driver (as the only caller).
> > But I'm not sure how exactly to achieve that, if you could give a bit
> > more detail I'd appreciate it!
> 
> Yes, I am basically suggesting to simply not provide a default at all.

For my own edification -- what is the way that gets resolved? I guess I
mean it seems like linux-next would cease to compile because of my new
series. Would my patches just get kicked out of -next for introducing
that (or even via the 0-day notifications), or should I put something
into the commit message indicating it is an API introduction?

Sorry for the tentativeness, I have not introduce a cross-architecture
API like this before.

Thanks,
Nish

> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [11:20:05 +0900], Benjamin Herrenschmidt wrote:
> On Tue, 2015-10-27 at 18:54 -0700, Nishanth Aravamudan wrote:
> > 
> > In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?
> 
> None :-) The TCEs are completely bypassed. You get a N:M linear mapping
> of all memory starting at 1<<59 PCI side.

Err, duh, sorry! Ok, so in that case, DMA page shift is PAGE_SHIFT,
then?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-23 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

This series consists of five patches:

1) add a generic dma_get_page_shift implementation that just returns
PAGE_SHIFT
2) override the generic implementation on Power to use the IOMMU table's
page shift if available
3) allow further specific overriding on power with machdep platform
overrides
4) use the machdep override on pseries, as the DDW code puts the TCE
shift in a special property and there is no IOMMU table available
5) move some sparc code around to make IOMMU_PAGE_SHIFT available in
include/asm
6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT
7) leverage the new API in the NVMe driver

With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

 arch/powerpc/include/asm/dma-mapping.h |  3 +++
 arch/powerpc/include/asm/machdep.h |  3 ++-
 arch/powerpc/kernel/dma.c  | 11 +++
 arch/powerpc/platforms/pseries/iommu.c | 36 

 arch/sparc/include/asm/dma-mapping.h   |  8 
 arch/sparc/include/asm/iommu_common.h  | 51 
+++
 arch/sparc/kernel/iommu.c  |  2 +-
 arch/sparc/kernel/iommu_common.h   | 51 
---
 arch/sparc/kernel/pci_psycho.c |  2 +-
 arch/sparc/kernel/pci_sabre.c  |  2 +-
 arch/sparc/kernel/pci_schizo.c |  2 +-
 arch/sparc/kernel/pci_sun4v.c  |  2 +-
 arch/sparc/kernel/psycho_common.c  |  2 +-
 arch/sparc/kernel/sbus.c   |  3 +--
 drivers/block/nvme-core.c  |  3 ++-
 include/linux/dma-mapping.h|  7 +++
 16 files changed, 127 insertions(+), 61 deletions(-)

v1 -> v2:
  Based upon feedback from Christoph Hellwig, rather than using an
  arch-specific hack, expose the DMA page shift via a generic DMA API and
  override it on Power as needed.
v2 -> v3:
  Based upon feedback from Christoph Hellwig, put the generic
  implementation in include/linux/dma-mapping.h, since not all archs use
  include/asm-generic/dma-mapping-common.h.
  Add sparc implementation, as that arch seems to have a different IOMMU
  page size.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-23 Thread Nishanth Aravamudan
[Sorry, subject should have been 0/7!]

On 23.10.2015 [13:54:20 -0700], Nishanth Aravamudan wrote:
> We received a bug report recently when DDW (64-bit direct DMA on Power)
> is not enabled for NVMe devices. In that case, we fall back to 32-bit
> DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
> Entries).
> 
> The NVMe device driver, though, assumes that the DMA alignment for the
> PRP entries will match the device's page size, and that the DMA aligment
> matches the kernel's page aligment. On Power, the the IOMMU page size,
> as mentioned above, can be 4K, while the device can have a page size of
> 8K, while the kernel has a page size of 64K. This eventually trips the
> BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
> of 4K but not 8K (e.g., 0xF000).
> 
> In this particular case, and generally, we want to use the IOMMU's page
> size for the default device page size, rather than the kernel's page
> size.
> 
> This series consists of five patches:
> 
> 1) add a generic dma_get_page_shift implementation that just returns
> PAGE_SHIFT
> 2) override the generic implementation on Power to use the IOMMU table's
> page shift if available
> 3) allow further specific overriding on power with machdep platform
> overrides
> 4) use the machdep override on pseries, as the DDW code puts the TCE
> shift in a special property and there is no IOMMU table available
> 5) move some sparc code around to make IOMMU_PAGE_SHIFT available in
> include/asm
> 6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT
> 7) leverage the new API in the NVMe driver
> 
> With these patches, a NVMe device survives our internal hardware
> exerciser; the kernel BUGs within a few seconds without the patch.
> 
>  arch/powerpc/include/asm/dma-mapping.h |  3 +++
>  arch/powerpc/include/asm/machdep.h |  3 ++-
>  arch/powerpc/kernel/dma.c  | 11 +++
>  arch/powerpc/platforms/pseries/iommu.c | 36 
> 
>  arch/sparc/include/asm/dma-mapping.h   |  8 
>  arch/sparc/include/asm/iommu_common.h  | 51 
> +++
>  arch/sparc/kernel/iommu.c  |  2 +-
>  arch/sparc/kernel/iommu_common.h   | 51 
> ---
>  arch/sparc/kernel/pci_psycho.c |  2 +-
>  arch/sparc/kernel/pci_sabre.c  |  2 +-
>  arch/sparc/kernel/pci_schizo.c |  2 +-
>  arch/sparc/kernel/pci_sun4v.c  |  2 +-
>  arch/sparc/kernel/psycho_common.c  |  2 +-
>  arch/sparc/kernel/sbus.c   |  3 +--
>  drivers/block/nvme-core.c  |  3 ++-
>  include/linux/dma-mapping.h|  7 +++
>  16 files changed, 127 insertions(+), 61 deletions(-)
> 
> v1 -> v2:
>   Based upon feedback from Christoph Hellwig, rather than using an
>   arch-specific hack, expose the DMA page shift via a generic DMA API and
>   override it on Power as needed.
> v2 -> v3:
>   Based upon feedback from Christoph Hellwig, put the generic
>   implementation in include/linux/dma-mapping.h, since not all archs use
>   include/asm-generic/dma-mapping-common.h.
>   Add sparc implementation, as that arch seems to have a different IOMMU
>   page size.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 7/7 v2] drivers/nvme: default to the IOMMU page size

2015-10-23 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

 drivers/block/nvme-core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 6f04771..5a79106 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1711,7 +1712,7 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   unsigned page_shift = dma_get_page_shift(dev->dev);
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/dma-mapping.h | 3 +++
 arch/powerpc/kernel/dma.c  | 9 +
 2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
 #define HAVE_ARCH_DMA_SET_MASK 1
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
 #include 
 
 extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 }
 EXPORT_SYMBOL(dma_set_mask);
 
+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+   return PAGE_SHIFT;
+}
+EXPORT_SYMBOL(dma_get_page_shift);
+
 u64 __dma_get_required_mask(struct device *dev)
 {
struct dma_map_ops *dma_ops = get_dma_ops(dev);
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h

2015-10-23 Thread Nishanth Aravamudan
[Apologies for the subject line, should just have the [RFC PATCH 5/7]]

On 23.10.2015 [14:00:08 -0700], Nishanth Aravamudan wrote:
> In order to cleanly expose the desired IOMMU page shift via the new
> dma_get_page_shift API, we need to have the sparc constants available in
> a more typical location. There should be no functional impact to this
> move, but it is untested.
> 
> Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> ---
>  arch/sparc/include/asm/iommu_common.h | 51 
> +++
>  arch/sparc/kernel/iommu.c |  2 +-
>  arch/sparc/kernel/iommu_common.h  | 51 
> ---
>  arch/sparc/kernel/pci_psycho.c|  2 +-
>  arch/sparc/kernel/pci_sabre.c |  2 +-
>  arch/sparc/kernel/pci_schizo.c|  2 +-
>  arch/sparc/kernel/pci_sun4v.c |  2 +-
>  arch/sparc/kernel/psycho_common.c |  2 +-
>  arch/sparc/kernel/sbus.c  |  3 +--
>  9 files changed, 58 insertions(+), 59 deletions(-)
>  create mode 100644 arch/sparc/include/asm/iommu_common.h
>  delete mode 100644 arch/sparc/kernel/iommu_common.h
> 
> diff --git a/arch/sparc/include/asm/iommu_common.h 
> b/arch/sparc/include/asm/iommu_common.h
> new file mode 100644
> index 000..b40cec2
> --- /dev/null
> +++ b/arch/sparc/include/asm/iommu_common.h
> @@ -0,0 +1,51 @@
> +/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
> + *
> + * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
> + */
> +
> +#ifndef _IOMMU_COMMON_H
> +#define _IOMMU_COMMON_H
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +/*
> + * These give mapping size of each iommu pte/tlb.
> + */
> +#define IO_PAGE_SHIFT13
> +#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT)
> +#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1))
> +#define IO_PAGE_ALIGN(addr)  ALIGN(addr, IO_PAGE_SIZE)
> +
> +#define IO_TSB_ENTRIES   (128*1024)
> +#define IO_TSB_SIZE  (IO_TSB_ENTRIES * 8)
> +
> +/*
> + * This is the hardwired shift in the iotlb tag/data parts.
> + */
> +#define IOMMU_PAGE_SHIFT 13
> +
> +#define SG_ENT_PHYS_ADDRESS(SG)  (__pa(sg_virt((SG
> +
> +static inline int is_span_boundary(unsigned long entry,
> +unsigned long shift,
> +unsigned long boundary_size,
> +struct scatterlist *outs,
> +struct scatterlist *sg)
> +{
> + unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
> + int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
> +  IO_PAGE_SIZE);
> +
> + return iommu_is_span_boundary(entry, nr, shift, boundary_size);
> +}
> +
> +#endif /* _IOMMU_COMMON_H */
> diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
> index 5320689..f9b6077 100644
> --- a/arch/sparc/kernel/iommu.c
> +++ b/arch/sparc/kernel/iommu.c
> @@ -20,8 +20,8 @@
>  #endif
>  
>  #include 
> +#include 
>  
> -#include "iommu_common.h"
>  #include "kernel.h"
>  
>  #define STC_CTXMATCH_ADDR(STC, CTX)  \
> diff --git a/arch/sparc/kernel/iommu_common.h 
> b/arch/sparc/kernel/iommu_common.h
> deleted file mode 100644
> index b40cec2..000
> --- a/arch/sparc/kernel/iommu_common.h
> +++ /dev/null
> @@ -1,51 +0,0 @@
> -/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
> - *
> - * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
> - */
> -
> -#ifndef _IOMMU_COMMON_H
> -#define _IOMMU_COMMON_H
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -#include 
> -
> -/*
> - * These give mapping size of each iommu pte/tlb.
> - */
> -#define IO_PAGE_SHIFT13
> -#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT)
> -#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1))
> -#define IO_PAGE_ALIGN(addr)  ALIGN(addr, IO_PAGE_SIZE)
> -
> -#define IO_TSB_ENTRIES   (128*1024)
> -#define IO_TSB_SIZE  (IO_TSB_ENTRIES * 8)
> -
> -/*
> - * This is the hardwired shift in the iotlb tag/data parts.
> - */
> -#define IOMMU_PAGE_SHIFT 13
> -
> -#define SG_ENT_PHYS_ADDRESS(SG)  (__pa(sg_virt((SG
> -
> -static inline int is_span_boundary(unsigned long entry,
> -unsigned long sh

[PATCH 3/7 v2] powerpc/dma: implement per-platform dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
The IOMMU page size is not always stored in struct iommu on Power.
Specifically if a device is configured for DDW (Dynamic DMA Windows aka.
64-bit direct DMA), the used TCE (Translation Control Entry) size is
stored in a special device property created at run-time by the DDW
configuration code. DDW is a pseries-specific feature, so allow
platforms to override the implementation of dma_get_page_shift if
desired.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h | 3 ++-
 arch/powerpc/kernel/dma.c  | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 3f191f5..9dd03cf 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -83,9 +83,10 @@ struct machdep_calls {
 #endif
 #endif /* CONFIG_PPC64 */
 
-   /* Platform set_dma_mask and dma_get_required_mask overrides */
+   /* Platform overrides */
int (*dma_set_mask)(struct device *dev, u64 dma_mask);
u64 (*dma_get_required_mask)(struct device *dev);
+   unsigned long   (*dma_get_page_shift)(struct device *dev);
 
int (*probe)(void);
void(*setup_arch)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e805af2..c363896 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask);
 unsigned long dma_get_page_shift(struct device *dev)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (ppc_md.dma_get_page_shift)
+   return ppc_md.dma_get_page_shift(dev);
if (tbl)
return tbl->it_page_shift;
return PAGE_SHIFT;
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API

2015-10-23 Thread Nishanth Aravamudan
Drivers like NVMe need to be able to determine the page size used for
DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all
architectures.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.
v2 -> v3:
  Based upon feedback from Christoph Hellwig that not all architectures
  have moved to dma-mapping-common.h, so move the #ifdef and default to
  linux/dma-mapping.h.
---
 include/linux/dma-mapping.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index ac07ff0..7eaba8d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -95,6 +95,13 @@ static inline u64 dma_get_mask(struct device *dev)
return DMA_BIT_MASK(32);
 }
 
+#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return PAGE_SHIFT;
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 #else
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 36 ++
 1 file changed, 36 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
return dma_iommu_ops.get_required_mask(dev);
 }
 
+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+   struct iommu_table *tbl;
+
+   if (!disable_ddw && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct device_node *dn;
+
+   dn = pci_device_to_OF_node(pdev);
+
+   /* search upwards for ibm,dma-window */
+   for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+   dn = dn->parent)
+   if (of_get_property(dn, "ibm,dma-window", NULL))
+   break;
+   /*
+* if there is a DDW configuration, the TCE shift is stored in
+* the property
+*/
+   if (dn && PCI_DN(dn)) {
+   const struct dynamic_dma_window_prop *direct64 =
+   of_get_property(dn, DIRECT64_PROPNAME, NULL);
+   if (direct64)
+   return be32_to_cpu(direct64->tce_shift);
+   }
+   }
+
+   tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+
+   return PAGE_SHIFT;
+}
+
 #else  /* CONFIG_PCI */
 #define pci_dma_bus_setup_pSeries  NULL
 #define pci_dma_dev_setup_pSeries  NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
 #define pci_dma_dev_setup_pSeriesLPNULL
 #define dma_set_mask_pSeriesLP NULL
 #define dma_get_required_mask_pSeriesLPNULL
+#define dma_get_page_shift_pSeriesLP   NULL
 #endif /* !CONFIG_PCI */
 
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+   ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
} else {
pseries_pci_controller_ops.dma_bus_setup = 
pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeries;
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 6/7] sparc/dma-mapping: override dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
On sparc, the kernel's page size differs from the IOMMU's page size, so
override the generic implementation, which always returns the kernel's
page size, and return IOMMU_PAGE_SHIFT instead.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
I know very little about sparc, so please correct me if this patch is
invalid.

 arch/sparc/include/asm/dma-mapping.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/sparc/include/asm/dma-mapping.h 
b/arch/sparc/include/asm/dma-mapping.h
index a21da59..76ba470 100644
--- a/arch/sparc/include/asm/dma-mapping.h
+++ b/arch/sparc/include/asm/dma-mapping.h
@@ -52,6 +52,14 @@ static inline int dma_set_mask(struct device *dev, u64 mask)
return -EINVAL;
 }
 
+/* for IOMMU_PAGE_SHIFT */
+#include 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return IOMMU_PAGE_SHIFT;
+}
+
 #include 
 
 #endif
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h

2015-10-23 Thread Nishanth Aravamudan
In order to cleanly expose the desired IOMMU page shift via the new
dma_get_page_shift API, we need to have the sparc constants available in
a more typical location. There should be no functional impact to this
move, but it is untested.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
 arch/sparc/include/asm/iommu_common.h | 51 +++
 arch/sparc/kernel/iommu.c |  2 +-
 arch/sparc/kernel/iommu_common.h  | 51 ---
 arch/sparc/kernel/pci_psycho.c|  2 +-
 arch/sparc/kernel/pci_sabre.c |  2 +-
 arch/sparc/kernel/pci_schizo.c|  2 +-
 arch/sparc/kernel/pci_sun4v.c |  2 +-
 arch/sparc/kernel/psycho_common.c |  2 +-
 arch/sparc/kernel/sbus.c  |  3 +--
 9 files changed, 58 insertions(+), 59 deletions(-)
 create mode 100644 arch/sparc/include/asm/iommu_common.h
 delete mode 100644 arch/sparc/kernel/iommu_common.h

diff --git a/arch/sparc/include/asm/iommu_common.h 
b/arch/sparc/include/asm/iommu_common.h
new file mode 100644
index 000..b40cec2
--- /dev/null
+++ b/arch/sparc/include/asm/iommu_common.h
@@ -0,0 +1,51 @@
+/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
+ *
+ * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
+ */
+
+#ifndef _IOMMU_COMMON_H
+#define _IOMMU_COMMON_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/*
+ * These give mapping size of each iommu pte/tlb.
+ */
+#define IO_PAGE_SHIFT  13
+#define IO_PAGE_SIZE   (1UL << IO_PAGE_SHIFT)
+#define IO_PAGE_MASK   (~(IO_PAGE_SIZE-1))
+#define IO_PAGE_ALIGN(addr)ALIGN(addr, IO_PAGE_SIZE)
+
+#define IO_TSB_ENTRIES (128*1024)
+#define IO_TSB_SIZE(IO_TSB_ENTRIES * 8)
+
+/*
+ * This is the hardwired shift in the iotlb tag/data parts.
+ */
+#define IOMMU_PAGE_SHIFT   13
+
+#define SG_ENT_PHYS_ADDRESS(SG)(__pa(sg_virt((SG
+
+static inline int is_span_boundary(unsigned long entry,
+  unsigned long shift,
+  unsigned long boundary_size,
+  struct scatterlist *outs,
+  struct scatterlist *sg)
+{
+   unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
+   int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
+IO_PAGE_SIZE);
+
+   return iommu_is_span_boundary(entry, nr, shift, boundary_size);
+}
+
+#endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index 5320689..f9b6077 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -20,8 +20,8 @@
 #endif
 
 #include 
+#include 
 
-#include "iommu_common.h"
 #include "kernel.h"
 
 #define STC_CTXMATCH_ADDR(STC, CTX)\
diff --git a/arch/sparc/kernel/iommu_common.h b/arch/sparc/kernel/iommu_common.h
deleted file mode 100644
index b40cec2..000
--- a/arch/sparc/kernel/iommu_common.h
+++ /dev/null
@@ -1,51 +0,0 @@
-/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
- *
- * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
- */
-
-#ifndef _IOMMU_COMMON_H
-#define _IOMMU_COMMON_H
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-/*
- * These give mapping size of each iommu pte/tlb.
- */
-#define IO_PAGE_SHIFT  13
-#define IO_PAGE_SIZE   (1UL << IO_PAGE_SHIFT)
-#define IO_PAGE_MASK   (~(IO_PAGE_SIZE-1))
-#define IO_PAGE_ALIGN(addr)ALIGN(addr, IO_PAGE_SIZE)
-
-#define IO_TSB_ENTRIES (128*1024)
-#define IO_TSB_SIZE(IO_TSB_ENTRIES * 8)
-
-/*
- * This is the hardwired shift in the iotlb tag/data parts.
- */
-#define IOMMU_PAGE_SHIFT   13
-
-#define SG_ENT_PHYS_ADDRESS(SG)(__pa(sg_virt((SG
-
-static inline int is_span_boundary(unsigned long entry,
-  unsigned long shift,
-  unsigned long boundary_size,
-  struct scatterlist *outs,
-  struct scatterlist *sg)
-{
-   unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
-   int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
-IO_PAGE_SIZE);
-
-   return iommu_is_span_boundary(entry, nr, shift, boundary_size);
-}
-
-#endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/pci_psycho.c b/arch/sparc/kernel/pci_psycho.c
index 7dce27b..332fd6f 100644
--- a/arch/sparc/kernel/pci_psycho.c
+++ b/arch/sparc/kernel/pci_psycho.c
@@ -15,13 +15,13 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 
 #include "pci_impl.h"
-#include "i

Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-19 Thread Nishanth Aravamudan
On 15.10.2015 [15:52:19 -0700], Nishanth Aravamudan wrote:
> On 14.10.2015 [08:42:51 -0700], Christoph Hellwig wrote:
> > Hi Nishanth,
> > 
> > sorry for the late reply.
> > 
> > > > On Power, since it's technically variable, we'd need a function. So are
> > > > you suggesting define'ing it to a function just on Power and leaving it
> > > > a constant elsewhere?
> > > > 
> > > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.
> > > 
> > > Sorry, I should have been more specific -- I'm ready to spin out a v3,
> > > with a sparc-specific function.
> > > 
> > > Are you ok with leaving it a function for now (the only caller is in
> > > NVMe obviously).
> > 
> > 
> > I guess we do indeed need a function then.  I'll take a look at your
> > patch, but as long you found a way to avoid adding too much boilerplate
> > code it should be fine.
> 
> Ok, so I've got the moved function (include/linux/dma-mapping.h
> instead of dma-mapping-common.h) ready to go, which should only
> involve changing the first patch in the series. But I'm really
> mystified by what to do for sparc, which defines IOMMU_PAGE_SHIFT and
> IO_PAGE_SHIFT in arch/sparc/kernel/iommu_common.h.
> 
> 1) Which constant reflects the value we mean for this function on
> sparc?  I assume it should be IOMMU_PAGE_SHIFT, but they are the same
> value and I want to make sure I get the semantics right.
> 
> 2) Where would I put sparc's definition of dma_get_page_shift()?
> Should it be in a asm/dma-mapping.h? Should we move some of the
> constants from arch/sparc/kernel/iommu_common.h to
> arch/sparc/include/asm/iommu_common.h and then #include that in
> asm/dma-mapping.h?
> 
> Dave M., any opinions/insights? Essentially, this helper function
> assists the NVMe driver in determining what page size it should use to
> satisfy both the device and IOMMU's requirements. Maybe I
> misunderstand the constants on sparc and PAGE_SHIFT is fine there too?

Anyone with sparc knowledge have an opinion/input?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-15 Thread Nishanth Aravamudan
On 14.10.2015 [08:42:51 -0700], Christoph Hellwig wrote:
> Hi Nishanth,
> 
> sorry for the late reply.
> 
> > > On Power, since it's technically variable, we'd need a function. So are
> > > you suggesting define'ing it to a function just on Power and leaving it
> > > a constant elsewhere?
> > > 
> > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.
> > 
> > Sorry, I should have been more specific -- I'm ready to spin out a v3,
> > with a sparc-specific function.
> > 
> > Are you ok with leaving it a function for now (the only caller is in
> > NVMe obviously).
> 
> 
> I guess we do indeed need a function then.  I'll take a look at your
> patch, but as long you found a way to avoid adding too much boilerplate
> code it should be fine.

Ok, so I've got the moved function (include/linux/dma-mapping.h instead
of dma-mapping-common.h) ready to go, which should only involve changing
the first patch in the series. But I'm really mystified by what to do
for sparc, which defines IOMMU_PAGE_SHIFT and IO_PAGE_SHIFT in
arch/sparc/kernel/iommu_common.h.

1) Which constant reflects the value we mean for this function on sparc?
I assume it should be IOMMU_PAGE_SHIFT, but they are the same value and
I want to make sure I get the semantics right.

2) Where would I put sparc's definition of dma_get_page_shift()? Should
it be in a asm/dma-mapping.h? Should we move some of the constants from
arch/sparc/kernel/iommu_common.h to
arch/sparc/include/asm/iommu_common.h and then #include that in
asm/dma-mapping.h?

Dave M., any opinions/insights? Essentially, this helper function
assists the NVMe driver in determining what page size it should use to
satisfy both the device and IOMMU's requirements. Maybe I misunderstand
the constants on sparc and PAGE_SHIFT is fine there too?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-14 Thread Nishanth Aravamudan
Hi Christoph,

On 12.10.2015 [14:06:51 -0700], Nishanth Aravamudan wrote:
> On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> > Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> > with an #ifndef in common code?
> 
> On Power, since it's technically variable, we'd need a function. So are
> you suggesting define'ing it to a function just on Power and leaving it
> a constant elsewhere?
> 
> I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.

Sorry, I should have been more specific -- I'm ready to spin out a v3,
with a sparc-specific function.

Are you ok with leaving it a function for now (the only caller is in
NVMe obviously).

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> with an #ifndef in common code?

I suppose we could do that -- I wasn't sure if the macro would be
palatable.

> Also not all architectures use dma-mapping-common.h yet, so you either
> need to update all of those as well, or just add the #ifndef directly
> to linux/dma-mapping.h.

Ok, I will take a look at that and spin up a v3.

Thanks for the feedback!

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [14:19:43 +1100], David Gibson wrote:
> On Fri, Oct 02, 2015 at 10:18:00AM -0700, Nishanth Aravamudan wrote:
> > We will leverage this macro in the NVMe driver, which needs to know the
> > configured IOMMU page shift to properly configure its device's page
> > size.
> > 
> > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> > 
> > ---
> > Given this is available, it seems reasonable to expose -- and it doesn't
> > really make sense to make the driver do a log2 call on the existing
> > IOMMU_PAGE_SIZE() value.
> > 
> > diff --git a/arch/powerpc/include/asm/iommu.h 
> > b/arch/powerpc/include/asm/iommu.h
> > index ca18cff..6fdf857 100644
> > --- a/arch/powerpc/include/asm/iommu.h
> > +++ b/arch/powerpc/include/asm/iommu.h
> > @@ -36,6 +36,7 @@
> >  #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
> >  #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
> >  
> > +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
> >  #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
> 
> Seems like it would be a touch safer to alter IOMMU_PAGE_SIZE so it
> uses the new IOMMU_PAGE_SHIFT macro.

Yes absolutely! Sorry, I initially didn't add the first macro, so didn't
think that through. Will update.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> with an #ifndef in common code?

On Power, since it's technically variable, we'd need a function. So are
you suggesting define'ing it to a function just on Power and leaving it
a constant elsewhere?

I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-12 Thread Nishanth Aravamudan
On 12.10.2015 [09:03:52 -0700], Nishanth Aravamudan wrote:
> On 06.10.2015 [14:19:43 +1100], David Gibson wrote:
> > On Fri, Oct 02, 2015 at 10:18:00AM -0700, Nishanth Aravamudan wrote:
> > > We will leverage this macro in the NVMe driver, which needs to know the
> > > configured IOMMU page shift to properly configure its device's page
> > > size.
> > > 
> > > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> > > 
> > > ---
> > > Given this is available, it seems reasonable to expose -- and it doesn't
> > > really make sense to make the driver do a log2 call on the existing
> > > IOMMU_PAGE_SIZE() value.
> > > 
> > > diff --git a/arch/powerpc/include/asm/iommu.h 
> > > b/arch/powerpc/include/asm/iommu.h
> > > index ca18cff..6fdf857 100644
> > > --- a/arch/powerpc/include/asm/iommu.h
> > > +++ b/arch/powerpc/include/asm/iommu.h
> > > @@ -36,6 +36,7 @@
> > >  #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
> > >  #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
> > >  
> > > +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
> > >  #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
> > 
> > Seems like it would be a touch safer to alter IOMMU_PAGE_SIZE so it
> > uses the new IOMMU_PAGE_SHIFT macro.
> 
> Yes absolutely! Sorry, I initially didn't add the first macro, so didn't
> think that through. Will update.

Err, replied too quickly -- just got back from vacation -- this is
from an old version of the patchset, we no longer introduce this macro.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] drivers/nvme: default to the IOMMU page size on Power

2015-10-02 Thread Nishanth Aravamudan
On 02.10.2015 [10:25:44 -0700], Christoph Hellwig wrote:
> Hi Nishanth,
> 
> please expose this value through the generic DMA API instead of adding
> architecture specific hacks to drivers.

Ok, I'm happy to do that instead -- what I struggled with is that I
don't have enough knowledge of the various architectures to provide the
right default implementation. It should be sufficient for the default to
return PAGE_SHIFT, and on Power just override that to return the IOMMU
table's page size? Since the only user will be the NVMe driver
currently, that should be fine?

Sorry for the less-than-ideal patch!

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

This series consists of two patches, one of which exposes the IOMMU's
page shift on Power (currently only the page size is exposed, and it
seems unnecessary to ilog2 that value in the driver). The second patch
leverages this value on Power in the NVMe driver.

With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2] drivers/nvme: default to the IOMMU page size on Power

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 7920c27..969a95e 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define NVME_MINORS(1U << MINORBITS)
 #define NVME_Q_DEPTH   1024
@@ -1680,6 +1681,11 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
unsigned page_shift = PAGE_SHIFT;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
+#ifdef CONFIG_PPC64
+   struct iommu_table *tbl = get_iommu_table_base(dev->dev);
+   if (tbl)
+   page_shift = IOMMU_PAGE_SHIFT(tbl);
+#endif
 
if (page_shift < dev_page_min) {
dev_err(dev->dev,

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-02 Thread Nishanth Aravamudan
We will leverage this macro in the NVMe driver, which needs to know the
configured IOMMU page shift to properly configure its device's page
size.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
Given this is available, it seems reasonable to expose -- and it doesn't
really make sense to make the driver do a log2 call on the existing
IOMMU_PAGE_SIZE() value.

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index ca18cff..6fdf857 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -36,6 +36,7 @@
 #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
 #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
 
+#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
 #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
 #define IOMMU_PAGE_MASK(tblptr) (~((1 << (tblptr)->it_page_shift) - 1))
 #define IOMMU_PAGE_ALIGN(addr, tblptr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE(tblptr))

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/5 v2] powerpc/dma: implement per-platform dma_get_page_shift

2015-10-02 Thread Nishanth Aravamudan
The IOMMU page size is not always stored in struct iommu on Power.
Specifically if a device is configured for DDW (Dynamic DMA Windows aka.
64-bit direct DMA), the used TCE (Translation Control Entry) size is
stored in a special device property created at run-time by the DDW
configuration code. DDW is a pseries-specific feature, so allow
platforms to override the implementation of dma_get_page_shift if
desired.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index cab6753..5c372e3 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,9 +78,10 @@ struct machdep_calls {
 #endif
 #endif /* CONFIG_PPC64 */
 
-   /* Platform set_dma_mask and dma_get_required_mask overrides */
+   /* Platform overrides */
int (*dma_set_mask)(struct device *dev, u64 dma_mask);
u64 (*dma_get_required_mask)(struct device *dev);
+   unsigned long   (*dma_get_page_shift)(struct device *dev);
 
int (*probe)(void);
void(*setup_arch)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e805af2..c363896 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask);
 unsigned long dma_get_page_shift(struct device *dev)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (ppc_md.dma_get_page_shift)
+   return ppc_md.dma_get_page_shift(dev);
if (tbl)
return tbl->it_page_shift;
return PAGE_SHIFT;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).
 
The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).
 
In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.
 
This series consists of five patches:

1) add a generic dma_get_page_shift implementation that just returns
PAGE_SHIFT
2) override the generic implementation on Power to use the IOMMU table's
page shift if available
3) allow further specific overriding on power with machdep platform
overrides
4) use the machdep override on pseries, as the DDW code puts the TCE
shift in a special property and there is no IOMMU table available
5) leverage the new API in the NVMe driver
 
With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

 arch/powerpc/include/asm/dma-mapping.h   |  3 +++
 arch/powerpc/include/asm/machdep.h   |  3 ++-
 arch/powerpc/kernel/dma.c| 11 +++
 arch/powerpc/platforms/pseries/iommu.c   | 36 

 drivers/block/nvme-core.c|  3 ++-
 include/asm-generic/dma-mapping-common.h |  7 +++
 6 files changed, 61 insertions(+), 2 deletions(-)

v1 -> v2:
  Based upon feedback from Christoph Hellwig, rather than using an
  arch-specific hack, expose the DMA page shift via a generic DMA API and
  override it on Power as needed.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 4/5 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-02 Thread Nishanth Aravamudan
When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
return dma_iommu_ops.get_required_mask(dev);
 }
 
+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+   struct iommu_table *tbl;
+
+   if (!disable_ddw && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct device_node *dn;
+
+   dn = pci_device_to_OF_node(pdev);
+
+   /* search upwards for ibm,dma-window */
+   for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+   dn = dn->parent)
+   if (of_get_property(dn, "ibm,dma-window", NULL))
+   break;
+   /*
+* if there is a DDW configuration, the TCE shift is stored in
+* the property
+*/
+   if (dn && PCI_DN(dn)) {
+   const struct dynamic_dma_window_prop *direct64 =
+   of_get_property(dn, DIRECT64_PROPNAME, NULL);
+   if (direct64)
+   return be32_to_cpu(direct64->tce_shift);
+   }
+   }
+
+   tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+
+   return PAGE_SHIFT;
+}
+
 #else  /* CONFIG_PCI */
 #define pci_dma_bus_setup_pSeries  NULL
 #define pci_dma_dev_setup_pSeries  NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
 #define pci_dma_dev_setup_pSeriesLPNULL
 #define dma_set_mask_pSeriesLP NULL
 #define dma_get_required_mask_pSeriesLPNULL
+#define dma_get_page_shift_pSeriesLP   NULL
 #endif /* !CONFIG_PCI */
 
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+   ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
} else {
pseries_pci_controller_ops.dma_bus_setup = 
pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeries;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/5 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-02 Thread Nishanth Aravamudan
On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
 #define HAVE_ARCH_DMA_SET_MASK 1
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
 #include 
 
 extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 }
 EXPORT_SYMBOL(dma_set_mask);
 
+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+   return PAGE_SHIFT;
+}
+EXPORT_SYMBOL(dma_get_page_shift);
+
 u64 __dma_get_required_mask(struct device *dev)
 {
struct dma_map_ops *dma_ops = get_dma_ops(dev);

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-02 Thread Nishanth Aravamudan
Drivers like NVMe need to be able to determine the page size used for
DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all
architectures.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/include/asm-generic/dma-mapping-common.h 
b/include/asm-generic/dma-mapping-common.h
index b1bc954..86e4e97 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -355,4 +355,11 @@ static inline int dma_set_mask(struct device *dev, u64 
mask)
 }
 #endif
 
+#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return PAGE_SHIFT;
+}
+#endif
+
 #endif

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 5/5 v2] drivers/nvme: default to the IOMMU page size

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index b97fc3f..c561137 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1713,7 +1714,7 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   unsigned page_shift = dma_get_page_shift(dev->dev);
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
On 03.10.2015 [07:35:09 +1000], Benjamin Herrenschmidt wrote:
> On Fri, 2015-10-02 at 14:04 -0700, Nishanth Aravamudan wrote:
> > Right, I did start with your advice and tried that approach, but it
> > turned out I was wrong about the actual issue at the time. The problem
> > for NVMe isn't actually the starting address alignment (which it can
> > handle not being aligned to the device's page size). It doesn't handle
> > (addr + len % dev_page_size != 0). That is, it's really a length
> > alignment issue.
> > 
> > It seems incredibly device specific to have a an API into the DMA code
> > to request an end alignment -- no other device seems to have this
> > issue/design. If you think that's better, I can fiddle with that
> > instead.
> > 
> > Sorry, I should have called this out better as an alternative
> > consideration.
> 
> Nah it's fine. Ok. Also adding the alignment requirement to the API
> would have been a much more complex patch since it would have had to
> be implemented for all archs.
> 
> I think your current solution is fine.

Great, thanks. Also, while it's possible an alignment API would be more
performant...we're already not using DDW on Power in this case,
performance is not a primary concern. We want to simply be
functional/correct in this configuration.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
On 03.10.2015 [06:51:06 +1000], Benjamin Herrenschmidt wrote:
> On Fri, 2015-10-02 at 13:09 -0700, Nishanth Aravamudan wrote:
> 
> > 1) add a generic dma_get_page_shift implementation that just returns
> > PAGE_SHIFT
> 
> So you chose to return the granularity of the iommu to the driver
> rather than providing a way for the driver to request a specific
> alignment for DMA mappings. Any specific reason ?

Right, I did start with your advice and tried that approach, but it
turned out I was wrong about the actual issue at the time. The problem
for NVMe isn't actually the starting address alignment (which it can
handle not being aligned to the device's page size). It doesn't handle
(addr + len % dev_page_size != 0). That is, it's really a length
alignment issue.

It seems incredibly device specific to have a an API into the DMA code
to request an end alignment -- no other device seems to have this
issue/design. If you think that's better, I can fiddle with that
instead.

Sorry, I should have called this out better as an alternative
consideration.

-Nish


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-28 Thread Nishanth Aravamudan
On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:
> On 9/27/15, Raghavendra K T  wrote:
> > Problem description:
> > Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
> > numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
> > got from device tree is naturally mapped (directly) to nid.
> 
> Interesting thing to play with, I'll try to test it on my POWER7 box,
> but it doesn't have the OPAL layer :(

Note that it's also interesting to try it under PowerVM, with odd NUMA
topologies and report any issues found :)


-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid

2015-09-28 Thread Nishanth Aravamudan
On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote:
> There is no change in the fuctionality
> 
> Signed-off-by: Raghavendra K T 
> ---
>  arch/powerpc/mm/numa.c | 42 +-
>  1 file changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index d5e6eee..f84ed2f 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
>   }
>  }
> 
> -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
> +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>   * info is found.
>   */
> -static int associativity_to_nid(const __be32 *associativity)
> +static int associativity_to_chipid(const __be32 *associativity)

This is confusing to me. This function is also used by the DLPAR code
under PowerVM to indicate what node the CPU is on -- not a chip (which I
don't believe is exposed at all under PowerVM).

>  {
> - int nid = -1;
> + int chipid = -1;
> 
>   if (min_common_depth == -1)
>   goto out;
> 
>   if (of_read_number(associativity, 1) >= min_common_depth)
> - nid = of_read_number([min_common_depth], 1);
> + chipid = of_read_number([min_common_depth], 1);

Doesn't this, in the PAPR documentation, specifically refer to the NODE
level domain?

>   /* POWER4 LPAR uses 0x as invalid node */
> - if (nid == 0x || nid >= MAX_NUMNODES)
> - nid = -1;
> + if (chipid == 0x || chipid >= MAX_NUMNODES)
> + chipid = -1;
> 
> - if (nid > 0 &&
> + if (chipid > 0 &&
>   of_read_number(associativity, 1) >= distance_ref_points_depth) {
>   /*
>* Skip the length field and send start of associativity array
>*/
> - initialize_distance_lookup_table(nid, associativity + 1);
> + initialize_distance_lookup_table(chipid, associativity + 1);
>   }
> 
>  out:
> - return nid;
> + return chipid;
>  }
> 
> -/* Returns the nid associated with the given device tree node,
> +/* Returns the chipid associated with the given device tree node,
>   * or -1 if not found.
>   */
> -static int of_node_to_nid_single(struct device_node *device)
> +static int of_node_to_chipid_single(struct device_node *device)
>  {
> - int nid = -1;
> + int chipid = -1;
>   const __be32 *tmp;
> 
>   tmp = of_get_associativity(device);
>   if (tmp)
> - nid = associativity_to_nid(tmp);
> - return nid;
> + chipid = associativity_to_chipid(tmp);
> + return chipid;
>  }
> 
>  /* Walk the device tree upwards, looking for an associativity id */
> @@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device)
> 
>   of_node_get(device);
>   while (device) {
> - nid = of_node_to_nid_single(device);
> + nid = of_node_to_chipid_single(device);
>   if (nid != -1)
>   break;
> 
> @@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
>  }
> 
>  /*
> - * This is like of_node_to_nid_single() for memory represented in the
> + * This is like of_node_to_chipid_single() for memory represented in the
>   * ibm,dynamic-reconfiguration-memory node.
>   */
>  static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
> @@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu)
>   goto out;
>   }
> 
> - nid = of_node_to_nid_single(cpu);
> + nid = of_node_to_chipid_single(cpu);
> 
>  out_present:
>   if (nid < 0 || !node_online(nid))
> @@ -754,7 +754,7 @@ static int __init parse_numa_properties(void)
> 
>   cpu = of_get_cpu_node(i, NULL);
>   BUG_ON(!cpu);
> - nid = of_node_to_nid_single(cpu);
> + nid = of_node_to_chipid_single(cpu);
>   of_node_put(cpu);
> 
>   /*
> @@ -796,7 +796,7 @@ new_range:
>* have associativity properties.  If none, then
>* everything goes to default_nid.
>*/
> - nid = of_node_to_nid_single(memory);
> + nid = of_node_to_chipid_single(memory);
>   if (nid < 0)
>   nid = default_nid;
> 
> @@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long 
> scn_addr)
>   if ((scn_addr < start) || (scn_addr >= (start + size)))
>   continue;
> 
> - nid = of_node_to_nid_single(memory);
> + nid = of_node_to_chipid_single(memory);
>   break;
>   }
> 
> @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)
> 
>   /* Use associativity from first thread for all siblings */
>   vphn_get_associativity(cpu, associativity);
> - 

Re: [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping

2015-09-28 Thread Nishanth Aravamudan
On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote:
> Create arrays that maps serial nids and sparse chipids.
> 
> Note: My original idea had only two arrays of chipid to nid map. Final
> code is inspired by driver/acpi/numa.c that maps a proximity node with
> a logical node by Takayoshi Kochi , and thus
> uses an additional chipid_map nodemask. The mask helps in first unused
> nid easily by knowing first unset bit in the mask.
> 
> No change in functionality.
> 
> Signed-off-by: Raghavendra K T 
> ---
>  arch/powerpc/mm/numa.c | 48 +++-
>  1 file changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index dd2073b..f015cad 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -63,6 +63,11 @@ static int form1_affinity;
>  static int distance_ref_points_depth;
>  static const __be32 *distance_ref_points;
>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> +static nodemask_t chipid_map = NODE_MASK_NONE;
> +static int chipid_to_nid_map[MAX_NUMNODES]
> + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };

Hrm, conceptually there are *more* chips than nodes, right? So what
guarantees we won't see > MAX_NUMNODES chips?

> +static int nid_to_chipid_map[MAX_NUMNODES]
> + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
> 
>  /*
>   * Allocate node_to_cpumask_map based on number of available nodes
> @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned 
> long end_pfn,
>   return 0;
>  }
> 
> +int chipid_to_nid(int chipid)
> +{
> + if (chipid < 0)
> + return NUMA_NO_NODE;

Do you really want to support these cases? Or should they be
bugs/warnings indicating that you got an unexpected input? Or at least
WARN_ON_ONCE?

> + return chipid_to_nid_map[chipid];
> +}
> +
> +int nid_to_chipid(int nid)
> +{
> + if (nid < 0)
> + return NUMA_NO_NODE;
> + return nid_to_chipid_map[nid];
> +}
> +
> +static void __map_chipid_to_nid(int chipid, int nid)
> +{
> + if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
> +  || nid < chipid_to_nid_map[chipid])
> + chipid_to_nid_map[chipid] = nid;
> + if (nid_to_chipid_map[nid] == NUMA_NO_NODE
> + || chipid < nid_to_chipid_map[nid])
> + nid_to_chipid_map[nid] = chipid;
> +}

chip <-> node mapping is a static (physical) concept, right? Should we
emit some debugging if for some reason we get a runtime call to remap
an already mapped chip to a new node?

> +
> +int map_chipid_to_nid(int chipid)
> +{
> + int nid;
> +
> + if (chipid < 0 || chipid >= MAX_NUMNODES)
> + return NUMA_NO_NODE;
> +
> + nid = chipid_to_nid_map[chipid];
> + if (nid == NUMA_NO_NODE) {
> + if (nodes_weight(chipid_map) >= MAX_NUMNODES)
> + return NUMA_NO_NODE;

If you create a KVM guest with a bogus topology, doesn't this just start
losing NUMA information for very high-noded guests?

> + nid = first_unset_node(chipid_map);
> + __map_chipid_to_nid(chipid, nid);
> + node_set(nid, chipid_map);
> + }
> + return nid;
> +}
> +
>  int numa_cpu_lookup(int cpu)
>  {
>   return numa_cpu_lookup_table[cpu];
> @@ -264,7 +311,6 @@ out:
>   return chipid;
>  }
> 
> -

stray change?

>   /* Return the nid from associativity */
>  static int associativity_to_nid(const __be32 *associativity)
>  {
> -- 
> 1.7.11.7
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-28 Thread Nishanth Aravamudan
On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote:
> Problem description:
> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
> got from device tree is naturally mapped (directly) to nid.

chipid is a OPAL concept, I believe, and not documented in PAPR... How
does this work under PowerVM?

> Potential side effect of that is:
> 
> 1) There are several places in kernel that assumes serial node numbering.
> and memory allocations assume that all the nodes from 0-(highest nid)
> exist inturn ending up allocating memory for the nodes that does not exist.
> 
> 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
> sparse nid of the host system to contiguous nids of guest (numa affinity,
> placement) could be a challenge.
> 
> Possible Solutions:
> 1) Handling the memory allocations is kernel case by case: Though in some
> cases it is easy to achieve, some cases may be intrusive/not trivial. 
> at the end it does not handle side effect (2) above.
> 
> 2) Map the sparse chipid got from device tree to a serial nid at kernel
> level (The idea proposed in this series).
> Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
> con: The chipid is in device tree no longer the same as nid in kernel

Is there any debugging/logging? Looks like not -- so how does a sysadmin
map from firmware-provided values to the Linux values? That's going to
make debugging of large systems (PowerVM or otherwise) less than
pleasant, it seems? Possibly you could put something in sysfs?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-28 Thread Nishanth Aravamudan
On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:
> Once we have made the distinction between nid and chipid
> create a 1:1 mapping between them. This makes compacting the
> nids easy later.
> 
> No functionality change.
> 
> Signed-off-by: Raghavendra K T 
> ---
>  arch/powerpc/mm/numa.c | 36 +---
>  1 file changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index f84ed2f..dd2073b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -264,6 +264,17 @@ out:
>   return chipid;
>  }
> 
> +
> + /* Return the nid from associativity */
> +static int associativity_to_nid(const __be32 *associativity)
> +{
> + int nid;
> +
> + nid = associativity_to_chipid(associativity);
> + return nid;
> +}

This is ultimately confusing. You are assigning the semantic return
value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the
variable naming be consistent?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-28 Thread Nishanth Aravamudan
On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:
> Once we have made the distinction between nid and chipid
> create a 1:1 mapping between them. This makes compacting the
> nids easy later.


Didn't the previous patch just do the opposite of...

> @@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device)
> 
>   of_node_get(device);
>   while (device) {
> - nid = of_node_to_chipid_single(device);
> + nid = of_node_to_nid_single(device);
>   if (nid != -1)
>   break;
> 
> @@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
>  }
> 
>  /*
> - * This is like of_node_to_chipid_single() for memory represented in the
> + * This is like of_node_to_nid_single() for memory represented in the
>   * ibm,dynamic-reconfiguration-memory node.
>   */
>  static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
> @@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu)
>   goto out;
>   }
> 
> - nid = of_node_to_chipid_single(cpu);
> + nid = of_node_to_nid_single(cpu);
> 
>  out_present:
>   if (nid < 0 || !node_online(nid))
> @@ -754,7 +776,7 @@ static int __init parse_numa_properties(void)
> 
>   cpu = of_get_cpu_node(i, NULL);
>   BUG_ON(!cpu);
> - nid = of_node_to_chipid_single(cpu);
> + nid = of_node_to_nid_single(cpu);
>   of_node_put(cpu);
> 
>   /*
> @@ -796,7 +818,7 @@ new_range:
>* have associativity properties.  If none, then
>* everything goes to default_nid.
>*/
> - nid = of_node_to_chipid_single(memory);
> + nid = of_node_to_nid_single(memory);
>   if (nid < 0)
>   nid = default_nid;
> 
> @@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long 
> scn_addr)
>   if ((scn_addr < start) || (scn_addr >= (start + size)))
>   continue;
> 
> - nid = of_node_to_chipid_single(memory);
> + nid = of_node_to_nid_single(memory);
>   break;
>   }
> 
> @@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void)
> 
>   /* Use associativity from first thread for all siblings */
>   vphn_get_associativity(cpu, associativity);
> - new_nid = associativity_to_chipid(associativity);
> + new_nid = associativity_to_nid(associativity);
>   if (new_nid < 0 || !node_online(new_nid))
>   new_nid = first_online_node;
> 

Why is this churn useful?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC] powerpc/hugetlb: Add warning message when gpage allocation request fails

2015-09-14 Thread Nishanth Aravamudan
On 14.09.2015 [18:59:25 +0530], Aneesh Kumar K.V wrote:
> Anshuman Khandual  writes:
> 
> > When a 16GB huge page is requested on POWER platform through kernel command
> > line interface, it silently fails because of the lack of any gigantic pages
> > on the system which the platform should have communicated through 16GB 
> > memory
> > blocks in the device tree during boot time. For example
> >
> > [0.480940] HugeTLB registered 16 GB page size, pre-allocated 0 pages
> > [0.480945] HugeTLB registered 16 MB page size, pre-allocated 16 pages
> >
> > This adds a warning message during alloc_bootmem_huge_page request both on
> > book3e and book3s powerpc platforms. After this change
> >
> > [0.00] Gigantic HugeTLB page not available
> > [0.473417] HugeTLB registered 16 GB page size, pre-allocated 0 pages
> > [0.473423] HugeTLB registered 16 MB page size, pre-allocated 16 pages
> 
> 
> That info is already part of the second line isn't it ? ie
> 
> [0.473417] HugeTLB registered 16 GB page size, pre-allocated 0 pages
> 
> pre-allocated 0 pages indicate we didn't allocate anything. So why do we
> need to add more details fo kernel output ?

Agreed, the '0 pages' message indicates we failed to pre-allocate any
pages. The 'pre-allocate' messages are specifically about the kernel
command-line requests for hugepages.

Not sure I understand the motivation for this?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=

2015-09-07 Thread Nishanth Aravamudan
On 07.09.2015 [19:19:09 +1000], Michael Ellerman wrote:
> On Fri, 2015-09-04 at 11:22 -0700, Nishanth Aravamudan wrote:
> > The 32-bit TCE table initialization relies on the DMA window having a
> > size equal to a power of 2 (and checks for it explicitly). But
> > crashkernel= has no constraint that requires a power-of-2 be specified.
> > This causes the kdump kernel to fail to boot as none of the PCI devices
> > (including the disk controller) are successfully initialized.
> > 
> > After this change, the PCI devices successfully set up the 32-bit TCE
> > table and kdump succeeds.
> > 
> > Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate 
> > TCE pages")
> > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> > Cc: sta...@vger.kernel.org # 4.2
> > 
> > Michael, I kept this as a follow-on patch to my previous one. If you'd
> > rather I made a v3 of that patch with the two fixes combined, I can
> > resend.
> 
> No that's fine. I guess they could have been a single fix, but it's
> not a big deal.

Ok, thanks for understanding.

> > Also, I fixed up the context on my end to be u64, but not sure
> > if that will match your tree (next doesn't have my prior patch applied
> > yet, that I can see).
> 
> next isn't open yet, because we're still in the merge window, ie. rc1
> hasn't come out yet.

Ah got it. I did mean your next branch, not linux-next itself, fwiw.

> This is a fix so it'll go to my fixes branch. Whether I send that to
> Linus before or after rc1 depends on how urgent the fixes people send
> me are. Sounds like you'd like these two to go in asap?

If at all possible, yes please.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=

2015-09-04 Thread Nishanth Aravamudan
The 32-bit TCE table initialization relies on the DMA window having a
size equal to a power of 2 (and checks for it explicitly). But
crashkernel= has no constraint that requires a power-of-2 be specified.
This causes the kdump kernel to fail to boot as none of the PCI devices
(including the disk controller) are successfully initialized.

After this change, the PCI devices successfully set up the 32-bit TCE
table and kdump succeeds.

Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE 
pages")
Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
Cc: sta...@vger.kernel.org # 4.2

---

Michael, I did this as a follow-on patch to my previous one. If you'd
rather I made a v3 of that patch with the two fixes combined, I can
resend.

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index f1c74c28e564..73914f4bd1ab 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2084,6 +2084,12 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
 */
const u64 window_size =
min((u64)pe->table_group.tce32_size, memory_hotplug_max());
+   /*
+* crashkernel= specifies the kdump kernel's maximum memory at
+* some offset and there is no guaranteed the result is a power
+* of 2, which will cause errors later.
+*/
+   window_size = __rounddown_pow_of_two(window_size);
 
rc = pnv_pci_ioda2_create_table(>table_group, 0,
IOMMU_PAGE_SHIFT_4K,

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=

2015-09-04 Thread Nishanth Aravamudan
The 32-bit TCE table initialization relies on the DMA window having a
size equal to a power of 2 (and checks for it explicitly). But
crashkernel= has no constraint that requires a power-of-2 be specified.
This causes the kdump kernel to fail to boot as none of the PCI devices
(including the disk controller) are successfully initialized.

After this change, the PCI devices successfully set up the 32-bit TCE
table and kdump succeeds.

Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE 
pages")
Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
Cc: sta...@vger.kernel.org # 4.2

---

Michael, I kept this as a follow-on patch to my previous one. If you'd
rather I made a v3 of that patch with the two fixes combined, I can
resend. Also, I fixed up the context on my end to be u64, but not sure
if that will match your tree (next doesn't have my prior patch applied
yet, that I can see).

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index f1c74c28e564..d5e635f2c3aa 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2078,12 +2078,18 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
struct iommu_table *tbl = NULL;
long rc;
/*
+* crashkernel= specifies the kdump kernel's maximum memory at
+* some offset and there is no guaranteed the result is a power
+* of 2, which will cause errors later.
+*/
+   const u64 max_memory = __rounddown_pow_of_two(memory_hotplug_max());
+   /*
 * In memory constrained environments, e.g. kdump kernel, the
 * DMA window can be larger than available memory, which will
 * cause errors later.
 */
const u64 window_size =
-   min((u64)pe->table_group.tce32_size, memory_hotplug_max());
+   min((u64)pe->table_group.tce32_size, max_memory);
 
rc = pnv_pci_ioda2_create_table(>table_group, 0,
iommu_page_shift_4k,

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=

2015-09-04 Thread Nishanth Aravamudan
On 04.09.2015 [20:01:22 +0200], Jan Stancek wrote:
> On Fri, Sep 04, 2015 at 09:59:38AM -0700, Nishanth Aravamudan wrote:
> > The 32-bit TCE table initialization relies on the DMA window having a
> > size equal to a power of 2 (and checks for it explicitly). But
> > crashkernel= has no constraint that requires a power-of-2 be specified.
> > This causes the kdump kernel to fail to boot as none of the PCI devices
> > (including the disk controller) are successfully initialized.
> > 
> > After this change, the PCI devices successfully set up the 32-bit TCE
> > table and kdump succeeds.
> > 
> > Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate 
> > TCE pages")
> > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> > Cc: sta...@vger.kernel.org # 4.2
> > ---
> > 
> > Michael, I did this as a follow-on patch to my previous one. If you'd
> > rather I made a v3 of that patch with the two fixes combined, I can
> > resend.
> > 
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> > b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index f1c74c28e564..73914f4bd1ab 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -2084,6 +2084,12 @@ static long 
> > pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
> >  */
> > const u64 window_size =
> > min((u64)pe->table_group.tce32_size, memory_hotplug_max());
> > +   /*
> > +* crashkernel= specifies the kdump kernel's maximum memory at
> > +* some offset and there is no guaranteed the result is a power
> > +* of 2, which will cause errors later.
> > +*/
> > +   window_size = __rounddown_pow_of_two(window_size);
> 
> Hi,
> 
> Just wondering if this won't hide potential alignment issues of
> "table_group.tce32_size", that now trigger EINVAL down the road
> and if it wouldn't be safer to round only "memory_hotplug_max()?
> 
>   const __u64 hotplug_max_p2 = __rounddown_pow_of_two(memory_hotplug_max());
>   const __u64 window_size =
> min((u64)pe->table_group.tce32_size, hotplug_max_p2);

Fair point, v2 on its way.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernel

2015-09-03 Thread Nishanth Aravamudan
On 03.09.2015 [19:58:53 +1000], Michael Ellerman wrote:
> On Wed, 2015-09-02 at 08:39 -0700, Nishanth Aravamudan wrote:
> > On 02.09.2015 [19:00:31 +1000], Alexey Kardashevskiy wrote:
> > > On 09/02/2015 11:11 AM, Nishanth Aravamudan wrote:
> > > >diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> > > >b/arch/powerpc/platforms/powernv/pci-ioda.c
> > > >index 85cbc96eff6c..0d7967e31169 100644
> > > >--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > > >+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > > >@@ -2077,10 +2077,17 @@ static long 
> > > >pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
> > > >  {
> > > > struct iommu_table *tbl = NULL;
> > > > long rc;
> > > >+/*
> > > >+ * In memory constrained environments, e.g. kdump kernel, the
> > > >+ * DMA window can be larger than available memory, which will
> > > >+ * cause errors later.
> > > >+ */
> > > >+__u64 window_size =
> > > 
> > > I asked for "const __u64" ;)
> > 
> > I knew I'd forget something!
> 
> Nish! In future please send a reply with the above comment, and then
> the v2 as a separate mail, otherwise I have to manually edit out your
> comment when applying.

Sorry! Will amend my practices in the future.



> > Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate 
> > TCE pages")
> 
> That went into 4.2, so should this have a:
> 
> Cc: sta...@vger.kernel.org # 4.2

Grr, another mess-up, you're right, it should. Can you amend on your
end? Sorry, I was in too much of a hurry to get the fixlet out.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernel

2015-09-02 Thread Nishanth Aravamudan
On 02.09.2015 [19:00:31 +1000], Alexey Kardashevskiy wrote:
> On 09/02/2015 11:11 AM, Nishanth Aravamudan wrote:
> >When attempting to kdump with the 4.2 kernel, we see for each PCI
> >device:
> >
> >  pci 0003:01 : [PE# 000] Assign DMA32 space
> >  pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..8000
> >  pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22
> >  PCI: Domain 0004 has 8 available 32-bit DMA segments
> >  PCI: 4 PE# for a total weight of 70
> >  pci 0004:01 : [PE# 002] Assign DMA32 space
> >  pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..8000
> >  pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22
> >  pci 0004:0d : [PE# 005] Assign DMA32 space
> >  pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..8000
> >  pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22
> >  pci 0004:0e : [PE# 006] Assign DMA32 space
> >  pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..8000
> >  pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22
> >  pci 0004:10 : [PE# 008] Assign DMA32 space
> >  pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..8000
> >  pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22
> >
> >and eventually the kdump kernel fails to boot as none of the PCI devices
> >(including the disk controller) are successfully initialized.
> >
> >The EINVAL response is because the DMA window (the 2GB base window) is
> >larger than the kdump kernel's reserved memory (crashkernel=, in this
> >case specified to be 1024M). The check in question,
> >
> >  if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
> >
> >is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust
> >the caller to pass in a smaller window size if our maximum memory value
> >is smaller than the DMA window.
> >
> >After this change, the PCI devices successfully set up the 32-bit TCE
> >table and kdump succeeds.
> >
> >The problem was seen on a Firestone machine originally.
> >
> >Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate 
> >TCE pages")
> >Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >
> >diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> >b/arch/powerpc/platforms/powernv/pci-ioda.c
> >index 85cbc96eff6c..0d7967e31169 100644
> >--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >@@ -2077,10 +2077,17 @@ static long 
> >pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
> >  {
> > struct iommu_table *tbl = NULL;
> > long rc;
> >+/*
> >+ * In memory constrained environments, e.g. kdump kernel, the
> >+ * DMA window can be larger than available memory, which will
> >+ * cause errors later.
> >+ */
> >+__u64 window_size =
> 
> I asked for "const __u64" ;)

I knew I'd forget something!

When attempting to kdump with the 4.2 kernel, we see for each PCI
device:

 pci 0003:01 : [PE# 000] Assign DMA32 space
 pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..8000
 pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22
 PCI: Domain 0004 has 8 available 32-bit DMA segments
 PCI: 4 PE# for a total weight of 70
 pci 0004:01 : [PE# 002] Assign DMA32 space
 pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..8000
 pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22
 pci 0004:0d : [PE# 005] Assign DMA32 space
 pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..8000
 pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22
 pci 0004:0e : [PE# 006] Assign DMA32 space
 pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..8000
 pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22
 pci 0004:10 : [PE# 008] Assign DMA32 space
 pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..8000
 pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22

and eventually the kdump kernel fails to boot as none of the PCI devices
(including the disk controller) are successfully initialized.

The EINVAL response is because the DMA window (the 2GB base window) is
larger than the kdump kernel's reserved memory (crashkernel=, in this
case specified to be 1024M). The check in question,

 if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))

is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust
the caller to p

[PATCH] powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernel

2015-09-01 Thread Nishanth Aravamudan
When attempting to kdump with the 4.2 kernel, we see for each PCI
device:

 pci 0003:01 : [PE# 000] Assign DMA32 space
 pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..8000
 pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22
 PCI: Domain 0004 has 8 available 32-bit DMA segments
 PCI: 4 PE# for a total weight of 70
 pci 0004:01 : [PE# 002] Assign DMA32 space
 pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..8000
 pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22
 pci 0004:0d : [PE# 005] Assign DMA32 space
 pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..8000
 pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22
 pci 0004:0e : [PE# 006] Assign DMA32 space
 pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..8000
 pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22
 pci 0004:10 : [PE# 008] Assign DMA32 space
 pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..8000
 pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22

and eventually the kdump kernel fails to boot as none of the PCI devices
(including the disk controller) are successfully initialized.

The EINVAL response is because the DMA window (the 2GB base window) is
larger than the kdump kernel's reserved memory (crashkernel=, in this
case specified to be 1024M). The check in question,

 if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))

is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust
the caller to pass in a smaller window size if our maximum memory value
is smaller than the DMA window.

After this change, the PCI devices successfully set up the 32-bit TCE
table and kdump succeeds.

The problem was seen on a Firestone machine originally.

Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE 
pages")
Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 85cbc96eff6c..0d7967e31169 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2077,10 +2077,17 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
 {
struct iommu_table *tbl = NULL;
long rc;
+   /*
+* In memory constrained environments, e.g. kdump kernel, the
+* DMA window can be larger than available memory, which will
+* cause errors later.
+*/
+   __u64 window_size =
+   min((u64)pe->table_group.tce32_size, memory_hotplug_max());
 
rc = pnv_pci_ioda2_create_table(>table_group, 0,
IOMMU_PAGE_SHIFT_4K,
-   pe->table_group.tce32_size,
+   window_size,
POWERNV_IOMMU_DEFAULT_LEVELS, );
if (rc) {
pe_err(pe, "Failed to create 32-bit TCE table, err %ld",

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] openvswitch: make for_each_node loops work with sparse numa systems

2015-07-21 Thread Nishanth Aravamudan
On 21.07.2015 [10:32:34 -0500], Chris J Arges wrote:
 Some architectures like POWER can have a NUMA node_possible_map that
 contains sparse entries. This causes memory corruption with openvswitch
 since it allocates flow_cache with a multiple of num_possible_nodes() and

Couldn't this also be fixed by just allocationg with a multiple of
nr_node_ids (which seems to have been the original intent all along)?
You could then make your stats array be sparse or not.

 assumes the node variable returned by for_each_node will index into
 flow-stats[node].
 
 For example, if node_possible_map is 0x30003, this patch will map node to
 node_cnt as follows:
 0,1,16,17 = 0,1,2,3
 
 The crash was noticed after 3af229f2 was applied as it changed the
 node_possible_map to match node_online_map on boot.
 Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861

My concern with this version of the fix is that you're relying on,
implicitly, the order of for_each_node's iteration corresponding to the
entries in stats 1:1. But what about node hotplug? It seems better to
have the enumeration of the stats array match the topology accurately,
rather, or to maintain some sort of internal map in the OVS code between
the NUMA node and the entry in the stats array?

I'm willing to be convinced otherwise, though :)

-Nish

 Signed-off-by: Chris J Arges chris.j.ar...@canonical.com
 ---
  net/openvswitch/flow.c   | 10 ++
  net/openvswitch/flow_table.c | 18 +++---
  2 files changed, 17 insertions(+), 11 deletions(-)
 
 diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
 index bc7b0ab..425d45d 100644
 --- a/net/openvswitch/flow.c
 +++ b/net/openvswitch/flow.c
 @@ -134,14 +134,14 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
   struct ovs_flow_stats *ovs_stats,
   unsigned long *used, __be16 *tcp_flags)
  {
 - int node;
 + int node, node_cnt = 0;
 
   *used = 0;
   *tcp_flags = 0;
   memset(ovs_stats, 0, sizeof(*ovs_stats));
 
   for_each_node(node) {
 - struct flow_stats *stats = 
 rcu_dereference_ovsl(flow-stats[node]);
 + struct flow_stats *stats = 
 rcu_dereference_ovsl(flow-stats[node_cnt]);
 
   if (stats) {
   /* Local CPU may write on non-local stats, so we must
 @@ -155,16 +155,17 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
   ovs_stats-n_bytes += stats-byte_count;
   spin_unlock_bh(stats-lock);
   }
 + node_cnt++;
   }
  }
 
  /* Called with ovs_mutex. */
  void ovs_flow_stats_clear(struct sw_flow *flow)
  {
 - int node;
 + int node, node_cnt = 0;
 
   for_each_node(node) {
 - struct flow_stats *stats = ovsl_dereference(flow-stats[node]);
 + struct flow_stats *stats = 
 ovsl_dereference(flow-stats[node_cnt]);
 
   if (stats) {
   spin_lock_bh(stats-lock);
 @@ -174,6 +175,7 @@ void ovs_flow_stats_clear(struct sw_flow *flow)
   stats-tcp_flags = 0;
   spin_unlock_bh(stats-lock);
   }
 + node_cnt++;
   }
  }
 
 diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
 index 4613df8..5d10c54 100644
 --- a/net/openvswitch/flow_table.c
 +++ b/net/openvswitch/flow_table.c
 @@ -77,7 +77,7 @@ struct sw_flow *ovs_flow_alloc(void)
  {
   struct sw_flow *flow;
   struct flow_stats *stats;
 - int node;
 + int node, node_cnt = 0;
 
   flow = kmem_cache_alloc(flow_cache, GFP_KERNEL);
   if (!flow)
 @@ -99,9 +99,11 @@ struct sw_flow *ovs_flow_alloc(void)
 
   RCU_INIT_POINTER(flow-stats[0], stats);
 
 - for_each_node(node)
 + for_each_node(node) {
   if (node != 0)
 - RCU_INIT_POINTER(flow-stats[node], NULL);
 + RCU_INIT_POINTER(flow-stats[node_cnt], NULL);
 + node_cnt++;
 + }
 
   return flow;
  err:
 @@ -139,15 +141,17 @@ static struct flex_array *alloc_buckets(unsigned int 
 n_buckets)
 
  static void flow_free(struct sw_flow *flow)
  {
 - int node;
 + int node, node_cnt = 0;
 
   if (ovs_identifier_is_key(flow-id))
   kfree(flow-id.unmasked_key);
   kfree((struct sw_flow_actions __force *)flow-sf_acts);
 - for_each_node(node)
 - if (flow-stats[node])
 + for_each_node(node) {
 + if (flow-stats[node_cnt])
   kmem_cache_free(flow_stats_cache,
 - (struct flow_stats __force 
 *)flow-stats[node]);
 + (struct flow_stats __force 
 *)flow-stats[node_cnt]);
 + node_cnt++;
 + }
   kmem_cache_free(flow_cache, flow);
  }
 
 -- 
 1.9.1
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] openvswitch: make for_each_node loops work with sparse numa systems

2015-07-21 Thread Nishanth Aravamudan
On 21.07.2015 [11:30:58 -0500], Chris J Arges wrote:
 On Tue, Jul 21, 2015 at 09:24:18AM -0700, Nishanth Aravamudan wrote:
  On 21.07.2015 [10:32:34 -0500], Chris J Arges wrote:
   Some architectures like POWER can have a NUMA node_possible_map that
   contains sparse entries. This causes memory corruption with openvswitch
   since it allocates flow_cache with a multiple of num_possible_nodes() and
  
  Couldn't this also be fixed by just allocationg with a multiple of
  nr_node_ids (which seems to have been the original intent all along)?
  You could then make your stats array be sparse or not.
  
 
 Yea originally this is what I did, but I thought it would be wasting memory.
 
   assumes the node variable returned by for_each_node will index into
   flow-stats[node].
   
   For example, if node_possible_map is 0x30003, this patch will map node to
   node_cnt as follows:
   0,1,16,17 = 0,1,2,3
   
   The crash was noticed after 3af229f2 was applied as it changed the
   node_possible_map to match node_online_map on boot.
   Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861
  
  My concern with this version of the fix is that you're relying on,
  implicitly, the order of for_each_node's iteration corresponding to the
  entries in stats 1:1. But what about node hotplug? It seems better to
  have the enumeration of the stats array match the topology accurately,
  rather, or to maintain some sort of internal map in the OVS code between
  the NUMA node and the entry in the stats array?
  
  I'm willing to be convinced otherwise, though :)
  
  -Nish
 
 
 Nish,
 
 The method I described should work for hotplug since it's using possible map
 which AFAIK is static rather than the online map. 

Oh you're right, I'm sorry!

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes

2015-07-21 Thread Nishanth Aravamudan
On 21.07.2015 [12:36:33 -0500], Chris J Arges wrote:
 Some architectures like POWER can have a NUMA node_possible_map that
 contains sparse entries. This causes memory corruption with openvswitch
 since it allocates flow_cache with a multiple of num_possible_nodes() and
 assumes the node variable returned by for_each_node will index into
 flow-stats[node].
 
 Use nr_node_ids to allocate a maximal sparse array instead of
 num_possible_nodes().
 
 The crash was noticed after 3af229f2 was applied as it changed the
 node_possible_map to match node_online_map on boot.
 Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861
 
 Signed-off-by: Chris J Arges chris.j.ar...@canonical.com
Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 ---
  net/openvswitch/flow_table.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
 index 4613df8..6552394 100644
 --- a/net/openvswitch/flow_table.c
 +++ b/net/openvswitch/flow_table.c
 @@ -752,7 +752,7 @@ int ovs_flow_init(void)
   BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
 
   flow_cache = kmem_cache_create(sw_flow, sizeof(struct sw_flow)
 -+ (num_possible_nodes()
 ++ (nr_node_ids
 * sizeof(struct flow_stats *)),
  0, 0, NULL);
   if (flow_cache == NULL)
 -- 
 1.9.1
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-15 Thread Nishanth Aravamudan
On 15.07.2015 [16:35:16 -0400], Tejun Heo wrote:
 Hello,
 
 On Thu, Jul 02, 2015 at 04:02:02PM -0700, Nishanth Aravamudan wrote:
  we currently emit at boot:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
  
  After this commit, we correctly emit:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
 
 JFYI, the numbers in the brackets aren't NUMA node numbers but percpu
 allocation group numbers and they're not split according to nodes but
 percpu allocation units.  In both cases, there are two units each
 serving 0-3 and 4-7.  In the above case, because it wasn't being fed
 the correct NUMA information, both got assigned to the same group.  In
 the latter, they got assigned to different ones but even then if the
 group numbers match NUMA node numbers, that's just a coincidence.

Ok, thank you for clarifying! From a correctness perspective, even if
the numbers don't match NUMA nodes, should we expect the grouping to be
split along NUMA topology?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 5/6] [RFC] crypto/testmgr: add null test for 842 algorithm

2015-07-13 Thread Nishanth Aravamudan
On 13.07.2015 [17:05:36 -0700], Nishanth Aravamudan wrote:
 On 04.07.2015 [15:24:53 +0800], Herbert Xu wrote:
  On Thu, Jul 02, 2015 at 03:41:19PM -0700, Nishanth Aravamudan wrote:
   Currently, when the nx-842-pseries driver loads, the following message
   is emitted:
   
   alg: No test for 842 (842-nx)
   
   It seems like the simplest way to fix this message (other than adding a
   proper test) is to just insert the null test into the list in the
   testmgr.
   
   Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
  
  Please add some real test vectors instead.

Apologies, hit send too fast. I'll work with Dan on this when he gets
back from vacation.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 5/6] [RFC] crypto/testmgr: add null test for 842 algorithm

2015-07-13 Thread Nishanth Aravamudan
On 04.07.2015 [15:24:53 +0800], Herbert Xu wrote:
 On Thu, Jul 02, 2015 at 03:41:19PM -0700, Nishanth Aravamudan wrote:
  Currently, when the nx-842-pseries driver loads, the following message
  is emitted:
  
  alg: No test for 842 (842-nx)
  
  It seems like the simplest way to fix this message (other than adding a
  proper test) is to just insert the null test into the list in the
  testmgr.
  
  Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 
 Please add some real test vectors instead.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-10 Thread Nishanth Aravamudan
On 08.07.2015 [16:16:23 -0700], Nishanth Aravamudan wrote:
 On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
  On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
   Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
   have an ordering issue during boot with early calls to cpu_to_node().
  
  now that .. implies we changed something and broke this. What commit was
  it that changed the behaviour?
 
 Well, that's something I'm trying to still unearth. In the commits
 before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d
 powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID), the dmesg reports:
 
 pcpu-alloc: [0] 0 1 2 3 4 5 6 7

Ok, I did a bisection, and it seems like prior to commit
1a4d76076cda69b0abf15463a8cebc172406da25 (percpu: implement
asynchronous chunk population), we emitted the above, e.g.:

pcpu-alloc: [0] 0 1 2 3 4 5 6 7

And after that commit, we emitted:

pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7

I'm not exactly sure why that changed, but I'm still
reading/understanding the commit. Tejun might be able to explain.

Tejun, for reference, I noticed on Power systems since the
above-mentioned commit, pcpu-alloc is not reflecting the topology of the
system correctly -- that is, the pcpu areas are all on node 0
unconditionally (based up on pcpu-alloc's output). Prior to that, there
was just one group, it seems like, which completely ignored the NUMA
topology.

Is this just an ordering thing that changed with the introduction of the
async code?

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-10 Thread Nishanth Aravamudan
On 08.07.2015 [18:22:09 -0700], David Rientjes wrote:
 On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:
 
  Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
  have an ordering issue during boot with early calls to cpu_to_node().
  The value returned by those calls now depend on the per-cpu area being
  setup, but that is not guaranteed to be the case during boot. Instead,
  we need to add an early_cpu_to_node() which doesn't use the per-CPU area
  and call that from certain spots that are known to invoke cpu_to_node()
  before the per-CPU areas are not configured.
  
  On an example 2-node NUMA system with the following topology:
  
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3
  node 0 size: 2029 MB
  node 0 free: 1753 MB
  node 1 cpus: 4 5 6 7
  node 1 size: 2045 MB
  node 1 free: 1945 MB
  node distances:
  node   0   1 
0:  10  40 
1:  40  10 
  
  we currently emit at boot:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
  
  After this commit, we correctly emit:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
  
  Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
  
  diff --git a/arch/powerpc/include/asm/topology.h 
  b/arch/powerpc/include/asm/topology.h
  index 5f1048e..f2c4c89 100644
  --- a/arch/powerpc/include/asm/topology.h
  +++ b/arch/powerpc/include/asm/topology.h
  @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
   extern int __node_distance(int, int);
   #define node_distance(a, b) __node_distance(a, b)
   
  +extern int early_cpu_to_node(int);
  +
   extern void __init dump_numa_cpu_topology(void);
   
   extern int sysfs_add_device_to_node(struct device *dev, int nid);
  diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
  index c69671c..23a2cf3 100644
  --- a/arch/powerpc/kernel/setup_64.c
  +++ b/arch/powerpc/kernel/setup_64.c
  @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
   
   static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t 
  align)
   {
  -   return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
  -   __pa(MAX_DMA_ADDRESS));
  +   return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
  +   align, __pa(MAX_DMA_ADDRESS));
   }
   
   static void __init pcpu_fc_free(void *ptr, size_t size)
  @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
   
   static int pcpu_cpu_distance(unsigned int from, unsigned int to)
   {
  -   if (cpu_to_node(from) == cpu_to_node(to))
  +   if (early_cpu_to_node(from) == early_cpu_to_node(to))
  return LOCAL_DISTANCE;
  else
  return REMOTE_DISTANCE;
  diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
  index 5e80621..9ffabf4 100644
  --- a/arch/powerpc/mm/numa.c
  +++ b/arch/powerpc/mm/numa.c
  @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
  cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
   }
   
  +int early_cpu_to_node(int cpu)
  +{
  +   return numa_cpu_lookup_table[cpu];
  +}
  +
   #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
   static void unmap_cpu_from_node(unsigned long cpu)
   {
  
  
 
 early_cpu_to_node() looks like it's begging to be __init since we 
 shouldn't have a need to reference to numa_cpu_lookup_table after boot and 
 that appears like it can be done if pcpu_cpu_distance() is made __init in 
 this patch and smp_prepare_boot_cpu() is made __init in the next patch.  
 So I think this is fine, but those functions and things like 
 reset_numa_cpu_lookup_table() should be in init.text.

Yep, that makes total sense!

 After the percpu areas on initialized and cpu_to_node() is correct, it 
 would be really nice to be able to make numa_cpu_lookup_table[] be 
 __initdata since it shouldn't be necessary anymore.  That probably has cpu 
 callbacks that need to be modified to no longer look at 
 numa_cpu_lookup_table[] or pass the value in, but it would make it much 
 cleaner.  Then nobody will have to worry about figuring out whether 
 early_cpu_to_node() or cpu_to_node() is the right one to call.

When I worked on the original pcpu patches for power, I wanted to do
this, but got myself confused and never came back to it. Thank you for
suggesting it and I'll work on it soon.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-08 Thread Nishanth Aravamudan
On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
 On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
  Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
  have an ordering issue during boot with early calls to cpu_to_node().
 
 now that .. implies we changed something and broke this. What commit was
 it that changed the behaviour?

Well, that's something I'm trying to still unearth. In the commits
before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d
powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID), the dmesg reports:

pcpu-alloc: [0] 0 1 2 3 4 5 6 7

At least prior to 8c272261194d, this might have been due to the old
powerpc-specific cpu_to_node():

static inline int cpu_to_node(int cpu)
{
   int nid;

   nid = numa_cpu_lookup_table[cpu];

   /*
* During early boot, the numa-cpu lookup table might not have
been
* setup for all CPUs yet. In such cases, default to node 0.
*/
   return (nid  0) ? 0 : nid;
}

which might imply that no one cares or that simply no one noticed.

  The value returned by those calls now depend on the per-cpu area being
  setup, but that is not guaranteed to be the case during boot. Instead,
  we need to add an early_cpu_to_node() which doesn't use the per-CPU area
  and call that from certain spots that are known to invoke cpu_to_node()
  before the per-CPU areas are not configured.
  
  On an example 2-node NUMA system with the following topology:
  
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3
  node 0 size: 2029 MB
  node 0 free: 1753 MB
  node 1 cpus: 4 5 6 7
  node 1 size: 2045 MB
  node 1 free: 1945 MB
  node distances:
  node   0   1 
0:  10  40 
1:  40  10 
  
  we currently emit at boot:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
  
  After this commit, we correctly emit:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
 
 
 So it looks fairly sane, and I guess it's a bug fix.
 
 But I'm a bit reluctant to put it in straight away without some time in next.

I'm fine with that -- it could use some more extensive testing,
admittedly (I only have been able to verify the pcpu areas are being
correctly allocated on the right node so far).

I still need to test with hotplug and things like that. Hence the RFC.

 It looks like the symptom is that the per-cpu areas are all allocated on node
 0, is that all that goes wrong?

Yes, that's the symptom. I cc'd a few folks to see if they could help
indicate the performance implications of such a setup -- sorry, I should
have been more explicit about that.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 6/6] nx-842-platform: if NX842 platform drivers are not modules, don't try to load them

2015-07-06 Thread Nishanth Aravamudan
On 06.07.2015 [16:13:07 +0800], Herbert Xu wrote:
 On Thu, Jul 02, 2015 at 03:42:26PM -0700, Nishanth Aravamudan wrote:
  Based off the CONFIG_SPU_FS_MODULE code, only attempt to load platform
  modules if the nx-842 pseries/powernv drivers are built as modules.
  
  Otherwise, if CONFIG_DEV_NX_COMPRESS=y,
  CONFIG_DEV_NX_COMPRESS_PSERIES=y, CONFIG_DEV_NX_POWERNV=y, the following
  message is emitted at boot:
  
  nx_compress: no nx842 driver found.
  
  even though the drivers successfully loads.
  
  This is because in the =y case, the module_init() calls get converted to
  initcalls and the nx842_init() runs before the platform driver
  nx842_pseries_init() or nx842_powernv_init() functions, which are what
  normally set the static platform driver.
  
  Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
  Cc: Dan Streetman ddstr...@us.ibm.com
  Cc: Herbert Xu herb...@gondor.apana.org.au
  Cc: David S. Miller da...@davemloft.net
  Cc: linux-cry...@vger.kernel.org
  Cc: linuxppc-dev@lists.ozlabs.org
 
 Ugh, I think this whole thing is redundant.  The whole point of
 the crypto API is to allow the coexistence of multiple underlying
 implementations.

Sure, that makes sense -- sorry, I was picking this up while Dan was on
vacation. Will provide a better v2.

 Please get rid of nx-842-platform.c completely and move the crypto
 registration into the individual platform drivers.  That is, powernv
 and pseries should each register their own crypto driver.  They can of
 course share a common set of crypto code which can live in its own
 module.  There should be no need for mucking with module reference
 counts at all.

Will do, thanks!

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] crypto/nx-842-{powerpc,pseries}: reduce chattiness of platform drivers

2015-07-06 Thread Nishanth Aravamudan
On 03.07.2015 [11:30:32 +1000], Michael Ellerman wrote:
 On Thu, 2015-07-02 at 15:40 -0700, Nishanth Aravamudan wrote:
  While we never would successfully load on the wrong machine type, there
  is extra output by default regardless of machine type.
  
  For instance, on a PowerVM LPAR, we see the following:
  
  nx_compress_powernv: loading
  nx_compress_powernv: no coprocessors found
  
  even though those coprocessors could never be found. Similar pseries
  messages are printed on powernv.
 
 I know I've been converting init calls to machine_initcalls() to avoid
 these sort of issues in platform code. But for a driver it should be
 trivial for it to only probe when the hardware is found.
 
 By which I mean I think we shouldn't need these.

Ok.

  diff --git a/drivers/crypto/nx/nx-842-powernv.c 
  b/drivers/crypto/nx/nx-842-powernv.c
  index 33b3b0abf4ae..6b5e5143c95b 100644
  --- a/drivers/crypto/nx/nx-842-powernv.c
  +++ b/drivers/crypto/nx/nx-842-powernv.c
  @@ -594,6 +594,9 @@ static __init int nx842_powernv_init(void)
  BUILD_BUG_ON(DDE_BUFFER_ALIGN % DDE_BUFFER_SIZE_MULT);
  BUILD_BUG_ON(DDE_BUFFER_SIZE_MULT % DDE_BUFFER_LAST_MULT);
   
  +   if (!machine_is(powernv))
  +   return -ENODEV;
  +
  pr_info(loading\n);
 
 This is just too chatty, drop it.
 
  for_each_compatible_node(dn, NULL, ibm,power-nx)
 
 It shouldn't be printing anything unless it finds some devices in this loop.
 
 And you should drop the print in here:
 
   if (!nx842_ct) {
   pr_err(no coprocessors found\n);
   return -ENODEV;
   }
 
 And that should mean no output unless hardware is found I think?

Yep, will adjust.

  @@ -625,6 +628,9 @@ static void __exit nx842_powernv_exit(void)
   {
  struct nx842_coproc *coproc, *n;
   
  +   if (!machine_is(powernv))
  +   return;
 
 You shouldn't need to touch the exit paths if the drivers were never
 loaded?

Duh, yep.

  diff --git a/drivers/crypto/nx/nx-842-pseries.c 
  b/drivers/crypto/nx/nx-842-pseries.c
  index b84b0ceeb46e..75a7bfdc160e 100644
  --- a/drivers/crypto/nx/nx-842-pseries.c
  +++ b/drivers/crypto/nx/nx-842-pseries.c
  @@ -1091,6 +1091,9 @@ static int __init nx842_pseries_init(void)
  struct nx842_devdata *new_devdata;
  int ret;
   
  +   if (!machine_is(pseries))
  +   return -ENODEV;
  +
  pr_info(Registering IBM Power 842 compression driver\n);
 
 Again this is too chatty, just remove it.

Will do.

  if (!of_find_compatible_node(NULL, NULL, ibm,compression))
   return -ENODEV;
 
 That should do the trick shouldn't it?

Yep, I think so.

Thanks Michael!


While we never would successfully load on the wrong machine type, there
is extra output by default regardless of machine type.

For instance, on a PowerVM LPAR, we see the following:

nx_compress_powernv: loading
nx_compress_powernv: no coprocessors found

even though those coprocessors could never be found.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
Cc: Dan Streetman ddstr...@us.ibm.com
Cc: Herbert Xu herb...@gondor.apana.org.au
Cc: David S. Miller da...@davemloft.net
Cc: linux-cry...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org

---
v2:
  Rather than not loading, just reduce the verbosity

 drivers/crypto/nx/nx-842-powernv.c | 10 +-
 drivers/crypto/nx/nx-842-pseries.c |  3 ---
 2 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index 33b3b0abf4ae..7e474562058d 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -594,15 +594,11 @@ static __init int nx842_powernv_init(void)
BUILD_BUG_ON(DDE_BUFFER_ALIGN % DDE_BUFFER_SIZE_MULT);
BUILD_BUG_ON(DDE_BUFFER_SIZE_MULT % DDE_BUFFER_LAST_MULT);
 
-   pr_info(loading\n);
-
for_each_compatible_node(dn, NULL, ibm,power-nx)
nx842_powernv_probe(dn);
 
-   if (!nx842_ct) {
-   pr_err(no coprocessors found\n);
+   if (!nx842_ct)
return -ENODEV;
-   }
 
if (!nx842_platform_driver_set(nx842_powernv_driver)) {
struct nx842_coproc *coproc, *n;
@@ -615,8 +611,6 @@ static __init int nx842_powernv_init(void)
return -EEXIST;
}
 
-   pr_info(loaded\n);
-
return 0;
 }
 module_init(nx842_powernv_init);
@@ -631,7 +625,5 @@ static void __exit nx842_powernv_exit(void)
list_del(coproc-list);
kfree(coproc);
}
-
-   pr_info(unloaded\n);
 }
 module_exit(nx842_powernv_exit);
diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-842-pseries.c
index b84b0ceeb46e..d44524da6589 100644
--- a/drivers/crypto/nx/nx-842-pseries.c
+++ b/drivers/crypto/nx/nx-842-pseries.c
@@ -1091,8 +1091,6 @@ static int __init nx842_pseries_init(void)
struct nx842_devdata *new_devdata;
int ret;
 
-   pr_info(Registering IBM Power 842 compression

[PATCH 5/6] [RFC] crypto/testmgr: add null test for 842 algorithm

2015-07-02 Thread Nishanth Aravamudan
Currently, when the nx-842-pseries driver loads, the following message
is emitted:

alg: No test for 842 (842-nx)

It seems like the simplest way to fix this message (other than adding a
proper test) is to just insert the null test into the list in the
testmgr.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

---
 crypto/testmgr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index d0a42bd3aae9..ff0f76e0d0b4 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1982,6 +1982,9 @@ static int alg_test_null(const struct alg_test_desc *desc,
 /* Please keep this list sorted by algorithm name. */
 static const struct alg_test_desc alg_test_descs[] = {
{
+   .alg = 842,
+   .test = alg_test_null,
+   }, {
.alg = __cbc-cast5-avx,
.test = alg_test_null,
}, {
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 6/6] nx-842-platform: if NX842 platform drivers are not modules, don't try to load them

2015-07-02 Thread Nishanth Aravamudan
Based off the CONFIG_SPU_FS_MODULE code, only attempt to load platform
modules if the nx-842 pseries/powernv drivers are built as modules.

Otherwise, if CONFIG_DEV_NX_COMPRESS=y,
CONFIG_DEV_NX_COMPRESS_PSERIES=y, CONFIG_DEV_NX_POWERNV=y, the following
message is emitted at boot:

nx_compress: no nx842 driver found.

even though the drivers successfully loads.

This is because in the =y case, the module_init() calls get converted to
initcalls and the nx842_init() runs before the platform driver
nx842_pseries_init() or nx842_powernv_init() functions, which are what
normally set the static platform driver.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
Cc: Dan Streetman ddstr...@us.ibm.com
Cc: Herbert Xu herb...@gondor.apana.org.au
Cc: David S. Miller da...@davemloft.net
Cc: linux-cry...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 drivers/crypto/nx/nx-842-platform.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/nx/nx-842-platform.c 
b/drivers/crypto/nx/nx-842-platform.c
index 664f13dd06ed..5363c72593b4 100644
--- a/drivers/crypto/nx/nx-842-platform.c
+++ b/drivers/crypto/nx/nx-842-platform.c
@@ -53,6 +53,7 @@ void nx842_platform_driver_unset(struct nx842_driver *_driver)
 }
 EXPORT_SYMBOL_GPL(nx842_platform_driver_unset);
 
+#if defined(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES_MODULE) || 
defined(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV_MODULE)
 bool nx842_platform_driver_get(void)
 {
bool ret = false;
@@ -66,7 +67,6 @@ bool nx842_platform_driver_get(void)
 
return ret;
 }
-EXPORT_SYMBOL_GPL(nx842_platform_driver_get);
 
 void nx842_platform_driver_put(void)
 {
@@ -77,6 +77,17 @@ void nx842_platform_driver_put(void)
 
spin_unlock(driver_lock);
 }
+#else
+bool nx842_platform_driver_get(void)
+{
+   return true;
+}
+
+void nx842_platform_driver_put(void)
+{
+}
+#endif
+EXPORT_SYMBOL_GPL(nx842_platform_driver_get);
 EXPORT_SYMBOL_GPL(nx842_platform_driver_put);
 
 MODULE_LICENSE(GPL);
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-02 Thread Nishanth Aravamudan
Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
have an ordering issue during boot with early calls to cpu_to_node().
The value returned by those calls now depend on the per-cpu area being
setup, but that is not guaranteed to be the case during boot. Instead,
we need to add an early_cpu_to_node() which doesn't use the per-CPU area
and call that from certain spots that are known to invoke cpu_to_node()
before the per-CPU areas are not configured.

On an example 2-node NUMA system with the following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 2029 MB
node 0 free: 1753 MB
node 1 cpus: 4 5 6 7
node 1 size: 2045 MB
node 1 free: 1945 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

we currently emit at boot:

[0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 

After this commit, we correctly emit:

[0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 5f1048e..f2c4c89 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
+extern int early_cpu_to_node(int);
+
 extern void __init dump_numa_cpu_topology(void);
 
 extern int sysfs_add_device_to_node(struct device *dev, int nid);
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index c69671c..23a2cf3 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
 
 static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
 {
-   return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
-   __pa(MAX_DMA_ADDRESS));
+   return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
+   align, __pa(MAX_DMA_ADDRESS));
 }
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
@@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
 
 static int pcpu_cpu_distance(unsigned int from, unsigned int to)
 {
-   if (cpu_to_node(from) == cpu_to_node(to))
+   if (early_cpu_to_node(from) == early_cpu_to_node(to))
return LOCAL_DISTANCE;
else
return REMOTE_DISTANCE;
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5e80621..9ffabf4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
 }
 
+int early_cpu_to_node(int cpu)
+{
+   return numa_cpu_lookup_table[cpu];
+}
+
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
 static void unmap_cpu_from_node(unsigned long cpu)
 {

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table

2015-07-02 Thread Nishanth Aravamudan
A simple move to a wrapper function to numa_cpu_lookup_table, now that
power has the early_cpu_to_node() API.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..7bf333b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 * numa_node_id() works after this.
 */
if (cpu_present(cpu)) {
-   set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+   set_cpu_numa_node(cpu, early_cpu_to_node(cpu));
set_cpu_numa_mem(cpu,
-   local_memory_node(numa_cpu_lookup_table[cpu]));
+   local_memory_node(early_cpu_to_node(cpu)));
}
}
 
@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
 #ifdef CONFIG_PPC64
paca[boot_cpuid].__current = current;
 #endif
-   set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+   set_numa_node(early_cpu_to_node(boot_cpuid));
current_set[boot_cpuid] = task_thread_info(current);
 }
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/6] drivers/nx-842: reduce verbosity of logging

2015-07-02 Thread Nishanth Aravamudan
Currently, on a LPAR with the nx-842 device disabled, the following
messages are emitted:

nx_compress: no nx842 driver found. [1]
Registering IBM Power 842 compression driver
nx_compress_pseries ibm,compression-v1: nx842_OF_upd_status: status 'disabled' 
is not 'okay'
nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sync_size new:4096 
old:0 [2]
nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sync_sg new:510 old:0
nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sg_len new:4080 old:0
nx_compress_powernv: loading [3]
nx_compress_powernv: no coprocessors found
alg: No test for 842 (842-nx) [4]

[1] is the result of an ordering issue when the CONFIG_ options are set =y. [2]
is the result of nx842_OF_upd_status() not returning the correct error code.
[3] is the result of attempting to load the PowerNV platform driver on a
non-PowerNV platform. [4] is the result of there simply not being any test for
842 in the crypto test manager.

After the changes in the series, the same system as above emits:

Registering IBM Power 842 compression driver
nx_compress_pseries ibm,compression-v1: nx842_OF_upd: device disabled

which seems to me, at least, to be far clearer.

Dan, I think there is still a issue in the code. If CONFIG_DEV_NX_COMPRESS=y
and CONFIG_DEV_NX_COMPRESS_PSERIES=m/CONFIG_DEV_NX_COMPRESS_POWERNV=m, it seems
like the request_module() is not working properly and we simply get a 
nx_compress: no nx842 driver found. at boot (even if I ensure the platform
drivers are in the initrd). If I make CONFIG_DEV_NX_COMPRESS=m, though, the
module(s) load successfully. Does it make sense/is it possible to have these
three symbols always be the same (either all =y or all =m)?

[1/6] crypto/nx-842-pseries: nx842_OF_upd_status should return ENODEV if device 
is not 'okay'
 drivers/crypto/nx/nx-842-pseries.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
[2/6] nx-842-pseries: rename nx842_{init,exit} to nx842_pseries_{init,exit}
 drivers/crypto/nx/nx-842-pseries.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
[3/6] nx-842-pseries: do not emit extra output if status is disabled
 drivers/crypto/nx/nx-842-pseries.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)
[4/6] crypto/nx-842-{powerpc,pseries}: only load on the appropriate machine type
 drivers/crypto/nx/nx-842-powernv.c | 6 ++
 drivers/crypto/nx/nx-842-pseries.c | 6 ++
 drivers/crypto/nx/nx-842.h | 1 +
 3 files changed, 13 insertions(+)
[5/6] crypto/testmgr: add null test for 842 algorithm
 crypto/testmgr.c | 3 +++
 1 file changed, 3 insertions(+)
[6/6] nx-842-platform: if NX842 platform drivers are not modules, don't try to 
load them
 drivers/crypto/nx/nx-842-platform.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/6] nx-842-pseries: rename nx842_{init,exit} to nx842_pseries_{init,exit}

2015-07-02 Thread Nishanth Aravamudan
While there is no technical reason that both nx-842.c and
nx-842-pseries.c can have the same name for the init/exit functions, it
is a bit confusing with initcall_debug. Rename the pseries specific
functions appropriately

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

---
 drivers/crypto/nx/nx-842-pseries.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-842-pseries.c
index 819c23c546e3..e17f4d2e96e0 100644
--- a/drivers/crypto/nx/nx-842-pseries.c
+++ b/drivers/crypto/nx/nx-842-pseries.c
@@ -1080,7 +1080,7 @@ static struct vio_driver nx842_vio_driver = {
.id_table = nx842_vio_driver_ids,
 };
 
-static int __init nx842_init(void)
+static int __init nx842_pseries_init(void)
 {
struct nx842_devdata *new_devdata;
int ret;
@@ -1116,9 +1116,9 @@ static int __init nx842_init(void)
return 0;
 }
 
-module_init(nx842_init);
+module_init(nx842_pseries_init);
 
-static void __exit nx842_exit(void)
+static void __exit nx842_pseries_exit(void)
 {
struct nx842_devdata *old_devdata;
unsigned long flags;
@@ -1137,5 +1137,5 @@ static void __exit nx842_exit(void)
vio_unregister_driver(nx842_vio_driver);
 }
 
-module_exit(nx842_exit);
+module_exit(nx842_pseries_exit);
 
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/6] crypto/nx-842-pseries: nx842_OF_upd_status should return ENODEV if device is not 'okay'

2015-07-02 Thread Nishanth Aravamudan
The current documention mentions explicitly that EINVAL should be
returned if the device is not available, but nx842_OF_upd_status()
always returns 0. However, nx842_probe() specifically checks for
non-ENODEV returns from nx842_of_upd() (which in turn calls
nx842_OF_upd_status()) and emits an extra error in that case. It seems
like the proper return code of a disabled device is ENODEV.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

---
 drivers/crypto/nx/nx-842-pseries.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-842-pseries.c
index 3040a6091bf2..819c23c546e3 100644
--- a/drivers/crypto/nx/nx-842-pseries.c
+++ b/drivers/crypto/nx/nx-842-pseries.c
@@ -556,7 +556,7 @@ static int nx842_OF_set_defaults(struct nx842_devdata 
*devdata)
  *
  * Returns:
  *  0 - Device is available
- *  -EINVAL - Device is not available
+ *  -ENODEV - Device is not available
  */
 static int nx842_OF_upd_status(struct nx842_devdata *devdata,
struct property *prop) {
@@ -569,6 +569,7 @@ static int nx842_OF_upd_status(struct nx842_devdata 
*devdata,
dev_info(devdata-dev, %s: status '%s' is not 'okay'\n,
__func__, status);
devdata-status = UNAVAILABLE;
+   ret = -ENODEV;
}
 
return ret;
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/6] nx-842-pseries: do not emit extra output if status is disabled

2015-07-02 Thread Nishanth Aravamudan
If the device-tree indicates the nx-842 device's status is 'disabled',
we emit two messages:

nx_compress_pseries ibm,compression-v1: nx842_OF_upd_status: status 'disabled' 
is not 'okay'.
nx_compress_pseries ibm,compression-v1: nx842_OF_upd: device disabled

Given that 'disabled' is a valid state, and we are going to emit that
the device is disabled, only print out a non-'okay' status if it is not
'disabled'.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

---
 drivers/crypto/nx/nx-842-pseries.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-842-pseries.c
index e17f4d2e96e0..b84b0ceeb46e 100644
--- a/drivers/crypto/nx/nx-842-pseries.c
+++ b/drivers/crypto/nx/nx-842-pseries.c
@@ -566,8 +566,14 @@ static int nx842_OF_upd_status(struct nx842_devdata 
*devdata,
if (!strncmp(status, okay, (size_t)prop-length)) {
devdata-status = AVAILABLE;
} else {
-   dev_info(devdata-dev, %s: status '%s' is not 'okay'\n,
+   /*
+* Caller will log that the device is disabled, so only
+* output if there is an unexpected status.
+*/
+   if (strncmp(status, disabled, (size_t)prop-length)) {
+   dev_info(devdata-dev, %s: status '%s' is not 
'okay'\n,
__func__, status);
+   }
devdata-status = UNAVAILABLE;
ret = -ENODEV;
}
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 4/6] crypto/nx-842-{powerpc,pseries}: only load on the appropriate machine type

2015-07-02 Thread Nishanth Aravamudan
While we never would successfully load on the wrong machine type, there
is extra output by default regardless of machine type.

For instance, on a PowerVM LPAR, we see the following:

nx_compress_powernv: loading
nx_compress_powernv: no coprocessors found

even though those coprocessors could never be found. Similar pseries
messages are printed on powernv.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

---
 drivers/crypto/nx/nx-842-powernv.c | 6 ++
 drivers/crypto/nx/nx-842-pseries.c | 6 ++
 drivers/crypto/nx/nx-842.h | 1 +
 3 files changed, 13 insertions(+)

diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index 33b3b0abf4ae..6b5e5143c95b 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -594,6 +594,9 @@ static __init int nx842_powernv_init(void)
BUILD_BUG_ON(DDE_BUFFER_ALIGN % DDE_BUFFER_SIZE_MULT);
BUILD_BUG_ON(DDE_BUFFER_SIZE_MULT % DDE_BUFFER_LAST_MULT);
 
+   if (!machine_is(powernv))
+   return -ENODEV;
+
pr_info(loading\n);
 
for_each_compatible_node(dn, NULL, ibm,power-nx)
@@ -625,6 +628,9 @@ static void __exit nx842_powernv_exit(void)
 {
struct nx842_coproc *coproc, *n;
 
+   if (!machine_is(powernv))
+   return;
+
nx842_platform_driver_unset(nx842_powernv_driver);
 
list_for_each_entry_safe(coproc, n, nx842_coprocs, list) {
diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-842-pseries.c
index b84b0ceeb46e..75a7bfdc160e 100644
--- a/drivers/crypto/nx/nx-842-pseries.c
+++ b/drivers/crypto/nx/nx-842-pseries.c
@@ -1091,6 +1091,9 @@ static int __init nx842_pseries_init(void)
struct nx842_devdata *new_devdata;
int ret;
 
+   if (!machine_is(pseries))
+   return -ENODEV;
+
pr_info(Registering IBM Power 842 compression driver\n);
 
if (!of_find_compatible_node(NULL, NULL, ibm,compression))
@@ -1129,6 +1132,9 @@ static void __exit nx842_pseries_exit(void)
struct nx842_devdata *old_devdata;
unsigned long flags;
 
+   if (!machine_is(pseries))
+   return;
+
pr_info(Exiting IBM Power 842 compression driver\n);
nx842_platform_driver_unset(nx842_pseries_driver);
spin_lock_irqsave(devdata_mutex, flags);
diff --git a/drivers/crypto/nx/nx-842.h b/drivers/crypto/nx/nx-842.h
index ac0ea79d0f8b..ea89c661e476 100644
--- a/drivers/crypto/nx/nx-842.h
+++ b/drivers/crypto/nx/nx-842.h
@@ -10,6 +10,7 @@
 #include linux/io.h
 #include linux/mm.h
 #include linux/ratelimit.h
+#include asm/machdep.h
 
 /* Restrictions on Data Descriptor List (DDL) and Entry (DDE) buffers
  *
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc,numa: Memory hotplug to memory-less nodes ?

2015-06-25 Thread Nishanth Aravamudan
On 24.06.2015 [07:13:36 -0500], Nathan Fontenot wrote:
 On 06/23/2015 11:01 PM, Bharata B Rao wrote:
  So will it be correct to say that memory hotplug to memory-less node
  isn't supported by PowerPC kernel ? Should I enforce the same in QEMU
  for PowerKVM ?
 
 
 I'm not sure if that is correct. It appears that we initialize all online
 nodes, even those without spanned_pages, at boot time. This occurs
 in setup_node_data() called from initmem_init().
 
 Looking at this I would think that we could add memory to any online node
 even if it does not have any spanned_pages. I think an interesting test
 we be to check for the node being online instead of checking to see if
 it has any memory.

I see no *technical* reason we should't be able to hotplug to an
initially memoryless node. I'm not sure it happens in practice under
PowerVM (where we have far less control of the topology anyways). One
aspect of this that I have on my todo list is seeing what SLUB does when
a node goes from memoryless to populated -- as during boot memoryless
nodes get a 'useless' per node structure (early_kmem_cache_node_alloc).

I can look at this a bit under KVM maybe later this week myself to see
what happens in a guest.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

2015-05-08 Thread Nishanth Aravamudan
On 08.05.2015 [15:47:26 -0700], Andrew Morton wrote:
 On Wed, 06 May 2015 11:28:12 +0200 Vlastimil Babka vba...@suse.cz wrote:
 
  On 05/06/2015 12:09 AM, Nishanth Aravamudan wrote:
   On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote:
   What I find somewhat worrying though is that we could potentially
   break the pfmemalloc_watermark_ok() test in situations where
   zone_reclaimable_pages(zone) == 0 is a transient situation (and not
   a permanently allocated hugepage). In that case, the throttling is
   supposed to help system recover, and we might be breaking that
   ability with this patch, no?
  
   Well, if it's transient, we'll skip it this time through, and once there
   are reclaimable pages, we should notice it again.
  
   I'm not familiar enough with this logic, so I'll read through the code
   again soon to see if your concern is valid, as best I can.
  
   In reviewing the code, I think that transiently unreclaimable zones will
   lead to some higher direct reclaim rates and possible contention, but
   shouldn't cause any major harm. The likelihood of that situation, as
   well, in a non-reserved memory setup like the one I described, seems
   exceedingly low.
  
  OK, I guess when a reasonably configured system has nothing to reclaim, 
  it's already busted and throttling won't change much.
  
  Consider the patch Acked-by: Vlastimil Babka vba...@suse.cz
 
 OK, thanks, I'll move this patch into the queue for 4.2-rc1.

Thank you!

 Or is it important enough to merge into 4.1?

I think 4.2 is sufficient, but I wonder now if I should have included a
stable tag? The issue has been around for a while and there's a
relatively easily workaround (use the per-node sysfs files to manually
round-robin around the exhausted node) in older kernels, so I had
decided against it before.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

2015-05-05 Thread Nishanth Aravamudan
On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote:
 On 03.04.2015 [09:57:35 +0200], Vlastimil Babka wrote:
  On 03/31/2015 11:48 AM, Michal Hocko wrote:
  On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
  On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
  On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
  @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t 
  *pgdat)
  
   for (i = 0; i = ZONE_NORMAL; i++) {
   zone = pgdat-node_zones[i];
  -   if (!populated_zone(zone))
  +   if (!populated_zone(zone) || !zone_reclaimable(zone))
   continue;
  
   pfmemalloc_reserve += min_wmark_pages(zone);
  
  Do you really want zone_reclaimable()?  Or do you want something more
  direct like zone_reclaimable_pages(zone) == 0?
  
  Yeah, I guess in my testing this worked out to be the same, since
  zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
  always be false. Thanks!
  
  Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc
  reserves if node has no ZONE_NORMAL) from Mel.
  
  We have a system with the following topology:
  
  # numactl -H
  available: 3 nodes (0,2-3)
  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
  23 24 25 26 27 28 29 30 31
  node 0 size: 28273 MB
  node 0 free: 27323 MB
  node 2 cpus:
  node 2 size: 16384 MB
  node 2 free: 0 MB
  node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
  node 3 size: 30533 MB
  node 3 free: 13273 MB
  node distances:
  node   0   2   3
 0:  10  20  20
 2:  20  10  20
 3:  20  20  10
  
  Node 2 has no free memory, because:
  # cat 
  /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
  1
  
  This leads to the following zoneinfo:
  
  Node 2, zone  DMA
 pages free 0
   min  1840
   low  2300
   high 2760
   scanned  0
   spanned  262144
   present  262144
   managed  262144
  ...
 all_unreclaimable: 1
  
  Blee, this is a weird configuration.
  
  If one then attempts to allocate some normal 16M hugepages via
  
  echo 37  /proc/sys/vm/nr_hugepages
  
  The echo never returns and kswapd2 consumes CPU cycles.
  
  This is because throttle_direct_reclaim ends up calling
  wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
  pfmemalloc_watermark_ok() in turn checks all zones on the node if there
  are any reserves, and if so, then indicates the watermarks are ok, by
  seeing if there are sufficient free pages.
  
  675becce15 added a condition already for memoryless nodes. In this case,
  though, the node has memory, it is just all consumed (and not
  reclaimable). Effectively, though, the result is the same on this call
  to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
  condition.
  
  With this change, the afore-mentioned 16M hugepage allocation attempt
  succeeds and correctly round-robins between Nodes 1 and 3.
  
  I am just wondering whether this is the right/complete fix. Don't we
  need a similar treatment at more places?
  I would expect kswapd would be looping endlessly because the zone
  wouldn't be balanced obviously. But I would be wrong... because
  pgdat_balanced is doing this:
 /*
  * A special case here:
  *
  * balance_pgdat() skips over all_unreclaimable after
  * DEF_PRIORITY. Effectively, it considers them balanced so
  * they must be considered balanced here as well!
  */
 if (!zone_reclaimable(zone)) {
 balanced_pages += zone-managed_pages;
 continue;
 }
  
  and zone_reclaimable is false for you as you didn't have any
  zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
  would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
  zones right?) and so the kswapd would be woken up easily. So it looks
  like a mess.
  
  Yeah, looks like a much cleaner/complete solution would be to remove
  such zones from zonelists. But that means covering all situations
  when these hugepages are allocated/removed and the approach then
  looks similar to memory hotplug.
  Also I'm not sure if the ability to actually allocate the reserved
  hugepage would be impossible due to not being reachable by a
  zonelist...
  
  There are possibly other places which rely on populated_zone or
  for_each_populated_zone without checking reclaimability. Are those
  working as expected?
  
  Yeah. At least the wakeup_kswapd case should be fixed IMHO. No point
  in waking it up just to let it immediately go to sleep again.
  
  That being said. I am not objecting to this patch. I am just trying to
  wrap my head around possible issues from such a weird configuration and
  all the consequences.
  
  Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

2015-04-10 Thread Nishanth Aravamudan
On 10.04.2015 [14:37:19 +0300], Konstantin Khlebnikov wrote:
 On 10.04.2015 01:58, Tanisha Aravamudan wrote:
 On 09.04.2015 [07:27:28 +0300], Konstantin Khlebnikov wrote:
 On Thu, Apr 9, 2015 at 2:07 AM, Nishanth Aravamudan
 n...@linux.vnet.ibm.com wrote:
 On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote:
 On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
 Node 0 might be offline as well as any other numa node,
 in this case kernel cannot handle memory allocation and crashes.
 
 Isn't the bug that numa_node_id() returned an offline node? That
 shouldn't happen.
 
 Offline node 0 came from static-inline copy of that function from of.h
 I've patched weak function for keeping consistency.
 
 Got it, that's not necessarily clear in the original commit message.
 
 Sorry.
 
 
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 ...
 #ifndef numa_node_id
 /* Returns the number of the current Node. */
 static inline int numa_node_id(void)
 {
  return raw_cpu_read(numa_node);
 }
 #endif
 ...
 #else   /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */
 
 /* Returns the number of the current Node. */
 #ifndef numa_node_id
 static inline int numa_node_id(void)
 {
  return cpu_to_node(raw_smp_processor_id());
 }
 #endif
 ...
 
 So that's either the per-cpu numa_node value, right? Or the result of
 cpu_to_node on the current processor.
 
 Example:
 
 [0.027133] [ cut here ]
 [0.027938] kernel BUG at include/linux/gfp.h:322!
 
 This is
 
 VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
 
 in
 
 alloc_pages_exact_node().
 
 And based on the trace below, that's
 
 __slab_alloc - alloc
 
 alloc_pages_exact_node
  - alloc_slab_page
  - allocate_slab
  - new_slab
  - new_slab_objects
   __slab_alloc?
 
 which is just passing the node value down, right? Which I think was
 from:
 
  domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * 
  size),
GFP_KERNEL, of_node_to_nid(of_node));
 
 ?
 
 
 What platform is this on, looks to be x86? qemu emulation of a
 pathological topology? What was the topology?
 
 qemu x86_64, 2 cpu, 2 numa nodes, all memory in second.
 
 Ok, this worked before? That is, this is a regression?
 
 Seems like that worked before 3.17 where
 bug was exposed by commit 44767bfaaed782d6d635ecbb13f3980041e6f33e
 (x86, irq: Enhance mp_register_ioapic() to support irqdomain)
 this is first usage of  *irq_domain_add*() in x86.

Ok.

   I've slightly patched it to allow that setup (in qemu hardcoded 1Mb
 of memory connected to node 0) And i've found unrelated bug --
 if numa node has less that 4Mb ram then kernel crashes even
 earlier because numa code ignores that node
 but buddy allocator still tries to use that pages.
 
 So this isn't an actually supported topology by qemu?
 
 Qemu easily created memoryless numa nodes but node 0 have hardcoded
 1Mb of ram. This seems like legacy prop for DOS era software.

Well, the problem is that x86 doesn't support memoryless nodes.

git grep MEMORYLESS_NODES
arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES
arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES

 Note that there is a ton of code that seems to assume node 0 is online.
 I started working on removing this assumption myself and it just led
 down a rathole (on power, we always have node 0 online, even if it is
 memoryless and cpuless, as a result).
 
 I am guessing this is just happening early in boot before the per-cpu
 areas are setup? That's why (I think) x86 has the early_cpu_to_node()
 function...
 
 Or do you not have CONFIG_OF set? So isn't the only change necessary to
 the include file, and it should just return first_online_node rather
 than 0?
 
 Ah and there's more of those node 0 assumptions :)
 
 That was x86 where is no CONFIG_OF at all.
 
 I don't know what's wrong with that machine but ACPI reports that
 cpus and memory from node 0 as connected to node 1 and everything
 seems worked fine until lates upgrade -- seems like buggy static-inline
 of_node_to_nid was intoduced in 3.13 but x86 ioapic uses it during
 early allocations only in since 3.17. Machine owner teells that 3.15
 worked fine.
 
 So, this was a qemu emulation of this actual physical machine without a
 node 0?
 
 Yep. Also I have crash from real machine but that stacktrace is messy
 because CONFIG_DEBUG_VM wasn't enabled and kernel crashed inside
 buddy allocator when tried to touch unallocated numa node structure.
 
 
 As I mentioned, there are lots of node 0 assumptions through the kernel.
 You might run into more issues at runtime.
 
 I think it's possible to trigger kernel crash for any memoryless numa
 node (not just for 0) if some device (like ioapic in my case) points to
 it in its acpi tables. In runtime numa affinity configured by user
 usually validated by the kernel, while numbers from firmware might
 be used without

Re: Topology updates and NUMA-level sched domains

2015-04-10 Thread Nishanth Aravamudan
On 10.04.2015 [11:08:10 +0200], Peter Zijlstra wrote:
 On Fri, Apr 10, 2015 at 10:31:53AM +0200, Peter Zijlstra wrote:
  Please, step back, look at what you're doing and ask yourself, will any
  sane person want to use this? Can they use this?
  
  If so, start by describing the desired user semantics of this work.
  Don't start by cobbling kernel bits togerther until it stops crashing.
 
 Also, please talk to your s390 folks. They've long since realized that
 this doesn't work. They've just added big honking caches (their BOOK
 stuff) and pretend the system is UMA.

Interesting, I will take a look.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Topology updates and NUMA-level sched domains

2015-04-10 Thread Nishanth Aravamudan
On 10.04.2015 [10:31:53 +0200], Peter Zijlstra wrote:
 On Thu, Apr 09, 2015 at 03:29:56PM -0700, Nishanth Aravamudan wrote:
   No, that's very much not the same. Even if it were dealing with hotplug
   it would still assume the cpu to return to the same node.
  
  The analogy may have been poor; a better one is: it's the same as
  hotunplugging a CPU from one node and hotplugging a physically identical
  CPU on a different node.
 
 Then it'll not be the same cpu from the OS's pov. The outgoing cpus and
 the incoming cpus will have different cpu numbers.

Right, it's an analogy. I understand it's not the exact same. I was
trying to have a civil discussion about how to solve this problem
without you calling me a wanker.

 Furthermore at boot we will have observed the empty socket and reserved
 cpu number and arranged per-cpu resources for them.

Ok, I see what you're referring to now:

static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t
align)
{
return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size,
align,
__pa(MAX_DMA_ADDRESS));
}

So we'll be referring to bootmem in the pcpu path for the node we were
on at boot-time.

Actually, this is already horribly broken on power.

[0.00] pcpu-alloc: [0] 000 001 002 003 [0] 004 005 006 007 
[0.00] pcpu-alloc: [0] 008 009 010 011 [0] 012 013 014 015 
[0.00] pcpu-alloc: [0] 016 017 018 019 [0] 020 021 022 023 
[0.00] pcpu-alloc: [0] 024 025 026 027 [0] 028 029 030 031 
[0.00] pcpu-alloc: [0] 032 033 034 035 [0] 036 037 038 039 
[0.00] pcpu-alloc: [0] 040 041 042 043 [0] 044 045 046 047 
[0.00] pcpu-alloc: [0] 048 049 050 051 [0] 052 053 054 055 
[0.00] pcpu-alloc: [0] 056 057 058 059 [0] 060 061 062 063 
[0.00] pcpu-alloc: [0] 064 065 066 067 [0] 068 069 070 071 
[0.00] pcpu-alloc: [0] 072 073 074 075 [0] 076 077 078 079 
[0.00] pcpu-alloc: [0] 080 081 082 083 [0] 084 085 086 087 
[0.00] pcpu-alloc: [0] 088 089 090 091 [0] 092 093 094 095 
[0.00] pcpu-alloc: [0] 096 097 098 099 [0] 100 101 102 103 
[0.00] pcpu-alloc: [0] 104 105 106 107 [0] 108 109 110 111 
[0.00] pcpu-alloc: [0] 112 113 114 115 [0] 116 117 118 119 
[0.00] pcpu-alloc: [0] 120 121 122 123 [0] 124 125 126 127 
[0.00] pcpu-alloc: [0] 128 129 130 131 [0] 132 133 134 135 
[0.00] pcpu-alloc: [0] 136 137 138 139 [0] 140 141 142 143 
[0.00] pcpu-alloc: [0] 144 145 146 147 [0] 148 149 150 151 
[0.00] pcpu-alloc: [0] 152 153 154 155 [0] 156 157 158 159 

even though the topology is:
available: 4 nodes (0-1,16-17)
node 0 cpus: 0 8 16 24 32
node 1 cpus: 40 48 56 64 72
node 16 cpus: 80 88 96 104 112
node 17 cpus: 120 128 136 144 152

The comment in for pcpu_build_alloc_info() seems wrong:

The returned configuration is guaranteed
 * to have CPUs on different nodes on different groups and =75% usage
 * of allocated virtual address space.

Or, we're returning node 0 for everything at this point. I'll debug it
further, but one more question:

If we have CONIFG_USE_PERCPU_NUMA_NODE_ID and are using cpu_to_node()
for setting up the per-cpu areas itself; isn't that a problem?

   People very much assume that when they set up their node affinities they
   will remain the same for the life time of their program. People set
   separate cpu affinity with sched_setaffinity() and memory affinity with
   mbind() and assume the cpu-node maps are invariant.
  
  That's a bad assumption to make if you're virtualized, I would think
  (including on KVM). Unless you're also binding your vcpu threads to
  physical cpus.
  
  But the point is valid, that userspace does tend to think rather
  statically about the world.
 
 I've no idea how KVM numa is working, if at all. I would not be
 surprised if it indeed hard binds vcpus to nodes. Not doing that allows
 the vcpus to randomly migrate between nodes which will completely
 destroy the whole point of exposing numa details to the guest.

Well, you *can* bind vcpus to nuodes. But you don't have to.

 I suppose some of the auto-numa work helps here. not sure at all.

Yes, it does, I think.

I'll look into the per-cpu memory case.
   
   Look into everything that does cpu_to_node() based allocations, because
   they all assume that that is stable.
  
   They allocate memory at init time to be node local, but they you go an
   mess that up.
  
  So, the case that you're considering is:
  
  CPU X on Node Y at boot-time, gets memory from Node Y.
  
  CPU X moves to Node Z at run-time, is still using memory from Node Y.
 
 Right, at which point numa doesn't make sense anymore. If you randomly
 scramble your cpu-node map what's the point of exposing numa to the
 guest?
 
 The whole point of NUMA is that userspace can be aware of the layout and
 use local memory where possible.
 
 Nobody will want to consider dynamic NUMA information; its utterly
 insane; do you

Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

2015-04-08 Thread Nishanth Aravamudan
On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote:
 On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
 Node 0 might be offline as well as any other numa node,
 in this case kernel cannot handle memory allocation and crashes.

Isn't the bug that numa_node_id() returned an offline node? That
shouldn't happen.

#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
...
#ifndef numa_node_id
/* Returns the number of the current Node. */
static inline int numa_node_id(void)
{
return raw_cpu_read(numa_node);
}
#endif
...
#else   /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */

/* Returns the number of the current Node. */
#ifndef numa_node_id
static inline int numa_node_id(void)
{
return cpu_to_node(raw_smp_processor_id());
}
#endif
...

So that's either the per-cpu numa_node value, right? Or the result of
cpu_to_node on the current processor.

 Example:
 
 [0.027133] [ cut here ]
 [0.027938] kernel BUG at include/linux/gfp.h:322!

This is 

VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));

in

alloc_pages_exact_node().

And based on the trace below, that's

__slab_alloc - alloc

alloc_pages_exact_node
- alloc_slab_page
- allocate_slab
- new_slab
- new_slab_objects
 __slab_alloc?

which is just passing the node value down, right? Which I think was
from:

domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
  GFP_KERNEL, of_node_to_nid(of_node));

?


What platform is this on, looks to be x86? qemu emulation of a
pathological topology? What was the topology?

Note that there is a ton of code that seems to assume node 0 is online.
I started working on removing this assumption myself and it just led
down a rathole (on power, we always have node 0 online, even if it is
memoryless and cpuless, as a result).

I am guessing this is just happening early in boot before the per-cpu
areas are setup? That's why (I think) x86 has the early_cpu_to_node()
function...

Or do you not have CONFIG_OF set? So isn't the only change necessary to
the include file, and it should just return first_online_node rather
than 0?

Ah and there's more of those node 0 assumptions :)

#define first_online_node   0
#define first_memory_node   0

if MAX_NUMODES == 1...

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Topology updates and NUMA-level sched domains

2015-04-07 Thread Nishanth Aravamudan
On 07.04.2015 [12:21:47 +0200], Peter Zijlstra wrote:
 On Mon, Apr 06, 2015 at 02:45:58PM -0700, Nishanth Aravamudan wrote:
  Hi Peter,
  
  As you are very aware, I think, power has some odd NUMA topologies (and
  changes to the those topologies) at run-time. In particular, we can see
  a topology at boot:
  
  Node 0: all Cpus
  Node 7: no cpus
  
  Then we get a notification from the hypervisor that a core (or two) have
  moved from node 0 to node 7. This results in the:
 
  or a re-init API (which won't try to reallocate various bits), because
  the topology could be completely different now (e.g.,
  sched_domains_numa_distance will also be inaccurate now).  Really, a
  topology update on power (not sure on s390x, but those are the only two
  archs that return a positive value from arch_update_cpu_topology() right
  now, afaics) is a lot like a hotplug event and we need to re-initialize
  any dependent structures.
  
  I'm just sending out feelers, as we can limp by with the above warning,
  it seems, but is less than ideal. Any help or insight you could provide
  would be greatly appreciated!
 
 So I think (and ISTR having stated this before) that dynamic cpu-node
 maps are absolutely insane.

Sorry if I wasn't involved at the time. I agree that it's a bit of a
mess!

 There is a ton of stuff that assumes the cpu-node relation is a boot
 time fixed one. Userspace being one of them. Per-cpu memory another.

Well, userspace already deals with CPU hotplug, right? And the topology
updates are, in a lot of ways, just like you've hotplugged a CPU from
one node and re-hotplugged it into another node.

I'll look into the per-cpu memory case.

For what it's worth, our test teams are stressing the kernel with these
topology updates and hopefully we'll be able to resolve any issues that
result.

 You simply cannot do this without causing massive borkage.
 
 So please come up with a coherent plan to deal with the entire problem
 of dynamic cpu to memory relation and I might consider the scheduler
 impact. But we're not going to hack around and maybe make it not crash
 in a few corner cases while the entire thing is shite.

Well, it doesn't crash now. In fact, it stays up reasonable well and
seems to dtrt (from the kernel perspective) other than the sched domain
messages.

I will look into per-cpu memory, and also another case I have been
thinking about where if a process is bound to a CPU/node combination via
numactl and then the topology changes, what exactly will happen. In
theory, via these topology updates, a node could go from memoryless -
not and v.v., which seems like it might not be well supported (but
again, should not be much different from hotplugging all the memory out
from a node).

And, in fact, I think topologically speaking, I think I should be able
to repeat the same sched domain warnings if I start off with a 2-node
system with all CPUs on one node, and then hotplug a CPU onto the second
node, right? That has nothing to do with power, that I can tell. I'll
see if I can demonstrate it via a KVM guest.

Thanks for your quick response!

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Topology updates and NUMA-level sched domains

2015-04-06 Thread Nishanth Aravamudan
Hi Peter,

As you are very aware, I think, power has some odd NUMA topologies (and
changes to the those topologies) at run-time. In particular, we can see
a topology at boot:

Node 0: all Cpus
Node 7: no cpus

Then we get a notification from the hypervisor that a core (or two) have
moved from node 0 to node 7. This results in the:

[   64.496687] BUG: arch topology borken
[   64.496689]  the CPU domain not a subset of the NUMA domain

messages for each moved CPU. I think this is because when we first came
up, we degrade (elide altogether?) the NUMA domain for node 7 as it has
no CPUs:

[0.305823] CPU0 attaching sched-domain:
[0.305831]  domain 0: span 0-7 level SIBLING
[0.305834]   groups: 0 (cpu_power = 146) 1 (cpu_power = 146) 2
(cpu_power = 146) 3 (cpu_power = 146) 4 (cpu_power = 146) 5 (cpu_power =
146) 6 (cpu_power = 146) 7 (cpu_power = 146)
[0.305854]   domain 1: span 0-79 level CPU
[0.305856]groups: 0-7 (cpu_power = 1168) 8-15 (cpu_power = 1168)
16-23 (cpu_power = 1168) 24-31 (cpu_power = 1168) 32-39 (cpu_power =
1168) 40-47 (cpu_power = 1168) 48-55 (cpu_power = 1168) 56-63 (cpu_power
= 1168) 64-71 (cpu_power = 1168) 72-79 (cpu_power = 1168)

For those cpus that moved, we get after the update:

[   64.505819] CPU8 attaching sched-domain:
[   64.505821]  domain 0: span 8-15 level SIBLING
[   64.505823]   groups: 8 (cpu_power = 147) 9 (cpu_power = 147) 10
(cpu_power = 147) 11 (cpu_power = 146) 12 (cpu_power = 147) 13
(cpu_power = 147) 14 (cpu_power = 146) 15 (cpu_power = 147)
[   64.505842]   domain 1: span 8-23,72-79 level CPU
[   64.505845]groups: 8-15 (cpu_power = 1174) 16-23 (cpu_power =
1175) 72-79 (cpu_power = 1176)

while the non-modified CPUs report, correctly:

[   64.497186] CPU0 attaching sched-domain:
[   64.497189]  domain 0: span 0-7 level SIBLING
[   64.497192]   groups: 0 (cpu_power = 147) 1 (cpu_power = 147) 2
(cpu_power = 146) 3 (cpu_power = 147) 4 (cpu_power = 147) 5 (cpu_power =
147) 6 (cpu_power = 147) 7 (cpu_power = 146)
[   64.497213]   domain 1: span 0-7,24-71 level CPU
[   64.497215]groups: 0-7 (cpu_power = 1174) 24-31 (cpu_power =
1173) 32-39 (cpu_power = 1176) 40-47 (cpu_power = 1175) 48-55 (cpu_power
= 1176) 56-63 (cpu_power = 1175) 64-71 (cpu_power = 1174)
[   64.497234]domain 2: span 0-79 level NUMA
[   64.497236] groups: 0-7,24-71 (cpu_power = 8223) 8-23,72-79
(cpu_power = 3525)

It seems like we might need something like this (HORRIBLE HACK, I know,
just to get discussion):

@@ -6958,6 +6960,10 @@ void partition_sched_domains(int ndoms_new,
cpumask_var_t doms_new[],
 
/* Let architecture update cpu core mappings. */
new_topology = arch_update_cpu_topology();
+   /* Update NUMA topology lists */
+   if (new_topology) {
+   sched_init_numa();
+   }
 
n = doms_new ? ndoms_new : 0;

or a re-init API (which won't try to reallocate various bits), because
the topology could be completely different now (e.g.,
sched_domains_numa_distance will also be inaccurate now).  Really, a
topology update on power (not sure on s390x, but those are the only two
archs that return a positive value from arch_update_cpu_topology() right
now, afaics) is a lot like a hotplug event and we need to re-initialize
any dependent structures.

I'm just sending out feelers, as we can limp by with the above warning,
it seems, but is less than ideal. Any help or insight you could provide
would be greatly appreciated!

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

2015-04-03 Thread Nishanth Aravamudan
On 03.04.2015 [20:24:45 +0200], Michal Hocko wrote:
 On Fri 03-04-15 10:43:57, Nishanth Aravamudan wrote:
  On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote:
 [...]
   I would expect kswapd would be looping endlessly because the zone
   wouldn't be balanced obviously. But I would be wrong... because
   pgdat_balanced is doing this:
 /*
  * A special case here:
  *
  * balance_pgdat() skips over all_unreclaimable after
  * DEF_PRIORITY. Effectively, it considers them balanced so
  * they must be considered balanced here as well!
  */
 if (!zone_reclaimable(zone)) {
 balanced_pages += zone-managed_pages;
 continue;
 }
   
   and zone_reclaimable is false for you as you didn't have any
   zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
   would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
   zones right?) and so the kswapd would be woken up easily. So it looks
   like a mess.
  
  My understanding, and I could easily be wrong, is that kswapd2 (node 2
  is the exhausted one) spins endlessly, because the reclaim logic sees
  that we are reclaiming from somewhere but the allocation request for
  node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will
  never complete, so we just continue to reclaim.
 
 __GFP_THISNODE would be waking up kswapd2 again and again, that is true.

Right, one idea I had for this was ensuring that we perform reclaim with
somehow some knowledge of __GFP_THISNODE -- that is it needs to be
somewhat targetted in order to actually help satisfy the current
allocation. But it got pretty hairy fast and I didn't want to break the
world :)

 I am just wondering whether we will have any __GFP_THISNODE allocations
 for a node without CPUs (numa_node_id() shouldn't return such a node
 AFAICS). Maybe if somebody is bound to Node2 explicitly but I would
 consider this as a misconfiguration.

Right, I'd need to check what happens if in our setup you taskset to
node2 and tried to force memory to be local -- I think you'd either be
killed immediately, or the kernel will just disagree with your binding
since it's invalid (e.g., that will happen if you try to bind to a
memoryless node, I think).

Keep in mind that although in my config node2 had no CPUs, that's not a
hard  fast requirement. I do believe in a previous iteration of this
bug, the exhausted node had no free memory but did have cpus assigned to
it.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

2015-04-03 Thread Nishanth Aravamudan
On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote:
 On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
  On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
   On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
@@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t 
*pgdat)
 
for (i = 0; i = ZONE_NORMAL; i++) {
zone = pgdat-node_zones[i];
-   if (!populated_zone(zone))
+   if (!populated_zone(zone) || !zone_reclaimable(zone))
continue;
 
pfmemalloc_reserve += min_wmark_pages(zone);
   
   Do you really want zone_reclaimable()?  Or do you want something more
   direct like zone_reclaimable_pages(zone) == 0?
  
  Yeah, I guess in my testing this worked out to be the same, since
  zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
  always be false. Thanks!
  
  Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc
  reserves if node has no ZONE_NORMAL) from Mel.
  
  We have a system with the following topology:
  
  # numactl -H
  available: 3 nodes (0,2-3)
  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
  23 24 25 26 27 28 29 30 31
  node 0 size: 28273 MB
  node 0 free: 27323 MB
  node 2 cpus:
  node 2 size: 16384 MB
  node 2 free: 0 MB
  node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
  node 3 size: 30533 MB
  node 3 free: 13273 MB
  node distances:
  node   0   2   3
0:  10  20  20
2:  20  10  20
3:  20  20  10
  
  Node 2 has no free memory, because:
  # cat 
  /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
  1
  
  This leads to the following zoneinfo:
  
  Node 2, zone  DMA
pages free 0
  min  1840
  low  2300
  high 2760
  scanned  0
  spanned  262144
  present  262144
  managed  262144
  ...
all_unreclaimable: 1
 
 Blee, this is a weird configuration.

Yep, super gross. It's relatively rare in the field, thankfully. But 16G
pages definitely make it pretty likely to hit (as in, I've seen it once
before :)
 
  If one then attempts to allocate some normal 16M hugepages via
  
  echo 37  /proc/sys/vm/nr_hugepages
  
  The echo never returns and kswapd2 consumes CPU cycles.
  
  This is because throttle_direct_reclaim ends up calling
  wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
  pfmemalloc_watermark_ok() in turn checks all zones on the node if there
  are any reserves, and if so, then indicates the watermarks are ok, by
  seeing if there are sufficient free pages.
  
  675becce15 added a condition already for memoryless nodes. In this case,
  though, the node has memory, it is just all consumed (and not
  reclaimable). Effectively, though, the result is the same on this call
  to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
  condition.
  
  With this change, the afore-mentioned 16M hugepage allocation attempt
  succeeds and correctly round-robins between Nodes 1 and 3.
 
 I am just wondering whether this is the right/complete fix. Don't we
 need a similar treatment at more places?

Almost certainly needs an audit. Exhausted nodes are tough to reproduce
easily (fully exhausted, that is), for me.

 I would expect kswapd would be looping endlessly because the zone
 wouldn't be balanced obviously. But I would be wrong... because
 pgdat_balanced is doing this:
   /*
* A special case here:
*
* balance_pgdat() skips over all_unreclaimable after
* DEF_PRIORITY. Effectively, it considers them balanced so
* they must be considered balanced here as well!
*/
   if (!zone_reclaimable(zone)) {
   balanced_pages += zone-managed_pages;
   continue;
   }
 
 and zone_reclaimable is false for you as you didn't have any
 zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
 would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
 zones right?) and so the kswapd would be woken up easily. So it looks
 like a mess.

My understanding, and I could easily be wrong, is that kswapd2 (node 2
is the exhausted one) spins endlessly, because the reclaim logic sees
that we are reclaiming from somewhere but the allocation request for
node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will
never complete, so we just continue to reclaim.

 There are possibly other places which rely on populated_zone or
 for_each_populated_zone without checking reclaimability. Are those
 working as expected?

Not yet verified, admittedly.

 That being said. I am not objecting to this patch. I am just trying to
 wrap my head around possible issues from such a weird configuration and
 all the consequences.

Yeah, there are almost certainly more. Luckily, our test

Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

2015-04-03 Thread Nishanth Aravamudan
On 03.04.2015 [09:57:35 +0200], Vlastimil Babka wrote:
 On 03/31/2015 11:48 AM, Michal Hocko wrote:
 On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
 On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
 On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
 @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t 
 *pgdat)
 
  for (i = 0; i = ZONE_NORMAL; i++) {
  zone = pgdat-node_zones[i];
 -   if (!populated_zone(zone))
 +   if (!populated_zone(zone) || !zone_reclaimable(zone))
  continue;
 
  pfmemalloc_reserve += min_wmark_pages(zone);
 
 Do you really want zone_reclaimable()?  Or do you want something more
 direct like zone_reclaimable_pages(zone) == 0?
 
 Yeah, I guess in my testing this worked out to be the same, since
 zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
 always be false. Thanks!
 
 Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc
 reserves if node has no ZONE_NORMAL) from Mel.
 
 We have a system with the following topology:
 
 # numactl -H
 available: 3 nodes (0,2-3)
 node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29 30 31
 node 0 size: 28273 MB
 node 0 free: 27323 MB
 node 2 cpus:
 node 2 size: 16384 MB
 node 2 free: 0 MB
 node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 node 3 size: 30533 MB
 node 3 free: 13273 MB
 node distances:
 node   0   2   3
0:  10  20  20
2:  20  10  20
3:  20  20  10
 
 Node 2 has no free memory, because:
 # cat 
 /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
 1
 
 This leads to the following zoneinfo:
 
 Node 2, zone  DMA
pages free 0
  min  1840
  low  2300
  high 2760
  scanned  0
  spanned  262144
  present  262144
  managed  262144
 ...
all_unreclaimable: 1
 
 Blee, this is a weird configuration.
 
 If one then attempts to allocate some normal 16M hugepages via
 
 echo 37  /proc/sys/vm/nr_hugepages
 
 The echo never returns and kswapd2 consumes CPU cycles.
 
 This is because throttle_direct_reclaim ends up calling
 wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
 pfmemalloc_watermark_ok() in turn checks all zones on the node if there
 are any reserves, and if so, then indicates the watermarks are ok, by
 seeing if there are sufficient free pages.
 
 675becce15 added a condition already for memoryless nodes. In this case,
 though, the node has memory, it is just all consumed (and not
 reclaimable). Effectively, though, the result is the same on this call
 to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
 condition.
 
 With this change, the afore-mentioned 16M hugepage allocation attempt
 succeeds and correctly round-robins between Nodes 1 and 3.
 
 I am just wondering whether this is the right/complete fix. Don't we
 need a similar treatment at more places?
 I would expect kswapd would be looping endlessly because the zone
 wouldn't be balanced obviously. But I would be wrong... because
 pgdat_balanced is doing this:
  /*
   * A special case here:
   *
   * balance_pgdat() skips over all_unreclaimable after
   * DEF_PRIORITY. Effectively, it considers them balanced so
   * they must be considered balanced here as well!
   */
  if (!zone_reclaimable(zone)) {
  balanced_pages += zone-managed_pages;
  continue;
  }
 
 and zone_reclaimable is false for you as you didn't have any
 zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
 would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
 zones right?) and so the kswapd would be woken up easily. So it looks
 like a mess.
 
 Yeah, looks like a much cleaner/complete solution would be to remove
 such zones from zonelists. But that means covering all situations
 when these hugepages are allocated/removed and the approach then
 looks similar to memory hotplug.
 Also I'm not sure if the ability to actually allocate the reserved
 hugepage would be impossible due to not being reachable by a
 zonelist...
 
 There are possibly other places which rely on populated_zone or
 for_each_populated_zone without checking reclaimability. Are those
 working as expected?
 
 Yeah. At least the wakeup_kswapd case should be fixed IMHO. No point
 in waking it up just to let it immediately go to sleep again.
 
 That being said. I am not objecting to this patch. I am just trying to
 wrap my head around possible issues from such a weird configuration and
 all the consequences.
 
 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 
 The patch as is doesn't seem to be harmful.
 
 Reviewed-by: Michal Hocko mho...@suse.cz
 
 ---
 v1 - v2:
Check against zone_reclaimable_pages, rather

[PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones

2015-03-27 Thread Nishanth Aravamudan
Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc
reserves if node has no ZONE_NORMAL) from Mel.

We have a system with the following topology:

(0) root @ br30p03: /root
# numactl -H
available: 3 nodes (0,2-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31
node 0 size: 28273 MB
node 0 free: 27323 MB
node 2 cpus:
node 2 size: 16384 MB
node 2 free: 0 MB
node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 3 size: 30533 MB
node 3 free: 13273 MB
node distances:
node   0   2   3 
  0:  10  20  20 
  2:  20  10  20 
  3:  20  20  10 

Node 2 has no free memory, because:

# cat 
/sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 
1

This leads to the following zoneinfo:

Node 2, zone  DMA
  pages free 0
min  1840
low  2300
high 2760
scanned  0
spanned  262144
present  262144
managed  262144
...
  all_unreclaimable: 1

If one then attempts to allocate some normal 16M hugepages:

echo 37  /proc/sys/vm/nr_hugepages

The echo enver returns and kswapd2 consumes CPU cycles.

This is because throttle_direct_reclaim ends up calling
wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
pfmemalloc_watermark_ok() in turn checks all zones on the node and see
if the there are any reserves, and if so, then indicates the watermarks
are ok, by seeing if there are sufficient free pages.

675becce15 added a condition already for memoryless nodes. In this case,
though, the node has memory, it is just all consumed (and not
recliamable). Effectively, though, the result is the same on this
call to pfmemalloc_watermark_ok() and thus seems like a reasonable
additional condition.

With this change, the afore-mentioned 16M hugepage allocation succeeds
and correctly round-robins between Nodes 1 and 3.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcd90c8..033c2b7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
for (i = 0; i = ZONE_NORMAL; i++) {
zone = pgdat-node_zones[i];
-   if (!populated_zone(zone))
+   if (!populated_zone(zone) || !zone_reclaimable(zone))
continue;
 
pfmemalloc_reserve += min_wmark_pages(zone);

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones

2015-03-27 Thread Nishanth Aravamudan
[ Sorry, typo'd anton's address ]

On 27.03.2015 [12:28:50 -0700], Nishanth Aravamudan wrote:
 Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc
 reserves if node has no ZONE_NORMAL) from Mel.
 
 We have a system with the following topology:
 
 (0) root @ br30p03: /root
 # numactl -H
 available: 3 nodes (0,2-3)
 node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29 30 31
 node 0 size: 28273 MB
 node 0 free: 27323 MB
 node 2 cpus:
 node 2 size: 16384 MB
 node 2 free: 0 MB
 node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 node 3 size: 30533 MB
 node 3 free: 13273 MB
 node distances:
 node   0   2   3 
   0:  10  20  20 
   2:  20  10  20 
   3:  20  20  10 
 
 Node 2 has no free memory, because:
 
 # cat 
 /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 
 1
 
 This leads to the following zoneinfo:
 
 Node 2, zone  DMA
   pages free 0
 min  1840
 low  2300
 high 2760
 scanned  0
 spanned  262144
 present  262144
 managed  262144
 ...
   all_unreclaimable: 1
 
 If one then attempts to allocate some normal 16M hugepages:
 
 echo 37  /proc/sys/vm/nr_hugepages
 
 The echo enver returns and kswapd2 consumes CPU cycles.
 
 This is because throttle_direct_reclaim ends up calling
 wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
 pfmemalloc_watermark_ok() in turn checks all zones on the node and see
 if the there are any reserves, and if so, then indicates the watermarks
 are ok, by seeing if there are sufficient free pages.
 
 675becce15 added a condition already for memoryless nodes. In this case,
 though, the node has memory, it is just all consumed (and not
 recliamable). Effectively, though, the result is the same on this
 call to pfmemalloc_watermark_ok() and thus seems like a reasonable
 additional condition.
 
 With this change, the afore-mentioned 16M hugepage allocation succeeds
 and correctly round-robins between Nodes 1 and 3.
 
 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index dcd90c8..033c2b7 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
  
 for (i = 0; i = ZONE_NORMAL; i++) {
 zone = pgdat-node_zones[i];
 -   if (!populated_zone(zone))
 +   if (!populated_zone(zone) || !zone_reclaimable(zone))
 continue;
  
 pfmemalloc_reserve += min_wmark_pages(zone);
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

2015-03-27 Thread Nishanth Aravamudan
On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
 On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
  @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
   
  for (i = 0; i = ZONE_NORMAL; i++) {
  zone = pgdat-node_zones[i];
  -   if (!populated_zone(zone))
  +   if (!populated_zone(zone) || !zone_reclaimable(zone))
  continue;
   
  pfmemalloc_reserve += min_wmark_pages(zone);
 
 Do you really want zone_reclaimable()?  Or do you want something more
 direct like zone_reclaimable_pages(zone) == 0?

Yeah, I guess in my testing this worked out to be the same, since
zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
always be false. Thanks!

Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc
reserves if node has no ZONE_NORMAL) from Mel.

We have a system with the following topology:

# numactl -H
available: 3 nodes (0,2-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31
node 0 size: 28273 MB
node 0 free: 27323 MB
node 2 cpus:
node 2 size: 16384 MB
node 2 free: 0 MB
node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 3 size: 30533 MB
node 3 free: 13273 MB
node distances:
node   0   2   3
  0:  10  20  20
  2:  20  10  20
  3:  20  20  10

Node 2 has no free memory, because:
# cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
1

This leads to the following zoneinfo:

Node 2, zone  DMA
  pages free 0
min  1840
low  2300
high 2760
scanned  0
spanned  262144
present  262144
managed  262144
...
  all_unreclaimable: 1

If one then attempts to allocate some normal 16M hugepages via

echo 37  /proc/sys/vm/nr_hugepages

The echo never returns and kswapd2 consumes CPU cycles.

This is because throttle_direct_reclaim ends up calling
wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
pfmemalloc_watermark_ok() in turn checks all zones on the node if there
are any reserves, and if so, then indicates the watermarks are ok, by
seeing if there are sufficient free pages.

675becce15 added a condition already for memoryless nodes. In this case,
though, the node has memory, it is just all consumed (and not
reclaimable). Effectively, though, the result is the same on this call
to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
condition.

With this change, the afore-mentioned 16M hugepage allocation attempt
succeeds and correctly round-robins between Nodes 1 and 3.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

---
v1 - v2:
  Check against zone_reclaimable_pages, rather zone_reclaimable, based
  upon feedback from Dave Hansen.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..c627fa4c991f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2646,7 +2646,8 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
for (i = 0; i = ZONE_NORMAL; i++) {
zone = pgdat-node_zones[i];
-   if (!populated_zone(zone))
+   if (!populated_zone(zone) ||
+   zone_reclaimable_pages(zone) == 0)
continue;
 
pfmemalloc_reserve += min_wmark_pages(zone);

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: new decimal conversion - seeking testers

2015-03-12 Thread Nishanth Aravamudan
On 13.03.2015 [00:09:19 +0100], Rasmus Villemoes wrote:
 Hi,
 
 I've proposed a new implementation of decimal conversion for
 lib/vsprintf.c; see
 http://thread.gmane.org/gmane.linux.kernel/1892035/focus=1905478.
 Benchmarking so far shows 25-50% (depending on distribution of input
 numbers) improvement on x86_64 and 10-30% on various 32 bit platforms.
 
 Since the new code plays a little endianness game I would really
 appreciate it if someone here would run the test and verification code
 on ppc.
 
 The code is on github, https://github.com/Villemoes/dec, and it should
 be as simple as
 
git clone https://github.com/Villemoes/dec.git
cd dec
make
./test   # benchmark
./verify # correctness
 
 though I can't blame you if you want to inspect the code before
 compiling and running something some stranger asks you to download :-)
 See also the README file.
 
 If 'make' doesn't work out-of-the-box, I'd also like to hear from you.

On a ppc64le box:

./test 
Distribution  Function nsecs/conv   Conv/1 sec  
uniform([10, 2^64-1]) linux_put_dec   56.04 17785895
uniform([10, 2^64-1]) rv_put_dec  31.97 31190888
  +/-   -42.94%  +75.37%
3 + neg_binom(0.05)   linux_put_dec   29.55 32986465
3 + neg_binom(0.05)   rv_put_dec  24.61 39416630
  +/-   -16.71%  +19.49%
3 + neg_binom(0.10)   linux_put_dec   22.16 43993836
3 + neg_binom(0.10)   rv_put_dec  18.76 50767222
  +/-   -15.34%  +15.40%
3 + neg_binom(0.15)   linux_put_dec   18.97 51272565
3 + neg_binom(0.15)   rv_put_dec  16.18 58328176
  +/-   -14.70%  +13.76%
3 + neg_binom(0.20)   linux_put_dec   16.79 57792783
3 + neg_binom(0.20)   rv_put_dec  14.03 66418077
  +/-   -16.45%  +14.92%
3 + neg_binom(0.50)   linux_put_dec   10.81 89762669
3 + neg_binom(0.50)   rv_put_dec   9.40104336963
  +/-   -13.08%  +16.24%

./verify 
Using 16 threads
Checking [10, 100] and [18446744063709551615,
18446744073709551615]
Thread 9: low range ok
Thread 13: low range ok
Thread 10: low range ok
Thread 15: low range ok
Thread 7: low range ok
Thread 5: low range ok
Thread 14: low range ok
Thread 0: low range ok
Thread 3: low range ok
Thread 1: low range ok
Thread 2: low range ok
Thread 4: low range ok
Thread 8: low range ok
Thread 12: low range ok
Thread 6: low range ok
Thread 11: low range ok
Thread 9: high range ok
Thread 10: high range ok
Thread 7: high range ok
Thread 13: high range ok
Thread 0: high range ok
Thread 14: high range ok
Thread 15: high range ok
Thread 5: high range ok
Thread 1: high range ok
Thread 8: high range ok
Thread 11: high range ok
Thread 6: high range ok
Thread 2: high range ok
Thread 12: high range ok
Thread 3: high range ok
Thread 4: high range ok
Thread 9: mid range ok
Thread 0: mid range ok
Thread 14: mid range ok
Thread 7: mid range ok
Thread 10: mid range ok
Thread 8: mid range ok
Thread 2: mid range ok
Thread 11: mid range ok
Thread 13: mid range ok
Thread 1: mid range ok
Thread 6: mid range ok
Thread 15: mid range ok
Thread 5: mid range ok
Thread 4: mid range ok
Thread 3: mid range ok
Thread 12: mid range ok
Distribution of lengths checked:
1   5
2   101
3   900
4   9000
5   9
6   90
7   900
8   9000
9   9
10  90
11  81
12  848
13  8384
14  83808
15  838192
16  8381904
17  83819040
18  838190304
19  8381903184
20  17866643425

On a ppc64 box:

./test 
Distribution  Function nsecs/conv   Conv/1 sec  
uniform([10, 2^64-1]) linux_put_dec   48.97 20478528
uniform([10, 2^64-1]) rv_put_dec  32.14 31915074
  +/-   -34.37%  +55.85%
3 + neg_binom(0.05)   linux_put_dec   25.60 38732189
3 + neg_binom(0.05)   rv_put_dec  20.18 48828445
  +/-   -21.18%  +26.07%
3 + neg_binom(0.10)   linux_put_dec   18.70 52648284
3 + neg_binom(0.10)   rv_put_dec  15.60 62915143
  +/-   -16.56%  +19.50%
3 + neg_binom(0.15)   linux_put_dec   15.27 64695363
3 + neg_binom(0.15)   rv_put_dec  13.34 73279482
  +/-   -12.62%  +13.27%
3 + neg_binom(0.20)   linux_put_dec   13.25 74240237

[PATCH v3] powerpc/numa: set node_possible_map to only node_online_map during boot

2015-03-10 Thread Nishanth Aravamudan
On 10.03.2015 [10:55:05 +1100], Michael Ellerman wrote:
 On Thu, 2015-03-05 at 21:27 -0800, Nishanth Aravamudan wrote:
  diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
  index 0257a7d659ef..0c1716cd271f 100644
  --- a/arch/powerpc/mm/numa.c
  +++ b/arch/powerpc/mm/numa.c
  @@ -958,6 +958,13 @@ void __init initmem_init(void)
   
  memblock_dump_all();
   
  +   /*
  +* Reduce the possible NUMA nodes to the online NUMA nodes,
  +* since we do not support node hotplug. This ensures that  we
  +* lower the maximum NUMA node ID to what is actually present.
  +*/
  +   node_possible_map = node_online_map;
 
 That looks nice, but is it generating what we want?
 
 ie. is the content of node_online_map being *copied* into node_possible_map.
 
 Or are we changing node_possible_map to point at node_online_map?

I think it ends up being the latter, which is probably fine in practice
(I think node_online_map is static on power after boot), but perhaps it
would be better to do:

nodes_and(node_possible_map, node_possible_map, node_online_map);

?

e.g.:


powerpc/numa: reset node_possible_map to only node_online_map

Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc -
for_each_node - alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
value of nr_node_ids in setup_nr_node_ids and the iteration of
for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

mem_cgroup_css_alloc should also be fixed to only iterate over
memory-populated nodes and handle hotplug, but that is a separate
change.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
To: Michael Ellerman m...@ellerman.id.au
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Tejun Heo t...@kernel.org
Cc: David Rientjes rient...@google.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Anton Blanchard an...@samba.org
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com

---
v1 - v2:
  Rather than clear node_possible_map and set it nid-by-nid, just
  directly assign node_online_map to it, as suggested by Michael
  Ellerman and Tejun Heo.

v2 - v3:
  Rather than direct assignment (which is just repointing the pointer),
  modify node_possible_map in-place.

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..1a118b08fad2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,6 +958,13 @@ void __init initmem_init(void)
 
memblock_dump_all();
 
+   /*
+* Reduce the possible NUMA nodes to the online NUMA nodes,
+* since we do not support node hotplug. This ensures that  we
+* lower the maximum NUMA node ID to what is actually present.
+*/
+   nodes_and(node_possible_map, node_possible_map, node_online_map);
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread Nishanth Aravamudan
Hi David,

On 05.03.2015 [13:16:35 -0800], David Rientjes wrote:
 On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
 
  diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
  index 0257a7d659ef..24de29b3651b 100644
  --- a/arch/powerpc/mm/numa.c
  +++ b/arch/powerpc/mm/numa.c
  @@ -958,9 +958,17 @@ void __init initmem_init(void)
   
  memblock_dump_all();
   
  +   /*
  +* zero out the possible nodes after we parse the device-tree,
  +* so that we lower the maximum NUMA node ID to what is actually
  +* present.
  +*/
  +   nodes_clear(node_possible_map);
  +
  for_each_online_node(nid) {
  unsigned long start_pfn, end_pfn;
   
  +   node_set(nid, node_possible_map);
  get_pfn_range_for_nid(nid, start_pfn, end_pfn);
  setup_node_data(nid, start_pfn, end_pfn);
  sparse_memory_present_with_active_regions(nid);
 
 This seems a bit strange, node_possible_map is supposed to be a superset 
 of node_online_map and this loop is iterating over node_online_map to set 
 nodes in node_possible_map.

So if we compare to x86:

arch/x86/mm/numa.c::numa_init():

nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
nodes_clear(node_online_map);
...
numa_register_memblks(...);

arch/x86/mm/numa.c::numa_register_memblks():

node_possible_map = numa_nodes_parsed;

Basically, it looks like x86 NUMA init clears out possible map and
online map, probably for a similar reason to what I gave in the
changelog that by default, the possible map seems to be based off
MAX_NUMNODES, rather than nr_node_ids or anything dynamic.

My patch was an attempt to emulate the same thing on powerpc. You are
right that there is a window in which the node_possible_map and
node_online_map are out of sync with my patch. It seems like it
shouldn't matter given how early in boot we are, but perhaps the
following would have been clearer:

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..1a118b08fad2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,6 +958,13 @@ void __init initmem_init(void)
 
memblock_dump_all();
 
+   /*
+* Reduce the possible NUMA nodes to the online NUMA nodes,
+* since we do not support node hotplug. This ensures that  we
+* lower the maximum NUMA node ID to what is actually present.
+*/
+   nodes_and(node_possible_map, node_possible_map, node_online_map);
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread Nishanth Aravamudan
On 05.03.2015 [13:58:27 -0800], David Rientjes wrote:
 On Fri, 6 Mar 2015, Michael Ellerman wrote:
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..24de29b3651b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,9 +958,17 @@ void __init initmem_init(void)
 
memblock_dump_all();
 
+   /*
+* zero out the possible nodes after we parse the device-tree,
+* so that we lower the maximum NUMA node ID to what is actually
+* present.
+*/
+   nodes_clear(node_possible_map);
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;
 
+   node_set(nid, node_possible_map);
get_pfn_range_for_nid(nid, start_pfn, end_pfn);
setup_node_data(nid, start_pfn, end_pfn);
sparse_memory_present_with_active_regions(nid);
   
   This seems a bit strange, node_possible_map is supposed to be a superset 
   of node_online_map and this loop is iterating over node_online_map to set 
   nodes in node_possible_map.
   
  Yeah. Though at this point in boot I don't think it matters that the
  two maps are out-of-sync temporarily.
  
  But it would simpler to just set the possible map to be the online
  map. That would also maintain the invariant that the possible map is
  always a superset of the online map.
  
  Or did I miss a detail there (sleep deprived parent mode).
  
 
 I think reset_numa_cpu_lookup_table() which iterates over the possible
 map, and thus only a subset of nodes now, may be concerning.


I think you are confusing the CPU online map and the NUMA node online
map. reset_numa_cpu_lookup_table is a cpu-node mapping, only called at
boot-time, and iterates over the CPU online map, which is unaltered by
my patch.

 I'm not sure why this is being proposed as a powerpc patch and now a
 patch for mem_cgroup_css_alloc().

I think mem_cgroup_css_alloc() is just an example of a larger issue. I
should have made that clearer in my changelog. Even if we change
mem_cgroup_css_alloc(), I think we want to fix the node_possible_map on
powerpc to be accurate at run-time, just like x86 does.

 In other words, why do we have to allocate for all possible nodes?  We
 should only be allocating for online nodes in N_MEMORY with mem
 hotplug disabled initially and then have a mem hotplug callback
 implemented to alloc_mem_cgroup_per_zone_info() for nodes that
 transition from memoryless - memory.  The extra bonus is that
 alloc_mem_cgroup_per_zone_info() need never allocate remote memory and
 the TODO in that function can be removed.

This is a good idea, and seems like it can be a follow-on parallel patch
to the one I provided (which does need an updated changelog now).

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread Nishanth Aravamudan
On 05.03.2015 [17:13:08 -0500], Tejun Heo wrote:
 On Thu, Mar 05, 2015 at 10:05:49AM -0800, Nishanth Aravamudan wrote:
  While looking at this, I noticed that nr_node_ids is actually a
  misnomer, it seems. It's not the number, but the maximum_node_id, as
  with sparse NUMA nodes, you might only have two NUMA nodes possible, but
  to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
  changed?
 
 It's the same for nr_cpu_ids.  It's counting the number of valid IDs
 during that boot instance.  In the above case, whether the nodes are
 sparse or not, there exist 17 node ids - 0 to 16.  Maybe numa_max_id
 had been a better name (but would that equal the highest number or
 +1?) but nr_node_ids != nr_nodes so I don't think it's a misnomer
 either.  Doesn't really matter at this point.  Maybe add comments on
 top of both?

Yes, I will consider that. To me, I guess it's more a matter of:

a) How does nr_node_ids relate to the number of possible NUMA node IDs
at runtime?

They are identical.

b) How does nr_node_ids relate to the number of NUMA node IDs in use?

There is no relation.

c) How does nr_node_ids relate to the maximum NUMA node ID in use?

It is one larger than that value.

However, for a), at least, we don't care about that on power, really. We
don't have node hotplug, so the possible is the online in practice,
for a given system.

Iteration seems to generally not be a problem (since we have sparse
iterators anyways) and we shouldn't be allocating for non-present nodes.

But we run into excessive allocations (I'm looking into a few others
Dipankar has found now) with array allocations based of nr_node_ids or
MAX_NUMNODES when the NUMA topology is sparse..

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread Nishanth Aravamudan
On 06.03.2015 [08:48:52 +1100], Michael Ellerman wrote:
 On Thu, 2015-03-05 at 13:16 -0800, David Rientjes wrote:
  On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
  
   diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
   index 0257a7d659ef..24de29b3651b 100644
   --- a/arch/powerpc/mm/numa.c
   +++ b/arch/powerpc/mm/numa.c
   @@ -958,9 +958,17 @@ void __init initmem_init(void)

 memblock_dump_all();

   + /*
   +  * zero out the possible nodes after we parse the device-tree,
   +  * so that we lower the maximum NUMA node ID to what is actually
   +  * present.
   +  */
   + nodes_clear(node_possible_map);
   +
 for_each_online_node(nid) {
 unsigned long start_pfn, end_pfn;

   + node_set(nid, node_possible_map);
 get_pfn_range_for_nid(nid, start_pfn, end_pfn);
 setup_node_data(nid, start_pfn, end_pfn);
 sparse_memory_present_with_active_regions(nid);
  
  This seems a bit strange, node_possible_map is supposed to be a superset 
  of node_online_map and this loop is iterating over node_online_map to set 
  nodes in node_possible_map.
  
 Yeah. Though at this point in boot I don't think it matters that the two maps
 are out-of-sync temporarily.
 
 But it would simpler to just set the possible map to be the online
 map. That would also maintain the invariant that the possible map is
 always a superset of the online map.

Yes, we could do that (see my reply to David just now). I didn't
consider just setting the map directly, that would be clearer. I didn't
want to post my nodes_and() version, because the cost of nodes_and
seemed higher than nodes_clear  node_set appropriately.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread Nishanth Aravamudan
On 05.03.2015 [17:08:04 -0500], Tejun Heo wrote:
 Hello,
 
 On Thu, Mar 05, 2015 at 01:58:27PM -0800, David Rientjes wrote:
  I'm not sure why this is being proposed as a powerpc patch and now a patch 
  for mem_cgroup_css_alloc().  In other words, why do we have to allocate 
  for all possible nodes?  We should only be allocating for online nodes in 
  N_MEMORY with mem hotplug disabled initially and then have a mem hotplug 
  callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that 
  transition from memoryless - memory.  The extra bonus is that 
  alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the 
  TODO in that function can be removed.
 
 For cpus, the general direction is allocating for all possible cpus.
 For iterations, we alternate between using all possibles and onlines
 depending on the use case; however, the general idea is that the
 possibles and onlines aren't gonna be very different.  NR_CPUS and
 MAX_NUMNODES gotta accomodate the worst possible case the kernel may
 run on but the possible masks should be set to the actually possible
 subset during boot so that the kernel don't end up allocating for and
 iterating over things which can't ever exist.

Makes sense to me.

 It can be argued that we should always stick to the online masks for
 allocation and iteration; however, that usually requires more
 complexity and the only cases where this mattered have been when the
 boot code got it wrong and failed to set the possible masks correctly,
 which also seems to be the case here.  I don't see any reason to
 deviate here.

So, do you agree with the general direction of my change? :)

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread Nishanth Aravamudan
Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc -
for_each_node - alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
iteration of for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

One could alternatively nodemask_and(node_possible_map,
node_online_map), but I think the cost of anding the two will always be
higher than zero and set a few bits in practice.

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

---
While looking at this, I noticed that nr_node_ids is actually a
misnomer, it seems. It's not the number, but the maximum_node_id, as
with sparse NUMA nodes, you might only have two NUMA nodes possible, but
to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
changed?

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..24de29b3651b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,9 +958,17 @@ void __init initmem_init(void)
 
memblock_dump_all();
 
+   /*
+* zero out the possible nodes after we parse the device-tree,
+* so that we lower the maximum NUMA node ID to what is actually
+* present.
+*/
+   nodes_clear(node_possible_map);
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;
 
+   node_set(nid, node_possible_map);
get_pfn_range_for_nid(nid, start_pfn, end_pfn);
setup_node_data(nid, start_pfn, end_pfn);
sparse_memory_present_with_active_regions(nid);

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

  1   2   3   4   >