Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote: > Looks fine, > > Reviewed-by: Christoph Hellwig> > ... but I doubt we'll ever bother updating it. Most architectures > with arger page sizes also have iommus and would need different settings > for different iommus vs direct mapping for very little gain. There's a > reason why we never bothered for RDMA either. FWIW, whose tree should this go through? The bug only appears on Power, afaik, but the patch is now just an NVMe change. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/1 v4] drivers/nvme: default to 4k device page size
On 03.11.2015 [13:46:25 +], Keith Busch wrote: > On Tue, Nov 03, 2015 at 05:18:24AM -0800, Christoph Hellwig wrote: > > On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote: > > > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c > > > index ccc0c1f93daa..a9a5285bdb39 100644 > > > --- a/drivers/block/nvme-core.c > > > +++ b/drivers/block/nvme-core.c > > > @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct > > > nvme_dev *dev) > > > u32 aqa; > > > u64 cap = readq(>bar->cap); > > > struct nvme_queue *nvmeq; > > > - unsigned page_shift = PAGE_SHIFT; > > > + /* > > > + * default to a 4K page size, with the intention to update this > > > + * path in the future to accomodate architectures with differing > > > + * kernel and IO page sizes. > > > + */ > > > + unsigned page_shift = 12; > > > unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12; > > > unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12; > > > > Looks good as a start. Note that all the MPSMIN/MAX checking could > > be removed as NVMe devices must support 4k pages. > > MAX can go, and while it's probably the case that all devices support 4k, > it's not a spec requirement, so we should keep the dev_page_min check. Ok, here's an updated patch. We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case of page sizes, we clearly want to use the IOMMU's page size in the driver. And generally, the NVMe driver in this function should be using the IOMMU's page size for the default device page size, rather than the kernel's page size. There is not currently an API to obtain the IOMMU's page size across all architectures and in the interest of a stop-gap fix to this functional issue, default the NVMe device page size to 4K, with the intent of adding such an API and implementation across all architectures in the next merge window. With the functionally equivalent v3 of this patch, our hardware test exerciser survives when using 32-bit DMA; without the patch, the kernel will BUG within a few minutes. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- v1 -> v2: Based upon feedback from Christoph Hellwig, implement the IOMMU page size lookup as a generic DMA API, rather than an architecture-specific hack. v2 -> v3: In the interest of fixing the functional problem in the short-term, just force the device page size to 4K and work on adding the new API in the next merge window. v3 -> v4: Rebase to the 4.3, including the new code locations. Based upon feedback from Keith Busch and Christoph Hellwig, remove the device max check, as the spec requires MPSMAX >= 4K. diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index e878590e71b6..00ca45bb0bc0 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1701,9 +1701,13 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev) u32 aqa; u64 cap = readq(>bar->cap); struct nvme_queue *nvmeq; - unsigned page_shift = PAGE_SHIFT; + /* +* default to a 4K page size, with the intention to update this +* path in the future to accomodate architectures with differing +* kernel and IO page sizes. +*/ + unsigned page_shift = 12; unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12; - unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12; if (page_shift < dev_page_min) { dev_err(dev->dev, @@ -1712,13 +1716,6 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev) 1 << page_shift); return -ENODEV; } - if (page_shift > dev_page_max) { - dev_info(dev->dev, - "Device maximum page size (%u) smaller than " - "host (%u); enabling work-around\n", - 1 << dev_page_max, 1 << page_shift); - page_shift = dev_page_max; - } dev->subsystem = readl(>bar->vs) >= NVME_VS(1, 1) ? NVME_CAP_NSSRC(cap) : 0; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote: > Looks fine, > > Reviewed-by: Christoph Hellwig> > ... but I doubt we'll ever bother updating it. Most architectures > with arger page sizes also have iommus and would need different settings > for different iommus vs direct mapping for very little gain. There's a > reason why we never bothered for RDMA either. Fair enough :) Thanks for all your reviews and comments. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/1 v3] drivers/nvme: default to 4k device page size
On 29.10.2015 [17:20:43 +], Busch, Keith wrote: > On Thu, Oct 29, 2015 at 08:57:01AM -0700, Nishanth Aravamudan wrote: > > On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote: > > > We had a quick cht about this issue and I think we simply should > > > default to a NVMe controler page size of 4k everywhere as that's the > > > safe default. This is also what we do for RDMA Memory reigstrations and > > > it works fine there for SRP and iSER. > > > > So, would that imply changing just the NVMe driver code rather than > > adding the dma_page_shift API at all? What about > > architectures that can support the larger page sizes? There is an > > implied performance impact, at least, of shifting the IO size down. > > It is the safe option, but you're right that it might have a > measurable performance impact (can you run an experiment?). Maybe we > should just change the driver to always use MPSMIN for the moment in > the interest of time, and you can flush out the new API before the > next merge window. Given that it's 4K just about everywhere by default (and sort of implicitly expected to be, I guess), I think I'd prefer we default to 4K. That should mitigate the performance impact (I'll ask our IO team to do some runs, but since this impacts functionality on some hardware, I don't think it's too relevant for now). Unless there are NVMe devcies with a MPSMAX < 4K? Something like the following? We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case of page sizes, we clearly want to use the IOMMU's page size in the driver. And generally, the NVMe driver in this function should be using the IOMMU's page size for the default device page size, rather than the kernel's page size. There is not currently an API to obtain the IOMMU's page size across all architectures and in the interest of a stop-gap fix to this functional issue, default the NVMe device page size to 4K, with the intent of adding such an API and implementation across all architectures in the next merge window. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- v1 -> v2: Based upon feedback from Christoph Hellwig, implement the IOMMU page size lookup as a generic DMA API, rather than an architecture-specific hack. v2 -> v3: In the interest of fixing the functional problem in the short-term, just force the device page size to 4K and work on adding the new API in the next merge window. diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c index ccc0c1f93daa..a9a5285bdb39 100644 --- a/drivers/block/nvme-core.c +++ b/drivers/block/nvme-core.c @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev) u32 aqa; u64 cap = readq(>bar->cap); struct nvme_queue *nvmeq; - unsigned page_shift = PAGE_SHIFT; + /* +* default to a 4K page size, with the intention to update this +* path in the future to accomodate architectures with differing +* kernel and IO page sizes. +*/ + unsigned page_shift = 12; unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12; unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12; -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size
On 30.10.2015 [21:48:48 +], Keith Busch wrote: > On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote: > > Given that it's 4K just about everywhere by default (and sort of > > implicitly expected to be, I guess), I think I'd prefer we default to > > 4K. That should mitigate the performance impact (I'll ask our IO team to > > do some runs, but since this impacts functionality on some hardware, I > > don't think it's too relevant for now). Unless there are NVMe devcies > > with a MPSMAX < 4K? > > Right, I assumed MPSMIN was always 4k for the same reason you mentioned, > but you can hard code it like you've done in your patch. > > The spec defines MPSMAX such that it's impossible to find a device > with MPSMAX < 4k. Great, thanks! -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
On 29.10.2015 [18:49:55 -0700], David Miller wrote: > From: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > Date: Thu, 29 Oct 2015 08:57:01 -0700 > > > So, would that imply changing just the NVMe driver code rather than > > adding the dma_page_shift API at all? What about > > architectures that can support the larger page sizes? There is an > > implied performance impact, at least, of shifting the IO size down. > > > > Sorry for the continuing questions -- I got lots of conflicting feedback > > on the last series and want to make sure v4 is more acceptable. > > In the long term I would be very happy to see us having a real interface > for this stuff, just my opinion... Yep, I think I'll try and balance the two -- fix NVMe for now with a 4K page size as suggested by Christoph, and then work on the more complete API for the next merge. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote: > On Wed, Oct 28, 2015 at 01:59:23PM +, Busch, Keith wrote: > > The "new" interface for all the other architectures is the same as the > > old one we've been using for the last 5 years. > > > > I welcome x86 maintainer feedback to confirm virtual and DMA addresses > > have the same offset at 4k alignment, but I have to insist we don't > > break my currently working hardware to force their attention. > > We had a quick cht about this issue and I think we simply should > default to a NVMe controler page size of 4k everywhere as that's the > safe default. This is also what we do for RDMA Memory reigstrations and > it works fine there for SRP and iSER. So, would that imply changing just the NVMe driver code rather than adding the dma_page_shift API at all? What about architectures that can support the larger page sizes? There is an implied performance impact, at least, of shifting the IO size down. Sorry for the continuing questions -- I got lots of conflicting feedback on the last series and want to make sure v4 is more acceptable. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
On 26.10.2015 [18:27:46 -0700], David Miller wrote: > From: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > Date: Fri, 23 Oct 2015 13:54:20 -0700 > > > 1) add a generic dma_get_page_shift implementation that just returns > > PAGE_SHIFT > > I won't object to this patch series, but if I had implemented this I > would have required the architectures to implement this explicitly, > one-by-one. I think it is less error prone and more likely to end > up with all the architectures setting this correctly. Well, looks like I should spin up a v4 anyways for the powerpc changes. So, to make sure I understand your point, should I make the generic dma_get_page_shift a compile-error kind of thing? It will only fail on architectures that actually build the NVME driver (as the only caller). But I'm not sure how exactly to achieve that, if you could give a bit more detail I'd appreciate it! Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift
On 27.10.2015 [16:56:10 +1100], Alexey Kardashevskiy wrote: > On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote: > >When DDW (Dynamic DMA Windows) are present for a device, we have stored > >the TCE (Translation Control Entry) size in a special device tree > >property. Check if we have enabled DDW for the device and return the TCE > >size from that property if present. If the property isn't present, > >fallback to looking the value up in struct iommu_table. If we don't find > >a iommu_table, fallback to the kernel's page size. > > > >Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > >--- > > arch/powerpc/platforms/pseries/iommu.c | 36 > > ++ > > 1 file changed, 36 insertions(+) > > > >diff --git a/arch/powerpc/platforms/pseries/iommu.c > >b/arch/powerpc/platforms/pseries/iommu.c > >index 0946b98..1bf6471 100644 > >--- a/arch/powerpc/platforms/pseries/iommu.c > >+++ b/arch/powerpc/platforms/pseries/iommu.c > >@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct > >device *dev) > > return dma_iommu_ops.get_required_mask(dev); > > } > > > >+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev) > >+{ > >+struct iommu_table *tbl; > >+ > >+if (!disable_ddw && dev_is_pci(dev)) { > >+struct pci_dev *pdev = to_pci_dev(dev); > >+struct device_node *dn; > >+ > >+dn = pci_device_to_OF_node(pdev); > >+ > >+/* search upwards for ibm,dma-window */ > >+for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group; > >+dn = dn->parent) > >+if (of_get_property(dn, "ibm,dma-window", NULL)) > >+break; > >+/* > >+ * if there is a DDW configuration, the TCE shift is stored in > >+ * the property > >+ */ > >+if (dn && PCI_DN(dn)) { > >+const struct dynamic_dma_window_prop *direct64 = > >+of_get_property(dn, DIRECT64_PROPNAME, NULL); > > > This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM > does not set it as 64bit windows are dynamic there so something like > find_existing_ddw() needs to be used here. DIRECT64_PROPNAME is a Linux thing, not a pHyp or QEMU/KVM thing -- it's created by the Linux DDW logic and left in the device-tree when we successfully configure DDW. You're right, though, that logically find_existing_ddw() would be better to use here. I'll spin up a new version. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote: > On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote: > >On Power, the kernel's page size can differ from the IOMMU's page size, > >so we need to override the generic implementation, which always returns > >the kernel's page size. Lookup the IOMMU's page size from struct > >iommu_table, if available. Fallback to the kernel's page size, > >otherwise. > > > >Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > >--- > > arch/powerpc/include/asm/dma-mapping.h | 3 +++ > > arch/powerpc/kernel/dma.c | 9 + > > 2 files changed, 12 insertions(+) > > > >diff --git a/arch/powerpc/include/asm/dma-mapping.h > >b/arch/powerpc/include/asm/dma-mapping.h > >index 7f522c0..c5638f4 100644 > >--- a/arch/powerpc/include/asm/dma-mapping.h > >+++ b/arch/powerpc/include/asm/dma-mapping.h > >@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, > >dma_addr_t off) > > #define HAVE_ARCH_DMA_SET_MASK 1 > > extern int dma_set_mask(struct device *dev, u64 dma_mask); > > > >+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 > >+extern unsigned long dma_get_page_shift(struct device *dev); > >+ > > #include > > > > extern int __dma_set_mask(struct device *dev, u64 dma_mask); > >diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c > >index 59503ed..e805af2 100644 > >--- a/arch/powerpc/kernel/dma.c > >+++ b/arch/powerpc/kernel/dma.c > >@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask) > > } > > EXPORT_SYMBOL(dma_set_mask); > > > >+unsigned long dma_get_page_shift(struct device *dev) > >+{ > >+struct iommu_table *tbl = get_iommu_table_base(dev); > >+if (tbl) > >+return tbl->it_page_shift; > > > All PCI devices have this initialized on POWER (at least, our, IBM's > POWER) so 4K will always be returned here while in the case of > (get_dma_ops(dev)==_direct_ops) it could actually return > PAGE_SHIFT. Is 4K still preferred value to return here? Right, so the logic of my series, goes like this: a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is PAGE_SHIFT everywhere, including Power. b) After 2/7, the Power code will return either the IOMMU table's shift value, if set, or PAGE_SHIFT (I guess this would be the case if get_dma_ops(dev) == _direct_ops, as you said). That is no different than we have now, except we can return the accurate IOMMU value if available. 3) After 3/7, the platform can override the generic Power get_dma_page_shift(). 4) After 4/7, pseries will return the DDW value, if available, then fallback to the IOMMU table's value. I think in the case of get_dma_ops(dev)==_direct_ops, the only way that can happen is if we are using DDW, right? -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
On 28.10.2015 [09:57:48 +1100], Julian Calaby wrote: > Hi Nishanth, > > On Wed, Oct 28, 2015 at 9:20 AM, Nishanth Aravamudan > <n...@linux.vnet.ibm.com> wrote: > > On 26.10.2015 [18:27:46 -0700], David Miller wrote: > >> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > >> Date: Fri, 23 Oct 2015 13:54:20 -0700 > >> > >> > 1) add a generic dma_get_page_shift implementation that just returns > >> > PAGE_SHIFT > >> > >> I won't object to this patch series, but if I had implemented this I > >> would have required the architectures to implement this explicitly, > >> one-by-one. I think it is less error prone and more likely to end > >> up with all the architectures setting this correctly. > > > > Well, looks like I should spin up a v4 anyways for the powerpc changes. > > So, to make sure I understand your point, should I make the generic > > dma_get_page_shift a compile-error kind of thing? It will only fail on > > architectures that actually build the NVME driver (as the only caller). > > But I'm not sure how exactly to achieve that, if you could give a bit > > more detail I'd appreciate it! > > He's suggesting that you _don't_ put a generic implementation in > /include/linux/dma-mapping.h and instead add it to _every_ > architecture. Ah, I see! Well, I don't know much about the DMA internals of most architectures -- and my approach kept things functionally the same everywhere (using PAGE_SHIFT) except: a) Power, where I know it doesn't work as-is and b) sparc, where the code implied that a different value than PAGE_SHIFT should be used. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
On 28.10.2015 [12:00:20 +1100], Alexey Kardashevskiy wrote: > On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote: > >On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote: > >>On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote: > >>>On Power, the kernel's page size can differ from the IOMMU's page size, > >>>so we need to override the generic implementation, which always returns > >>>the kernel's page size. Lookup the IOMMU's page size from struct > >>>iommu_table, if available. Fallback to the kernel's page size, > >>>otherwise. > >>> > >>>Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > >>>--- > >>> arch/powerpc/include/asm/dma-mapping.h | 3 +++ > >>> arch/powerpc/kernel/dma.c | 9 + > >>> 2 files changed, 12 insertions(+) > >>> > >>>diff --git a/arch/powerpc/include/asm/dma-mapping.h > >>>b/arch/powerpc/include/asm/dma-mapping.h > >>>index 7f522c0..c5638f4 100644 > >>>--- a/arch/powerpc/include/asm/dma-mapping.h > >>>+++ b/arch/powerpc/include/asm/dma-mapping.h > >>>@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, > >>>dma_addr_t off) > >>> #define HAVE_ARCH_DMA_SET_MASK 1 > >>> extern int dma_set_mask(struct device *dev, u64 dma_mask); > >>> > >>>+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 > >>>+extern unsigned long dma_get_page_shift(struct device *dev); > >>>+ > >>> #include > >>> > >>> extern int __dma_set_mask(struct device *dev, u64 dma_mask); > >>>diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c > >>>index 59503ed..e805af2 100644 > >>>--- a/arch/powerpc/kernel/dma.c > >>>+++ b/arch/powerpc/kernel/dma.c > >>>@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask) > >>> } > >>> EXPORT_SYMBOL(dma_set_mask); > >>> > >>>+unsigned long dma_get_page_shift(struct device *dev) > >>>+{ > >>>+ struct iommu_table *tbl = get_iommu_table_base(dev); > >>>+ if (tbl) > >>>+ return tbl->it_page_shift; > >> > >> > >>All PCI devices have this initialized on POWER (at least, our, IBM's > >>POWER) so 4K will always be returned here while in the case of > >>(get_dma_ops(dev)==_direct_ops) it could actually return > >>PAGE_SHIFT. Is 4K still preferred value to return here? > > > >Right, so the logic of my series, goes like this: > > > >a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is > >PAGE_SHIFT everywhere, including Power. > > > >b) After 2/7, the Power code will return either the IOMMU table's shift > >value, if set, or PAGE_SHIFT (I guess this would be the case if > >get_dma_ops(dev) == _direct_ops, as you said). That is no different > >than we have now, except we can return the accurate IOMMU value if > >available. > > If it is not available, then something went wrong and BUG_ON(!tbl || > !tbl->it_page_shift) make more sense here than pretending that this > function can ever return PAGE_SHIFT. imho. That's a good point, thanks! > >3) After 3/7, the platform can override the generic Power > >get_dma_page_shift(). > > > >4) After 4/7, pseries will return the DDW value, if available, then > >fallback to the IOMMU table's value. I think in the case of > >get_dma_ops(dev)==_direct_ops, the only way that can happen is if we > >are using DDW, right? > > This is for pseries guests; for the powernv host it is a "bypass" > mode which does 64bit direct DMA mapping and there is no additional > window for that (i.e. DIRECT64_PROPNAME, etc). You're right! I should update the code to handle both cases. In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K? Seems like this would be a different platform implentation I'd put in for 'powernv', is that right? My apologies for missing that, and thank you for the review! -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
On 27.10.2015 [17:53:22 -0700], David Miller wrote: > From: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > Date: Tue, 27 Oct 2015 15:20:10 -0700 > > > Well, looks like I should spin up a v4 anyways for the powerpc changes. > > So, to make sure I understand your point, should I make the generic > > dma_get_page_shift a compile-error kind of thing? It will only fail on > > architectures that actually build the NVME driver (as the only caller). > > But I'm not sure how exactly to achieve that, if you could give a bit > > more detail I'd appreciate it! > > Yes, I am basically suggesting to simply not provide a default at all. For my own edification -- what is the way that gets resolved? I guess I mean it seems like linux-next would cease to compile because of my new series. Would my patches just get kicked out of -next for introducing that (or even via the 0-day notifications), or should I put something into the commit message indicating it is an API introduction? Sorry for the tentativeness, I have not introduce a cross-architecture API like this before. Thanks, Nish > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
On 28.10.2015 [11:20:05 +0900], Benjamin Herrenschmidt wrote: > On Tue, 2015-10-27 at 18:54 -0700, Nishanth Aravamudan wrote: > > > > In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K? > > None :-) The TCEs are completely bypassed. You get a N:M linear mapping > of all memory starting at 1<<59 PCI side. Err, duh, sorry! Ok, so in that case, DMA page shift is PAGE_SHIFT, then? -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case, and generally, we want to use the IOMMU's page size for the default device page size, rather than the kernel's page size. This series consists of five patches: 1) add a generic dma_get_page_shift implementation that just returns PAGE_SHIFT 2) override the generic implementation on Power to use the IOMMU table's page shift if available 3) allow further specific overriding on power with machdep platform overrides 4) use the machdep override on pseries, as the DDW code puts the TCE shift in a special property and there is no IOMMU table available 5) move some sparc code around to make IOMMU_PAGE_SHIFT available in include/asm 6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT 7) leverage the new API in the NVMe driver With these patches, a NVMe device survives our internal hardware exerciser; the kernel BUGs within a few seconds without the patch. arch/powerpc/include/asm/dma-mapping.h | 3 +++ arch/powerpc/include/asm/machdep.h | 3 ++- arch/powerpc/kernel/dma.c | 11 +++ arch/powerpc/platforms/pseries/iommu.c | 36 arch/sparc/include/asm/dma-mapping.h | 8 arch/sparc/include/asm/iommu_common.h | 51 +++ arch/sparc/kernel/iommu.c | 2 +- arch/sparc/kernel/iommu_common.h | 51 --- arch/sparc/kernel/pci_psycho.c | 2 +- arch/sparc/kernel/pci_sabre.c | 2 +- arch/sparc/kernel/pci_schizo.c | 2 +- arch/sparc/kernel/pci_sun4v.c | 2 +- arch/sparc/kernel/psycho_common.c | 2 +- arch/sparc/kernel/sbus.c | 3 +-- drivers/block/nvme-core.c | 3 ++- include/linux/dma-mapping.h| 7 +++ 16 files changed, 127 insertions(+), 61 deletions(-) v1 -> v2: Based upon feedback from Christoph Hellwig, rather than using an arch-specific hack, expose the DMA page shift via a generic DMA API and override it on Power as needed. v2 -> v3: Based upon feedback from Christoph Hellwig, put the generic implementation in include/linux/dma-mapping.h, since not all archs use include/asm-generic/dma-mapping-common.h. Add sparc implementation, as that arch seems to have a different IOMMU page size. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
[Sorry, subject should have been 0/7!] On 23.10.2015 [13:54:20 -0700], Nishanth Aravamudan wrote: > We received a bug report recently when DDW (64-bit direct DMA on Power) > is not enabled for NVMe devices. In that case, we fall back to 32-bit > DMA via the IOMMU, which is always done via 4K TCEs (Translation Control > Entries). > > The NVMe device driver, though, assumes that the DMA alignment for the > PRP entries will match the device's page size, and that the DMA aligment > matches the kernel's page aligment. On Power, the the IOMMU page size, > as mentioned above, can be 4K, while the device can have a page size of > 8K, while the kernel has a page size of 64K. This eventually trips the > BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple > of 4K but not 8K (e.g., 0xF000). > > In this particular case, and generally, we want to use the IOMMU's page > size for the default device page size, rather than the kernel's page > size. > > This series consists of five patches: > > 1) add a generic dma_get_page_shift implementation that just returns > PAGE_SHIFT > 2) override the generic implementation on Power to use the IOMMU table's > page shift if available > 3) allow further specific overriding on power with machdep platform > overrides > 4) use the machdep override on pseries, as the DDW code puts the TCE > shift in a special property and there is no IOMMU table available > 5) move some sparc code around to make IOMMU_PAGE_SHIFT available in > include/asm > 6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT > 7) leverage the new API in the NVMe driver > > With these patches, a NVMe device survives our internal hardware > exerciser; the kernel BUGs within a few seconds without the patch. > > arch/powerpc/include/asm/dma-mapping.h | 3 +++ > arch/powerpc/include/asm/machdep.h | 3 ++- > arch/powerpc/kernel/dma.c | 11 +++ > arch/powerpc/platforms/pseries/iommu.c | 36 > > arch/sparc/include/asm/dma-mapping.h | 8 > arch/sparc/include/asm/iommu_common.h | 51 > +++ > arch/sparc/kernel/iommu.c | 2 +- > arch/sparc/kernel/iommu_common.h | 51 > --- > arch/sparc/kernel/pci_psycho.c | 2 +- > arch/sparc/kernel/pci_sabre.c | 2 +- > arch/sparc/kernel/pci_schizo.c | 2 +- > arch/sparc/kernel/pci_sun4v.c | 2 +- > arch/sparc/kernel/psycho_common.c | 2 +- > arch/sparc/kernel/sbus.c | 3 +-- > drivers/block/nvme-core.c | 3 ++- > include/linux/dma-mapping.h| 7 +++ > 16 files changed, 127 insertions(+), 61 deletions(-) > > v1 -> v2: > Based upon feedback from Christoph Hellwig, rather than using an > arch-specific hack, expose the DMA page shift via a generic DMA API and > override it on Power as needed. > v2 -> v3: > Based upon feedback from Christoph Hellwig, put the generic > implementation in include/linux/dma-mapping.h, since not all archs use > include/asm-generic/dma-mapping-common.h. > Add sparc implementation, as that arch seems to have a different IOMMU > page size. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 7/7 v2] drivers/nvme: default to the IOMMU page size
We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case of page sizes, we clearly want to use the IOMMU's page size in the driver. And generally, the NVMe driver in this function should be using the IOMMU's page size for the default device page size, rather than the kernel's page size. With this patch, a NVMe device survives our internal hardware exerciser; the kernel BUGs within a few seconds without the patch. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- v1 -> v2: Based upon feedback from Christoph Hellwig, implement the IOMMU page size lookup as a generic DMA API, rather than an architecture-specific hack. drivers/block/nvme-core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c index 6f04771..5a79106 100644 --- a/drivers/block/nvme-core.c +++ b/drivers/block/nvme-core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -1711,7 +1712,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev) u32 aqa; u64 cap = readq(>bar->cap); struct nvme_queue *nvmeq; - unsigned page_shift = PAGE_SHIFT; + unsigned page_shift = dma_get_page_shift(dev->dev); unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12; unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12; -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
On Power, the kernel's page size can differ from the IOMMU's page size, so we need to override the generic implementation, which always returns the kernel's page size. Lookup the IOMMU's page size from struct iommu_table, if available. Fallback to the kernel's page size, otherwise. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- arch/powerpc/include/asm/dma-mapping.h | 3 +++ arch/powerpc/kernel/dma.c | 9 + 2 files changed, 12 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 7f522c0..c5638f4 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) #define HAVE_ARCH_DMA_SET_MASK 1 extern int dma_set_mask(struct device *dev, u64 dma_mask); +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 +extern unsigned long dma_get_page_shift(struct device *dev); + #include extern int __dma_set_mask(struct device *dev, u64 dma_mask); diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index 59503ed..e805af2 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask) } EXPORT_SYMBOL(dma_set_mask); +unsigned long dma_get_page_shift(struct device *dev) +{ + struct iommu_table *tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; + return PAGE_SHIFT; +} +EXPORT_SYMBOL(dma_get_page_shift); + u64 __dma_get_required_mask(struct device *dev) { struct dma_map_ops *dma_ops = get_dma_ops(dev); -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h
[Apologies for the subject line, should just have the [RFC PATCH 5/7]] On 23.10.2015 [14:00:08 -0700], Nishanth Aravamudan wrote: > In order to cleanly expose the desired IOMMU page shift via the new > dma_get_page_shift API, we need to have the sparc constants available in > a more typical location. There should be no functional impact to this > move, but it is untested. > > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > --- > arch/sparc/include/asm/iommu_common.h | 51 > +++ > arch/sparc/kernel/iommu.c | 2 +- > arch/sparc/kernel/iommu_common.h | 51 > --- > arch/sparc/kernel/pci_psycho.c| 2 +- > arch/sparc/kernel/pci_sabre.c | 2 +- > arch/sparc/kernel/pci_schizo.c| 2 +- > arch/sparc/kernel/pci_sun4v.c | 2 +- > arch/sparc/kernel/psycho_common.c | 2 +- > arch/sparc/kernel/sbus.c | 3 +-- > 9 files changed, 58 insertions(+), 59 deletions(-) > create mode 100644 arch/sparc/include/asm/iommu_common.h > delete mode 100644 arch/sparc/kernel/iommu_common.h > > diff --git a/arch/sparc/include/asm/iommu_common.h > b/arch/sparc/include/asm/iommu_common.h > new file mode 100644 > index 000..b40cec2 > --- /dev/null > +++ b/arch/sparc/include/asm/iommu_common.h > @@ -0,0 +1,51 @@ > +/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations. > + * > + * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net) > + */ > + > +#ifndef _IOMMU_COMMON_H > +#define _IOMMU_COMMON_H > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > + > +/* > + * These give mapping size of each iommu pte/tlb. > + */ > +#define IO_PAGE_SHIFT13 > +#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT) > +#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1)) > +#define IO_PAGE_ALIGN(addr) ALIGN(addr, IO_PAGE_SIZE) > + > +#define IO_TSB_ENTRIES (128*1024) > +#define IO_TSB_SIZE (IO_TSB_ENTRIES * 8) > + > +/* > + * This is the hardwired shift in the iotlb tag/data parts. > + */ > +#define IOMMU_PAGE_SHIFT 13 > + > +#define SG_ENT_PHYS_ADDRESS(SG) (__pa(sg_virt((SG > + > +static inline int is_span_boundary(unsigned long entry, > +unsigned long shift, > +unsigned long boundary_size, > +struct scatterlist *outs, > +struct scatterlist *sg) > +{ > + unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs); > + int nr = iommu_num_pages(paddr, outs->dma_length + sg->length, > + IO_PAGE_SIZE); > + > + return iommu_is_span_boundary(entry, nr, shift, boundary_size); > +} > + > +#endif /* _IOMMU_COMMON_H */ > diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c > index 5320689..f9b6077 100644 > --- a/arch/sparc/kernel/iommu.c > +++ b/arch/sparc/kernel/iommu.c > @@ -20,8 +20,8 @@ > #endif > > #include > +#include > > -#include "iommu_common.h" > #include "kernel.h" > > #define STC_CTXMATCH_ADDR(STC, CTX) \ > diff --git a/arch/sparc/kernel/iommu_common.h > b/arch/sparc/kernel/iommu_common.h > deleted file mode 100644 > index b40cec2..000 > --- a/arch/sparc/kernel/iommu_common.h > +++ /dev/null > @@ -1,51 +0,0 @@ > -/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations. > - * > - * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net) > - */ > - > -#ifndef _IOMMU_COMMON_H > -#define _IOMMU_COMMON_H > - > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > - > -#include > - > -/* > - * These give mapping size of each iommu pte/tlb. > - */ > -#define IO_PAGE_SHIFT13 > -#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT) > -#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1)) > -#define IO_PAGE_ALIGN(addr) ALIGN(addr, IO_PAGE_SIZE) > - > -#define IO_TSB_ENTRIES (128*1024) > -#define IO_TSB_SIZE (IO_TSB_ENTRIES * 8) > - > -/* > - * This is the hardwired shift in the iotlb tag/data parts. > - */ > -#define IOMMU_PAGE_SHIFT 13 > - > -#define SG_ENT_PHYS_ADDRESS(SG) (__pa(sg_virt((SG > - > -static inline int is_span_boundary(unsigned long entry, > -unsigned long sh
[PATCH 3/7 v2] powerpc/dma: implement per-platform dma_get_page_shift
The IOMMU page size is not always stored in struct iommu on Power. Specifically if a device is configured for DDW (Dynamic DMA Windows aka. 64-bit direct DMA), the used TCE (Translation Control Entry) size is stored in a special device property created at run-time by the DDW configuration code. DDW is a pseries-specific feature, so allow platforms to override the implementation of dma_get_page_shift if desired. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- arch/powerpc/include/asm/machdep.h | 3 ++- arch/powerpc/kernel/dma.c | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 3f191f5..9dd03cf 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -83,9 +83,10 @@ struct machdep_calls { #endif #endif /* CONFIG_PPC64 */ - /* Platform set_dma_mask and dma_get_required_mask overrides */ + /* Platform overrides */ int (*dma_set_mask)(struct device *dev, u64 dma_mask); u64 (*dma_get_required_mask)(struct device *dev); + unsigned long (*dma_get_page_shift)(struct device *dev); int (*probe)(void); void(*setup_arch)(void); /* Optional, may be NULL */ diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index e805af2..c363896 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask); unsigned long dma_get_page_shift(struct device *dev) { struct iommu_table *tbl = get_iommu_table_base(dev); + if (ppc_md.dma_get_page_shift) + return ppc_md.dma_get_page_shift(dev); if (tbl) return tbl->it_page_shift; return PAGE_SHIFT; -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API
Drivers like NVMe need to be able to determine the page size used for DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all architectures. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- v1 -> v2: Based upon feedback from Christoph Hellwig, implement the IOMMU page size lookup as a generic DMA API, rather than an architecture-specific hack. v2 -> v3: Based upon feedback from Christoph Hellwig that not all architectures have moved to dma-mapping-common.h, so move the #ifdef and default to linux/dma-mapping.h. --- include/linux/dma-mapping.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index ac07ff0..7eaba8d 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -95,6 +95,13 @@ static inline u64 dma_get_mask(struct device *dev) return DMA_BIT_MASK(32); } +#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT +static inline unsigned long dma_get_page_shift(struct device *dev) +{ + return PAGE_SHIFT; +} +#endif + #ifdef CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK int dma_set_coherent_mask(struct device *dev, u64 mask); #else -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift
When DDW (Dynamic DMA Windows) are present for a device, we have stored the TCE (Translation Control Entry) size in a special device tree property. Check if we have enabled DDW for the device and return the TCE size from that property if present. If the property isn't present, fallback to looking the value up in struct iommu_table. If we don't find a iommu_table, fallback to the kernel's page size. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- arch/powerpc/platforms/pseries/iommu.c | 36 ++ 1 file changed, 36 insertions(+) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 0946b98..1bf6471 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev) return dma_iommu_ops.get_required_mask(dev); } +static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev) +{ + struct iommu_table *tbl; + + if (!disable_ddw && dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + struct device_node *dn; + + dn = pci_device_to_OF_node(pdev); + + /* search upwards for ibm,dma-window */ + for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group; + dn = dn->parent) + if (of_get_property(dn, "ibm,dma-window", NULL)) + break; + /* +* if there is a DDW configuration, the TCE shift is stored in +* the property +*/ + if (dn && PCI_DN(dn)) { + const struct dynamic_dma_window_prop *direct64 = + of_get_property(dn, DIRECT64_PROPNAME, NULL); + if (direct64) + return be32_to_cpu(direct64->tce_shift); + } + } + + tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; + + return PAGE_SHIFT; +} + #else /* CONFIG_PCI */ #define pci_dma_bus_setup_pSeries NULL #define pci_dma_dev_setup_pSeries NULL @@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev) #define pci_dma_dev_setup_pSeriesLPNULL #define dma_set_mask_pSeriesLP NULL #define dma_get_required_mask_pSeriesLPNULL +#define dma_get_page_shift_pSeriesLP NULL #endif /* !CONFIG_PCI */ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, @@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void) pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeriesLP; ppc_md.dma_set_mask = dma_set_mask_pSeriesLP; ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP; + ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP; } else { pseries_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pSeries; pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeries; -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 6/7] sparc/dma-mapping: override dma_get_page_shift
On sparc, the kernel's page size differs from the IOMMU's page size, so override the generic implementation, which always returns the kernel's page size, and return IOMMU_PAGE_SHIFT instead. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- I know very little about sparc, so please correct me if this patch is invalid. arch/sparc/include/asm/dma-mapping.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/sparc/include/asm/dma-mapping.h b/arch/sparc/include/asm/dma-mapping.h index a21da59..76ba470 100644 --- a/arch/sparc/include/asm/dma-mapping.h +++ b/arch/sparc/include/asm/dma-mapping.h @@ -52,6 +52,14 @@ static inline int dma_set_mask(struct device *dev, u64 mask) return -EINVAL; } +/* for IOMMU_PAGE_SHIFT */ +#include +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 +static inline unsigned long dma_get_page_shift(struct device *dev) +{ + return IOMMU_PAGE_SHIFT; +} + #include #endif -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h
In order to cleanly expose the desired IOMMU page shift via the new dma_get_page_shift API, we need to have the sparc constants available in a more typical location. There should be no functional impact to this move, but it is untested. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- arch/sparc/include/asm/iommu_common.h | 51 +++ arch/sparc/kernel/iommu.c | 2 +- arch/sparc/kernel/iommu_common.h | 51 --- arch/sparc/kernel/pci_psycho.c| 2 +- arch/sparc/kernel/pci_sabre.c | 2 +- arch/sparc/kernel/pci_schizo.c| 2 +- arch/sparc/kernel/pci_sun4v.c | 2 +- arch/sparc/kernel/psycho_common.c | 2 +- arch/sparc/kernel/sbus.c | 3 +-- 9 files changed, 58 insertions(+), 59 deletions(-) create mode 100644 arch/sparc/include/asm/iommu_common.h delete mode 100644 arch/sparc/kernel/iommu_common.h diff --git a/arch/sparc/include/asm/iommu_common.h b/arch/sparc/include/asm/iommu_common.h new file mode 100644 index 000..b40cec2 --- /dev/null +++ b/arch/sparc/include/asm/iommu_common.h @@ -0,0 +1,51 @@ +/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations. + * + * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net) + */ + +#ifndef _IOMMU_COMMON_H +#define _IOMMU_COMMON_H + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +/* + * These give mapping size of each iommu pte/tlb. + */ +#define IO_PAGE_SHIFT 13 +#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT) +#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1)) +#define IO_PAGE_ALIGN(addr)ALIGN(addr, IO_PAGE_SIZE) + +#define IO_TSB_ENTRIES (128*1024) +#define IO_TSB_SIZE(IO_TSB_ENTRIES * 8) + +/* + * This is the hardwired shift in the iotlb tag/data parts. + */ +#define IOMMU_PAGE_SHIFT 13 + +#define SG_ENT_PHYS_ADDRESS(SG)(__pa(sg_virt((SG + +static inline int is_span_boundary(unsigned long entry, + unsigned long shift, + unsigned long boundary_size, + struct scatterlist *outs, + struct scatterlist *sg) +{ + unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs); + int nr = iommu_num_pages(paddr, outs->dma_length + sg->length, +IO_PAGE_SIZE); + + return iommu_is_span_boundary(entry, nr, shift, boundary_size); +} + +#endif /* _IOMMU_COMMON_H */ diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index 5320689..f9b6077 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -20,8 +20,8 @@ #endif #include +#include -#include "iommu_common.h" #include "kernel.h" #define STC_CTXMATCH_ADDR(STC, CTX)\ diff --git a/arch/sparc/kernel/iommu_common.h b/arch/sparc/kernel/iommu_common.h deleted file mode 100644 index b40cec2..000 --- a/arch/sparc/kernel/iommu_common.h +++ /dev/null @@ -1,51 +0,0 @@ -/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations. - * - * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net) - */ - -#ifndef _IOMMU_COMMON_H -#define _IOMMU_COMMON_H - -#include -#include -#include -#include -#include -#include -#include -#include - -#include - -/* - * These give mapping size of each iommu pte/tlb. - */ -#define IO_PAGE_SHIFT 13 -#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT) -#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1)) -#define IO_PAGE_ALIGN(addr)ALIGN(addr, IO_PAGE_SIZE) - -#define IO_TSB_ENTRIES (128*1024) -#define IO_TSB_SIZE(IO_TSB_ENTRIES * 8) - -/* - * This is the hardwired shift in the iotlb tag/data parts. - */ -#define IOMMU_PAGE_SHIFT 13 - -#define SG_ENT_PHYS_ADDRESS(SG)(__pa(sg_virt((SG - -static inline int is_span_boundary(unsigned long entry, - unsigned long shift, - unsigned long boundary_size, - struct scatterlist *outs, - struct scatterlist *sg) -{ - unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs); - int nr = iommu_num_pages(paddr, outs->dma_length + sg->length, -IO_PAGE_SIZE); - - return iommu_is_span_boundary(entry, nr, shift, boundary_size); -} - -#endif /* _IOMMU_COMMON_H */ diff --git a/arch/sparc/kernel/pci_psycho.c b/arch/sparc/kernel/pci_psycho.c index 7dce27b..332fd6f 100644 --- a/arch/sparc/kernel/pci_psycho.c +++ b/arch/sparc/kernel/pci_psycho.c @@ -15,13 +15,13 @@ #include #include +#include #include #include #include #include #include "pci_impl.h" -#include "i
Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API
On 15.10.2015 [15:52:19 -0700], Nishanth Aravamudan wrote: > On 14.10.2015 [08:42:51 -0700], Christoph Hellwig wrote: > > Hi Nishanth, > > > > sorry for the late reply. > > > > > > On Power, since it's technically variable, we'd need a function. So are > > > > you suggesting define'ing it to a function just on Power and leaving it > > > > a constant elsewhere? > > > > > > > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw. > > > > > > Sorry, I should have been more specific -- I'm ready to spin out a v3, > > > with a sparc-specific function. > > > > > > Are you ok with leaving it a function for now (the only caller is in > > > NVMe obviously). > > > > > > I guess we do indeed need a function then. I'll take a look at your > > patch, but as long you found a way to avoid adding too much boilerplate > > code it should be fine. > > Ok, so I've got the moved function (include/linux/dma-mapping.h > instead of dma-mapping-common.h) ready to go, which should only > involve changing the first patch in the series. But I'm really > mystified by what to do for sparc, which defines IOMMU_PAGE_SHIFT and > IO_PAGE_SHIFT in arch/sparc/kernel/iommu_common.h. > > 1) Which constant reflects the value we mean for this function on > sparc? I assume it should be IOMMU_PAGE_SHIFT, but they are the same > value and I want to make sure I get the semantics right. > > 2) Where would I put sparc's definition of dma_get_page_shift()? > Should it be in a asm/dma-mapping.h? Should we move some of the > constants from arch/sparc/kernel/iommu_common.h to > arch/sparc/include/asm/iommu_common.h and then #include that in > asm/dma-mapping.h? > > Dave M., any opinions/insights? Essentially, this helper function > assists the NVMe driver in determining what page size it should use to > satisfy both the device and IOMMU's requirements. Maybe I > misunderstand the constants on sparc and PAGE_SHIFT is fine there too? Anyone with sparc knowledge have an opinion/input? -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API
On 14.10.2015 [08:42:51 -0700], Christoph Hellwig wrote: > Hi Nishanth, > > sorry for the late reply. > > > > On Power, since it's technically variable, we'd need a function. So are > > > you suggesting define'ing it to a function just on Power and leaving it > > > a constant elsewhere? > > > > > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw. > > > > Sorry, I should have been more specific -- I'm ready to spin out a v3, > > with a sparc-specific function. > > > > Are you ok with leaving it a function for now (the only caller is in > > NVMe obviously). > > > I guess we do indeed need a function then. I'll take a look at your > patch, but as long you found a way to avoid adding too much boilerplate > code it should be fine. Ok, so I've got the moved function (include/linux/dma-mapping.h instead of dma-mapping-common.h) ready to go, which should only involve changing the first patch in the series. But I'm really mystified by what to do for sparc, which defines IOMMU_PAGE_SHIFT and IO_PAGE_SHIFT in arch/sparc/kernel/iommu_common.h. 1) Which constant reflects the value we mean for this function on sparc? I assume it should be IOMMU_PAGE_SHIFT, but they are the same value and I want to make sure I get the semantics right. 2) Where would I put sparc's definition of dma_get_page_shift()? Should it be in a asm/dma-mapping.h? Should we move some of the constants from arch/sparc/kernel/iommu_common.h to arch/sparc/include/asm/iommu_common.h and then #include that in asm/dma-mapping.h? Dave M., any opinions/insights? Essentially, this helper function assists the NVMe driver in determining what page size it should use to satisfy both the device and IOMMU's requirements. Maybe I misunderstand the constants on sparc and PAGE_SHIFT is fine there too? -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API
Hi Christoph, On 12.10.2015 [14:06:51 -0700], Nishanth Aravamudan wrote: > On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote: > > Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define > > with an #ifndef in common code? > > On Power, since it's technically variable, we'd need a function. So are > you suggesting define'ing it to a function just on Power and leaving it > a constant elsewhere? > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw. Sorry, I should have been more specific -- I'm ready to spin out a v3, with a sparc-specific function. Are you ok with leaving it a function for now (the only caller is in NVMe obviously). -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API
On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote: > Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define > with an #ifndef in common code? I suppose we could do that -- I wasn't sure if the macro would be palatable. > Also not all architectures use dma-mapping-common.h yet, so you either > need to update all of those as well, or just add the #ifndef directly > to linux/dma-mapping.h. Ok, I will take a look at that and spin up a v3. Thanks for the feedback! -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] powerpc/iommu: expose IOMMU page shift
On 06.10.2015 [14:19:43 +1100], David Gibson wrote: > On Fri, Oct 02, 2015 at 10:18:00AM -0700, Nishanth Aravamudan wrote: > > We will leverage this macro in the NVMe driver, which needs to know the > > configured IOMMU page shift to properly configure its device's page > > size. > > > > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > > > > --- > > Given this is available, it seems reasonable to expose -- and it doesn't > > really make sense to make the driver do a log2 call on the existing > > IOMMU_PAGE_SIZE() value. > > > > diff --git a/arch/powerpc/include/asm/iommu.h > > b/arch/powerpc/include/asm/iommu.h > > index ca18cff..6fdf857 100644 > > --- a/arch/powerpc/include/asm/iommu.h > > +++ b/arch/powerpc/include/asm/iommu.h > > @@ -36,6 +36,7 @@ > > #define IOMMU_PAGE_MASK_4K (~((1 << IOMMU_PAGE_SHIFT_4K) - 1)) > > #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K) > > > > +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift > > #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift) > > Seems like it would be a touch safer to alter IOMMU_PAGE_SIZE so it > uses the new IOMMU_PAGE_SHIFT macro. Yes absolutely! Sorry, I initially didn't add the first macro, so didn't think that through. Will update. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API
On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote: > Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define > with an #ifndef in common code? On Power, since it's technically variable, we'd need a function. So are you suggesting define'ing it to a function just on Power and leaving it a constant elsewhere? I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] powerpc/iommu: expose IOMMU page shift
On 12.10.2015 [09:03:52 -0700], Nishanth Aravamudan wrote: > On 06.10.2015 [14:19:43 +1100], David Gibson wrote: > > On Fri, Oct 02, 2015 at 10:18:00AM -0700, Nishanth Aravamudan wrote: > > > We will leverage this macro in the NVMe driver, which needs to know the > > > configured IOMMU page shift to properly configure its device's page > > > size. > > > > > > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > > > > > > --- > > > Given this is available, it seems reasonable to expose -- and it doesn't > > > really make sense to make the driver do a log2 call on the existing > > > IOMMU_PAGE_SIZE() value. > > > > > > diff --git a/arch/powerpc/include/asm/iommu.h > > > b/arch/powerpc/include/asm/iommu.h > > > index ca18cff..6fdf857 100644 > > > --- a/arch/powerpc/include/asm/iommu.h > > > +++ b/arch/powerpc/include/asm/iommu.h > > > @@ -36,6 +36,7 @@ > > > #define IOMMU_PAGE_MASK_4K (~((1 << IOMMU_PAGE_SHIFT_4K) - 1)) > > > #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K) > > > > > > +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift > > > #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift) > > > > Seems like it would be a touch safer to alter IOMMU_PAGE_SIZE so it > > uses the new IOMMU_PAGE_SHIFT macro. > > Yes absolutely! Sorry, I initially didn't add the first macro, so didn't > think that through. Will update. Err, replied too quickly -- just got back from vacation -- this is from an old version of the patchset, we no longer introduce this macro. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] drivers/nvme: default to the IOMMU page size on Power
On 02.10.2015 [10:25:44 -0700], Christoph Hellwig wrote: > Hi Nishanth, > > please expose this value through the generic DMA API instead of adding > architecture specific hacks to drivers. Ok, I'm happy to do that instead -- what I struggled with is that I don't have enough knowledge of the various architectures to provide the right default implementation. It should be sufficient for the default to return PAGE_SHIFT, and on Power just override that to return the IOMMU table's page size? Since the only user will be the NVMe driver currently, that should be fine? Sorry for the less-than-ideal patch! -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/2] Fix NVMe driver support on Power with 32-bit DMA
We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case, and generally, we want to use the IOMMU's page size for the default device page size, rather than the kernel's page size. This series consists of two patches, one of which exposes the IOMMU's page shift on Power (currently only the page size is exposed, and it seems unnecessary to ilog2 that value in the driver). The second patch leverages this value on Power in the NVMe driver. With these patches, a NVMe device survives our internal hardware exerciser; the kernel BUGs within a few seconds without the patch. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] drivers/nvme: default to the IOMMU page size on Power
We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case, and generally, we want to use the IOMMU's page size for the default device page size, rather than the kernel's page size. With this patch, a NVMe device survives our internal hardware exerciser; the kernel BUGs within a few seconds without the patch. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c index 7920c27..969a95e 100644 --- a/drivers/block/nvme-core.c +++ b/drivers/block/nvme-core.c @@ -42,6 +42,7 @@ #include #include #include +#include #define NVME_MINORS(1U << MINORBITS) #define NVME_Q_DEPTH 1024 @@ -1680,6 +1681,11 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev) unsigned page_shift = PAGE_SHIFT; unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12; unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12; +#ifdef CONFIG_PPC64 + struct iommu_table *tbl = get_iommu_table_base(dev->dev); + if (tbl) + page_shift = IOMMU_PAGE_SHIFT(tbl); +#endif if (page_shift < dev_page_min) { dev_err(dev->dev, ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] powerpc/iommu: expose IOMMU page shift
We will leverage this macro in the NVMe driver, which needs to know the configured IOMMU page shift to properly configure its device's page size. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- Given this is available, it seems reasonable to expose -- and it doesn't really make sense to make the driver do a log2 call on the existing IOMMU_PAGE_SIZE() value. diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index ca18cff..6fdf857 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -36,6 +36,7 @@ #define IOMMU_PAGE_MASK_4K (~((1 << IOMMU_PAGE_SHIFT_4K) - 1)) #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K) +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift) #define IOMMU_PAGE_MASK(tblptr) (~((1 << (tblptr)->it_page_shift) - 1)) #define IOMMU_PAGE_ALIGN(addr, tblptr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE(tblptr)) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/5 v2] powerpc/dma: implement per-platform dma_get_page_shift
The IOMMU page size is not always stored in struct iommu on Power. Specifically if a device is configured for DDW (Dynamic DMA Windows aka. 64-bit direct DMA), the used TCE (Translation Control Entry) size is stored in a special device property created at run-time by the DDW configuration code. DDW is a pseries-specific feature, so allow platforms to override the implementation of dma_get_page_shift if desired. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index cab6753..5c372e3 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -78,9 +78,10 @@ struct machdep_calls { #endif #endif /* CONFIG_PPC64 */ - /* Platform set_dma_mask and dma_get_required_mask overrides */ + /* Platform overrides */ int (*dma_set_mask)(struct device *dev, u64 dma_mask); u64 (*dma_get_required_mask)(struct device *dev); + unsigned long (*dma_get_page_shift)(struct device *dev); int (*probe)(void); void(*setup_arch)(void); /* Optional, may be NULL */ diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index e805af2..c363896 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask); unsigned long dma_get_page_shift(struct device *dev) { struct iommu_table *tbl = get_iommu_table_base(dev); + if (ppc_md.dma_get_page_shift) + return ppc_md.dma_get_page_shift(dev); if (tbl) return tbl->it_page_shift; return PAGE_SHIFT; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA
We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case, and generally, we want to use the IOMMU's page size for the default device page size, rather than the kernel's page size. This series consists of five patches: 1) add a generic dma_get_page_shift implementation that just returns PAGE_SHIFT 2) override the generic implementation on Power to use the IOMMU table's page shift if available 3) allow further specific overriding on power with machdep platform overrides 4) use the machdep override on pseries, as the DDW code puts the TCE shift in a special property and there is no IOMMU table available 5) leverage the new API in the NVMe driver With these patches, a NVMe device survives our internal hardware exerciser; the kernel BUGs within a few seconds without the patch. arch/powerpc/include/asm/dma-mapping.h | 3 +++ arch/powerpc/include/asm/machdep.h | 3 ++- arch/powerpc/kernel/dma.c| 11 +++ arch/powerpc/platforms/pseries/iommu.c | 36 drivers/block/nvme-core.c| 3 ++- include/asm-generic/dma-mapping-common.h | 7 +++ 6 files changed, 61 insertions(+), 2 deletions(-) v1 -> v2: Based upon feedback from Christoph Hellwig, rather than using an arch-specific hack, expose the DMA page shift via a generic DMA API and override it on Power as needed. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/5 v2] pseries/iommu: implement DDW-aware dma_get_page_shift
When DDW (Dynamic DMA Windows) are present for a device, we have stored the TCE (Translation Control Entry) size in a special device tree property. Check if we have enabled DDW for the device and return the TCE size from that property if present. If the property isn't present, fallback to looking the value up in struct iommu_table. If we don't find a iommu_table, fallback to the kernel's page size. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 0946b98..1bf6471 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev) return dma_iommu_ops.get_required_mask(dev); } +static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev) +{ + struct iommu_table *tbl; + + if (!disable_ddw && dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + struct device_node *dn; + + dn = pci_device_to_OF_node(pdev); + + /* search upwards for ibm,dma-window */ + for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group; + dn = dn->parent) + if (of_get_property(dn, "ibm,dma-window", NULL)) + break; + /* +* if there is a DDW configuration, the TCE shift is stored in +* the property +*/ + if (dn && PCI_DN(dn)) { + const struct dynamic_dma_window_prop *direct64 = + of_get_property(dn, DIRECT64_PROPNAME, NULL); + if (direct64) + return be32_to_cpu(direct64->tce_shift); + } + } + + tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; + + return PAGE_SHIFT; +} + #else /* CONFIG_PCI */ #define pci_dma_bus_setup_pSeries NULL #define pci_dma_dev_setup_pSeries NULL @@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev) #define pci_dma_dev_setup_pSeriesLPNULL #define dma_set_mask_pSeriesLP NULL #define dma_get_required_mask_pSeriesLPNULL +#define dma_get_page_shift_pSeriesLP NULL #endif /* !CONFIG_PCI */ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, @@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void) pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeriesLP; ppc_md.dma_set_mask = dma_set_mask_pSeriesLP; ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP; + ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP; } else { pseries_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pSeries; pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeries; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/5 v2] powerpc/dma-mapping: override dma_get_page_shift
On Power, the kernel's page size can differ from the IOMMU's page size, so we need to override the generic implementation, which always returns the kernel's page size. Lookup the IOMMU's page size from struct iommu_table, if available. Fallback to the kernel's page size, otherwise. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 7f522c0..c5638f4 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) #define HAVE_ARCH_DMA_SET_MASK 1 extern int dma_set_mask(struct device *dev, u64 dma_mask); +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 +extern unsigned long dma_get_page_shift(struct device *dev); + #include extern int __dma_set_mask(struct device *dev, u64 dma_mask); diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index 59503ed..e805af2 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask) } EXPORT_SYMBOL(dma_set_mask); +unsigned long dma_get_page_shift(struct device *dev) +{ + struct iommu_table *tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; + return PAGE_SHIFT; +} +EXPORT_SYMBOL(dma_get_page_shift); + u64 __dma_get_required_mask(struct device *dev) { struct dma_map_ops *dma_ops = get_dma_ops(dev); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API
Drivers like NVMe need to be able to determine the page size used for DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all architectures. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h index b1bc954..86e4e97 100644 --- a/include/asm-generic/dma-mapping-common.h +++ b/include/asm-generic/dma-mapping-common.h @@ -355,4 +355,11 @@ static inline int dma_set_mask(struct device *dev, u64 mask) } #endif +#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT +static inline unsigned long dma_get_page_shift(struct device *dev) +{ + return PAGE_SHIFT; +} +#endif + #endif ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 5/5 v2] drivers/nvme: default to the IOMMU page size
We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case of page sizes, we clearly want to use the IOMMU's page size in the driver. And generally, the NVMe driver in this function should be using the IOMMU's page size for the default device page size, rather than the kernel's page size. With this patch, a NVMe device survives our internal hardware exerciser; the kernel BUGs within a few seconds without the patch. --- v1 -> v2: Based upon feedback from Christoph Hellwig, implement the IOMMU page size lookup as a generic DMA API, rather than an architecture-specific hack. diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c index b97fc3f..c561137 100644 --- a/drivers/block/nvme-core.c +++ b/drivers/block/nvme-core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -1713,7 +1714,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev) u32 aqa; u64 cap = readq(>bar->cap); struct nvme_queue *nvmeq; - unsigned page_shift = PAGE_SHIFT; + unsigned page_shift = dma_get_page_shift(dev->dev); unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12; unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA
On 03.10.2015 [07:35:09 +1000], Benjamin Herrenschmidt wrote: > On Fri, 2015-10-02 at 14:04 -0700, Nishanth Aravamudan wrote: > > Right, I did start with your advice and tried that approach, but it > > turned out I was wrong about the actual issue at the time. The problem > > for NVMe isn't actually the starting address alignment (which it can > > handle not being aligned to the device's page size). It doesn't handle > > (addr + len % dev_page_size != 0). That is, it's really a length > > alignment issue. > > > > It seems incredibly device specific to have a an API into the DMA code > > to request an end alignment -- no other device seems to have this > > issue/design. If you think that's better, I can fiddle with that > > instead. > > > > Sorry, I should have called this out better as an alternative > > consideration. > > Nah it's fine. Ok. Also adding the alignment requirement to the API > would have been a much more complex patch since it would have had to > be implemented for all archs. > > I think your current solution is fine. Great, thanks. Also, while it's possible an alignment API would be more performant...we're already not using DDW on Power in this case, performance is not a primary concern. We want to simply be functional/correct in this configuration. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA
On 03.10.2015 [06:51:06 +1000], Benjamin Herrenschmidt wrote: > On Fri, 2015-10-02 at 13:09 -0700, Nishanth Aravamudan wrote: > > > 1) add a generic dma_get_page_shift implementation that just returns > > PAGE_SHIFT > > So you chose to return the granularity of the iommu to the driver > rather than providing a way for the driver to request a specific > alignment for DMA mappings. Any specific reason ? Right, I did start with your advice and tried that approach, but it turned out I was wrong about the actual issue at the time. The problem for NVMe isn't actually the starting address alignment (which it can handle not being aligned to the device's page size). It doesn't handle (addr + len % dev_page_size != 0). That is, it's really a length alignment issue. It seems incredibly device specific to have a an API into the DMA code to request an end alignment -- no other device seems to have this issue/design. If you think that's better, I can fiddle with that instead. Sorry, I should have called this out better as an alternative consideration. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote: > On 9/27/15, Raghavendra K Twrote: > > Problem description: > > Powerpc has sparse node numbering, i.e. on a 4 node system nodes are > > numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid > > got from device tree is naturally mapped (directly) to nid. > > Interesting thing to play with, I'll try to test it on my POWER7 box, > but it doesn't have the OPAL layer :( Note that it's also interesting to try it under PowerVM, with odd NUMA topologies and report any issues found :) -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid
On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote: > There is no change in the fuctionality > > Signed-off-by: Raghavendra K T> --- > arch/powerpc/mm/numa.c | 42 +- > 1 file changed, 21 insertions(+), 21 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index d5e6eee..f84ed2f 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid, > } > } > > -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa > +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa > * info is found. > */ > -static int associativity_to_nid(const __be32 *associativity) > +static int associativity_to_chipid(const __be32 *associativity) This is confusing to me. This function is also used by the DLPAR code under PowerVM to indicate what node the CPU is on -- not a chip (which I don't believe is exposed at all under PowerVM). > { > - int nid = -1; > + int chipid = -1; > > if (min_common_depth == -1) > goto out; > > if (of_read_number(associativity, 1) >= min_common_depth) > - nid = of_read_number([min_common_depth], 1); > + chipid = of_read_number([min_common_depth], 1); Doesn't this, in the PAPR documentation, specifically refer to the NODE level domain? > /* POWER4 LPAR uses 0x as invalid node */ > - if (nid == 0x || nid >= MAX_NUMNODES) > - nid = -1; > + if (chipid == 0x || chipid >= MAX_NUMNODES) > + chipid = -1; > > - if (nid > 0 && > + if (chipid > 0 && > of_read_number(associativity, 1) >= distance_ref_points_depth) { > /* >* Skip the length field and send start of associativity array >*/ > - initialize_distance_lookup_table(nid, associativity + 1); > + initialize_distance_lookup_table(chipid, associativity + 1); > } > > out: > - return nid; > + return chipid; > } > > -/* Returns the nid associated with the given device tree node, > +/* Returns the chipid associated with the given device tree node, > * or -1 if not found. > */ > -static int of_node_to_nid_single(struct device_node *device) > +static int of_node_to_chipid_single(struct device_node *device) > { > - int nid = -1; > + int chipid = -1; > const __be32 *tmp; > > tmp = of_get_associativity(device); > if (tmp) > - nid = associativity_to_nid(tmp); > - return nid; > + chipid = associativity_to_chipid(tmp); > + return chipid; > } > > /* Walk the device tree upwards, looking for an associativity id */ > @@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device) > > of_node_get(device); > while (device) { > - nid = of_node_to_nid_single(device); > + nid = of_node_to_chipid_single(device); > if (nid != -1) > break; > > @@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory, > } > > /* > - * This is like of_node_to_nid_single() for memory represented in the > + * This is like of_node_to_chipid_single() for memory represented in the > * ibm,dynamic-reconfiguration-memory node. > */ > static int of_drconf_to_nid_single(struct of_drconf_cell *drmem, > @@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu) > goto out; > } > > - nid = of_node_to_nid_single(cpu); > + nid = of_node_to_chipid_single(cpu); > > out_present: > if (nid < 0 || !node_online(nid)) > @@ -754,7 +754,7 @@ static int __init parse_numa_properties(void) > > cpu = of_get_cpu_node(i, NULL); > BUG_ON(!cpu); > - nid = of_node_to_nid_single(cpu); > + nid = of_node_to_chipid_single(cpu); > of_node_put(cpu); > > /* > @@ -796,7 +796,7 @@ new_range: >* have associativity properties. If none, then >* everything goes to default_nid. >*/ > - nid = of_node_to_nid_single(memory); > + nid = of_node_to_chipid_single(memory); > if (nid < 0) > nid = default_nid; > > @@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long > scn_addr) > if ((scn_addr < start) || (scn_addr >= (start + size))) > continue; > > - nid = of_node_to_nid_single(memory); > + nid = of_node_to_chipid_single(memory); > break; > } > > @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void) > > /* Use associativity from first thread for all siblings */ > vphn_get_associativity(cpu, associativity); > -
Re: [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote: > Create arrays that maps serial nids and sparse chipids. > > Note: My original idea had only two arrays of chipid to nid map. Final > code is inspired by driver/acpi/numa.c that maps a proximity node with > a logical node by Takayoshi Kochi, and thus > uses an additional chipid_map nodemask. The mask helps in first unused > nid easily by knowing first unset bit in the mask. > > No change in functionality. > > Signed-off-by: Raghavendra K T > --- > arch/powerpc/mm/numa.c | 48 +++- > 1 file changed, 47 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index dd2073b..f015cad 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -63,6 +63,11 @@ static int form1_affinity; > static int distance_ref_points_depth; > static const __be32 *distance_ref_points; > static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; > +static nodemask_t chipid_map = NODE_MASK_NONE; > +static int chipid_to_nid_map[MAX_NUMNODES] > + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; Hrm, conceptually there are *more* chips than nodes, right? So what guarantees we won't see > MAX_NUMNODES chips? > +static int nid_to_chipid_map[MAX_NUMNODES] > + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; > > /* > * Allocate node_to_cpumask_map based on number of available nodes > @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned > long end_pfn, > return 0; > } > > +int chipid_to_nid(int chipid) > +{ > + if (chipid < 0) > + return NUMA_NO_NODE; Do you really want to support these cases? Or should they be bugs/warnings indicating that you got an unexpected input? Or at least WARN_ON_ONCE? > + return chipid_to_nid_map[chipid]; > +} > + > +int nid_to_chipid(int nid) > +{ > + if (nid < 0) > + return NUMA_NO_NODE; > + return nid_to_chipid_map[nid]; > +} > + > +static void __map_chipid_to_nid(int chipid, int nid) > +{ > + if (chipid_to_nid_map[chipid] == NUMA_NO_NODE > + || nid < chipid_to_nid_map[chipid]) > + chipid_to_nid_map[chipid] = nid; > + if (nid_to_chipid_map[nid] == NUMA_NO_NODE > + || chipid < nid_to_chipid_map[nid]) > + nid_to_chipid_map[nid] = chipid; > +} chip <-> node mapping is a static (physical) concept, right? Should we emit some debugging if for some reason we get a runtime call to remap an already mapped chip to a new node? > + > +int map_chipid_to_nid(int chipid) > +{ > + int nid; > + > + if (chipid < 0 || chipid >= MAX_NUMNODES) > + return NUMA_NO_NODE; > + > + nid = chipid_to_nid_map[chipid]; > + if (nid == NUMA_NO_NODE) { > + if (nodes_weight(chipid_map) >= MAX_NUMNODES) > + return NUMA_NO_NODE; If you create a KVM guest with a bogus topology, doesn't this just start losing NUMA information for very high-noded guests? > + nid = first_unset_node(chipid_map); > + __map_chipid_to_nid(chipid, nid); > + node_set(nid, chipid_map); > + } > + return nid; > +} > + > int numa_cpu_lookup(int cpu) > { > return numa_cpu_lookup_table[cpu]; > @@ -264,7 +311,6 @@ out: > return chipid; > } > > - stray change? > /* Return the nid from associativity */ > static int associativity_to_nid(const __be32 *associativity) > { > -- > 1.7.11.7 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote: > Problem description: > Powerpc has sparse node numbering, i.e. on a 4 node system nodes are > numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid > got from device tree is naturally mapped (directly) to nid. chipid is a OPAL concept, I believe, and not documented in PAPR... How does this work under PowerVM? > Potential side effect of that is: > > 1) There are several places in kernel that assumes serial node numbering. > and memory allocations assume that all the nodes from 0-(highest nid) > exist inturn ending up allocating memory for the nodes that does not exist. > > 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping > sparse nid of the host system to contiguous nids of guest (numa affinity, > placement) could be a challenge. > > Possible Solutions: > 1) Handling the memory allocations is kernel case by case: Though in some > cases it is easy to achieve, some cases may be intrusive/not trivial. > at the end it does not handle side effect (2) above. > > 2) Map the sparse chipid got from device tree to a serial nid at kernel > level (The idea proposed in this series). > Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. > con: The chipid is in device tree no longer the same as nid in kernel Is there any debugging/logging? Looks like not -- so how does a sysadmin map from firmware-provided values to the Linux values? That's going to make debugging of large systems (PowerVM or otherwise) less than pleasant, it seems? Possibly you could put something in sysfs? -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote: > Once we have made the distinction between nid and chipid > create a 1:1 mapping between them. This makes compacting the > nids easy later. > > No functionality change. > > Signed-off-by: Raghavendra K T> --- > arch/powerpc/mm/numa.c | 36 +--- > 1 file changed, 29 insertions(+), 7 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index f84ed2f..dd2073b 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -264,6 +264,17 @@ out: > return chipid; > } > > + > + /* Return the nid from associativity */ > +static int associativity_to_nid(const __be32 *associativity) > +{ > + int nid; > + > + nid = associativity_to_chipid(associativity); > + return nid; > +} This is ultimately confusing. You are assigning the semantic return value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the variable naming be consistent? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote: > Once we have made the distinction between nid and chipid > create a 1:1 mapping between them. This makes compacting the > nids easy later. Didn't the previous patch just do the opposite of... > @@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device) > > of_node_get(device); > while (device) { > - nid = of_node_to_chipid_single(device); > + nid = of_node_to_nid_single(device); > if (nid != -1) > break; > > @@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory, > } > > /* > - * This is like of_node_to_chipid_single() for memory represented in the > + * This is like of_node_to_nid_single() for memory represented in the > * ibm,dynamic-reconfiguration-memory node. > */ > static int of_drconf_to_nid_single(struct of_drconf_cell *drmem, > @@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu) > goto out; > } > > - nid = of_node_to_chipid_single(cpu); > + nid = of_node_to_nid_single(cpu); > > out_present: > if (nid < 0 || !node_online(nid)) > @@ -754,7 +776,7 @@ static int __init parse_numa_properties(void) > > cpu = of_get_cpu_node(i, NULL); > BUG_ON(!cpu); > - nid = of_node_to_chipid_single(cpu); > + nid = of_node_to_nid_single(cpu); > of_node_put(cpu); > > /* > @@ -796,7 +818,7 @@ new_range: >* have associativity properties. If none, then >* everything goes to default_nid. >*/ > - nid = of_node_to_chipid_single(memory); > + nid = of_node_to_nid_single(memory); > if (nid < 0) > nid = default_nid; > > @@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long > scn_addr) > if ((scn_addr < start) || (scn_addr >= (start + size))) > continue; > > - nid = of_node_to_chipid_single(memory); > + nid = of_node_to_nid_single(memory); > break; > } > > @@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void) > > /* Use associativity from first thread for all siblings */ > vphn_get_associativity(cpu, associativity); > - new_nid = associativity_to_chipid(associativity); > + new_nid = associativity_to_nid(associativity); > if (new_nid < 0 || !node_online(new_nid)) > new_nid = first_online_node; > Why is this churn useful? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC] powerpc/hugetlb: Add warning message when gpage allocation request fails
On 14.09.2015 [18:59:25 +0530], Aneesh Kumar K.V wrote: > Anshuman Khandualwrites: > > > When a 16GB huge page is requested on POWER platform through kernel command > > line interface, it silently fails because of the lack of any gigantic pages > > on the system which the platform should have communicated through 16GB > > memory > > blocks in the device tree during boot time. For example > > > > [0.480940] HugeTLB registered 16 GB page size, pre-allocated 0 pages > > [0.480945] HugeTLB registered 16 MB page size, pre-allocated 16 pages > > > > This adds a warning message during alloc_bootmem_huge_page request both on > > book3e and book3s powerpc platforms. After this change > > > > [0.00] Gigantic HugeTLB page not available > > [0.473417] HugeTLB registered 16 GB page size, pre-allocated 0 pages > > [0.473423] HugeTLB registered 16 MB page size, pre-allocated 16 pages > > > That info is already part of the second line isn't it ? ie > > [0.473417] HugeTLB registered 16 GB page size, pre-allocated 0 pages > > pre-allocated 0 pages indicate we didn't allocate anything. So why do we > need to add more details fo kernel output ? Agreed, the '0 pages' message indicates we failed to pre-allocate any pages. The 'pre-allocate' messages are specifically about the kernel command-line requests for hugepages. Not sure I understand the motivation for this? -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=
On 07.09.2015 [19:19:09 +1000], Michael Ellerman wrote: > On Fri, 2015-09-04 at 11:22 -0700, Nishanth Aravamudan wrote: > > The 32-bit TCE table initialization relies on the DMA window having a > > size equal to a power of 2 (and checks for it explicitly). But > > crashkernel= has no constraint that requires a power-of-2 be specified. > > This causes the kdump kernel to fail to boot as none of the PCI devices > > (including the disk controller) are successfully initialized. > > > > After this change, the PCI devices successfully set up the 32-bit TCE > > table and kdump succeeds. > > > > Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate > > TCE pages") > > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > > Cc: sta...@vger.kernel.org # 4.2 > > > > Michael, I kept this as a follow-on patch to my previous one. If you'd > > rather I made a v3 of that patch with the two fixes combined, I can > > resend. > > No that's fine. I guess they could have been a single fix, but it's > not a big deal. Ok, thanks for understanding. > > Also, I fixed up the context on my end to be u64, but not sure > > if that will match your tree (next doesn't have my prior patch applied > > yet, that I can see). > > next isn't open yet, because we're still in the merge window, ie. rc1 > hasn't come out yet. Ah got it. I did mean your next branch, not linux-next itself, fwiw. > This is a fix so it'll go to my fixes branch. Whether I send that to > Linus before or after rc1 depends on how urgent the fixes people send > me are. Sounds like you'd like these two to go in asap? If at all possible, yes please. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=
The 32-bit TCE table initialization relies on the DMA window having a size equal to a power of 2 (and checks for it explicitly). But crashkernel= has no constraint that requires a power-of-2 be specified. This causes the kdump kernel to fail to boot as none of the PCI devices (including the disk controller) are successfully initialized. After this change, the PCI devices successfully set up the 32-bit TCE table and kdump succeeds. Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages") Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> Cc: sta...@vger.kernel.org # 4.2 --- Michael, I did this as a follow-on patch to my previous one. If you'd rather I made a v3 of that patch with the two fixes combined, I can resend. diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f1c74c28e564..73914f4bd1ab 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2084,6 +2084,12 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) */ const u64 window_size = min((u64)pe->table_group.tce32_size, memory_hotplug_max()); + /* +* crashkernel= specifies the kdump kernel's maximum memory at +* some offset and there is no guaranteed the result is a power +* of 2, which will cause errors later. +*/ + window_size = __rounddown_pow_of_two(window_size); rc = pnv_pci_ioda2_create_table(>table_group, 0, IOMMU_PAGE_SHIFT_4K, ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=
The 32-bit TCE table initialization relies on the DMA window having a size equal to a power of 2 (and checks for it explicitly). But crashkernel= has no constraint that requires a power-of-2 be specified. This causes the kdump kernel to fail to boot as none of the PCI devices (including the disk controller) are successfully initialized. After this change, the PCI devices successfully set up the 32-bit TCE table and kdump succeeds. Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages") Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> Cc: sta...@vger.kernel.org # 4.2 --- Michael, I kept this as a follow-on patch to my previous one. If you'd rather I made a v3 of that patch with the two fixes combined, I can resend. Also, I fixed up the context on my end to be u64, but not sure if that will match your tree (next doesn't have my prior patch applied yet, that I can see). diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f1c74c28e564..d5e635f2c3aa 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2078,12 +2078,18 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) struct iommu_table *tbl = NULL; long rc; /* +* crashkernel= specifies the kdump kernel's maximum memory at +* some offset and there is no guaranteed the result is a power +* of 2, which will cause errors later. +*/ + const u64 max_memory = __rounddown_pow_of_two(memory_hotplug_max()); + /* * In memory constrained environments, e.g. kdump kernel, the * DMA window can be larger than available memory, which will * cause errors later. */ const u64 window_size = - min((u64)pe->table_group.tce32_size, memory_hotplug_max()); + min((u64)pe->table_group.tce32_size, max_memory); rc = pnv_pci_ioda2_create_table(>table_group, 0, iommu_page_shift_4k, ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=
On 04.09.2015 [20:01:22 +0200], Jan Stancek wrote: > On Fri, Sep 04, 2015 at 09:59:38AM -0700, Nishanth Aravamudan wrote: > > The 32-bit TCE table initialization relies on the DMA window having a > > size equal to a power of 2 (and checks for it explicitly). But > > crashkernel= has no constraint that requires a power-of-2 be specified. > > This causes the kdump kernel to fail to boot as none of the PCI devices > > (including the disk controller) are successfully initialized. > > > > After this change, the PCI devices successfully set up the 32-bit TCE > > table and kdump succeeds. > > > > Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate > > TCE pages") > > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > > Cc: sta...@vger.kernel.org # 4.2 > > --- > > > > Michael, I did this as a follow-on patch to my previous one. If you'd > > rather I made a v3 of that patch with the two fixes combined, I can > > resend. > > > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > > b/arch/powerpc/platforms/powernv/pci-ioda.c > > index f1c74c28e564..73914f4bd1ab 100644 > > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > > @@ -2084,6 +2084,12 @@ static long > > pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) > > */ > > const u64 window_size = > > min((u64)pe->table_group.tce32_size, memory_hotplug_max()); > > + /* > > +* crashkernel= specifies the kdump kernel's maximum memory at > > +* some offset and there is no guaranteed the result is a power > > +* of 2, which will cause errors later. > > +*/ > > + window_size = __rounddown_pow_of_two(window_size); > > Hi, > > Just wondering if this won't hide potential alignment issues of > "table_group.tce32_size", that now trigger EINVAL down the road > and if it wouldn't be safer to round only "memory_hotplug_max()? > > const __u64 hotplug_max_p2 = __rounddown_pow_of_two(memory_hotplug_max()); > const __u64 window_size = > min((u64)pe->table_group.tce32_size, hotplug_max_p2); Fair point, v2 on its way. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernel
On 03.09.2015 [19:58:53 +1000], Michael Ellerman wrote: > On Wed, 2015-09-02 at 08:39 -0700, Nishanth Aravamudan wrote: > > On 02.09.2015 [19:00:31 +1000], Alexey Kardashevskiy wrote: > > > On 09/02/2015 11:11 AM, Nishanth Aravamudan wrote: > > > >diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > > > >b/arch/powerpc/platforms/powernv/pci-ioda.c > > > >index 85cbc96eff6c..0d7967e31169 100644 > > > >--- a/arch/powerpc/platforms/powernv/pci-ioda.c > > > >+++ b/arch/powerpc/platforms/powernv/pci-ioda.c > > > >@@ -2077,10 +2077,17 @@ static long > > > >pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) > > > > { > > > > struct iommu_table *tbl = NULL; > > > > long rc; > > > >+/* > > > >+ * In memory constrained environments, e.g. kdump kernel, the > > > >+ * DMA window can be larger than available memory, which will > > > >+ * cause errors later. > > > >+ */ > > > >+__u64 window_size = > > > > > > I asked for "const __u64" ;) > > > > I knew I'd forget something! > > Nish! In future please send a reply with the above comment, and then > the v2 as a separate mail, otherwise I have to manually edit out your > comment when applying. Sorry! Will amend my practices in the future. > > Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate > > TCE pages") > > That went into 4.2, so should this have a: > > Cc: sta...@vger.kernel.org # 4.2 Grr, another mess-up, you're right, it should. Can you amend on your end? Sorry, I was in too much of a hurry to get the fixlet out. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernel
On 02.09.2015 [19:00:31 +1000], Alexey Kardashevskiy wrote: > On 09/02/2015 11:11 AM, Nishanth Aravamudan wrote: > >When attempting to kdump with the 4.2 kernel, we see for each PCI > >device: > > > > pci 0003:01 : [PE# 000] Assign DMA32 space > > pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..8000 > > pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22 > > PCI: Domain 0004 has 8 available 32-bit DMA segments > > PCI: 4 PE# for a total weight of 70 > > pci 0004:01 : [PE# 002] Assign DMA32 space > > pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..8000 > > pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22 > > pci 0004:0d : [PE# 005] Assign DMA32 space > > pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..8000 > > pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22 > > pci 0004:0e : [PE# 006] Assign DMA32 space > > pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..8000 > > pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22 > > pci 0004:10 : [PE# 008] Assign DMA32 space > > pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..8000 > > pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22 > > > >and eventually the kdump kernel fails to boot as none of the PCI devices > >(including the disk controller) are successfully initialized. > > > >The EINVAL response is because the DMA window (the 2GB base window) is > >larger than the kdump kernel's reserved memory (crashkernel=, in this > >case specified to be 1024M). The check in question, > > > > if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size)) > > > >is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust > >the caller to pass in a smaller window size if our maximum memory value > >is smaller than the DMA window. > > > >After this change, the PCI devices successfully set up the 32-bit TCE > >table and kdump succeeds. > > > >The problem was seen on a Firestone machine originally. > > > >Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate > >TCE pages") > >Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> > > > >diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > >b/arch/powerpc/platforms/powernv/pci-ioda.c > >index 85cbc96eff6c..0d7967e31169 100644 > >--- a/arch/powerpc/platforms/powernv/pci-ioda.c > >+++ b/arch/powerpc/platforms/powernv/pci-ioda.c > >@@ -2077,10 +2077,17 @@ static long > >pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) > > { > > struct iommu_table *tbl = NULL; > > long rc; > >+/* > >+ * In memory constrained environments, e.g. kdump kernel, the > >+ * DMA window can be larger than available memory, which will > >+ * cause errors later. > >+ */ > >+__u64 window_size = > > I asked for "const __u64" ;) I knew I'd forget something! When attempting to kdump with the 4.2 kernel, we see for each PCI device: pci 0003:01 : [PE# 000] Assign DMA32 space pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..8000 pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22 PCI: Domain 0004 has 8 available 32-bit DMA segments PCI: 4 PE# for a total weight of 70 pci 0004:01 : [PE# 002] Assign DMA32 space pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..8000 pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22 pci 0004:0d : [PE# 005] Assign DMA32 space pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..8000 pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22 pci 0004:0e : [PE# 006] Assign DMA32 space pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..8000 pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22 pci 0004:10 : [PE# 008] Assign DMA32 space pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..8000 pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22 and eventually the kdump kernel fails to boot as none of the PCI devices (including the disk controller) are successfully initialized. The EINVAL response is because the DMA window (the 2GB base window) is larger than the kdump kernel's reserved memory (crashkernel=, in this case specified to be 1024M). The check in question, if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size)) is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust the caller to p
[PATCH] powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernel
When attempting to kdump with the 4.2 kernel, we see for each PCI device: pci 0003:01 : [PE# 000] Assign DMA32 space pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..8000 pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22 PCI: Domain 0004 has 8 available 32-bit DMA segments PCI: 4 PE# for a total weight of 70 pci 0004:01 : [PE# 002] Assign DMA32 space pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..8000 pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22 pci 0004:0d : [PE# 005] Assign DMA32 space pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..8000 pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22 pci 0004:0e : [PE# 006] Assign DMA32 space pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..8000 pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22 pci 0004:10 : [PE# 008] Assign DMA32 space pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..8000 pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22 and eventually the kdump kernel fails to boot as none of the PCI devices (including the disk controller) are successfully initialized. The EINVAL response is because the DMA window (the 2GB base window) is larger than the kdump kernel's reserved memory (crashkernel=, in this case specified to be 1024M). The check in question, if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size)) is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust the caller to pass in a smaller window size if our maximum memory value is smaller than the DMA window. After this change, the PCI devices successfully set up the 32-bit TCE table and kdump succeeds. The problem was seen on a Firestone machine originally. Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages") Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 85cbc96eff6c..0d7967e31169 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2077,10 +2077,17 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) { struct iommu_table *tbl = NULL; long rc; + /* +* In memory constrained environments, e.g. kdump kernel, the +* DMA window can be larger than available memory, which will +* cause errors later. +*/ + __u64 window_size = + min((u64)pe->table_group.tce32_size, memory_hotplug_max()); rc = pnv_pci_ioda2_create_table(>table_group, 0, IOMMU_PAGE_SHIFT_4K, - pe->table_group.tce32_size, + window_size, POWERNV_IOMMU_DEFAULT_LEVELS, ); if (rc) { pe_err(pe, "Failed to create 32-bit TCE table, err %ld", ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] openvswitch: make for_each_node loops work with sparse numa systems
On 21.07.2015 [10:32:34 -0500], Chris J Arges wrote: Some architectures like POWER can have a NUMA node_possible_map that contains sparse entries. This causes memory corruption with openvswitch since it allocates flow_cache with a multiple of num_possible_nodes() and Couldn't this also be fixed by just allocationg with a multiple of nr_node_ids (which seems to have been the original intent all along)? You could then make your stats array be sparse or not. assumes the node variable returned by for_each_node will index into flow-stats[node]. For example, if node_possible_map is 0x30003, this patch will map node to node_cnt as follows: 0,1,16,17 = 0,1,2,3 The crash was noticed after 3af229f2 was applied as it changed the node_possible_map to match node_online_map on boot. Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861 My concern with this version of the fix is that you're relying on, implicitly, the order of for_each_node's iteration corresponding to the entries in stats 1:1. But what about node hotplug? It seems better to have the enumeration of the stats array match the topology accurately, rather, or to maintain some sort of internal map in the OVS code between the NUMA node and the entry in the stats array? I'm willing to be convinced otherwise, though :) -Nish Signed-off-by: Chris J Arges chris.j.ar...@canonical.com --- net/openvswitch/flow.c | 10 ++ net/openvswitch/flow_table.c | 18 +++--- 2 files changed, 17 insertions(+), 11 deletions(-) diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index bc7b0ab..425d45d 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -134,14 +134,14 @@ void ovs_flow_stats_get(const struct sw_flow *flow, struct ovs_flow_stats *ovs_stats, unsigned long *used, __be16 *tcp_flags) { - int node; + int node, node_cnt = 0; *used = 0; *tcp_flags = 0; memset(ovs_stats, 0, sizeof(*ovs_stats)); for_each_node(node) { - struct flow_stats *stats = rcu_dereference_ovsl(flow-stats[node]); + struct flow_stats *stats = rcu_dereference_ovsl(flow-stats[node_cnt]); if (stats) { /* Local CPU may write on non-local stats, so we must @@ -155,16 +155,17 @@ void ovs_flow_stats_get(const struct sw_flow *flow, ovs_stats-n_bytes += stats-byte_count; spin_unlock_bh(stats-lock); } + node_cnt++; } } /* Called with ovs_mutex. */ void ovs_flow_stats_clear(struct sw_flow *flow) { - int node; + int node, node_cnt = 0; for_each_node(node) { - struct flow_stats *stats = ovsl_dereference(flow-stats[node]); + struct flow_stats *stats = ovsl_dereference(flow-stats[node_cnt]); if (stats) { spin_lock_bh(stats-lock); @@ -174,6 +175,7 @@ void ovs_flow_stats_clear(struct sw_flow *flow) stats-tcp_flags = 0; spin_unlock_bh(stats-lock); } + node_cnt++; } } diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c index 4613df8..5d10c54 100644 --- a/net/openvswitch/flow_table.c +++ b/net/openvswitch/flow_table.c @@ -77,7 +77,7 @@ struct sw_flow *ovs_flow_alloc(void) { struct sw_flow *flow; struct flow_stats *stats; - int node; + int node, node_cnt = 0; flow = kmem_cache_alloc(flow_cache, GFP_KERNEL); if (!flow) @@ -99,9 +99,11 @@ struct sw_flow *ovs_flow_alloc(void) RCU_INIT_POINTER(flow-stats[0], stats); - for_each_node(node) + for_each_node(node) { if (node != 0) - RCU_INIT_POINTER(flow-stats[node], NULL); + RCU_INIT_POINTER(flow-stats[node_cnt], NULL); + node_cnt++; + } return flow; err: @@ -139,15 +141,17 @@ static struct flex_array *alloc_buckets(unsigned int n_buckets) static void flow_free(struct sw_flow *flow) { - int node; + int node, node_cnt = 0; if (ovs_identifier_is_key(flow-id)) kfree(flow-id.unmasked_key); kfree((struct sw_flow_actions __force *)flow-sf_acts); - for_each_node(node) - if (flow-stats[node]) + for_each_node(node) { + if (flow-stats[node_cnt]) kmem_cache_free(flow_stats_cache, - (struct flow_stats __force *)flow-stats[node]); + (struct flow_stats __force *)flow-stats[node_cnt]); + node_cnt++; + } kmem_cache_free(flow_cache, flow); } -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] openvswitch: make for_each_node loops work with sparse numa systems
On 21.07.2015 [11:30:58 -0500], Chris J Arges wrote: On Tue, Jul 21, 2015 at 09:24:18AM -0700, Nishanth Aravamudan wrote: On 21.07.2015 [10:32:34 -0500], Chris J Arges wrote: Some architectures like POWER can have a NUMA node_possible_map that contains sparse entries. This causes memory corruption with openvswitch since it allocates flow_cache with a multiple of num_possible_nodes() and Couldn't this also be fixed by just allocationg with a multiple of nr_node_ids (which seems to have been the original intent all along)? You could then make your stats array be sparse or not. Yea originally this is what I did, but I thought it would be wasting memory. assumes the node variable returned by for_each_node will index into flow-stats[node]. For example, if node_possible_map is 0x30003, this patch will map node to node_cnt as follows: 0,1,16,17 = 0,1,2,3 The crash was noticed after 3af229f2 was applied as it changed the node_possible_map to match node_online_map on boot. Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861 My concern with this version of the fix is that you're relying on, implicitly, the order of for_each_node's iteration corresponding to the entries in stats 1:1. But what about node hotplug? It seems better to have the enumeration of the stats array match the topology accurately, rather, or to maintain some sort of internal map in the OVS code between the NUMA node and the entry in the stats array? I'm willing to be convinced otherwise, though :) -Nish Nish, The method I described should work for hotplug since it's using possible map which AFAIK is static rather than the online map. Oh you're right, I'm sorry! ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
On 21.07.2015 [12:36:33 -0500], Chris J Arges wrote: Some architectures like POWER can have a NUMA node_possible_map that contains sparse entries. This causes memory corruption with openvswitch since it allocates flow_cache with a multiple of num_possible_nodes() and assumes the node variable returned by for_each_node will index into flow-stats[node]. Use nr_node_ids to allocate a maximal sparse array instead of num_possible_nodes(). The crash was noticed after 3af229f2 was applied as it changed the node_possible_map to match node_online_map on boot. Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861 Signed-off-by: Chris J Arges chris.j.ar...@canonical.com Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- net/openvswitch/flow_table.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c index 4613df8..6552394 100644 --- a/net/openvswitch/flow_table.c +++ b/net/openvswitch/flow_table.c @@ -752,7 +752,7 @@ int ovs_flow_init(void) BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long)); flow_cache = kmem_cache_create(sw_flow, sizeof(struct sw_flow) -+ (num_possible_nodes() ++ (nr_node_ids * sizeof(struct flow_stats *)), 0, 0, NULL); if (flow_cache == NULL) -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
On 15.07.2015 [16:35:16 -0400], Tejun Heo wrote: Hello, On Thu, Jul 02, 2015 at 04:02:02PM -0700, Nishanth Aravamudan wrote: we currently emit at boot: [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 After this commit, we correctly emit: [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 JFYI, the numbers in the brackets aren't NUMA node numbers but percpu allocation group numbers and they're not split according to nodes but percpu allocation units. In both cases, there are two units each serving 0-3 and 4-7. In the above case, because it wasn't being fed the correct NUMA information, both got assigned to the same group. In the latter, they got assigned to different ones but even then if the group numbers match NUMA node numbers, that's just a coincidence. Ok, thank you for clarifying! From a correctness perspective, even if the numbers don't match NUMA nodes, should we expect the grouping to be split along NUMA topology? -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/6] [RFC] crypto/testmgr: add null test for 842 algorithm
On 13.07.2015 [17:05:36 -0700], Nishanth Aravamudan wrote: On 04.07.2015 [15:24:53 +0800], Herbert Xu wrote: On Thu, Jul 02, 2015 at 03:41:19PM -0700, Nishanth Aravamudan wrote: Currently, when the nx-842-pseries driver loads, the following message is emitted: alg: No test for 842 (842-nx) It seems like the simplest way to fix this message (other than adding a proper test) is to just insert the null test into the list in the testmgr. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Please add some real test vectors instead. Apologies, hit send too fast. I'll work with Dan on this when he gets back from vacation. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/6] [RFC] crypto/testmgr: add null test for 842 algorithm
On 04.07.2015 [15:24:53 +0800], Herbert Xu wrote: On Thu, Jul 02, 2015 at 03:41:19PM -0700, Nishanth Aravamudan wrote: Currently, when the nx-842-pseries driver loads, the following message is emitted: alg: No test for 842 (842-nx) It seems like the simplest way to fix this message (other than adding a proper test) is to just insert the null test into the list in the testmgr. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Please add some real test vectors instead. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
On 08.07.2015 [16:16:23 -0700], Nishanth Aravamudan wrote: On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote: On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote: Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). now that .. implies we changed something and broke this. What commit was it that changed the behaviour? Well, that's something I'm trying to still unearth. In the commits before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID), the dmesg reports: pcpu-alloc: [0] 0 1 2 3 4 5 6 7 Ok, I did a bisection, and it seems like prior to commit 1a4d76076cda69b0abf15463a8cebc172406da25 (percpu: implement asynchronous chunk population), we emitted the above, e.g.: pcpu-alloc: [0] 0 1 2 3 4 5 6 7 And after that commit, we emitted: pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 I'm not exactly sure why that changed, but I'm still reading/understanding the commit. Tejun might be able to explain. Tejun, for reference, I noticed on Power systems since the above-mentioned commit, pcpu-alloc is not reflecting the topology of the system correctly -- that is, the pcpu areas are all on node 0 unconditionally (based up on pcpu-alloc's output). Prior to that, there was just one group, it seems like, which completely ignored the NUMA topology. Is this just an ordering thing that changed with the introduction of the async code? Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
On 08.07.2015 [18:22:09 -0700], David Rientjes wrote: On Thu, 2 Jul 2015, Nishanth Aravamudan wrote: Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). The value returned by those calls now depend on the per-cpu area being setup, but that is not guaranteed to be the case during boot. Instead, we need to add an early_cpu_to_node() which doesn't use the per-CPU area and call that from certain spots that are known to invoke cpu_to_node() before the per-CPU areas are not configured. On an example 2-node NUMA system with the following topology: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 2029 MB node 0 free: 1753 MB node 1 cpus: 4 5 6 7 node 1 size: 2045 MB node 1 free: 1945 MB node distances: node 0 1 0: 10 40 1: 40 10 we currently emit at boot: [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 After this commit, we correctly emit: [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 5f1048e..f2c4c89 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus) extern int __node_distance(int, int); #define node_distance(a, b) __node_distance(a, b) +extern int early_cpu_to_node(int); + extern void __init dump_numa_cpu_topology(void); extern int sysfs_add_device_to_node(struct device *dev, int nid); diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index c69671c..23a2cf3 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p) static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align) { - return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align, - __pa(MAX_DMA_ADDRESS)); + return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size, + align, __pa(MAX_DMA_ADDRESS)); } static void __init pcpu_fc_free(void *ptr, size_t size) @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size) static int pcpu_cpu_distance(unsigned int from, unsigned int to) { - if (cpu_to_node(from) == cpu_to_node(to)) + if (early_cpu_to_node(from) == early_cpu_to_node(to)) return LOCAL_DISTANCE; else return REMOTE_DISTANCE; diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 5e80621..9ffabf4 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node) cpumask_set_cpu(cpu, node_to_cpumask_map[node]); } +int early_cpu_to_node(int cpu) +{ + return numa_cpu_lookup_table[cpu]; +} + #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { early_cpu_to_node() looks like it's begging to be __init since we shouldn't have a need to reference to numa_cpu_lookup_table after boot and that appears like it can be done if pcpu_cpu_distance() is made __init in this patch and smp_prepare_boot_cpu() is made __init in the next patch. So I think this is fine, but those functions and things like reset_numa_cpu_lookup_table() should be in init.text. Yep, that makes total sense! After the percpu areas on initialized and cpu_to_node() is correct, it would be really nice to be able to make numa_cpu_lookup_table[] be __initdata since it shouldn't be necessary anymore. That probably has cpu callbacks that need to be modified to no longer look at numa_cpu_lookup_table[] or pass the value in, but it would make it much cleaner. Then nobody will have to worry about figuring out whether early_cpu_to_node() or cpu_to_node() is the right one to call. When I worked on the original pcpu patches for power, I wanted to do this, but got myself confused and never came back to it. Thank you for suggesting it and I'll work on it soon. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote: On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote: Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). now that .. implies we changed something and broke this. What commit was it that changed the behaviour? Well, that's something I'm trying to still unearth. In the commits before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID), the dmesg reports: pcpu-alloc: [0] 0 1 2 3 4 5 6 7 At least prior to 8c272261194d, this might have been due to the old powerpc-specific cpu_to_node(): static inline int cpu_to_node(int cpu) { int nid; nid = numa_cpu_lookup_table[cpu]; /* * During early boot, the numa-cpu lookup table might not have been * setup for all CPUs yet. In such cases, default to node 0. */ return (nid 0) ? 0 : nid; } which might imply that no one cares or that simply no one noticed. The value returned by those calls now depend on the per-cpu area being setup, but that is not guaranteed to be the case during boot. Instead, we need to add an early_cpu_to_node() which doesn't use the per-CPU area and call that from certain spots that are known to invoke cpu_to_node() before the per-CPU areas are not configured. On an example 2-node NUMA system with the following topology: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 2029 MB node 0 free: 1753 MB node 1 cpus: 4 5 6 7 node 1 size: 2045 MB node 1 free: 1945 MB node distances: node 0 1 0: 10 40 1: 40 10 we currently emit at boot: [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 After this commit, we correctly emit: [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 So it looks fairly sane, and I guess it's a bug fix. But I'm a bit reluctant to put it in straight away without some time in next. I'm fine with that -- it could use some more extensive testing, admittedly (I only have been able to verify the pcpu areas are being correctly allocated on the right node so far). I still need to test with hotplug and things like that. Hence the RFC. It looks like the symptom is that the per-cpu areas are all allocated on node 0, is that all that goes wrong? Yes, that's the symptom. I cc'd a few folks to see if they could help indicate the performance implications of such a setup -- sorry, I should have been more explicit about that. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 6/6] nx-842-platform: if NX842 platform drivers are not modules, don't try to load them
On 06.07.2015 [16:13:07 +0800], Herbert Xu wrote: On Thu, Jul 02, 2015 at 03:42:26PM -0700, Nishanth Aravamudan wrote: Based off the CONFIG_SPU_FS_MODULE code, only attempt to load platform modules if the nx-842 pseries/powernv drivers are built as modules. Otherwise, if CONFIG_DEV_NX_COMPRESS=y, CONFIG_DEV_NX_COMPRESS_PSERIES=y, CONFIG_DEV_NX_POWERNV=y, the following message is emitted at boot: nx_compress: no nx842 driver found. even though the drivers successfully loads. This is because in the =y case, the module_init() calls get converted to initcalls and the nx842_init() runs before the platform driver nx842_pseries_init() or nx842_powernv_init() functions, which are what normally set the static platform driver. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Cc: Dan Streetman ddstr...@us.ibm.com Cc: Herbert Xu herb...@gondor.apana.org.au Cc: David S. Miller da...@davemloft.net Cc: linux-cry...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Ugh, I think this whole thing is redundant. The whole point of the crypto API is to allow the coexistence of multiple underlying implementations. Sure, that makes sense -- sorry, I was picking this up while Dan was on vacation. Will provide a better v2. Please get rid of nx-842-platform.c completely and move the crypto registration into the individual platform drivers. That is, powernv and pseries should each register their own crypto driver. They can of course share a common set of crypto code which can live in its own module. There should be no need for mucking with module reference counts at all. Will do, thanks! -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] crypto/nx-842-{powerpc,pseries}: reduce chattiness of platform drivers
On 03.07.2015 [11:30:32 +1000], Michael Ellerman wrote: On Thu, 2015-07-02 at 15:40 -0700, Nishanth Aravamudan wrote: While we never would successfully load on the wrong machine type, there is extra output by default regardless of machine type. For instance, on a PowerVM LPAR, we see the following: nx_compress_powernv: loading nx_compress_powernv: no coprocessors found even though those coprocessors could never be found. Similar pseries messages are printed on powernv. I know I've been converting init calls to machine_initcalls() to avoid these sort of issues in platform code. But for a driver it should be trivial for it to only probe when the hardware is found. By which I mean I think we shouldn't need these. Ok. diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index 33b3b0abf4ae..6b5e5143c95b 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -594,6 +594,9 @@ static __init int nx842_powernv_init(void) BUILD_BUG_ON(DDE_BUFFER_ALIGN % DDE_BUFFER_SIZE_MULT); BUILD_BUG_ON(DDE_BUFFER_SIZE_MULT % DDE_BUFFER_LAST_MULT); + if (!machine_is(powernv)) + return -ENODEV; + pr_info(loading\n); This is just too chatty, drop it. for_each_compatible_node(dn, NULL, ibm,power-nx) It shouldn't be printing anything unless it finds some devices in this loop. And you should drop the print in here: if (!nx842_ct) { pr_err(no coprocessors found\n); return -ENODEV; } And that should mean no output unless hardware is found I think? Yep, will adjust. @@ -625,6 +628,9 @@ static void __exit nx842_powernv_exit(void) { struct nx842_coproc *coproc, *n; + if (!machine_is(powernv)) + return; You shouldn't need to touch the exit paths if the drivers were never loaded? Duh, yep. diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c index b84b0ceeb46e..75a7bfdc160e 100644 --- a/drivers/crypto/nx/nx-842-pseries.c +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -1091,6 +1091,9 @@ static int __init nx842_pseries_init(void) struct nx842_devdata *new_devdata; int ret; + if (!machine_is(pseries)) + return -ENODEV; + pr_info(Registering IBM Power 842 compression driver\n); Again this is too chatty, just remove it. Will do. if (!of_find_compatible_node(NULL, NULL, ibm,compression)) return -ENODEV; That should do the trick shouldn't it? Yep, I think so. Thanks Michael! While we never would successfully load on the wrong machine type, there is extra output by default regardless of machine type. For instance, on a PowerVM LPAR, we see the following: nx_compress_powernv: loading nx_compress_powernv: no coprocessors found even though those coprocessors could never be found. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Cc: Dan Streetman ddstr...@us.ibm.com Cc: Herbert Xu herb...@gondor.apana.org.au Cc: David S. Miller da...@davemloft.net Cc: linux-cry...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org --- v2: Rather than not loading, just reduce the verbosity drivers/crypto/nx/nx-842-powernv.c | 10 +- drivers/crypto/nx/nx-842-pseries.c | 3 --- 2 files changed, 1 insertion(+), 12 deletions(-) diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index 33b3b0abf4ae..7e474562058d 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -594,15 +594,11 @@ static __init int nx842_powernv_init(void) BUILD_BUG_ON(DDE_BUFFER_ALIGN % DDE_BUFFER_SIZE_MULT); BUILD_BUG_ON(DDE_BUFFER_SIZE_MULT % DDE_BUFFER_LAST_MULT); - pr_info(loading\n); - for_each_compatible_node(dn, NULL, ibm,power-nx) nx842_powernv_probe(dn); - if (!nx842_ct) { - pr_err(no coprocessors found\n); + if (!nx842_ct) return -ENODEV; - } if (!nx842_platform_driver_set(nx842_powernv_driver)) { struct nx842_coproc *coproc, *n; @@ -615,8 +611,6 @@ static __init int nx842_powernv_init(void) return -EEXIST; } - pr_info(loaded\n); - return 0; } module_init(nx842_powernv_init); @@ -631,7 +625,5 @@ static void __exit nx842_powernv_exit(void) list_del(coproc-list); kfree(coproc); } - - pr_info(unloaded\n); } module_exit(nx842_powernv_exit); diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c index b84b0ceeb46e..d44524da6589 100644 --- a/drivers/crypto/nx/nx-842-pseries.c +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -1091,8 +1091,6 @@ static int __init nx842_pseries_init(void) struct nx842_devdata *new_devdata; int ret; - pr_info(Registering IBM Power 842 compression
[PATCH 5/6] [RFC] crypto/testmgr: add null test for 842 algorithm
Currently, when the nx-842-pseries driver loads, the following message is emitted: alg: No test for 842 (842-nx) It seems like the simplest way to fix this message (other than adding a proper test) is to just insert the null test into the list in the testmgr. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- crypto/testmgr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/crypto/testmgr.c b/crypto/testmgr.c index d0a42bd3aae9..ff0f76e0d0b4 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -1982,6 +1982,9 @@ static int alg_test_null(const struct alg_test_desc *desc, /* Please keep this list sorted by algorithm name. */ static const struct alg_test_desc alg_test_descs[] = { { + .alg = 842, + .test = alg_test_null, + }, { .alg = __cbc-cast5-avx, .test = alg_test_null, }, { -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 6/6] nx-842-platform: if NX842 platform drivers are not modules, don't try to load them
Based off the CONFIG_SPU_FS_MODULE code, only attempt to load platform modules if the nx-842 pseries/powernv drivers are built as modules. Otherwise, if CONFIG_DEV_NX_COMPRESS=y, CONFIG_DEV_NX_COMPRESS_PSERIES=y, CONFIG_DEV_NX_POWERNV=y, the following message is emitted at boot: nx_compress: no nx842 driver found. even though the drivers successfully loads. This is because in the =y case, the module_init() calls get converted to initcalls and the nx842_init() runs before the platform driver nx842_pseries_init() or nx842_powernv_init() functions, which are what normally set the static platform driver. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Cc: Dan Streetman ddstr...@us.ibm.com Cc: Herbert Xu herb...@gondor.apana.org.au Cc: David S. Miller da...@davemloft.net Cc: linux-cry...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org --- drivers/crypto/nx/nx-842-platform.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/crypto/nx/nx-842-platform.c b/drivers/crypto/nx/nx-842-platform.c index 664f13dd06ed..5363c72593b4 100644 --- a/drivers/crypto/nx/nx-842-platform.c +++ b/drivers/crypto/nx/nx-842-platform.c @@ -53,6 +53,7 @@ void nx842_platform_driver_unset(struct nx842_driver *_driver) } EXPORT_SYMBOL_GPL(nx842_platform_driver_unset); +#if defined(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES_MODULE) || defined(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV_MODULE) bool nx842_platform_driver_get(void) { bool ret = false; @@ -66,7 +67,6 @@ bool nx842_platform_driver_get(void) return ret; } -EXPORT_SYMBOL_GPL(nx842_platform_driver_get); void nx842_platform_driver_put(void) { @@ -77,6 +77,17 @@ void nx842_platform_driver_put(void) spin_unlock(driver_lock); } +#else +bool nx842_platform_driver_get(void) +{ + return true; +} + +void nx842_platform_driver_put(void) +{ +} +#endif +EXPORT_SYMBOL_GPL(nx842_platform_driver_get); EXPORT_SYMBOL_GPL(nx842_platform_driver_put); MODULE_LICENSE(GPL); -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). The value returned by those calls now depend on the per-cpu area being setup, but that is not guaranteed to be the case during boot. Instead, we need to add an early_cpu_to_node() which doesn't use the per-CPU area and call that from certain spots that are known to invoke cpu_to_node() before the per-CPU areas are not configured. On an example 2-node NUMA system with the following topology: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 2029 MB node 0 free: 1753 MB node 1 cpus: 4 5 6 7 node 1 size: 2045 MB node 1 free: 1945 MB node distances: node 0 1 0: 10 40 1: 40 10 we currently emit at boot: [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 After this commit, we correctly emit: [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 5f1048e..f2c4c89 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus) extern int __node_distance(int, int); #define node_distance(a, b) __node_distance(a, b) +extern int early_cpu_to_node(int); + extern void __init dump_numa_cpu_topology(void); extern int sysfs_add_device_to_node(struct device *dev, int nid); diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index c69671c..23a2cf3 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p) static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align) { - return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align, - __pa(MAX_DMA_ADDRESS)); + return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size, + align, __pa(MAX_DMA_ADDRESS)); } static void __init pcpu_fc_free(void *ptr, size_t size) @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size) static int pcpu_cpu_distance(unsigned int from, unsigned int to) { - if (cpu_to_node(from) == cpu_to_node(to)) + if (early_cpu_to_node(from) == early_cpu_to_node(to)) return LOCAL_DISTANCE; else return REMOTE_DISTANCE; diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 5e80621..9ffabf4 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node) cpumask_set_cpu(cpu, node_to_cpumask_map[node]); } +int early_cpu_to_node(int cpu) +{ + return numa_cpu_lookup_table[cpu]; +} + #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table
A simple move to a wrapper function to numa_cpu_lookup_table, now that power has the early_cpu_to_node() API. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec9ec20..7bf333b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) * numa_node_id() works after this. */ if (cpu_present(cpu)) { - set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); + set_cpu_numa_node(cpu, early_cpu_to_node(cpu)); set_cpu_numa_mem(cpu, - local_memory_node(numa_cpu_lookup_table[cpu])); + local_memory_node(early_cpu_to_node(cpu))); } } @@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void) #ifdef CONFIG_PPC64 paca[boot_cpuid].__current = current; #endif - set_numa_node(numa_cpu_lookup_table[boot_cpuid]); + set_numa_node(early_cpu_to_node(boot_cpuid)); current_set[boot_cpuid] = task_thread_info(current); } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/6] drivers/nx-842: reduce verbosity of logging
Currently, on a LPAR with the nx-842 device disabled, the following messages are emitted: nx_compress: no nx842 driver found. [1] Registering IBM Power 842 compression driver nx_compress_pseries ibm,compression-v1: nx842_OF_upd_status: status 'disabled' is not 'okay' nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sync_size new:4096 old:0 [2] nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sync_sg new:510 old:0 nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sg_len new:4080 old:0 nx_compress_powernv: loading [3] nx_compress_powernv: no coprocessors found alg: No test for 842 (842-nx) [4] [1] is the result of an ordering issue when the CONFIG_ options are set =y. [2] is the result of nx842_OF_upd_status() not returning the correct error code. [3] is the result of attempting to load the PowerNV platform driver on a non-PowerNV platform. [4] is the result of there simply not being any test for 842 in the crypto test manager. After the changes in the series, the same system as above emits: Registering IBM Power 842 compression driver nx_compress_pseries ibm,compression-v1: nx842_OF_upd: device disabled which seems to me, at least, to be far clearer. Dan, I think there is still a issue in the code. If CONFIG_DEV_NX_COMPRESS=y and CONFIG_DEV_NX_COMPRESS_PSERIES=m/CONFIG_DEV_NX_COMPRESS_POWERNV=m, it seems like the request_module() is not working properly and we simply get a nx_compress: no nx842 driver found. at boot (even if I ensure the platform drivers are in the initrd). If I make CONFIG_DEV_NX_COMPRESS=m, though, the module(s) load successfully. Does it make sense/is it possible to have these three symbols always be the same (either all =y or all =m)? [1/6] crypto/nx-842-pseries: nx842_OF_upd_status should return ENODEV if device is not 'okay' drivers/crypto/nx/nx-842-pseries.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) [2/6] nx-842-pseries: rename nx842_{init,exit} to nx842_pseries_{init,exit} drivers/crypto/nx/nx-842-pseries.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) [3/6] nx-842-pseries: do not emit extra output if status is disabled drivers/crypto/nx/nx-842-pseries.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) [4/6] crypto/nx-842-{powerpc,pseries}: only load on the appropriate machine type drivers/crypto/nx/nx-842-powernv.c | 6 ++ drivers/crypto/nx/nx-842-pseries.c | 6 ++ drivers/crypto/nx/nx-842.h | 1 + 3 files changed, 13 insertions(+) [5/6] crypto/testmgr: add null test for 842 algorithm crypto/testmgr.c | 3 +++ 1 file changed, 3 insertions(+) [6/6] nx-842-platform: if NX842 platform drivers are not modules, don't try to load them drivers/crypto/nx/nx-842-platform.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/6] nx-842-pseries: rename nx842_{init,exit} to nx842_pseries_{init,exit}
While there is no technical reason that both nx-842.c and nx-842-pseries.c can have the same name for the init/exit functions, it is a bit confusing with initcall_debug. Rename the pseries specific functions appropriately Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- drivers/crypto/nx/nx-842-pseries.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c index 819c23c546e3..e17f4d2e96e0 100644 --- a/drivers/crypto/nx/nx-842-pseries.c +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -1080,7 +1080,7 @@ static struct vio_driver nx842_vio_driver = { .id_table = nx842_vio_driver_ids, }; -static int __init nx842_init(void) +static int __init nx842_pseries_init(void) { struct nx842_devdata *new_devdata; int ret; @@ -1116,9 +1116,9 @@ static int __init nx842_init(void) return 0; } -module_init(nx842_init); +module_init(nx842_pseries_init); -static void __exit nx842_exit(void) +static void __exit nx842_pseries_exit(void) { struct nx842_devdata *old_devdata; unsigned long flags; @@ -1137,5 +1137,5 @@ static void __exit nx842_exit(void) vio_unregister_driver(nx842_vio_driver); } -module_exit(nx842_exit); +module_exit(nx842_pseries_exit); -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/6] crypto/nx-842-pseries: nx842_OF_upd_status should return ENODEV if device is not 'okay'
The current documention mentions explicitly that EINVAL should be returned if the device is not available, but nx842_OF_upd_status() always returns 0. However, nx842_probe() specifically checks for non-ENODEV returns from nx842_of_upd() (which in turn calls nx842_OF_upd_status()) and emits an extra error in that case. It seems like the proper return code of a disabled device is ENODEV. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- drivers/crypto/nx/nx-842-pseries.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c index 3040a6091bf2..819c23c546e3 100644 --- a/drivers/crypto/nx/nx-842-pseries.c +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -556,7 +556,7 @@ static int nx842_OF_set_defaults(struct nx842_devdata *devdata) * * Returns: * 0 - Device is available - * -EINVAL - Device is not available + * -ENODEV - Device is not available */ static int nx842_OF_upd_status(struct nx842_devdata *devdata, struct property *prop) { @@ -569,6 +569,7 @@ static int nx842_OF_upd_status(struct nx842_devdata *devdata, dev_info(devdata-dev, %s: status '%s' is not 'okay'\n, __func__, status); devdata-status = UNAVAILABLE; + ret = -ENODEV; } return ret; -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/6] nx-842-pseries: do not emit extra output if status is disabled
If the device-tree indicates the nx-842 device's status is 'disabled', we emit two messages: nx_compress_pseries ibm,compression-v1: nx842_OF_upd_status: status 'disabled' is not 'okay'. nx_compress_pseries ibm,compression-v1: nx842_OF_upd: device disabled Given that 'disabled' is a valid state, and we are going to emit that the device is disabled, only print out a non-'okay' status if it is not 'disabled'. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- drivers/crypto/nx/nx-842-pseries.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c index e17f4d2e96e0..b84b0ceeb46e 100644 --- a/drivers/crypto/nx/nx-842-pseries.c +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -566,8 +566,14 @@ static int nx842_OF_upd_status(struct nx842_devdata *devdata, if (!strncmp(status, okay, (size_t)prop-length)) { devdata-status = AVAILABLE; } else { - dev_info(devdata-dev, %s: status '%s' is not 'okay'\n, + /* +* Caller will log that the device is disabled, so only +* output if there is an unexpected status. +*/ + if (strncmp(status, disabled, (size_t)prop-length)) { + dev_info(devdata-dev, %s: status '%s' is not 'okay'\n, __func__, status); + } devdata-status = UNAVAILABLE; ret = -ENODEV; } -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/6] crypto/nx-842-{powerpc,pseries}: only load on the appropriate machine type
While we never would successfully load on the wrong machine type, there is extra output by default regardless of machine type. For instance, on a PowerVM LPAR, we see the following: nx_compress_powernv: loading nx_compress_powernv: no coprocessors found even though those coprocessors could never be found. Similar pseries messages are printed on powernv. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- drivers/crypto/nx/nx-842-powernv.c | 6 ++ drivers/crypto/nx/nx-842-pseries.c | 6 ++ drivers/crypto/nx/nx-842.h | 1 + 3 files changed, 13 insertions(+) diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index 33b3b0abf4ae..6b5e5143c95b 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -594,6 +594,9 @@ static __init int nx842_powernv_init(void) BUILD_BUG_ON(DDE_BUFFER_ALIGN % DDE_BUFFER_SIZE_MULT); BUILD_BUG_ON(DDE_BUFFER_SIZE_MULT % DDE_BUFFER_LAST_MULT); + if (!machine_is(powernv)) + return -ENODEV; + pr_info(loading\n); for_each_compatible_node(dn, NULL, ibm,power-nx) @@ -625,6 +628,9 @@ static void __exit nx842_powernv_exit(void) { struct nx842_coproc *coproc, *n; + if (!machine_is(powernv)) + return; + nx842_platform_driver_unset(nx842_powernv_driver); list_for_each_entry_safe(coproc, n, nx842_coprocs, list) { diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c index b84b0ceeb46e..75a7bfdc160e 100644 --- a/drivers/crypto/nx/nx-842-pseries.c +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -1091,6 +1091,9 @@ static int __init nx842_pseries_init(void) struct nx842_devdata *new_devdata; int ret; + if (!machine_is(pseries)) + return -ENODEV; + pr_info(Registering IBM Power 842 compression driver\n); if (!of_find_compatible_node(NULL, NULL, ibm,compression)) @@ -1129,6 +1132,9 @@ static void __exit nx842_pseries_exit(void) struct nx842_devdata *old_devdata; unsigned long flags; + if (!machine_is(pseries)) + return; + pr_info(Exiting IBM Power 842 compression driver\n); nx842_platform_driver_unset(nx842_pseries_driver); spin_lock_irqsave(devdata_mutex, flags); diff --git a/drivers/crypto/nx/nx-842.h b/drivers/crypto/nx/nx-842.h index ac0ea79d0f8b..ea89c661e476 100644 --- a/drivers/crypto/nx/nx-842.h +++ b/drivers/crypto/nx/nx-842.h @@ -10,6 +10,7 @@ #include linux/io.h #include linux/mm.h #include linux/ratelimit.h +#include asm/machdep.h /* Restrictions on Data Descriptor List (DDL) and Entry (DDE) buffers * -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc,numa: Memory hotplug to memory-less nodes ?
On 24.06.2015 [07:13:36 -0500], Nathan Fontenot wrote: On 06/23/2015 11:01 PM, Bharata B Rao wrote: So will it be correct to say that memory hotplug to memory-less node isn't supported by PowerPC kernel ? Should I enforce the same in QEMU for PowerKVM ? I'm not sure if that is correct. It appears that we initialize all online nodes, even those without spanned_pages, at boot time. This occurs in setup_node_data() called from initmem_init(). Looking at this I would think that we could add memory to any online node even if it does not have any spanned_pages. I think an interesting test we be to check for the node being online instead of checking to see if it has any memory. I see no *technical* reason we should't be able to hotplug to an initially memoryless node. I'm not sure it happens in practice under PowerVM (where we have far less control of the topology anyways). One aspect of this that I have on my todo list is seeing what SLUB does when a node goes from memoryless to populated -- as during boot memoryless nodes get a 'useless' per node structure (early_kmem_cache_node_alloc). I can look at this a bit under KVM maybe later this week myself to see what happens in a guest. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On 08.05.2015 [15:47:26 -0700], Andrew Morton wrote: On Wed, 06 May 2015 11:28:12 +0200 Vlastimil Babka vba...@suse.cz wrote: On 05/06/2015 12:09 AM, Nishanth Aravamudan wrote: On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote: What I find somewhat worrying though is that we could potentially break the pfmemalloc_watermark_ok() test in situations where zone_reclaimable_pages(zone) == 0 is a transient situation (and not a permanently allocated hugepage). In that case, the throttling is supposed to help system recover, and we might be breaking that ability with this patch, no? Well, if it's transient, we'll skip it this time through, and once there are reclaimable pages, we should notice it again. I'm not familiar enough with this logic, so I'll read through the code again soon to see if your concern is valid, as best I can. In reviewing the code, I think that transiently unreclaimable zones will lead to some higher direct reclaim rates and possible contention, but shouldn't cause any major harm. The likelihood of that situation, as well, in a non-reserved memory setup like the one I described, seems exceedingly low. OK, I guess when a reasonably configured system has nothing to reclaim, it's already busted and throttling won't change much. Consider the patch Acked-by: Vlastimil Babka vba...@suse.cz OK, thanks, I'll move this patch into the queue for 4.2-rc1. Thank you! Or is it important enough to merge into 4.1? I think 4.2 is sufficient, but I wonder now if I should have included a stable tag? The issue has been around for a while and there's a relatively easily workaround (use the per-node sysfs files to manually round-robin around the exhausted node) in older kernels, so I had decided against it before. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote: On 03.04.2015 [09:57:35 +0200], Vlastimil Babka wrote: On 03/31/2015 11:48 AM, Michal Hocko wrote: On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote: On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote: On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote: @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); Do you really want zone_reclaimable()? Or do you want something more direct like zone_reclaimable_pages(zone) == 0? Yeah, I guess in my testing this worked out to be the same, since zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will always be false. Thanks! Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 Blee, this is a weird configuration. If one then attempts to allocate some normal 16M hugepages via echo 37 /proc/sys/vm/nr_hugepages The echo never returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node if there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not reclaimable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation attempt succeeds and correctly round-robins between Nodes 1 and 3. I am just wondering whether this is the right/complete fix. Don't we need a similar treatment at more places? I would expect kswapd would be looping endlessly because the zone wouldn't be balanced obviously. But I would be wrong... because pgdat_balanced is doing this: /* * A special case here: * * balance_pgdat() skips over all_unreclaimable after * DEF_PRIORITY. Effectively, it considers them balanced so * they must be considered balanced here as well! */ if (!zone_reclaimable(zone)) { balanced_pages += zone-managed_pages; continue; } and zone_reclaimable is false for you as you didn't have any zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those zones right?) and so the kswapd would be woken up easily. So it looks like a mess. Yeah, looks like a much cleaner/complete solution would be to remove such zones from zonelists. But that means covering all situations when these hugepages are allocated/removed and the approach then looks similar to memory hotplug. Also I'm not sure if the ability to actually allocate the reserved hugepage would be impossible due to not being reachable by a zonelist... There are possibly other places which rely on populated_zone or for_each_populated_zone without checking reclaimability. Are those working as expected? Yeah. At least the wakeup_kswapd case should be fixed IMHO. No point in waking it up just to let it immediately go to sleep again. That being said. I am not objecting to this patch. I am just trying to wrap my head around possible issues from such a weird configuration and all the consequences. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()
On 10.04.2015 [14:37:19 +0300], Konstantin Khlebnikov wrote: On 10.04.2015 01:58, Tanisha Aravamudan wrote: On 09.04.2015 [07:27:28 +0300], Konstantin Khlebnikov wrote: On Thu, Apr 9, 2015 at 2:07 AM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote: On 08.04.2015 19:59, Konstantin Khlebnikov wrote: Node 0 might be offline as well as any other numa node, in this case kernel cannot handle memory allocation and crashes. Isn't the bug that numa_node_id() returned an offline node? That shouldn't happen. Offline node 0 came from static-inline copy of that function from of.h I've patched weak function for keeping consistency. Got it, that's not necessarily clear in the original commit message. Sorry. #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID ... #ifndef numa_node_id /* Returns the number of the current Node. */ static inline int numa_node_id(void) { return raw_cpu_read(numa_node); } #endif ... #else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */ /* Returns the number of the current Node. */ #ifndef numa_node_id static inline int numa_node_id(void) { return cpu_to_node(raw_smp_processor_id()); } #endif ... So that's either the per-cpu numa_node value, right? Or the result of cpu_to_node on the current processor. Example: [0.027133] [ cut here ] [0.027938] kernel BUG at include/linux/gfp.h:322! This is VM_BUG_ON(nid 0 || nid = MAX_NUMNODES || !node_online(nid)); in alloc_pages_exact_node(). And based on the trace below, that's __slab_alloc - alloc alloc_pages_exact_node - alloc_slab_page - allocate_slab - new_slab - new_slab_objects __slab_alloc? which is just passing the node value down, right? Which I think was from: domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size), GFP_KERNEL, of_node_to_nid(of_node)); ? What platform is this on, looks to be x86? qemu emulation of a pathological topology? What was the topology? qemu x86_64, 2 cpu, 2 numa nodes, all memory in second. Ok, this worked before? That is, this is a regression? Seems like that worked before 3.17 where bug was exposed by commit 44767bfaaed782d6d635ecbb13f3980041e6f33e (x86, irq: Enhance mp_register_ioapic() to support irqdomain) this is first usage of *irq_domain_add*() in x86. Ok. I've slightly patched it to allow that setup (in qemu hardcoded 1Mb of memory connected to node 0) And i've found unrelated bug -- if numa node has less that 4Mb ram then kernel crashes even earlier because numa code ignores that node but buddy allocator still tries to use that pages. So this isn't an actually supported topology by qemu? Qemu easily created memoryless numa nodes but node 0 have hardcoded 1Mb of ram. This seems like legacy prop for DOS era software. Well, the problem is that x86 doesn't support memoryless nodes. git grep MEMORYLESS_NODES arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES Note that there is a ton of code that seems to assume node 0 is online. I started working on removing this assumption myself and it just led down a rathole (on power, we always have node 0 online, even if it is memoryless and cpuless, as a result). I am guessing this is just happening early in boot before the per-cpu areas are setup? That's why (I think) x86 has the early_cpu_to_node() function... Or do you not have CONFIG_OF set? So isn't the only change necessary to the include file, and it should just return first_online_node rather than 0? Ah and there's more of those node 0 assumptions :) That was x86 where is no CONFIG_OF at all. I don't know what's wrong with that machine but ACPI reports that cpus and memory from node 0 as connected to node 1 and everything seems worked fine until lates upgrade -- seems like buggy static-inline of_node_to_nid was intoduced in 3.13 but x86 ioapic uses it during early allocations only in since 3.17. Machine owner teells that 3.15 worked fine. So, this was a qemu emulation of this actual physical machine without a node 0? Yep. Also I have crash from real machine but that stacktrace is messy because CONFIG_DEBUG_VM wasn't enabled and kernel crashed inside buddy allocator when tried to touch unallocated numa node structure. As I mentioned, there are lots of node 0 assumptions through the kernel. You might run into more issues at runtime. I think it's possible to trigger kernel crash for any memoryless numa node (not just for 0) if some device (like ioapic in my case) points to it in its acpi tables. In runtime numa affinity configured by user usually validated by the kernel, while numbers from firmware might be used without
Re: Topology updates and NUMA-level sched domains
On 10.04.2015 [11:08:10 +0200], Peter Zijlstra wrote: On Fri, Apr 10, 2015 at 10:31:53AM +0200, Peter Zijlstra wrote: Please, step back, look at what you're doing and ask yourself, will any sane person want to use this? Can they use this? If so, start by describing the desired user semantics of this work. Don't start by cobbling kernel bits togerther until it stops crashing. Also, please talk to your s390 folks. They've long since realized that this doesn't work. They've just added big honking caches (their BOOK stuff) and pretend the system is UMA. Interesting, I will take a look. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Topology updates and NUMA-level sched domains
On 10.04.2015 [10:31:53 +0200], Peter Zijlstra wrote: On Thu, Apr 09, 2015 at 03:29:56PM -0700, Nishanth Aravamudan wrote: No, that's very much not the same. Even if it were dealing with hotplug it would still assume the cpu to return to the same node. The analogy may have been poor; a better one is: it's the same as hotunplugging a CPU from one node and hotplugging a physically identical CPU on a different node. Then it'll not be the same cpu from the OS's pov. The outgoing cpus and the incoming cpus will have different cpu numbers. Right, it's an analogy. I understand it's not the exact same. I was trying to have a civil discussion about how to solve this problem without you calling me a wanker. Furthermore at boot we will have observed the empty socket and reserved cpu number and arranged per-cpu resources for them. Ok, I see what you're referring to now: static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align) { return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align, __pa(MAX_DMA_ADDRESS)); } So we'll be referring to bootmem in the pcpu path for the node we were on at boot-time. Actually, this is already horribly broken on power. [0.00] pcpu-alloc: [0] 000 001 002 003 [0] 004 005 006 007 [0.00] pcpu-alloc: [0] 008 009 010 011 [0] 012 013 014 015 [0.00] pcpu-alloc: [0] 016 017 018 019 [0] 020 021 022 023 [0.00] pcpu-alloc: [0] 024 025 026 027 [0] 028 029 030 031 [0.00] pcpu-alloc: [0] 032 033 034 035 [0] 036 037 038 039 [0.00] pcpu-alloc: [0] 040 041 042 043 [0] 044 045 046 047 [0.00] pcpu-alloc: [0] 048 049 050 051 [0] 052 053 054 055 [0.00] pcpu-alloc: [0] 056 057 058 059 [0] 060 061 062 063 [0.00] pcpu-alloc: [0] 064 065 066 067 [0] 068 069 070 071 [0.00] pcpu-alloc: [0] 072 073 074 075 [0] 076 077 078 079 [0.00] pcpu-alloc: [0] 080 081 082 083 [0] 084 085 086 087 [0.00] pcpu-alloc: [0] 088 089 090 091 [0] 092 093 094 095 [0.00] pcpu-alloc: [0] 096 097 098 099 [0] 100 101 102 103 [0.00] pcpu-alloc: [0] 104 105 106 107 [0] 108 109 110 111 [0.00] pcpu-alloc: [0] 112 113 114 115 [0] 116 117 118 119 [0.00] pcpu-alloc: [0] 120 121 122 123 [0] 124 125 126 127 [0.00] pcpu-alloc: [0] 128 129 130 131 [0] 132 133 134 135 [0.00] pcpu-alloc: [0] 136 137 138 139 [0] 140 141 142 143 [0.00] pcpu-alloc: [0] 144 145 146 147 [0] 148 149 150 151 [0.00] pcpu-alloc: [0] 152 153 154 155 [0] 156 157 158 159 even though the topology is: available: 4 nodes (0-1,16-17) node 0 cpus: 0 8 16 24 32 node 1 cpus: 40 48 56 64 72 node 16 cpus: 80 88 96 104 112 node 17 cpus: 120 128 136 144 152 The comment in for pcpu_build_alloc_info() seems wrong: The returned configuration is guaranteed * to have CPUs on different nodes on different groups and =75% usage * of allocated virtual address space. Or, we're returning node 0 for everything at this point. I'll debug it further, but one more question: If we have CONIFG_USE_PERCPU_NUMA_NODE_ID and are using cpu_to_node() for setting up the per-cpu areas itself; isn't that a problem? People very much assume that when they set up their node affinities they will remain the same for the life time of their program. People set separate cpu affinity with sched_setaffinity() and memory affinity with mbind() and assume the cpu-node maps are invariant. That's a bad assumption to make if you're virtualized, I would think (including on KVM). Unless you're also binding your vcpu threads to physical cpus. But the point is valid, that userspace does tend to think rather statically about the world. I've no idea how KVM numa is working, if at all. I would not be surprised if it indeed hard binds vcpus to nodes. Not doing that allows the vcpus to randomly migrate between nodes which will completely destroy the whole point of exposing numa details to the guest. Well, you *can* bind vcpus to nuodes. But you don't have to. I suppose some of the auto-numa work helps here. not sure at all. Yes, it does, I think. I'll look into the per-cpu memory case. Look into everything that does cpu_to_node() based allocations, because they all assume that that is stable. They allocate memory at init time to be node local, but they you go an mess that up. So, the case that you're considering is: CPU X on Node Y at boot-time, gets memory from Node Y. CPU X moves to Node Z at run-time, is still using memory from Node Y. Right, at which point numa doesn't make sense anymore. If you randomly scramble your cpu-node map what's the point of exposing numa to the guest? The whole point of NUMA is that userspace can be aware of the layout and use local memory where possible. Nobody will want to consider dynamic NUMA information; its utterly insane; do you
Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()
On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote: On 08.04.2015 19:59, Konstantin Khlebnikov wrote: Node 0 might be offline as well as any other numa node, in this case kernel cannot handle memory allocation and crashes. Isn't the bug that numa_node_id() returned an offline node? That shouldn't happen. #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID ... #ifndef numa_node_id /* Returns the number of the current Node. */ static inline int numa_node_id(void) { return raw_cpu_read(numa_node); } #endif ... #else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */ /* Returns the number of the current Node. */ #ifndef numa_node_id static inline int numa_node_id(void) { return cpu_to_node(raw_smp_processor_id()); } #endif ... So that's either the per-cpu numa_node value, right? Or the result of cpu_to_node on the current processor. Example: [0.027133] [ cut here ] [0.027938] kernel BUG at include/linux/gfp.h:322! This is VM_BUG_ON(nid 0 || nid = MAX_NUMNODES || !node_online(nid)); in alloc_pages_exact_node(). And based on the trace below, that's __slab_alloc - alloc alloc_pages_exact_node - alloc_slab_page - allocate_slab - new_slab - new_slab_objects __slab_alloc? which is just passing the node value down, right? Which I think was from: domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size), GFP_KERNEL, of_node_to_nid(of_node)); ? What platform is this on, looks to be x86? qemu emulation of a pathological topology? What was the topology? Note that there is a ton of code that seems to assume node 0 is online. I started working on removing this assumption myself and it just led down a rathole (on power, we always have node 0 online, even if it is memoryless and cpuless, as a result). I am guessing this is just happening early in boot before the per-cpu areas are setup? That's why (I think) x86 has the early_cpu_to_node() function... Or do you not have CONFIG_OF set? So isn't the only change necessary to the include file, and it should just return first_online_node rather than 0? Ah and there's more of those node 0 assumptions :) #define first_online_node 0 #define first_memory_node 0 if MAX_NUMODES == 1... -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Topology updates and NUMA-level sched domains
On 07.04.2015 [12:21:47 +0200], Peter Zijlstra wrote: On Mon, Apr 06, 2015 at 02:45:58PM -0700, Nishanth Aravamudan wrote: Hi Peter, As you are very aware, I think, power has some odd NUMA topologies (and changes to the those topologies) at run-time. In particular, we can see a topology at boot: Node 0: all Cpus Node 7: no cpus Then we get a notification from the hypervisor that a core (or two) have moved from node 0 to node 7. This results in the: or a re-init API (which won't try to reallocate various bits), because the topology could be completely different now (e.g., sched_domains_numa_distance will also be inaccurate now). Really, a topology update on power (not sure on s390x, but those are the only two archs that return a positive value from arch_update_cpu_topology() right now, afaics) is a lot like a hotplug event and we need to re-initialize any dependent structures. I'm just sending out feelers, as we can limp by with the above warning, it seems, but is less than ideal. Any help or insight you could provide would be greatly appreciated! So I think (and ISTR having stated this before) that dynamic cpu-node maps are absolutely insane. Sorry if I wasn't involved at the time. I agree that it's a bit of a mess! There is a ton of stuff that assumes the cpu-node relation is a boot time fixed one. Userspace being one of them. Per-cpu memory another. Well, userspace already deals with CPU hotplug, right? And the topology updates are, in a lot of ways, just like you've hotplugged a CPU from one node and re-hotplugged it into another node. I'll look into the per-cpu memory case. For what it's worth, our test teams are stressing the kernel with these topology updates and hopefully we'll be able to resolve any issues that result. You simply cannot do this without causing massive borkage. So please come up with a coherent plan to deal with the entire problem of dynamic cpu to memory relation and I might consider the scheduler impact. But we're not going to hack around and maybe make it not crash in a few corner cases while the entire thing is shite. Well, it doesn't crash now. In fact, it stays up reasonable well and seems to dtrt (from the kernel perspective) other than the sched domain messages. I will look into per-cpu memory, and also another case I have been thinking about where if a process is bound to a CPU/node combination via numactl and then the topology changes, what exactly will happen. In theory, via these topology updates, a node could go from memoryless - not and v.v., which seems like it might not be well supported (but again, should not be much different from hotplugging all the memory out from a node). And, in fact, I think topologically speaking, I think I should be able to repeat the same sched domain warnings if I start off with a 2-node system with all CPUs on one node, and then hotplug a CPU onto the second node, right? That has nothing to do with power, that I can tell. I'll see if I can demonstrate it via a KVM guest. Thanks for your quick response! -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Topology updates and NUMA-level sched domains
Hi Peter, As you are very aware, I think, power has some odd NUMA topologies (and changes to the those topologies) at run-time. In particular, we can see a topology at boot: Node 0: all Cpus Node 7: no cpus Then we get a notification from the hypervisor that a core (or two) have moved from node 0 to node 7. This results in the: [ 64.496687] BUG: arch topology borken [ 64.496689] the CPU domain not a subset of the NUMA domain messages for each moved CPU. I think this is because when we first came up, we degrade (elide altogether?) the NUMA domain for node 7 as it has no CPUs: [0.305823] CPU0 attaching sched-domain: [0.305831] domain 0: span 0-7 level SIBLING [0.305834] groups: 0 (cpu_power = 146) 1 (cpu_power = 146) 2 (cpu_power = 146) 3 (cpu_power = 146) 4 (cpu_power = 146) 5 (cpu_power = 146) 6 (cpu_power = 146) 7 (cpu_power = 146) [0.305854] domain 1: span 0-79 level CPU [0.305856]groups: 0-7 (cpu_power = 1168) 8-15 (cpu_power = 1168) 16-23 (cpu_power = 1168) 24-31 (cpu_power = 1168) 32-39 (cpu_power = 1168) 40-47 (cpu_power = 1168) 48-55 (cpu_power = 1168) 56-63 (cpu_power = 1168) 64-71 (cpu_power = 1168) 72-79 (cpu_power = 1168) For those cpus that moved, we get after the update: [ 64.505819] CPU8 attaching sched-domain: [ 64.505821] domain 0: span 8-15 level SIBLING [ 64.505823] groups: 8 (cpu_power = 147) 9 (cpu_power = 147) 10 (cpu_power = 147) 11 (cpu_power = 146) 12 (cpu_power = 147) 13 (cpu_power = 147) 14 (cpu_power = 146) 15 (cpu_power = 147) [ 64.505842] domain 1: span 8-23,72-79 level CPU [ 64.505845]groups: 8-15 (cpu_power = 1174) 16-23 (cpu_power = 1175) 72-79 (cpu_power = 1176) while the non-modified CPUs report, correctly: [ 64.497186] CPU0 attaching sched-domain: [ 64.497189] domain 0: span 0-7 level SIBLING [ 64.497192] groups: 0 (cpu_power = 147) 1 (cpu_power = 147) 2 (cpu_power = 146) 3 (cpu_power = 147) 4 (cpu_power = 147) 5 (cpu_power = 147) 6 (cpu_power = 147) 7 (cpu_power = 146) [ 64.497213] domain 1: span 0-7,24-71 level CPU [ 64.497215]groups: 0-7 (cpu_power = 1174) 24-31 (cpu_power = 1173) 32-39 (cpu_power = 1176) 40-47 (cpu_power = 1175) 48-55 (cpu_power = 1176) 56-63 (cpu_power = 1175) 64-71 (cpu_power = 1174) [ 64.497234]domain 2: span 0-79 level NUMA [ 64.497236] groups: 0-7,24-71 (cpu_power = 8223) 8-23,72-79 (cpu_power = 3525) It seems like we might need something like this (HORRIBLE HACK, I know, just to get discussion): @@ -6958,6 +6960,10 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], /* Let architecture update cpu core mappings. */ new_topology = arch_update_cpu_topology(); + /* Update NUMA topology lists */ + if (new_topology) { + sched_init_numa(); + } n = doms_new ? ndoms_new : 0; or a re-init API (which won't try to reallocate various bits), because the topology could be completely different now (e.g., sched_domains_numa_distance will also be inaccurate now). Really, a topology update on power (not sure on s390x, but those are the only two archs that return a positive value from arch_update_cpu_topology() right now, afaics) is a lot like a hotplug event and we need to re-initialize any dependent structures. I'm just sending out feelers, as we can limp by with the above warning, it seems, but is less than ideal. Any help or insight you could provide would be greatly appreciated! -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On 03.04.2015 [20:24:45 +0200], Michal Hocko wrote: On Fri 03-04-15 10:43:57, Nishanth Aravamudan wrote: On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote: [...] I would expect kswapd would be looping endlessly because the zone wouldn't be balanced obviously. But I would be wrong... because pgdat_balanced is doing this: /* * A special case here: * * balance_pgdat() skips over all_unreclaimable after * DEF_PRIORITY. Effectively, it considers them balanced so * they must be considered balanced here as well! */ if (!zone_reclaimable(zone)) { balanced_pages += zone-managed_pages; continue; } and zone_reclaimable is false for you as you didn't have any zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those zones right?) and so the kswapd would be woken up easily. So it looks like a mess. My understanding, and I could easily be wrong, is that kswapd2 (node 2 is the exhausted one) spins endlessly, because the reclaim logic sees that we are reclaiming from somewhere but the allocation request for node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will never complete, so we just continue to reclaim. __GFP_THISNODE would be waking up kswapd2 again and again, that is true. Right, one idea I had for this was ensuring that we perform reclaim with somehow some knowledge of __GFP_THISNODE -- that is it needs to be somewhat targetted in order to actually help satisfy the current allocation. But it got pretty hairy fast and I didn't want to break the world :) I am just wondering whether we will have any __GFP_THISNODE allocations for a node without CPUs (numa_node_id() shouldn't return such a node AFAICS). Maybe if somebody is bound to Node2 explicitly but I would consider this as a misconfiguration. Right, I'd need to check what happens if in our setup you taskset to node2 and tried to force memory to be local -- I think you'd either be killed immediately, or the kernel will just disagree with your binding since it's invalid (e.g., that will happen if you try to bind to a memoryless node, I think). Keep in mind that although in my config node2 had no CPUs, that's not a hard fast requirement. I do believe in a previous iteration of this bug, the exhausted node had no free memory but did have cpus assigned to it. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote: On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote: On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote: On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote: @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); Do you really want zone_reclaimable()? Or do you want something more direct like zone_reclaimable_pages(zone) == 0? Yeah, I guess in my testing this worked out to be the same, since zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will always be false. Thanks! Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 Blee, this is a weird configuration. Yep, super gross. It's relatively rare in the field, thankfully. But 16G pages definitely make it pretty likely to hit (as in, I've seen it once before :) If one then attempts to allocate some normal 16M hugepages via echo 37 /proc/sys/vm/nr_hugepages The echo never returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node if there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not reclaimable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation attempt succeeds and correctly round-robins between Nodes 1 and 3. I am just wondering whether this is the right/complete fix. Don't we need a similar treatment at more places? Almost certainly needs an audit. Exhausted nodes are tough to reproduce easily (fully exhausted, that is), for me. I would expect kswapd would be looping endlessly because the zone wouldn't be balanced obviously. But I would be wrong... because pgdat_balanced is doing this: /* * A special case here: * * balance_pgdat() skips over all_unreclaimable after * DEF_PRIORITY. Effectively, it considers them balanced so * they must be considered balanced here as well! */ if (!zone_reclaimable(zone)) { balanced_pages += zone-managed_pages; continue; } and zone_reclaimable is false for you as you didn't have any zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those zones right?) and so the kswapd would be woken up easily. So it looks like a mess. My understanding, and I could easily be wrong, is that kswapd2 (node 2 is the exhausted one) spins endlessly, because the reclaim logic sees that we are reclaiming from somewhere but the allocation request for node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will never complete, so we just continue to reclaim. There are possibly other places which rely on populated_zone or for_each_populated_zone without checking reclaimability. Are those working as expected? Not yet verified, admittedly. That being said. I am not objecting to this patch. I am just trying to wrap my head around possible issues from such a weird configuration and all the consequences. Yeah, there are almost certainly more. Luckily, our test
Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On 03.04.2015 [09:57:35 +0200], Vlastimil Babka wrote: On 03/31/2015 11:48 AM, Michal Hocko wrote: On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote: On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote: On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote: @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); Do you really want zone_reclaimable()? Or do you want something more direct like zone_reclaimable_pages(zone) == 0? Yeah, I guess in my testing this worked out to be the same, since zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will always be false. Thanks! Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 Blee, this is a weird configuration. If one then attempts to allocate some normal 16M hugepages via echo 37 /proc/sys/vm/nr_hugepages The echo never returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node if there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not reclaimable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation attempt succeeds and correctly round-robins between Nodes 1 and 3. I am just wondering whether this is the right/complete fix. Don't we need a similar treatment at more places? I would expect kswapd would be looping endlessly because the zone wouldn't be balanced obviously. But I would be wrong... because pgdat_balanced is doing this: /* * A special case here: * * balance_pgdat() skips over all_unreclaimable after * DEF_PRIORITY. Effectively, it considers them balanced so * they must be considered balanced here as well! */ if (!zone_reclaimable(zone)) { balanced_pages += zone-managed_pages; continue; } and zone_reclaimable is false for you as you didn't have any zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those zones right?) and so the kswapd would be woken up easily. So it looks like a mess. Yeah, looks like a much cleaner/complete solution would be to remove such zones from zonelists. But that means covering all situations when these hugepages are allocated/removed and the approach then looks similar to memory hotplug. Also I'm not sure if the ability to actually allocate the reserved hugepage would be impossible due to not being reachable by a zonelist... There are possibly other places which rely on populated_zone or for_each_populated_zone without checking reclaimability. Are those working as expected? Yeah. At least the wakeup_kswapd case should be fixed IMHO. No point in waking it up just to let it immediately go to sleep again. That being said. I am not objecting to this patch. I am just trying to wrap my head around possible issues from such a weird configuration and all the consequences. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com The patch as is doesn't seem to be harmful. Reviewed-by: Michal Hocko mho...@suse.cz --- v1 - v2: Check against zone_reclaimable_pages, rather
[PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: (0) root @ br30p03: /root # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 If one then attempts to allocate some normal 16M hugepages: echo 37 /proc/sys/vm/nr_hugepages The echo enver returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node and see if the there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not recliamable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation succeeds and correctly round-robins between Nodes 1 and 3. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/mm/vmscan.c b/mm/vmscan.c index dcd90c8..033c2b7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
[ Sorry, typo'd anton's address ] On 27.03.2015 [12:28:50 -0700], Nishanth Aravamudan wrote: Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: (0) root @ br30p03: /root # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 If one then attempts to allocate some normal 16M hugepages: echo 37 /proc/sys/vm/nr_hugepages The echo enver returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node and see if the there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not recliamable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation succeeds and correctly round-robins between Nodes 1 and 3. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/mm/vmscan.c b/mm/vmscan.c index dcd90c8..033c2b7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote: On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote: @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); Do you really want zone_reclaimable()? Or do you want something more direct like zone_reclaimable_pages(zone) == 0? Yeah, I guess in my testing this worked out to be the same, since zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will always be false. Thanks! Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 If one then attempts to allocate some normal 16M hugepages via echo 37 /proc/sys/vm/nr_hugepages The echo never returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node if there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not reclaimable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation attempt succeeds and correctly round-robins between Nodes 1 and 3. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- v1 - v2: Check against zone_reclaimable_pages, rather zone_reclaimable, based upon feedback from Dave Hansen. diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd71bac..c627fa4c991f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2646,7 +2646,8 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || + zone_reclaimable_pages(zone) == 0) continue; pfmemalloc_reserve += min_wmark_pages(zone); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: new decimal conversion - seeking testers
On 13.03.2015 [00:09:19 +0100], Rasmus Villemoes wrote: Hi, I've proposed a new implementation of decimal conversion for lib/vsprintf.c; see http://thread.gmane.org/gmane.linux.kernel/1892035/focus=1905478. Benchmarking so far shows 25-50% (depending on distribution of input numbers) improvement on x86_64 and 10-30% on various 32 bit platforms. Since the new code plays a little endianness game I would really appreciate it if someone here would run the test and verification code on ppc. The code is on github, https://github.com/Villemoes/dec, and it should be as simple as git clone https://github.com/Villemoes/dec.git cd dec make ./test # benchmark ./verify # correctness though I can't blame you if you want to inspect the code before compiling and running something some stranger asks you to download :-) See also the README file. If 'make' doesn't work out-of-the-box, I'd also like to hear from you. On a ppc64le box: ./test Distribution Function nsecs/conv Conv/1 sec uniform([10, 2^64-1]) linux_put_dec 56.04 17785895 uniform([10, 2^64-1]) rv_put_dec 31.97 31190888 +/- -42.94% +75.37% 3 + neg_binom(0.05) linux_put_dec 29.55 32986465 3 + neg_binom(0.05) rv_put_dec 24.61 39416630 +/- -16.71% +19.49% 3 + neg_binom(0.10) linux_put_dec 22.16 43993836 3 + neg_binom(0.10) rv_put_dec 18.76 50767222 +/- -15.34% +15.40% 3 + neg_binom(0.15) linux_put_dec 18.97 51272565 3 + neg_binom(0.15) rv_put_dec 16.18 58328176 +/- -14.70% +13.76% 3 + neg_binom(0.20) linux_put_dec 16.79 57792783 3 + neg_binom(0.20) rv_put_dec 14.03 66418077 +/- -16.45% +14.92% 3 + neg_binom(0.50) linux_put_dec 10.81 89762669 3 + neg_binom(0.50) rv_put_dec 9.40104336963 +/- -13.08% +16.24% ./verify Using 16 threads Checking [10, 100] and [18446744063709551615, 18446744073709551615] Thread 9: low range ok Thread 13: low range ok Thread 10: low range ok Thread 15: low range ok Thread 7: low range ok Thread 5: low range ok Thread 14: low range ok Thread 0: low range ok Thread 3: low range ok Thread 1: low range ok Thread 2: low range ok Thread 4: low range ok Thread 8: low range ok Thread 12: low range ok Thread 6: low range ok Thread 11: low range ok Thread 9: high range ok Thread 10: high range ok Thread 7: high range ok Thread 13: high range ok Thread 0: high range ok Thread 14: high range ok Thread 15: high range ok Thread 5: high range ok Thread 1: high range ok Thread 8: high range ok Thread 11: high range ok Thread 6: high range ok Thread 2: high range ok Thread 12: high range ok Thread 3: high range ok Thread 4: high range ok Thread 9: mid range ok Thread 0: mid range ok Thread 14: mid range ok Thread 7: mid range ok Thread 10: mid range ok Thread 8: mid range ok Thread 2: mid range ok Thread 11: mid range ok Thread 13: mid range ok Thread 1: mid range ok Thread 6: mid range ok Thread 15: mid range ok Thread 5: mid range ok Thread 4: mid range ok Thread 3: mid range ok Thread 12: mid range ok Distribution of lengths checked: 1 5 2 101 3 900 4 9000 5 9 6 90 7 900 8 9000 9 9 10 90 11 81 12 848 13 8384 14 83808 15 838192 16 8381904 17 83819040 18 838190304 19 8381903184 20 17866643425 On a ppc64 box: ./test Distribution Function nsecs/conv Conv/1 sec uniform([10, 2^64-1]) linux_put_dec 48.97 20478528 uniform([10, 2^64-1]) rv_put_dec 32.14 31915074 +/- -34.37% +55.85% 3 + neg_binom(0.05) linux_put_dec 25.60 38732189 3 + neg_binom(0.05) rv_put_dec 20.18 48828445 +/- -21.18% +26.07% 3 + neg_binom(0.10) linux_put_dec 18.70 52648284 3 + neg_binom(0.10) rv_put_dec 15.60 62915143 +/- -16.56% +19.50% 3 + neg_binom(0.15) linux_put_dec 15.27 64695363 3 + neg_binom(0.15) rv_put_dec 13.34 73279482 +/- -12.62% +13.27% 3 + neg_binom(0.20) linux_put_dec 13.25 74240237
[PATCH v3] powerpc/numa: set node_possible_map to only node_online_map during boot
On 10.03.2015 [10:55:05 +1100], Michael Ellerman wrote: On Thu, 2015-03-05 at 21:27 -0800, Nishanth Aravamudan wrote: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..0c1716cd271f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,6 +958,13 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* Reduce the possible NUMA nodes to the online NUMA nodes, +* since we do not support node hotplug. This ensures that we +* lower the maximum NUMA node ID to what is actually present. +*/ + node_possible_map = node_online_map; That looks nice, but is it generating what we want? ie. is the content of node_online_map being *copied* into node_possible_map. Or are we changing node_possible_map to point at node_online_map? I think it ends up being the latter, which is probably fine in practice (I think node_online_map is static on power after boot), but perhaps it would be better to do: nodes_and(node_possible_map, node_possible_map, node_online_map); ? e.g.: powerpc/numa: reset node_possible_map to only node_online_map Raghu noticed an issue with excessive memory allocation on power with a simple cgroup test, specifically, in mem_cgroup_css_alloc - for_each_node - alloc_mem_cgroup_per_zone_info(), which ends up blowing up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup directories). The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes possible), which defines node_possible_map, which in turn defines the value of nr_node_ids in setup_nr_node_ids and the iteration of for_each_node. In practice, we never see a system with 256 NUMA nodes, and in fact, we do not support node hotplug on power in the first place, so the nodes that are online when we come up are the nodes that will be present for the lifetime of this kernel. So let's, at least, drop the NUMA possible map down to the online map at runtime. This is similar to what x86 does in its initialization routines. mem_cgroup_css_alloc should also be fixed to only iterate over memory-populated nodes and handle hotplug, but that is a separate change. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com To: Michael Ellerman m...@ellerman.id.au Cc: linuxppc-dev@lists.ozlabs.org Cc: Tejun Heo t...@kernel.org Cc: David Rientjes rient...@google.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Anton Blanchard an...@samba.org Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- v1 - v2: Rather than clear node_possible_map and set it nid-by-nid, just directly assign node_online_map to it, as suggested by Michael Ellerman and Tejun Heo. v2 - v3: Rather than direct assignment (which is just repointing the pointer), modify node_possible_map in-place. diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..1a118b08fad2 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,6 +958,13 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* Reduce the possible NUMA nodes to the online NUMA nodes, +* since we do not support node hotplug. This ensures that we +* lower the maximum NUMA node ID to what is actually present. +*/ + nodes_and(node_possible_map, node_possible_map, node_online_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
Hi David, On 05.03.2015 [13:16:35 -0800], David Rientjes wrote: On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..24de29b3651b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,9 +958,17 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* zero out the possible nodes after we parse the device-tree, +* so that we lower the maximum NUMA node ID to what is actually +* present. +*/ + nodes_clear(node_possible_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; + node_set(nid, node_possible_map); get_pfn_range_for_nid(nid, start_pfn, end_pfn); setup_node_data(nid, start_pfn, end_pfn); sparse_memory_present_with_active_regions(nid); This seems a bit strange, node_possible_map is supposed to be a superset of node_online_map and this loop is iterating over node_online_map to set nodes in node_possible_map. So if we compare to x86: arch/x86/mm/numa.c::numa_init(): nodes_clear(numa_nodes_parsed); nodes_clear(node_possible_map); nodes_clear(node_online_map); ... numa_register_memblks(...); arch/x86/mm/numa.c::numa_register_memblks(): node_possible_map = numa_nodes_parsed; Basically, it looks like x86 NUMA init clears out possible map and online map, probably for a similar reason to what I gave in the changelog that by default, the possible map seems to be based off MAX_NUMNODES, rather than nr_node_ids or anything dynamic. My patch was an attempt to emulate the same thing on powerpc. You are right that there is a window in which the node_possible_map and node_online_map are out of sync with my patch. It seems like it shouldn't matter given how early in boot we are, but perhaps the following would have been clearer: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..1a118b08fad2 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,6 +958,13 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* Reduce the possible NUMA nodes to the online NUMA nodes, +* since we do not support node hotplug. This ensures that we +* lower the maximum NUMA node ID to what is actually present. +*/ + nodes_and(node_possible_map, node_possible_map, node_online_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
On 05.03.2015 [13:58:27 -0800], David Rientjes wrote: On Fri, 6 Mar 2015, Michael Ellerman wrote: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..24de29b3651b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,9 +958,17 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* zero out the possible nodes after we parse the device-tree, +* so that we lower the maximum NUMA node ID to what is actually +* present. +*/ + nodes_clear(node_possible_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; + node_set(nid, node_possible_map); get_pfn_range_for_nid(nid, start_pfn, end_pfn); setup_node_data(nid, start_pfn, end_pfn); sparse_memory_present_with_active_regions(nid); This seems a bit strange, node_possible_map is supposed to be a superset of node_online_map and this loop is iterating over node_online_map to set nodes in node_possible_map. Yeah. Though at this point in boot I don't think it matters that the two maps are out-of-sync temporarily. But it would simpler to just set the possible map to be the online map. That would also maintain the invariant that the possible map is always a superset of the online map. Or did I miss a detail there (sleep deprived parent mode). I think reset_numa_cpu_lookup_table() which iterates over the possible map, and thus only a subset of nodes now, may be concerning. I think you are confusing the CPU online map and the NUMA node online map. reset_numa_cpu_lookup_table is a cpu-node mapping, only called at boot-time, and iterates over the CPU online map, which is unaltered by my patch. I'm not sure why this is being proposed as a powerpc patch and now a patch for mem_cgroup_css_alloc(). I think mem_cgroup_css_alloc() is just an example of a larger issue. I should have made that clearer in my changelog. Even if we change mem_cgroup_css_alloc(), I think we want to fix the node_possible_map on powerpc to be accurate at run-time, just like x86 does. In other words, why do we have to allocate for all possible nodes? We should only be allocating for online nodes in N_MEMORY with mem hotplug disabled initially and then have a mem hotplug callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that transition from memoryless - memory. The extra bonus is that alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the TODO in that function can be removed. This is a good idea, and seems like it can be a follow-on parallel patch to the one I provided (which does need an updated changelog now). Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
On 05.03.2015 [17:13:08 -0500], Tejun Heo wrote: On Thu, Mar 05, 2015 at 10:05:49AM -0800, Nishanth Aravamudan wrote: While looking at this, I noticed that nr_node_ids is actually a misnomer, it seems. It's not the number, but the maximum_node_id, as with sparse NUMA nodes, you might only have two NUMA nodes possible, but to make certain loops work, nr_node_ids will be, e.g., 17. Should it be changed? It's the same for nr_cpu_ids. It's counting the number of valid IDs during that boot instance. In the above case, whether the nodes are sparse or not, there exist 17 node ids - 0 to 16. Maybe numa_max_id had been a better name (but would that equal the highest number or +1?) but nr_node_ids != nr_nodes so I don't think it's a misnomer either. Doesn't really matter at this point. Maybe add comments on top of both? Yes, I will consider that. To me, I guess it's more a matter of: a) How does nr_node_ids relate to the number of possible NUMA node IDs at runtime? They are identical. b) How does nr_node_ids relate to the number of NUMA node IDs in use? There is no relation. c) How does nr_node_ids relate to the maximum NUMA node ID in use? It is one larger than that value. However, for a), at least, we don't care about that on power, really. We don't have node hotplug, so the possible is the online in practice, for a given system. Iteration seems to generally not be a problem (since we have sparse iterators anyways) and we shouldn't be allocating for non-present nodes. But we run into excessive allocations (I'm looking into a few others Dipankar has found now) with array allocations based of nr_node_ids or MAX_NUMNODES when the NUMA topology is sparse.. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
On 06.03.2015 [08:48:52 +1100], Michael Ellerman wrote: On Thu, 2015-03-05 at 13:16 -0800, David Rientjes wrote: On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..24de29b3651b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,9 +958,17 @@ void __init initmem_init(void) memblock_dump_all(); + /* + * zero out the possible nodes after we parse the device-tree, + * so that we lower the maximum NUMA node ID to what is actually + * present. + */ + nodes_clear(node_possible_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; + node_set(nid, node_possible_map); get_pfn_range_for_nid(nid, start_pfn, end_pfn); setup_node_data(nid, start_pfn, end_pfn); sparse_memory_present_with_active_regions(nid); This seems a bit strange, node_possible_map is supposed to be a superset of node_online_map and this loop is iterating over node_online_map to set nodes in node_possible_map. Yeah. Though at this point in boot I don't think it matters that the two maps are out-of-sync temporarily. But it would simpler to just set the possible map to be the online map. That would also maintain the invariant that the possible map is always a superset of the online map. Yes, we could do that (see my reply to David just now). I didn't consider just setting the map directly, that would be clearer. I didn't want to post my nodes_and() version, because the cost of nodes_and seemed higher than nodes_clear node_set appropriately. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
On 05.03.2015 [17:08:04 -0500], Tejun Heo wrote: Hello, On Thu, Mar 05, 2015 at 01:58:27PM -0800, David Rientjes wrote: I'm not sure why this is being proposed as a powerpc patch and now a patch for mem_cgroup_css_alloc(). In other words, why do we have to allocate for all possible nodes? We should only be allocating for online nodes in N_MEMORY with mem hotplug disabled initially and then have a mem hotplug callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that transition from memoryless - memory. The extra bonus is that alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the TODO in that function can be removed. For cpus, the general direction is allocating for all possible cpus. For iterations, we alternate between using all possibles and onlines depending on the use case; however, the general idea is that the possibles and onlines aren't gonna be very different. NR_CPUS and MAX_NUMNODES gotta accomodate the worst possible case the kernel may run on but the possible masks should be set to the actually possible subset during boot so that the kernel don't end up allocating for and iterating over things which can't ever exist. Makes sense to me. It can be argued that we should always stick to the online masks for allocation and iteration; however, that usually requires more complexity and the only cases where this mattered have been when the boot code got it wrong and failed to set the possible masks correctly, which also seems to be the case here. I don't see any reason to deviate here. So, do you agree with the general direction of my change? :) Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
Raghu noticed an issue with excessive memory allocation on power with a simple cgroup test, specifically, in mem_cgroup_css_alloc - for_each_node - alloc_mem_cgroup_per_zone_info(), which ends up blowing up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup directories). The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes possible), which defines node_possible_map, which in turn defines the iteration of for_each_node. In practice, we never see a system with 256 NUMA nodes, and in fact, we do not support node hotplug on power in the first place, so the nodes that are online when we come up are the nodes that will be present for the lifetime of this kernel. So let's, at least, drop the NUMA possible map down to the online map at runtime. This is similar to what x86 does in its initialization routines. One could alternatively nodemask_and(node_possible_map, node_online_map), but I think the cost of anding the two will always be higher than zero and set a few bits in practice. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- While looking at this, I noticed that nr_node_ids is actually a misnomer, it seems. It's not the number, but the maximum_node_id, as with sparse NUMA nodes, you might only have two NUMA nodes possible, but to make certain loops work, nr_node_ids will be, e.g., 17. Should it be changed? diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..24de29b3651b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,9 +958,17 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* zero out the possible nodes after we parse the device-tree, +* so that we lower the maximum NUMA node ID to what is actually +* present. +*/ + nodes_clear(node_possible_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; + node_set(nid, node_possible_map); get_pfn_range_for_nid(nid, start_pfn, end_pfn); setup_node_data(nid, start_pfn, end_pfn); sparse_memory_present_with_active_regions(nid); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev