Re: 1M hugepage size being registered on Linux
Michal Suchánekwrites: > On Tue, 20 Jun 2017 10:47:47 -0300 > victora wrote: > >> Hi Alistair/Jeremy, >> >> I am working on a bug related to 1M hugepage size being registered on >> Linux (Power 8 Baremetal - Garrison). >> >> I was checking dmesg and it seems that 1M page size is coming from >> firmware to Linux. >> >> [0.00] base_shift=20: shift=20, sllp=0x0130, >> avpnm=0x, tlbiel=0, penc=2 >> [1.528867] HugeTLB registered 1 MB page size, pre-allocated 0 >> pages >> >> Should Linux support this page size? As afar as I know, this was an >> unsupported page size in the past isn't it? If this should be >> supported now, is there any specific reason for that? > > Hello, > > a525108cf1cc powerpc/mm/hugetlb: Filter out hugepage size not supported > by page table layout > > Should reject unsupported huge page sizes, right? Yes, see my reply. cheers
Re: [PATCH V3] cxl: Export library to support IBM XSL
Frederic Barratwrites: >> diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c >> new file mode 100644 >> index 000..4f4c5ca >> --- /dev/null >> +++ b/drivers/misc/cxl/cxllib.c >> @@ -0,0 +1,246 @@ >> +/* >> + * Copyright 2017 IBM Corp. >> + * >> + * This program is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU General Public License >> + * as published by the Free Software Foundation; either version >> + * 2 of the License, or (at your option) any later version. >> + */ >> + >> +#include >> +#include >> +#include >> +#include > > Maybe somebody can comment on this, but I believe the usual order is the > 'linux' headers first, then the 'asm'. Though I don't know if there's a > valid reason behind it, or just tradition... AFAIK it's mainly tradition. In general if there's a linux/foo.h and an asm/foo.h you're usually supposed to include the linux version, and then that will include asm/foo.h appropriately. In some cases you might have an asm header that relies on something defined in the linux version, meaning including the asm one first doesn't build. Outside of arch it's slightly dubious to be including asm/ headers directly, because you require all arches to have that header. None of which really answers your question :) cheers
Re: [PATCH kernel 0/3 REPOST] vfio-pci: Add support for mmapping MSI-X table
On 23/06/17 07:11, Alex Williamson wrote: > On Thu, 15 Jun 2017 15:48:42 +1000 > Alexey Kardashevskiywrote: > >> Here is a patchset which Yongji was working on before >> leaving IBM LTC. Since we still want to have this functionality >> in the kernel (DPDK is the first user), here is a rebase >> on the current upstream. >> >> >> Current vfio-pci implementation disallows to mmap the page >> containing MSI-X table in case that users can write directly >> to MSI-X table and generate an incorrect MSIs. >> >> However, this will cause some performance issue when there >> are some critical device registers in the same page as the >> MSI-X table. We have to handle the mmio access to these >> registers in QEMU emulation rather than in guest. >> >> To solve this issue, this series allows to expose MSI-X table >> to userspace when hardware enables the capability of interrupt >> remapping which can ensure that a given PCI device can only >> shoot the MSIs assigned for it. And we introduce a new bus_flags >> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side >> for different archs. >> >> The patch 3 are based on the proposed patchset[1]. >> >> Changelog >> v3: >> - rebased on the current upstream > > There's something not forthcoming here, the last version I see from > Yongji is this one: > > https://lists.linuxfoundation.org/pipermail/iommu/2016-June/017245.html > > Which was a 6-patch series where patches 2-4 tried to apply > PCI_BUS_FLAGS_MSI_REMAP for cases that supported other platforms. That > doesn't exist here, so it's not simply a rebase. Patch 1/ seems to > equate this new flag to the IOMMU capability IOMMU_CAP_INTR_REMAP, but > nothing is done here to match them together. That patch also mentions > the work Eric has done for similar features on ARM, but again those > patches are dropped. It seems like an incomplete feature now. Thanks, Thanks! I suspected this is not the latest but could not find anything better than we use internally for tests, and I could not reach Yongji for comments whether this was the latest update. As I am reading the patches, I notice that the "msi remap" term is used all over the place. While this remapping capability may be the case for x86/arm (and therefore the IOMMU_CAP_INTR_REMAP flag makes sense), powernv does not do remapping but provides hardware isolation. When we are allowing MSIX BAR mapping to the userspace - the isolation is what we really care about. Will it make sense to rename PCI_BUS_FLAGS_MSI_REMAP to PCI_BUS_FLAGS_MSI_ISOLATED ? Another thing - the patchset enables PCI_BUS_FLAGS_MSI_REMAP when IOMMU just advertises IOMMU_CAP_INTR_REMAP, not necessarily uses it, should the patchset actually look at something like irq_remapping_enabled in drivers/iommu/amd_iommu.c instead? > > Alex > >> v2: >> - Make the commit log more clear >> - Replace pci_bus_check_msi_remapping() with pci_bus_msi_isolated() >> so that we could clearly know what the function does >> - Set PCI_BUS_FLAGS_MSI_REMAP in pci_create_root_bus() instead >> of iommu_bus_notifier() >> - Reserve VFIO_REGION_INFO_FLAG_CAPS when we allow to mmap MSI-X >> table so that we can know whether we allow to mmap MSI-X table >> in QEMU >> >> [1] >> https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg1138820.html >> >> >> This is based on sha1 >> 63f700aab4c1 Linus Torvalds "Merge tag 'xtensa-20170612' of >> git://github.com/jcmvbkbc/linux-xtensa". >> >> Please comment. Thanks. >> >> >> >> Yongji Xie (3): >> PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag >> pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge >> vfio-pci: Allow to expose MSI-X table to userspace if interrupt >> remapping is enabled >> >> include/linux/pci.h | 1 + >> arch/powerpc/platforms/powernv/pci-ioda.c | 8 >> drivers/vfio/pci/vfio_pci.c | 18 +++--- >> drivers/vfio/pci/vfio_pci_rdwr.c | 3 ++- >> 4 files changed, 26 insertions(+), 4 deletions(-) >> > -- Alexey
Re: linux-next: build warning after merge of the tip tree
Hi Michael, On Fri, 23 Jun 2017 14:10:25 +1000 Michael Ellermanwrote: > > Fixed in my next today by: > > d4cfb11387ee ("powerpc: Convert VDSO update function to use new > update_vsyscall interface") > > But you must have pulled before I pushed that, so the warning will go > away tomorrow. Excellent, thanks. -- Cheers, Stephen Rothwell
Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd
Thomas Gleixnerwrites: > On Thu, 22 Jun 2017, Thiago Jung Bauermann wrote: >> Michael Bringmann provided this information: >> >> It's not hard to backport both this patch and commit fe5595c07400 >> >> ("stop_machine: Provide stop_machine_cpuslocked()") from branch >> >> smp/hotplug in tip.git for stable. >> > >> > Yeah but it's not really my business backporting that unfortunately. >> >> Sorry, I wasn't clear. I was offering to provide backported patches for >> the relevant stable branches. >> >> Though that will only be necessary if we also backport the topology >> fixes as well. > > So shall I pick up the fix and route it through tip smp/hotplug where the > cpu hotplug core changes reside on which that patch depends on? Yes please. Here's an ack if you like: Acked-by: Michael Ellerman cheers
Re: linux-next: build warning after merge of the tip tree
Stephen Rothwellwrites: > Hi all, > > [Forgot to cc John] > > On Fri, 23 Jun 2017 13:58:34 +1000 Stephen Rothwell > wrote: >> >> Hi all, >> >> After merging the tip tree, today's linux-next build (powerpc >> ppc64_defconfig) produced this warning: >> >> kernel/time/timekeeping.c:519:2: warning: #warning Please contact your >> maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. >> [-Wcpp] >> #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD >> compatibity will disappear soon. >> ^ >> kernel/time/timekeeping.c:519:2: warning: #warning Please contact your >> maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. >> [-Wcpp] >> #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD >> compatibity will disappear soon. >> ^ >> >> Introduced by commit >> >> 369adf04d80a ("time: Add warning about imminent deprecation of >> CONFIG_GENERIC_TIME_VSYSCALL_OLD") Fixed in my next today by: d4cfb11387ee ("powerpc: Convert VDSO update function to use new update_vsyscall interface") But you must have pulled before I pushed that, so the warning will go away tomorrow. cheers
Re: linux-next: build warning after merge of the tip tree
Hi all, [Forgot to cc John] On Fri, 23 Jun 2017 13:58:34 +1000 Stephen Rothwellwrote: > > Hi all, > > After merging the tip tree, today's linux-next build (powerpc > ppc64_defconfig) produced this warning: > > kernel/time/timekeeping.c:519:2: warning: #warning Please contact your > maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. > [-Wcpp] > #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD > compatibity will disappear soon. > ^ > kernel/time/timekeeping.c:519:2: warning: #warning Please contact your > maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. > [-Wcpp] > #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD > compatibity will disappear soon. > ^ > > Introduced by commit > > 369adf04d80a ("time: Add warning about imminent deprecation of > CONFIG_GENERIC_TIME_VSYSCALL_OLD") -- Cheers, Stephen Rothwell
linux-next: build warning after merge of the tip tree
Hi all, After merging the tip tree, today's linux-next build (powerpc ppc64_defconfig) produced this warning: kernel/time/timekeeping.c:519:2: warning: #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. [-Wcpp] #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. ^ kernel/time/timekeeping.c:519:2: warning: #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. [-Wcpp] #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. ^ Introduced by commit 369adf04d80a ("time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD") -- Cheers, Stephen Rothwell
Re: [PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9
Yes please... Would be helpful On 22 Jun. 2017 7:50 pm, "Stewart Smith"wrote: > Michael Neuling writes: > > On POWER9 the ERAT may be incorrect on wakeup from some stop states > > that lose state. This causes random segvs and illegal instructions > > when these stop states are enabled. > > > > This patch invalidates the ERAT on wakeup on POWER9 to prevent this > > from causing a problem. > > > > Signed-off-by: Michael Neuling > > CC: stable? > > -- > Stewart Smith > OPAL Architect, IBM. > >
Re: [PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9
On Fri, 2017-06-23 at 10:50 +1000, Stewart Smith wrote: > Michael Neulingwrites: > > On POWER9 the ERAT may be incorrect on wakeup from some stop states > > that lose state. This causes random segvs and illegal instructions > > when these stop states are enabled. > > > > This patch invalidates the ERAT on wakeup on POWER9 to prevent this > > from causing a problem. > > > > Signed-off-by: Michael Neuling > > CC: stable? Yes please! Mikey
Re: [PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9
Michael Neulingwrites: > On POWER9 the ERAT may be incorrect on wakeup from some stop states > that lose state. This causes random segvs and illegal instructions > when these stop states are enabled. > > This patch invalidates the ERAT on wakeup on POWER9 to prevent this > from causing a problem. > > Signed-off-by: Michael Neuling CC: stable? -- Stewart Smith OPAL Architect, IBM.
Re: new dma-mapping tree, was Re: clean up and modularize arch dma_mapping interface V2
Hi all, On Wed, 21 Jun 2017 15:32:39 +0200 Marek Szyprowskiwrote: > > On 2017-06-20 15:16, Christoph Hellwig wrote: > > On Tue, Jun 20, 2017 at 11:04:00PM +1000, Stephen Rothwell wrote: > >> git://git.linaro.org/people/mszyprowski/linux-dma-mapping.git#dma-mapping-next > >> > >> Contacts: Marek Szyprowski and Kyungmin Park (cc'd) > >> > >> I have called your tree dma-mapping-hch for now. The other tree has > >> not been updated since 4.9-rc1 and I am not sure how general it is. > >> Marek, Kyungmin, any comments? > > I'd be happy to join efforts - co-maintainers and reviers are always > > welcome. > > I did some dma-mapping unification works in the past and my tree in > linux-next > was a side effect of that. I think that for now it can be dropped in > favor of > Christoph's tree. I can also do some review and help in maintainers work if > needed, although I was recently busy with other stuff. OK, so I have dropped the dma-mapping tree and renamed dma-mapping-hch to dma-mapping. -- Cheers, Stephen Rothwell
Re: DNS (?) not working on G5 (64-bit powerpc) (was [net-next, v3, 3/3] udp: try to avoid 2 cache miss on dequeue)
On Thu, Jun 22, 2017, at 22:57, Paolo Abeni wrote: > > Can you please check if the following patch fixes the issue? Only > compiled tested here. > > Thanks!!! > --- > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 067a607..80d89fe 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1446,16 +1446,19 @@ static struct sk_buff > *__first_packet_length(struct sock *sk, > { > struct sk_buff *skb; > > - while ((skb = skb_peek(rcvq)) != NULL && > - udp_lib_checksum_complete(skb)) { > - __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, > - IS_UDPLITE(sk)); > - __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, > - IS_UDPLITE(sk)); > - atomic_inc(>sk_drops); > - __skb_unlink(skb, rcvq); > - *total += skb->truesize; > - kfree_skb(skb); > + while ((skb = skb_peek(rcvq)) != NULL) { > + if (udp_lib_checksum_complete(skb)) { > + __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, > + IS_UDPLITE(sk)); > + __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, > + IS_UDPLITE(sk)); > + atomic_inc(>sk_drops); > + __skb_unlink(skb, rcvq); > + *total += skb->truesize; > + kfree_skb(skb); > + } else { > + udp_set_dev_scratch(skb); It needs a "break;" here. > + } > } > return skb; > } Bye, Hannes
Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
On 6/22/2017 3:03 PM, Arnd Bergmann wrote: Drivers that want a non-atomic variant should include either include/linux/io-64-nonatomic-hi-lo.h or include/linux/io-64-nonatomic-lo-hi.h depending on what they need. Drivers that require 64-bit I/O should probably just depend on CONFIG_64BIT and maybe use readq/writeq. Ok, I will work something like that up. We'll still need a patch similar to patch 2 (less the non-atomic versions) seeing even CONFIG_GENERIC_IOMAP arches don't actually have a working ioread64/iowrite64 implementation. Thanks, Logan
Re: DNS (?) not working on G5 (64-bit powerpc) (was [net-next,v3,3/3] udp: try to avoid 2 cache miss on dequeue)
On Thu, 2017-06-22 at 23:06 +1000, Michael Ellerman wrote: > Paolo wrote: > > when udp_recvmsg() is executed, on x86_64 and other archs, most skb > > fields are on cold cachelines. > > If the skb are linear and the kernel don't need to compute the udp > > csum, only a handful of skb fields are required by udp_recvmsg(). > > Since we already use skb->dev_scratch to cache hot data, and > > there are 32 bits unused on 64 bit archs, use such field to cache > > as much data as we can, and try to prefetch on dequeue the relevant > > fields that are left out. > > > > This can save up to 2 cache miss per packet. > > > > v1 -> v2: > > - changed udp_dev_scratch fields types to u{32,16} variant, > > replaced bitfiled with bool > > > > Signed-off-by: Paolo Abeni> > Acked-by: Eric Dumazet > > --- > > net/ipv4/udp.c | 114 > > +++-- > > 1 file changed, 103 insertions(+), 11 deletions(-) > > This appears to break wget on one of my machines. > > Networking in general is working, I'm able to SSH in, but then I can't > do a wget. Can you please check if the following patch fixes the issue? Only compiled tested here. Thanks!!! --- diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 067a607..80d89fe 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1446,16 +1446,19 @@ static struct sk_buff *__first_packet_length(struct sock *sk, { struct sk_buff *skb; - while ((skb = skb_peek(rcvq)) != NULL && - udp_lib_checksum_complete(skb)) { - __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, - IS_UDPLITE(sk)); - __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, - IS_UDPLITE(sk)); - atomic_inc(>sk_drops); - __skb_unlink(skb, rcvq); - *total += skb->truesize; - kfree_skb(skb); + while ((skb = skb_peek(rcvq)) != NULL) { + if (udp_lib_checksum_complete(skb)) { + __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, + IS_UDPLITE(sk)); + __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, + IS_UDPLITE(sk)); + atomic_inc(>sk_drops); + __skb_unlink(skb, rcvq); + *total += skb->truesize; + kfree_skb(skb); + } else { + udp_set_dev_scratch(skb); + } } return skb; }
Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available
On 6/22/2017 2:36 PM, Alan Cox wrote: I think that makes sense for the platforms with that problem. I'm not sure there are many that can't do it for mmio at least. 486SX can't do it and I guess some ARM32 but I think almost everyone else can including most 32bit x86. What's more of a problem is a lot of platforms can do 64bit MMIO via ioread/write64 but not 64bit port I/O, and it's not clear how you represent that via an ioread/write API that abstracts it away. In Patch 2, we call bad_io_access for anyone trying to do 64bit accesses on port I/O. Logan
Re: DNS (?) not working on G5 (64-bit powerpc) (was [net-next,v3,3/3] udp: try to avoid 2 cache miss on dequeue)
On Thu, 2017-06-22 at 18:43 +0200, Paolo Abeni wrote: > On Thu, 2017-06-22 at 23:06 +1000, Michael Ellerman wrote: > > Paolo wrote: > > > when udp_recvmsg() is executed, on x86_64 and other archs, most skb > > > fields are on cold cachelines. > > > If the skb are linear and the kernel don't need to compute the udp > > > csum, only a handful of skb fields are required by udp_recvmsg(). > > > Since we already use skb->dev_scratch to cache hot data, and > > > there are 32 bits unused on 64 bit archs, use such field to cache > > > as much data as we can, and try to prefetch on dequeue the relevant > > > fields that are left out. > > > > > > This can save up to 2 cache miss per packet. > > > > > > v1 -> v2: > > > - changed udp_dev_scratch fields types to u{32,16} variant, > > > replaced bitfiled with bool > > > > > > Signed-off-by: Paolo Abeni> > > Acked-by: Eric Dumazet > > > --- > > > net/ipv4/udp.c | 114 > > > +++-- > > > 1 file changed, 103 insertions(+), 11 deletions(-) > > > > This appears to break wget on one of my machines. > > > > Networking in general is working, I'm able to SSH in, but then I can't > > do a wget. > > > > eg: > > > > $ wget google.com > > --2017-06-22 22:45:39-- http://google.com/ > > Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in > > name resolution. > > wget: unable to resolve host address ‘proxy.pmdw.com’ > > > > $ host proxy.pmdw.com > > proxy.pmdw.com is an alias for raven.pmdw.com. > > raven.pmdw.com has address 10.1.2.3 > > > > $ wget google.com > > --2017-06-22 22:52:08-- http://google.com/ > > Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in > > name resolution. > > wget: unable to resolve host address ‘proxy.pmdw.com’ > > > > Maybe host is using TCP but the man page says it doesn't? > > > > > > Everything is OK if I boot back to the previous commit 0a463c78d25b > > ("udp: avoid a cache miss on dequeue"): > > > > $ wget google.com > > --2017-06-22 23:00:01-- http://google.com/ > > Resolving proxy.pmdw.com (proxy.pmdw.com)... 10.1.2.3 > > Connecting to proxy.pmdw.com (proxy.pmdw.com)|10.1.2.3|:3128... connected. > > Proxy request sent, awaiting response... 302 Found > > Location: http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE > > [following] > > --2017-06-22 23:00:01-- > > http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE > > Reusing existing connection to proxy.pmdw.com:3128. > > Proxy request sent, awaiting response... 200 OK > > Length: unspecified [text/html] > > Saving to: ‘index.html’ > > > > index.html [ <=> > > ] 11.37K --.-KB/sin 0.001s > > > > 2017-06-22 23:00:01 (22.0 MB/s) - ‘index.html’ saved [11640] > > > > $ uname -a > > Linux 4.12.0-rc4-gcc6-00988-g0a463c7 #88 SMP Thu Jun 22 22:55:12 AEST 2017 > > ppc64 GNU/Linux > > > > > > Haven't had time to debug any further. Any ideas? > > Thank you for this report. > > Can you please specify features of the relevant NIC ? (ethtool -k > ) > > I'll try to replicate the issue as soon I'll get hands on suitable HW, I had my hands on power7, but I can't trivially reproduce the issue so I'm going to bug you for more info. Can you please specify the host CPU, the NIC in use (ethtool -i ), the compiler version used to build the kernel and possibly provide a tcpdump of the DNS packets received/sent while running wget and while running the host command? Do you have the relevant kernel running on others PPC hosts? Thank you, Paolo
Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available
On 6/22/2017 2:14 PM, Alan Cox wrote: If a platform doesn't support 64bit I/O operations from the CPU then you either need to use some kind of platform/architecture specific interface if present or accept you don't have one. Yes, I understand that. The thing is that every user that's currently using it right now is patching in their own version that splits it on non-64bit systems. It's not safe to split it. Possibly for some use cases you could add an ioread64_maysplit() I'm open to doing something like that. What btw is the actual ARM compiler warning ? Is the compiler also trying to tell you it's a bad idea ? It's just the compiler noting that you are mixing volatile and non-volatile pointers. Strangely some io{read|write}XX use volatile but most do not. But it's nothing crazy. Logan
Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
On 6/22/2017 2:08 PM, Alan Cox wrote: But this does not do the same thing as an ioread64 with regards to atomicity or side effects on the device. The same is true of the other hacks. You either have a real 64bit single read/write from MMIO space or you don't. You can't fake it. Yes, I know. But is it not better than having every driver that wants to use these functions fake it themselves? Logan
Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
On 6/22/2017 11:29 AM, Stephen Bates wrote: +#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p)) Logan, thanks for taking this cleanup on. I think this should be iowrite64 not iowrite32? Yup, good catch. Thanks. I'll fix it in a v2 of this series. Logan
Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
> +#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p)) Logan, thanks for taking this cleanup on. I think this should be iowrite64 not iowrite32? Stephen
[PATCH 1/7] drm/tilcdc: don't use volatile with iowrite64
This is a prep patch for adding a universal iowrite64. The patch is to prevent compiler warnings when we add iowrite64 that would occur because there is an unnecessary volatile in this driver. Signed-off-by: Logan GunthorpeCc: Jyri Sarha Cc: Tomi Valkeinen Cc: David Airlie --- drivers/gpu/drm/tilcdc/tilcdc_regs.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/tilcdc/tilcdc_regs.h b/drivers/gpu/drm/tilcdc/tilcdc_regs.h index 9d528c0a67a4..e9ce725698a9 100644 --- a/drivers/gpu/drm/tilcdc/tilcdc_regs.h +++ b/drivers/gpu/drm/tilcdc/tilcdc_regs.h @@ -131,14 +131,14 @@ static inline void tilcdc_write(struct drm_device *dev, u32 reg, u32 data) static inline void tilcdc_write64(struct drm_device *dev, u32 reg, u64 data) { struct tilcdc_drm_private *priv = dev->dev_private; - volatile void __iomem *addr = priv->mmio + reg; + void __iomem *addr = priv->mmio + reg; #ifdef iowrite64 iowrite64(data, addr); #else __iowmb(); /* This compiles to strd (=64-bit write) on ARM7 */ - *(volatile u64 __force *)addr = __cpu_to_le64(data); + *(u64 __force *)addr = __cpu_to_le64(data); #endif } -- 2.11.0
[PATCH 7/7] crypto: caam: cleanup CONFIG_64BIT ifdefs when using io{read|write}64
Now that ioread64 and iowrite64 are always available we don't need the ugly ifdefs to change their implementation when they are not. Signed-off-by: Logan GunthorpeCc: "Horia Geantă" Cc: Dan Douglass Cc: Herbert Xu Cc: "David S. Miller" --- drivers/crypto/caam/regs.h | 29 - 1 file changed, 29 deletions(-) diff --git a/drivers/crypto/caam/regs.h b/drivers/crypto/caam/regs.h index 84d2f838a063..26fc19dd0c39 100644 --- a/drivers/crypto/caam/regs.h +++ b/drivers/crypto/caam/regs.h @@ -134,7 +134,6 @@ static inline void clrsetbits_32(void __iomem *reg, u32 clear, u32 set) *base + 0x : least-significant 32 bits *base + 0x0004 : most-significant 32 bits */ -#ifdef CONFIG_64BIT static inline void wr_reg64(void __iomem *reg, u64 data) { if (caam_little_end) @@ -151,34 +150,6 @@ static inline u64 rd_reg64(void __iomem *reg) return ioread64be(reg); } -#else /* CONFIG_64BIT */ -static inline void wr_reg64(void __iomem *reg, u64 data) -{ -#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX - if (caam_little_end) { - wr_reg32((u32 __iomem *)(reg) + 1, data >> 32); - wr_reg32((u32 __iomem *)(reg), data); - } else -#endif - { - wr_reg32((u32 __iomem *)(reg), data >> 32); - wr_reg32((u32 __iomem *)(reg) + 1, data); - } -} - -static inline u64 rd_reg64(void __iomem *reg) -{ -#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX - if (caam_little_end) - return ((u64)rd_reg32((u32 __iomem *)(reg) + 1) << 32 | - (u64)rd_reg32((u32 __iomem *)(reg))); - else -#endif - return ((u64)rd_reg32((u32 __iomem *)(reg)) << 32 | - (u64)rd_reg32((u32 __iomem *)(reg) + 1)); -} -#endif /* CONFIG_64BIT */ - #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT #ifdef CONFIG_SOC_IMX7D #define cpu_to_caam_dma(value) \ -- 2.11.0
[PATCH 6/7] drm/tilcdc: clean up ifdef hacks around iowrite64
Now that we can expect iowrite64 to always exist the hack is no longer necessary so we just call iowrite64 directly. Signed-off-by: Logan GunthorpeCc: Jyri Sarha Cc: Tomi Valkeinen Cc: David Airlie --- drivers/gpu/drm/tilcdc/tilcdc_regs.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/drivers/gpu/drm/tilcdc/tilcdc_regs.h b/drivers/gpu/drm/tilcdc/tilcdc_regs.h index e9ce725698a9..0b901405f30a 100644 --- a/drivers/gpu/drm/tilcdc/tilcdc_regs.h +++ b/drivers/gpu/drm/tilcdc/tilcdc_regs.h @@ -133,13 +133,7 @@ static inline void tilcdc_write64(struct drm_device *dev, u32 reg, u64 data) struct tilcdc_drm_private *priv = dev->dev_private; void __iomem *addr = priv->mmio + reg; -#ifdef iowrite64 iowrite64(data, addr); -#else - __iowmb(); - /* This compiles to strd (=64-bit write) on ARM7 */ - *(u64 __force *)addr = __cpu_to_le64(data); -#endif } static inline u32 tilcdc_read(struct drm_device *dev, u32 reg) -- 2.11.0
[PATCH 2/7] iomap: implement ioread64 and iowrite64
Currently, ioread64 and iowrite64 are not impleminted in the generic iomap implementation. The prototypes are defined if CONFIG_64BIT is set but there is no actual implementation. Seeing the functions are not universally available, they are unusable for driver developers. This leads to ugly hacks such as those at the top of drivers/ntb/hw/intel/ntb_hw_intel.c This patch adds generic implementations for these functions. We add the obvious version if readq/writeq are implemented and fall back to using two io32 calls in cases that don't provide direct 64bit accesses. Thus making the functions universally available to configurations with CONFIG_GENERIC_IOMAP=y. For any pio accesses, the 64bit operations remain unsupported and simply call bad_io_access in cases readq would be called. Signed-off-by: Logan GunthorpeCc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Arnd Bergmann Cc: Suresh Warrier Cc: Nicholas Piggin --- arch/powerpc/include/asm/io.h | 2 ++ include/asm-generic/iomap.h | 4 --- lib/iomap.c | 62 +++ 3 files changed, 64 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 422f99cf9924..11a83667d2c3 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -788,8 +788,10 @@ extern void __iounmap_at(void *ea, unsigned long size); #define mmio_read16be(addr)readw_be(addr) #define mmio_read32be(addr)readl_be(addr) +#define mmio_read64be(addr)readq_be(addr) #define mmio_write16be(val, addr) writew_be(val, addr) #define mmio_write32be(val, addr) writel_be(val, addr) +#define mmio_write64be(val, addr) writeq_be(val, addr) #define mmio_insb(addr, dst, count)readsb(addr, dst, count) #define mmio_insw(addr, dst, count)readsw(addr, dst, count) #define mmio_insl(addr, dst, count)readsl(addr, dst, count) diff --git a/include/asm-generic/iomap.h b/include/asm-generic/iomap.h index 650fede33c25..43ec4ea9f6f9 100644 --- a/include/asm-generic/iomap.h +++ b/include/asm-generic/iomap.h @@ -30,20 +30,16 @@ extern unsigned int ioread16(void __iomem *); extern unsigned int ioread16be(void __iomem *); extern unsigned int ioread32(void __iomem *); extern unsigned int ioread32be(void __iomem *); -#ifdef CONFIG_64BIT extern u64 ioread64(void __iomem *); extern u64 ioread64be(void __iomem *); -#endif extern void iowrite8(u8, void __iomem *); extern void iowrite16(u16, void __iomem *); extern void iowrite16be(u16, void __iomem *); extern void iowrite32(u32, void __iomem *); extern void iowrite32be(u32, void __iomem *); -#ifdef CONFIG_64BIT extern void iowrite64(u64, void __iomem *); extern void iowrite64be(u64, void __iomem *); -#endif /* * "string" versions of the above. Note that they diff --git a/lib/iomap.c b/lib/iomap.c index fc3dcb4b238e..e38e036cb52f 100644 --- a/lib/iomap.c +++ b/lib/iomap.c @@ -66,6 +66,7 @@ static void bad_io_access(unsigned long port, const char *access) #ifndef mmio_read16be #define mmio_read16be(addr) be16_to_cpu(__raw_readw(addr)) #define mmio_read32be(addr) be32_to_cpu(__raw_readl(addr)) +#define mmio_read64be(addr) be64_to_cpu(__raw_readq(addr)) #endif unsigned int ioread8(void __iomem *addr) @@ -93,11 +94,45 @@ unsigned int ioread32be(void __iomem *addr) IO_COND(addr, return pio_read32be(port), return mmio_read32be(addr)); return 0x; } + +#ifdef readq +u64 ioread64(void __iomem *addr) +{ + IO_COND(addr, bad_io_access(port, "ioread64"), return readq(addr)); + return 0xLL; +} +u64 ioread64be(void __iomem *addr) +{ + IO_COND(addr, bad_io_access(port, "ioread64be"), + return mmio_read64be(addr)); + return 0xLL; +} +#else +u64 ioread64(void __iomem *addr) +{ + u64 low, high; + + low = ioread32(addr); + high = ioread32(addr + sizeof(u32)); + return low | (high << 32); +} +u64 ioread64be(void __iomem *addr) +{ + u64 low, high; + + low = ioread32be(addr + sizeof(u32)); + high = ioread32be(addr); + return low | (high << 32); +} +#endif + EXPORT_SYMBOL(ioread8); EXPORT_SYMBOL(ioread16); EXPORT_SYMBOL(ioread16be); EXPORT_SYMBOL(ioread32); EXPORT_SYMBOL(ioread32be); +EXPORT_SYMBOL(ioread64); +EXPORT_SYMBOL(ioread64be); #ifndef pio_write16be #define pio_write16be(val,port) outw(swab16(val),port) @@ -107,6 +142,7 @@ EXPORT_SYMBOL(ioread32be); #ifndef mmio_write16be #define mmio_write16be(val,port) __raw_writew(be16_to_cpu(val),port) #define mmio_write32be(val,port) __raw_writel(be32_to_cpu(val),port) +#define mmio_write64be(val,port) __raw_writeq(be64_to_cpu(val),port) #endif void iowrite8(u8 val, void __iomem *addr) @@ -129,11
[PATCH 0/7] cleanup issues with io{read|write}64
Hi, Presently, the 64bit IO functions are not very usable in drivers because they are not universally available in all architectures. This leads to a bunch of hacks in the kernel to work around this. (See the last 3 patches in this series.) As part of my switchtec_ntb submission which added another one of these warts, Greg asked me to look into fixing it[1]. So this patchset attempts to solve this issue by filling in the missing implementations in iomap.c and io.h. After that, the alpha architecture is the only one I found that also needed a fix for this. Finally, this patchset removes the hacks that have accumulated in the kernel, thus far, for working around this. This set is based off of v4.12-rc6. Thanks, Logan [1] https://marc.info/?l=linux-kernel=149774601910663=2 Logan Gunthorpe (7): drm/tilcdc: don't use volatile with iowrite64 iomap: implement ioread64 and iowrite64 asm-generic/io.h: make ioread64 and iowrite64 universally available alpha: provide ioread64 and iowrite64 implementations ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks drm/tilcdc: clean up ifdef hacks around iowrite64 crypto: caam: cleanup CONFIG_64BIT ifdefs when using io{read|write}64 arch/alpha/include/asm/io.h | 2 ++ arch/alpha/kernel/io.c | 18 +++ arch/powerpc/include/asm/io.h| 2 ++ drivers/crypto/caam/regs.h | 29 - drivers/gpu/drm/tilcdc/tilcdc_regs.h | 8 + drivers/ntb/hw/intel/ntb_hw_intel.c | 30 - include/asm-generic/io.h | 54 --- include/asm-generic/iomap.h | 4 --- lib/iomap.c | 62 9 files changed, 127 insertions(+), 82 deletions(-) -- 2.11.0
[PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
Alpha implements its own io operation and doesn't use the common library. Thus to make ioread64 and iowrite64 globally available we need to add implementations for alpha. For this, we simply use calls that chain two 32-bit operations. (mostly because I don't really understand the alpha architecture.) Signed-off-by: Logan GunthorpeCc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner --- arch/alpha/include/asm/io.h | 2 ++ arch/alpha/kernel/io.c | 18 ++ 2 files changed, 20 insertions(+) diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h index ff4049155c84..15588092c062 100644 --- a/arch/alpha/include/asm/io.h +++ b/arch/alpha/include/asm/io.h @@ -493,8 +493,10 @@ extern inline void writeq(u64 b, volatile void __iomem *addr) #define ioread16be(p) be16_to_cpu(ioread16(p)) #define ioread32be(p) be32_to_cpu(ioread32(p)) +#define ioread64be(p) be64_to_cpu(ioread64(p)) #define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p)) #define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p)) +#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p)) #define inb_p inb #define inw_p inw diff --git a/arch/alpha/kernel/io.c b/arch/alpha/kernel/io.c index 19c5875ab398..8c28026f7849 100644 --- a/arch/alpha/kernel/io.c +++ b/arch/alpha/kernel/io.c @@ -59,6 +59,24 @@ EXPORT_SYMBOL(iowrite8); EXPORT_SYMBOL(iowrite16); EXPORT_SYMBOL(iowrite32); +u64 ioread64(void __iomem *addr) +{ + u64 low, high; + + low = ioread32(addr); + high = ioread32(addr + sizeof(u32)); + return low | (high << 32); +} + +void iowrite64(u64 val, void __iomem *addr) +{ + iowrite32(val, addr); + iowrite32(val >> 32, addr + sizeof(u32)); +} + +EXPORT_SYMBOL(ioread64); +EXPORT_SYMBOL(iowrite64); + u8 inb(unsigned long port) { return ioread8(ioport_map(port, 1)); -- 2.11.0
[PATCH 5/7] ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks
Now that ioread64 and iowrite64 are available generically we can remove the hack at the top of ntb_hw_intel.c that patches them in when they are not available. Signed-off-by: Logan GunthorpeCc: Jon Mason Cc: Dave Jiang Cc: Allen Hubbe --- drivers/ntb/hw/intel/ntb_hw_intel.c | 30 -- 1 file changed, 30 deletions(-) diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c b/drivers/ntb/hw/intel/ntb_hw_intel.c index c00238491673..56221d540c2b 100644 --- a/drivers/ntb/hw/intel/ntb_hw_intel.c +++ b/drivers/ntb/hw/intel/ntb_hw_intel.c @@ -153,35 +153,6 @@ MODULE_PARM_DESC(xeon_b2b_dsd_bar5_addr32, static inline enum ntb_topo xeon_ppd_topo(struct intel_ntb_dev *ndev, u8 ppd); static int xeon_init_isr(struct intel_ntb_dev *ndev); -#ifndef ioread64 -#ifdef readq -#define ioread64 readq -#else -#define ioread64 _ioread64 -static inline u64 _ioread64(void __iomem *mmio) -{ - u64 low, high; - - low = ioread32(mmio); - high = ioread32(mmio + sizeof(u32)); - return low | (high << 32); -} -#endif -#endif - -#ifndef iowrite64 -#ifdef writeq -#define iowrite64 writeq -#else -#define iowrite64 _iowrite64 -static inline void _iowrite64(u64 val, void __iomem *mmio) -{ - iowrite32(val, mmio); - iowrite32(val >> 32, mmio + sizeof(u32)); -} -#endif -#endif - static inline int pdev_is_atom(struct pci_dev *pdev) { switch (pdev->device) { @@ -3008,4 +2979,3 @@ static void __exit intel_ntb_pci_driver_exit(void) debugfs_remove_recursive(debugfs_dir); } module_exit(intel_ntb_pci_driver_exit); - -- 2.11.0
[PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available
Currently, ioread64 and iowrite64 are only available io CONFIG_64BIT=y and CONFIG_GENERIC_IOMAP=n. Thus, seeing the functions are not universally available, it makes them unusable for driver developers. This leads to ugly hacks such as those at the top of drivers/ntb/hw/intel/ntb_hw_intel.c This patch adds fallback implementations for when CONFIG_64BIT and CONFIG_GENERIC_IOMAP are not set. These functions use two io32 based calls to complete the operation. Note, we do not use the volatile keyword in these functions like the others in the same file. It is necessary to avoid a compiler warning on arm. Signed-off-by: Logan GunthorpeCc: Arnd Bergmann --- include/asm-generic/io.h | 54 +--- 1 file changed, 42 insertions(+), 12 deletions(-) diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h index 7ef015eb3403..817edaa3da78 100644 --- a/include/asm-generic/io.h +++ b/include/asm-generic/io.h @@ -585,15 +585,24 @@ static inline u32 ioread32(const volatile void __iomem *addr) } #endif -#ifdef CONFIG_64BIT #ifndef ioread64 #define ioread64 ioread64 -static inline u64 ioread64(const volatile void __iomem *addr) +#ifdef readq +static inline u64 ioread64(const void __iomem *addr) { return readq(addr); } +#else +static inline u64 ioread64(const void __iomem *addr) +{ + u64 low, high; + + low = ioread32(addr); + high = ioread32(addr + sizeof(u32)); + return low | (high << 32); +} +#endif #endif -#endif /* CONFIG_64BIT */ #ifndef iowrite8 #define iowrite8 iowrite8 @@ -619,15 +628,21 @@ static inline void iowrite32(u32 value, volatile void __iomem *addr) } #endif -#ifdef CONFIG_64BIT #ifndef iowrite64 #define iowrite64 iowrite64 -static inline void iowrite64(u64 value, volatile void __iomem *addr) +#ifdef writeq +static inline void iowrite64(u64 value, void __iomem *addr) { writeq(value, addr); } +#else +static inline void iowrite64(u64 value, void __iomem *addr) +{ + iowrite32(value, addr); + iowrite32(value >> 32, addr + sizeof(u32)); +} +#endif #endif -#endif /* CONFIG_64BIT */ #ifndef ioread16be #define ioread16be ioread16be @@ -645,15 +660,24 @@ static inline u32 ioread32be(const volatile void __iomem *addr) } #endif -#ifdef CONFIG_64BIT #ifndef ioread64be #define ioread64be ioread64be -static inline u64 ioread64be(const volatile void __iomem *addr) +#ifdef readq +static inline u64 ioread64be(const void __iomem *addr) { return swab64(readq(addr)); } +#else +static inline u64 ioread64be(const void __iomem *addr) +{ + u64 low, high; + + low = ioread32be(addr + sizeof(u32)); + high = ioread32be(addr); + return low | (high << 32); +} +#endif #endif -#endif /* CONFIG_64BIT */ #ifndef iowrite16be #define iowrite16be iowrite16be @@ -671,15 +695,21 @@ static inline void iowrite32be(u32 value, volatile void __iomem *addr) } #endif -#ifdef CONFIG_64BIT #ifndef iowrite64be #define iowrite64be iowrite64be -static inline void iowrite64be(u64 value, volatile void __iomem *addr) +#ifdef writeq +static inline void iowrite64be(u64 value, void __iomem *addr) { writeq(swab64(value), addr); } +#else +static inline void iowrite64be(u64 value, void __iomem *addr) +{ + iowrite32be(value >> 32, addr); + iowrite32be(value, addr + sizeof(u32)); +} +#endif #endif -#endif /* CONFIG_64BIT */ #ifndef ioread8_rep #define ioread8_rep ioread8_rep -- 2.11.0
Re: DNS (?) not working on G5 (64-bit powerpc) (was [net-next,v3,3/3] udp: try to avoid 2 cache miss on dequeue)
On Thu, 2017-06-22 at 23:06 +1000, Michael Ellerman wrote: > Paolo wrote: > > when udp_recvmsg() is executed, on x86_64 and other archs, most skb > > fields are on cold cachelines. > > If the skb are linear and the kernel don't need to compute the udp > > csum, only a handful of skb fields are required by udp_recvmsg(). > > Since we already use skb->dev_scratch to cache hot data, and > > there are 32 bits unused on 64 bit archs, use such field to cache > > as much data as we can, and try to prefetch on dequeue the relevant > > fields that are left out. > > > > This can save up to 2 cache miss per packet. > > > > v1 -> v2: > > - changed udp_dev_scratch fields types to u{32,16} variant, > > replaced bitfiled with bool > > > > Signed-off-by: Paolo Abeni> > Acked-by: Eric Dumazet > > --- > > net/ipv4/udp.c | 114 > > +++-- > > 1 file changed, 103 insertions(+), 11 deletions(-) > > This appears to break wget on one of my machines. > > Networking in general is working, I'm able to SSH in, but then I can't > do a wget. > > eg: > > $ wget google.com > --2017-06-22 22:45:39-- http://google.com/ > Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in > name resolution. > wget: unable to resolve host address ‘proxy.pmdw.com’ > > $ host proxy.pmdw.com > proxy.pmdw.com is an alias for raven.pmdw.com. > raven.pmdw.com has address 10.1.2.3 > > $ wget google.com > --2017-06-22 22:52:08-- http://google.com/ > Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in > name resolution. > wget: unable to resolve host address ‘proxy.pmdw.com’ > > Maybe host is using TCP but the man page says it doesn't? > > > Everything is OK if I boot back to the previous commit 0a463c78d25b > ("udp: avoid a cache miss on dequeue"): > > $ wget google.com > --2017-06-22 23:00:01-- http://google.com/ > Resolving proxy.pmdw.com (proxy.pmdw.com)... 10.1.2.3 > Connecting to proxy.pmdw.com (proxy.pmdw.com)|10.1.2.3|:3128... connected. > Proxy request sent, awaiting response... 302 Found > Location: http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE > [following] > --2017-06-22 23:00:01-- > http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE > Reusing existing connection to proxy.pmdw.com:3128. > Proxy request sent, awaiting response... 200 OK > Length: unspecified [text/html] > Saving to: ‘index.html’ > > index.html [ <=> ] > 11.37K --.-KB/sin 0.001s > > 2017-06-22 23:00:01 (22.0 MB/s) - ‘index.html’ saved [11640] > > $ uname -a > Linux 4.12.0-rc4-gcc6-00988-g0a463c7 #88 SMP Thu Jun 22 22:55:12 AEST 2017 > ppc64 GNU/Linux > > > Haven't had time to debug any further. Any ideas? Thank you for this report. Can you please specify features of the relevant NIC ? (ethtool -k ) I'll try to replicate the issue as soon I'll get hands on suitable HW, meanwhile can you please try to trace the system behavior with perf? Something like: perf probe -a __udp_enqueue_schedule_skb perf probe -a udp_recvmsg perf probe -a udpv6_recvmsg perf record -e probe:__udp_enqueue_schedule_skb -e probe:udp_recvmsg -e probe:udpv6_recvmsg -ag wget google.com perf report --stdio Thanks, Paolo
Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd
On Thu, 22 Jun 2017, Thiago Jung Bauermann wrote: > Michael Bringmann provided this information: > >> It's not hard to backport both this patch and commit fe5595c07400 > >> ("stop_machine: Provide stop_machine_cpuslocked()") from branch > >> smp/hotplug in tip.git for stable. > > > > Yeah but it's not really my business backporting that unfortunately. > > Sorry, I wasn't clear. I was offering to provide backported patches for > the relevant stable branches. > > Though that will only be necessary if we also backport the topology > fixes as well. So shall I pick up the fix and route it through tip smp/hotplug where the cpu hotplug core changes reside on which that patch depends on? Thanks, tglx
Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
On Sat, Jun 17, 2017 at 08:54:44AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2017-06-16 at 12:15 -0700, Ram Pai wrote: > > gp_regs size is not changed, nor is the layout. A unused field in > > the gp_regs is used to fill in the AMR contents. Old binaries will not > > be knowing about this unused field, and hence should not break. > > > > New binaries can leverage this already existing but newly defined > > field; to read the contents of AMR. > > > > Is it still a concern? > > Calls to sys_swapcontext with a made-up context will end up with a crap > AMR if done by code who didn't know about that register. Turns out x86 does not have this problem, because x86 does not implement sys_swapcontext. However; unlike x86, powerpc lets signal handler program the AMR(x86 PKRU equivalent), which can persist even after the signal handler returns to the kernel through sys_sigreturn. So I am inclined to deviate from the x86 protection-key semantics. On x86 the persistent way for the signal handler to change the key register(PKRU) is through a field in the siginfo structure. And on powerpc the persistent way for the signal handler to change the key register(AMR) will be to directly program the AMR register. This should resolve your concern on powerpc, since there is no way a crap AMR value will change the real AMR register, because the powerpc kernel will not be letting it happen. Acceptable? RP
Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
On 06/22/2017 09:48 AM, Logan Gunthorpe wrote: Alpha implements its own io operation and doesn't use the common library. Thus to make ioread64 and iowrite64 globally available we need to add implementations for alpha. For this, we simply use calls that chain two 32-bit operations. (mostly because I don't really understand the alpha architecture.) It's not difficult to provide this interface[*]. I believe the only reason I didn't do so from the beginning is that it wasn't used. r~ * At least for systems other than Jensen, which cannot generate 64-bit I/O. On the other hand, Jensen doesn't have PCI (EISA only), and so won't have any devices that care.
Re: [PATCH kernel 0/3 REPOST] vfio-pci: Add support for mmapping MSI-X table
On Thu, 15 Jun 2017 15:48:42 +1000 Alexey Kardashevskiywrote: > Here is a patchset which Yongji was working on before > leaving IBM LTC. Since we still want to have this functionality > in the kernel (DPDK is the first user), here is a rebase > on the current upstream. > > > Current vfio-pci implementation disallows to mmap the page > containing MSI-X table in case that users can write directly > to MSI-X table and generate an incorrect MSIs. > > However, this will cause some performance issue when there > are some critical device registers in the same page as the > MSI-X table. We have to handle the mmio access to these > registers in QEMU emulation rather than in guest. > > To solve this issue, this series allows to expose MSI-X table > to userspace when hardware enables the capability of interrupt > remapping which can ensure that a given PCI device can only > shoot the MSIs assigned for it. And we introduce a new bus_flags > PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side > for different archs. > > The patch 3 are based on the proposed patchset[1]. > > Changelog > v3: > - rebased on the current upstream There's something not forthcoming here, the last version I see from Yongji is this one: https://lists.linuxfoundation.org/pipermail/iommu/2016-June/017245.html Which was a 6-patch series where patches 2-4 tried to apply PCI_BUS_FLAGS_MSI_REMAP for cases that supported other platforms. That doesn't exist here, so it's not simply a rebase. Patch 1/ seems to equate this new flag to the IOMMU capability IOMMU_CAP_INTR_REMAP, but nothing is done here to match them together. That patch also mentions the work Eric has done for similar features on ARM, but again those patches are dropped. It seems like an incomplete feature now. Thanks, Alex > v2: > - Make the commit log more clear > - Replace pci_bus_check_msi_remapping() with pci_bus_msi_isolated() > so that we could clearly know what the function does > - Set PCI_BUS_FLAGS_MSI_REMAP in pci_create_root_bus() instead > of iommu_bus_notifier() > - Reserve VFIO_REGION_INFO_FLAG_CAPS when we allow to mmap MSI-X > table so that we can know whether we allow to mmap MSI-X table > in QEMU > > [1] > https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg1138820.html > > > This is based on sha1 > 63f700aab4c1 Linus Torvalds "Merge tag 'xtensa-20170612' of > git://github.com/jcmvbkbc/linux-xtensa". > > Please comment. Thanks. > > > > Yongji Xie (3): > PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag > pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge > vfio-pci: Allow to expose MSI-X table to userspace if interrupt > remapping is enabled > > include/linux/pci.h | 1 + > arch/powerpc/platforms/powernv/pci-ioda.c | 8 > drivers/vfio/pci/vfio_pci.c | 18 +++--- > drivers/vfio/pci/vfio_pci_rdwr.c | 3 ++- > 4 files changed, 26 insertions(+), 4 deletions(-) >
Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
On Thu, Jun 22, 2017 at 10:09 PM, Logan Gunthorpewrote: > On 6/22/2017 2:08 PM, Alan Cox wrote: >> >> But this does not do the same thing as an ioread64 with regards to >> atomicity or side effects on the device. The same is true of the other >> hacks. You either have a real 64bit single read/write from MMIO space or >> you don't. You can't fake it. > > > Yes, I know. But is it not better than having every driver that wants to use > these functions fake it themselves? Drivers that want a non-atomic variant should include either include/linux/io-64-nonatomic-hi-lo.h or include/linux/io-64-nonatomic-lo-hi.h depending on what they need. Drivers that require 64-bit I/O should probably just depend on CONFIG_64BIT and maybe use readq/writeq. I see that there are exactly two drivers calling ioread64/iowrite64: drivers/crypto/caam/ is architecture specific and drivers/ntb/hw/intel/ntb_hw_intel.c already has a workaround that should make it build on alpha. Arnd
Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
On Thu, 22 Jun 2017 10:48:14 -0600 Logan Gunthorpewrote: > Alpha implements its own io operation and doesn't use the > common library. Thus to make ioread64 and iowrite64 globally > available we need to add implementations for alpha. > > For this, we simply use calls that chain two 32-bit operations. > (mostly because I don't really understand the alpha architecture.) But this does not do the same thing as an ioread64 with regards to atomicity or side effects on the device. The same is true of the other hacks. You either have a real 64bit single read/write from MMIO space or you don't. You can't fake it. Alan
Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available
On Thu, 22 Jun 2017 14:24:58 -0600 Logan Gunthorpewrote: > On 6/22/2017 2:14 PM, Alan Cox wrote: > > If a platform doesn't support 64bit I/O operations from the CPU then you > > either need to use some kind of platform/architecture specific interface > > if present or accept you don't have one. > > Yes, I understand that. > > The thing is that every user that's currently using it right now is > patching in their own version that splits it on non-64bit systems. > > > It's not safe to split it. Possibly for some use cases you could add an > > ioread64_maysplit() > > I'm open to doing something like that. I think that makes sense for the platforms with that problem. I'm not sure there are many that can't do it for mmio at least. 486SX can't do it and I guess some ARM32 but I think almost everyone else can including most 32bit x86. What's more of a problem is a lot of platforms can do 64bit MMIO via ioread/write64 but not 64bit port I/O, and it's not clear how you represent that via an ioread/write API that abstracts it away. Alan
[PATCH] powerpc/kernel: Avoid redundancies on giveup_all
Currently giveup_all() calls __giveup_fpu(), __giveup_altivec(), and __giveup_vsx(). But __giveup_vsx() also calls __giveup_fpu() and __giveup_altivec() again, in a redudant manner. Other than giving up FP and Altivec, __giveup_vsx() also disables MSR_VSX on MSR, but this is already done by __giveup_{fpu,altivec}(). As VSX can not be enabled alone on MSR (without FP and/or VEC enabled), this is also a redundancy. VSX is never enabled alone (without FP and VEC) because every time VSX is enabled, as through load_up_vsx() and restore_math(), FP is also enabled together. This change improves giveup_all() in average just 3%, but since giveup_all() is called very frequently, around 8x per CPU per second on an idle machine, this change might show some noticeable improvement. Signed-off-by: Breno LeitaoSigned-off-by: Gustavo Romero --- arch/powerpc/kernel/process.c | 4 1 file changed, 4 deletions(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 2ad725ef4368..5d6af58270e6 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -494,10 +494,6 @@ void giveup_all(struct task_struct *tsk) if (usermsr & MSR_VEC) __giveup_altivec(tsk); #endif -#ifdef CONFIG_VSX - if (usermsr & MSR_VSX) - __giveup_vsx(tsk); -#endif #ifdef CONFIG_SPE if (usermsr & MSR_SPE) __giveup_spe(tsk); -- 2.11.0
Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available
On Thu, 22 Jun 2017 10:48:13 -0600 Logan Gunthorpewrote: > Currently, ioread64 and iowrite64 are only available io CONFIG_64BIT=y > and CONFIG_GENERIC_IOMAP=n. Thus, seeing the functions are not > universally available, it makes them unusable for driver developers. > This leads to ugly hacks such as those at the top of > > drivers/ntb/hw/intel/ntb_hw_intel.c > > This patch adds fallback implementations for when CONFIG_64BIT and > CONFIG_GENERIC_IOMAP are not set. These functions use two io32 based > calls to complete the operation. > > Note, we do not use the volatile keyword in these functions like the > others in the same file. It is necessary to avoid a compiler warning > on arm. This is a really really bad idea as per the Alpha comment. ioread64 and iowrite64 generate a single 64bit bus transaction. There is hardware where mmio operations have side effects so simply using a pair of 32bit operations blindly does not work (consider something as trivial as reading a 64bit performance counter or incrementing pointer). If a platform doesn't support 64bit I/O operations from the CPU then you either need to use some kind of platform/architecture specific interface if present or accept you don't have one. It's not safe to split it. Possibly for some use cases you could add an ioread64_maysplit() but you cannot blindly break ioread64/write64() and expect it to magically allow you to use drivers that depend upon it. What btw is the actual ARM compiler warning ? Is the compiler also trying to tell you it's a bad idea ? Alan
Re: [PATCH 09/17] cxlflash: Create character device to provide host management interface
> On Jun 21, 2017, at 9:15 PM, Uma Krishnanwrote: > > The cxlflash driver currently lacks host management interface. Future > devices supported by cxlflash will provide a variety of host-wide > management functions. Examples include LUN provisioning, hardware debug > support, and firmware download. > > In order to provide a way to manage the device, a character device will > be created during probe of each adapter. This device will support a set of > ioctls defined in the SISLite specification from which administrators can > manage the adapter. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH 08/17] cxlflash: Add scsi command abort handler
> On Jun 21, 2017, at 9:15 PM, Uma Krishnanwrote: > > To date, CXL flash devices do not support a single command abort operation. > Instead, the SISLite specification provides a context reset operation to > cleanup all pending commands for a given context. > > When a context reset is successful, it is guaranteed that the AFU has > aborted all currently pending I/O. This sequence is less invasive than a > device or host reset and can be executed to support scsi command abort > requests. Add eh_abort_handler callback support to process command timeouts > and abort requests. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH 07/17] cxlflash: Flush pending commands in cleanup path
> On Jun 21, 2017, at 9:14 PM, Uma Krishnanwrote: > > When the AFU is reset in an error path, pending scsi commands can be > silently dropped without completion or a formal abort. This puts the onus > on the cxlflash driver to notify mid-layer and indicating that the command > can be retried. > > Once the card has been quiesced, the hardware send queue lock is acquired > to prevent any data movement while the pending commands are processed. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH 06/17] cxlflash: Track pending scsi commands in each hardware queue
> On Jun 21, 2017, at 9:14 PM, Uma Krishnanwrote: > > Currently, there is no book keeping of the pending scsi commands in the > cxlflash driver. This lack of tracking in-flight requests is too > restrictive and requires a heavy-hammer reset each time an adapter error is > encountered. Additionally, it does not allow for commands to be properly > retried. > > In order to avoid this problem and to better handle error path command > cleanup, introduce a linked list for each hardware queue that tracks > pending commands. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH 05/17] cxlflash: Handle AFU sync failures
> On Jun 21, 2017, at 9:14 PM, Uma Krishnanwrote: > > AFU sync operations are not currently evaluated for failure. This is > acceptable for paths where there is not a dependency on the AFU being > consistent with the host. Examples include link reset events and LUN > cleanup operations. On paths where there is a dependency, such as a LUN > open, a sync failure should be acted upon. > > In the event of AFU sync failures, either log or cleanup as appropriate for > operations that are dependent on a successful sync completion. > > Update documentation to reflect behavior in the event of an AFU sync > failure. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH 04/17] cxlflash: Schedule asynchronous reset of the host
> On Jun 21, 2017, at 9:14 PM, Uma Krishnanwrote: > > A context reset failure indicates the AFU is in a bad state. At present, > when such a situation occurs, no further action is taken. This leaves the > adapter in an unusable state with no recoverable actions. > > To avoid this situation, context reset failures will be escalated to a host > reset operation. This will be done asynchronously to allow the acting > thread to return to the user with a failure. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH 03/17] cxlflash: Reset hardware queue context via specified register
> On Jun 21, 2017, at 9:14 PM, Uma Krishnanwrote: > > Per the SISLite specification, context_reset() writes 0x1 to the LSB of the > reset register. When the AFU processes this reset request, it is expected > to clear the bit after reset is complete. The current implementation simply > checks that the entire value read back is not 1, instead of masking off the > LSB and evaluating it for a change to 0. Should the AFU manipulate other > bits during the reset (reading back a value of 0xF for example), successful > completion will be prematurely indicated given the existing logic. > > Additionally, in the event that the context reset operation fails, there > does not currently exist a way to provide feedback to the initiator of the > reset. This poses a problem for the rare case that a context reset fails as > the caller will proceed on the assumption that all is well. > > To remedy these issues, refactor the context reset routine to only mask off > the LSB when evaluating for success and return status to the caller. Also > update the context reset handler parameters to pass a hardware queue > reference instead of a single command to better reflect that the entire > queue associated with the context is impacted by the reset. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH 02/17] cxlflash: Update cxlflash_afu_sync() to return errno
> On Jun 21, 2017, at 9:13 PM, Uma Krishnanwrote: > > The cxlflash_afu_sync() routine returns a negative one to indicate any kind > of failure. This makes it impossible to establish why the error occurred. > > Update the return codes to clearly indicate the failure cause to the > caller. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH 01/17] cxlflash: Combine the send queue locks
> On Jun 21, 2017, at 9:13 PM, Uma Krishnanwrote: > > Currently there are separate spin locks for the two supported I/O queueing > models. This makes it difficult to serialize with paths outside the enqueue > path. > > As a design simplification and to support serialization with enqueue > operations, move to only a single lock that is used for enqueueing > regardless of the queueing model. > > Signed-off-by: Uma Krishnan Acked-by: Matthew R. Ochs
Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd
Michael Ellermanwrites: > Thiago Jung Bauermann writes: > >> Michael Ellerman writes: >>> Thiago Jung Bauermann writes: >>> Calling arch_update_cpu_topology from a CPU hotplug state machine callback hits a deadlock because the function tries to get a read lock on cpu_hotplug_lock while the state machine still holds a write lock on it. Since all callers of arch_update_cpu_topology except rtasd already hold cpu_hotplug_lock, this patch changes the function to use stop_machine_cpuslocked and creates a separate function for rtasd which still tries to obtain the lock. Michael Bringmann investigated the bug and provided a detailed analysis of the deadlock on this previous RFC for an alternate solution: https://patchwork.ozlabs.org/patch/771293/ >>> >>> Do we know when this broke? Or has it never worked? >> >> It's been broken since at least v4.4, I think. I don't know about >> earlier versions. > > OK. > > Just to be clear, this is happening on a 4.12-rcX system with no other > patches? > > The code in arch_update_cpu_topology() has used stop_machine() since > 30c05350c39d ("powerpc/pseries: Use stop machine to update cpu maps") > which went into v3.10, about 4 years ago. > > Prior to that it used get/put_online_cpus(), since 9eff1a38407c > ("powerpc/pseries: Poll VPA for topology changes and update NUMA maps"), > which was 2.6.38 in 2010. > > I wouldn't rule out the possibility it's been broken for 7 years, but I > wonder if something else has changed to cause it to break. > > We really need to work it out before we backport anything. Michael Bringmann provided this information: We need at least one patch to show the issue in the latest 4.12 codebase: [PATCH V6 2/2] powerpc/numa: Update CPU topology when VPHN enabled https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-June/159341.html Reason: Prior to this patch we were not exercising the PowerPC VPHN hcall nor the associated path through the kernel/sched code that encounters the problem. All CPUs, whether present at boot or hot-added, are added to node 0 without this patch. This representation of the topology is incorrect in many/most cases. cpu_hotplug_begin() + get_online_cpus() have potentially been broken since the implementation of multithreading for _cpu_up() and _cpu_down(). Reason: 'cpu_hotplug.active_writer' check in get_online_cpus() is dependent upon the nested routines that call get_online_cpus() to execute in the same thread as the one that invokes 'cpu_hotplug_begin'. PowerPC's version of arch_update_cpu_topology() has used stop_machine() or get_online_cpus() for years, since at least 2011. Practically speaking, until the recent patch, in the 4.12 codebase PowerPC CPUs are being added only to nodes that were online at boot, and the topology did not change enough to trigger the paths through 'stop_machine'. I can't say for certain about the earlier code bases where 'arch_update_cpu_topology' used 'get_online_cpus'/'put_online_cpus' directly. >>> Should it go to stable? (can't in its current form AFAICS) >> >> It's not hard to backport both this patch and commit fe5595c07400 >> ("stop_machine: Provide stop_machine_cpuslocked()") from branch >> smp/hotplug in tip.git for stable. > > Yeah but it's not really my business backporting that unfortunately. Sorry, I wasn't clear. I was offering to provide backported patches for the relevant stable branches. Though that will only be necessary if we also backport the topology fixes as well. -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [PATCH 1/2] trace/kprobes: Sanitize derived event names
On 2017/06/22 06:29PM, Masami Hiramatsu wrote: > On Thu, 22 Jun 2017 00:20:27 +0530 > "Naveen N. Rao"wrote: > > > When we derive event names, convert some expected symbols (such as ':' > > used to specify module:name and '.' present in some symbols) into > > underscores so that the event name is not rejected. > > Oops, ok, this is my mistake. > > Acked-by: Masami Hiramatsu > > This must be marked as bugfix for stable trees. > > Could you also add a testcase for this (module name) bug? > > MODNAME=`lsmod | head -n 2 | tail -n 1 | cut -f 1 -d " "` > FUNCNAME=`grep -m 1 "\\[$MODNAME\\]" /proc/kallsyms | xargs | cut -f 3 -d " "` > > May gives you a target name :) Sure. Here is a test. Thanks for the review, Naveen - [PATCH] selftests/ftrace: Add a test to probe module functions Add a kprobes test to ensure that we are able to add a probe on a module function using 'p :' format, without having to specify a probe name. Suggested-by: Masami Hiramatsu Signed-off-by: Naveen N. Rao --- .../testing/selftests/ftrace/test.d/kprobe/probe_module.tc | 14 ++ 1 file changed, 14 insertions(+) create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/probe_module.tc diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/probe_module.tc b/tools/testing/selftests/ftrace/test.d/kprobe/probe_module.tc new file mode 100644 index ..ea7657041ba6 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/kprobe/probe_module.tc @@ -0,0 +1,14 @@ +#!/bin/sh +# description: Kprobe dynamic event - probing module + +[ -f kprobe_events ] || exit_unsupported # this is configurable + +echo 0 > events/enable +echo > kprobe_events +export MOD=`lsmod | head -n 2 | tail -n 1 | cut -f1 -d" "` +export FUNC=`grep -m 1 ".* t .*\\[$MOD\\]" /proc/kallsyms | xargs | cut -f3 -d" "` +[ "x" != "x$MOD" -a "y" != "y$FUNC" ] || exit_untested +echo p $MOD:$FUNC > kprobe_events +grep $MOD kprobe_events +echo > kprobe_events +clear_trace -- 2.13.1
Re: [next-20170609] WARNING: CPU: 3 PID: 71167 at lib/idr.c:157 idr_replace
On Thu, Jun 22, 2017 at 12:31:35AM +0100, Chris Wilson wrote: > Quoting Tejun Heo (2017-06-13 14:58:49) > > Cc'ing David Airlie. > > > > This is from drm driver calling in idr_replace() w/ a negative id. > > Probably a silly bug in error handling path? > > No, this is the validation of an invalid userspace handle. The drm ABI > for handles is supposed to be a full u32 range with 0 reserved for an > invalid handle (constrained by the idr_alloc ofc). The WARN was > introduced by 0a835c4f090a. Matthew, did idr lost some of its range while being reimplemented using radix tree? Thanks. -- tejun
Re: [RFC v3 01/23] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
On Thu, Jun 22, 2017 at 07:21:03PM +1000, Balbir Singh wrote: > On Wed, 2017-06-21 at 18:39 -0700, Ram Pai wrote: > > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6, > > in the 4K backed HPTE pages. These bits continue to be used > > for 64K backed HPTE pages in this patch, but will be freed > > up in the next patch. The bit numbers are big-endian as > > defined in the ISA3.0 > > > > The patch does the following change to the 64K PTE format > > > > Why can't we stuff the bits in the VMA and retrieve it from there? > Basically always get a minor fault in hash and for keys handle > the fault in do_page_fault() and handle the keys from the VMA? I think you raise a valid point. We dont necessarily have to program the pte. the hpte can be programmed directly from the key in the vma. Just that the code becomes a little ugly to do so, since the _hash_page_*() functions do not have access to the vma. However we are also trying to maintain consistency between hpte and rpte implementation. The keys have to be programmed into the rpte. The patch is working towards enabling the consistency, so that the same code can work on both, hpte for now and rpte in the future. Maybe I can just do what you propose. However this patch by itself has value, because it frees up four valuable pte bits, irrespective of whether we use it for memory keys. Let me see what others have to say. Aneesh: thoughts? > > > H_PAGE_BUSY moves from bit 3 to bit 9 > > H_PAGE_F_SECOND which occupied bit 4 moves to the second part > > of the pte. > > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the > > second part of the pte. > > > > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot > > is initialized to 0xF indicating an invalid slot. If a HPTE > > gets cached in a 0xF slot(i.e 7th slot of secondary), it is > > released immediately. In other words, even though 0xF is a > > valid slot we discard and consider it as an invalid > > slot;i.e HPTE(). This gives us an opportunity to not > > depend on a bit in the primary PTE in order to determine the > > validity of a slot. > > This is not clear, could you please rephrase? What is the bit in the > primary key we rely on? (H_PAGE_F_SECOND|H_PAGE_F_GIX) bits, which is big-endian bits 3 4 5 and 6. They are currently used to track the validitiy of the 4k-hptes backing the 64k-pte. Each bit tracks four 4k-hptes, for a total of sixteen 4k-hptes. > > > > > When we release aHPTE in the 0xF slot we also release a > > legitimate primary slot andunmapthat entry. This is to > > ensure that we do get a legimate non-0xF slot the next time we > > retry for a slot. > > > > Though treating 0xF slot as invalid reduces the number of available > > slots and may have an effect on the performance, the probabilty > > of hitting a 0xF is extermely low. > > > > Compared to the current scheme, the above described scheme reduces > > the number of false hash table updates significantly and has the > > added advantage of releasing four valuable PTE bits for other > > purpose. > > > > This idea was jointly developed by Paul Mackerras, Aneesh, Michael > > Ellermen and myself. > > > > It would be helpful if you had a text diagram explaining the PTE bits > before and after. ok. will add it in the next version. > > > 4K PTE format remain unchanged currently. > > > > The code seems to be doing a lot more than the changelog suggests. A few > functions are completely removed, common code between 64K and 4K has been > split under #ifndef. It would be good to call all of these out. ok. will do. > > > Signed-off-by: Ram Pai> > > > Conflicts: > > arch/powerpc/include/asm/book3s/64/hash.h > > --- > > arch/powerpc/include/asm/book3s/64/hash-4k.h | 7 +++ > > arch/powerpc/include/asm/book3s/64/hash-64k.h | 17 --- > > arch/powerpc/include/asm/book3s/64/hash.h | 12 +++-- > > arch/powerpc/mm/hash64_64k.c | 70 > > +++ > > arch/powerpc/mm/hash_utils_64.c | 4 +- > > 5 files changed, 66 insertions(+), 44 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h > > b/arch/powerpc/include/asm/book3s/64/hash-4k.h > > index b4b5e6b..9c2c8f1 100644 > > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h > > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h > > @@ -16,6 +16,13 @@ > > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE) > > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE) > > > > +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG > > */ > > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) > > +#define H_PAGE_F_GIX_SHIFT 56 > > + > > +#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are > > busy */ > > +#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated
[PATCH 1/4] arm: Reduce ELF_ET_DYN_BASE
Now that explicitly executed loaders are loaded in the mmap region, position PIE binaries lower in the address space to avoid possible collisions with mmap or stack regions. Signed-off-by: Kees Cook--- arch/arm/include/asm/elf.h | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/arm/include/asm/elf.h b/arch/arm/include/asm/elf.h index d2315ffd8f12..f13ae153fb24 100644 --- a/arch/arm/include/asm/elf.h +++ b/arch/arm/include/asm/elf.h @@ -112,12 +112,8 @@ int dump_task_regs(struct task_struct *t, elf_gregset_t *elfregs); #define CORE_DUMP_USE_REGSET #define ELF_EXEC_PAGESIZE 4096 -/* This is the location that an ET_DYN program is loaded if exec'ed. Typical - use of this is to invoke "./ld.so someprog" to test out a new version of - the loader. We need to make sure that it is out of the way of the program - that it will "exec", and that there is sufficient room for the brk. */ - -#define ELF_ET_DYN_BASE(TASK_SIZE / 3 * 2) +/* This is the base location for PIE (ET_DYN with INTERP) loads. */ +#define ELF_ET_DYN_BASE0x40UL /* When the program starts, a1 contains a pointer to a function to be registered with atexit, as per the SVR4 ABI. A value of 0 means we -- 2.7.4
[PATCH 2/4] arm64: Reduce ELF_ET_DYN_BASE
Now that explicitly executed loaders are loaded in the mmap region, position PIE binaries lower in the address space to avoid possible collisions with mmap or stack regions. For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. Signed-off-by: Kees Cook--- arch/arm64/include/asm/elf.h | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h index 5d1700425efe..f742af8f7c42 100644 --- a/arch/arm64/include/asm/elf.h +++ b/arch/arm64/include/asm/elf.h @@ -113,12 +113,13 @@ #define ELF_EXEC_PAGESIZE PAGE_SIZE /* - * This is the location that an ET_DYN program is loaded if exec'ed. Typical - * use of this is to invoke "./ld.so someprog" to test out a new version of - * the loader. We need to make sure that it is out of the way of the program - * that it will "exec", and that there is sufficient room for the brk. + * This is the base location for PIE (ET_DYN with INTERP) loads. On + * 64-bit, this is raised to 4GB to leave the entire 32-bit address + * space open for things that want to use the area for 32-bit pointers. */ -#define ELF_ET_DYN_BASE(2 * TASK_SIZE_64 / 3) +#define ELF_ET_DYN_BASE(test_thread_flag(TIF_32BIT) ? \ + 0x00040UL : \ + 0x1UL) #ifndef __ASSEMBLY__ @@ -173,8 +174,6 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm, #ifdef CONFIG_COMPAT -#define COMPAT_ELF_ET_DYN_BASE (2 * TASK_SIZE_32 / 3) - /* AArch32 registers. */ #define COMPAT_ELF_NGREG 18 typedef unsigned int compat_elf_greg_t; -- 2.7.4
[PATCH 4/4] s390: Reduce ELF_ET_DYN_BASE
Now that explicitly executed loaders are loaded in the mmap region, position PIE binaries lower in the address space to avoid possible collisions with mmap or stack regions. For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. Signed-off-by: Kees Cook--- arch/s390/include/asm/elf.h | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h index e8f623041769..7c58d599f91b 100644 --- a/arch/s390/include/asm/elf.h +++ b/arch/s390/include/asm/elf.h @@ -161,14 +161,13 @@ extern unsigned int vdso_enabled; #define CORE_DUMP_USE_REGSET #define ELF_EXEC_PAGESIZE 4096 -/* This is the location that an ET_DYN program is loaded if exec'ed. Typical - use of this is to invoke "./ld.so someprog" to test out a new version of - the loader. We need to make sure that it is out of the way of the program - that it will "exec", and that there is sufficient room for the brk. 64-bit - tasks are aligned to 4GB. */ -#define ELF_ET_DYN_BASE (is_compat_task() ? \ - (STACK_TOP / 3 * 2) : \ - (STACK_TOP / 3 * 2) & ~((1UL << 32) - 1)) +/* + * This is the base location for PIE (ET_DYN with INTERP) loads. On + * 64-bit, this is raised to 4GB to leave the entire 32-bit address + * space open for things that want to use the area for 32-bit pointers. + */ +#define ELF_ET_DYN_BASE(is_compat_task() ? 0x00040UL : \ + 0x1UL) /* This yields a mask that user programs can use to figure out what instruction set this CPU supports. */ -- 2.7.4
[PATCH 3/4] powerpc: Reduce ELF_ET_DYN_BASE
Now that explicitly executed loaders are loaded in the mmap region, position PIE binaries lower in the address space to avoid possible collisions with mmap or stack regions. For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. Signed-off-by: Kees Cook--- arch/powerpc/include/asm/elf.h | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h index 09bde6e34f5d..548d9a411a0d 100644 --- a/arch/powerpc/include/asm/elf.h +++ b/arch/powerpc/include/asm/elf.h @@ -23,12 +23,13 @@ #define CORE_DUMP_USE_REGSET #define ELF_EXEC_PAGESIZE PAGE_SIZE -/* This is the location that an ET_DYN program is loaded if exec'ed. Typical - use of this is to invoke "./ld.so someprog" to test out a new version of - the loader. We need to make sure that it is out of the way of the program - that it will "exec", and that there is sufficient room for the brk. */ - -#define ELF_ET_DYN_BASE0x2000 +/* + * This is the base location for PIE (ET_DYN with INTERP) loads. On + * 64-bit, this is raised to 4GB to leave the entire 32-bit address + * space open for things that want to use the area for 32-bit pointers. + */ +#define ELF_ET_DYN_BASE(is_32bit_task() ? 0x00040UL : \ + 0x1UL) #define ELF_CORE_EFLAGS (is_elf2_task() ? 2 : 0) -- 2.7.4
[PATCH 0/4] Reduce ELF_ET_DYN_BASE
This is a follow-up to "binfmt_elf: Use ELF_ET_DYN_BASE only for PIE"[1], which allow ELF_ET_DYN_BASE to be reduced from high in the address space. That patch only changed x86, and this series changes arm, arm64, powerpc, and s390. Since these depend on the mentioned patch (which I'm hoping akpm will pick up), I'm hoping to get arch-maintainer Acks for these (so they can be carried all together). Thanks! -Kees [1] https://patchwork.kernel.org/patch/9802387/
[PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9
On POWER9 the ERAT may be incorrect on wakeup from some stop states that lose state. This causes random segvs and illegal instructions when these stop states are enabled. This patch invalidates the ERAT on wakeup on POWER9 to prevent this from causing a problem. Signed-off-by: Michael Neuling--- arch/powerpc/kernel/idle_book3s.S | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 1ea14b96f1..ace2ad50c8 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -793,6 +793,9 @@ fastsleep_workaround_at_exit: */ .global pnv_wakeup_loss pnv_wakeup_loss: +BEGIN_FTR_SECTION + PPC_INVALIDATE_ERAT +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) ld r1,PACAR1(r13) BEGIN_FTR_SECTION CHECK_HMI_INTERRUPT -- 2.11.0
RE: [PATCH 5/7] ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks
> -Original Message- > From: linux-...@googlegroups.com [mailto:linux-...@googlegroups.com] On > Behalf Of Logan Gunthorpe > Sent: Thursday, June 22, 2017 9:48 AM > To: linux-ker...@vger.kernel.org; linux-a...@vger.kernel.org; > linux-...@googlegroups.com; linux-al...@vger.kernel.org; linuxppc- > d...@lists.ozlabs.org; linux-cry...@vger.kernel.org; > dri-de...@lists.freedesktop.org > Cc: Arnd Bergmann; Greg Kroah-Hartman > ; Stephen Bates ; > Logan Gunthorpe ; Jon Mason ; Jiang, > Dave ; Allen Hubbe > > Subject: [PATCH 5/7] ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks > > Now that ioread64 and iowrite64 are available generically we can > remove the hack at the top of ntb_hw_intel.c that patches them in > when they are not available. > > Signed-off-by: Logan Gunthorpe > Cc: Jon Mason > Cc: Dave Jiang > Cc: Allen Hubbe Thanks for doing this Logan. Acked-by: Dave Jiang > --- > drivers/ntb/hw/intel/ntb_hw_intel.c | 30 -- > 1 file changed, 30 deletions(-) > > diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c > b/drivers/ntb/hw/intel/ntb_hw_intel.c > index c00238491673..56221d540c2b 100644 > --- a/drivers/ntb/hw/intel/ntb_hw_intel.c > +++ b/drivers/ntb/hw/intel/ntb_hw_intel.c > @@ -153,35 +153,6 @@ MODULE_PARM_DESC(xeon_b2b_dsd_bar5_addr32, > static inline enum ntb_topo xeon_ppd_topo(struct intel_ntb_dev *ndev, u8 > ppd); > static int xeon_init_isr(struct intel_ntb_dev *ndev); > > -#ifndef ioread64 > -#ifdef readq > -#define ioread64 readq > -#else > -#define ioread64 _ioread64 > -static inline u64 _ioread64(void __iomem *mmio) > -{ > - u64 low, high; > - > - low = ioread32(mmio); > - high = ioread32(mmio + sizeof(u32)); > - return low | (high << 32); > -} > -#endif > -#endif > - > -#ifndef iowrite64 > -#ifdef writeq > -#define iowrite64 writeq > -#else > -#define iowrite64 _iowrite64 > -static inline void _iowrite64(u64 val, void __iomem *mmio) > -{ > - iowrite32(val, mmio); > - iowrite32(val >> 32, mmio + sizeof(u32)); > -} > -#endif > -#endif > - > static inline int pdev_is_atom(struct pci_dev *pdev) > { > switch (pdev->device) { > @@ -3008,4 +2979,3 @@ static void __exit intel_ntb_pci_driver_exit(void) > debugfs_remove_recursive(debugfs_dir); > } > module_exit(intel_ntb_pci_driver_exit); > - > -- > 2.11.0 > > -- > You received this message because you are subscribed to the Google Groups > "linux-ntb" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to linux-ntb+unsubscr...@googlegroups.com. > To post to this group, send email to linux-...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/linux-ntb/20170622164817.25515-6- > logang%40deltatee.com. > For more options, visit https://groups.google.com/d/optout.
Re: [PATCH 2/2] selftests/ftrace: Update multiple kprobes test for powerpc
On 2017/06/22 06:07PM, Masami Hiramatsu wrote: > On Thu, 22 Jun 2017 00:20:28 +0530 > "Naveen N. Rao"wrote: > > > KPROBES_ON_FTRACE is only available on powerpc64le. Update comment to > > clarify this. > > > > Also, we should use an offset of 8 to ensure that the probe does not > > fall on ftrace location. The current offset of 4 will fall before the > > function local entry point and won't fire, while an offset of 12 or 16 > > will fall on ftrace location. Offset 8 is currently guaranteed to not be > > the ftrace location. > > OK, these part seems good to me. > > > > > Finally, do not filter out symbols with a dot. Powerpc Elfv1 uses dot > > prefix for all functions and this prevents us from testing some of those > > symbols. Furthermore, with the patch to derive event names properly in > > the presence of ':' and '.', such names are accepted by kprobe_events > > and constitutes a good test for those symbols. > > Hmm, the reason why I added such filter was to avoid symbols including > gcc-generated suffixes like as .constprop or .isra etc. I see. I do wonder -- is there a problem if we try probing those symbols? On my local x86 vm, I don't see an issue probing it especially with the previous patch to enable probing with symbols having a '.' or ':'. Furthermore, since this is for testing kprobe_events, I feel it is good to try probing those symbols too to catch any weird errors we may hit. Thanks for the review! - Naveen > So if the Powerpc Elfv1 use dot prefix, that is OK, in that case, > could you update the filter as "^.*\\..*" ? > > Thank you, > > > > > Signed-off-by: Naveen N. Rao > > --- > > tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc | 8 > > > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git > > a/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc > > b/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc > > index f4d1ff785d67..d209c071b2c0 100644 > > --- a/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc > > +++ b/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc > > @@ -2,16 +2,16 @@ > > # description: Register/unregister many kprobe events > > > > # ftrace fentry skip size depends on the machine architecture. > > -# Currently HAVE_KPROBES_ON_FTRACE defined on x86 and powerpc > > +# Currently HAVE_KPROBES_ON_FTRACE defined on x86 and powerpc64le > > case `uname -m` in > >x86_64|i[3456]86) OFFS=5;; > > - ppc*) OFFS=4;; > > + ppc64le) OFFS=8;; > >*) OFFS=0;; > > esac > > > > echo "Setup up to 256 kprobes" > > -grep t /proc/kallsyms | cut -f3 -d" " | grep -v .*\\..* | \ > > -head -n 256 | while read i; do echo p ${i}+${OFFS} ; done > kprobe_events > > ||: > > +grep t /proc/kallsyms | cut -f3 -d" " | head -n 256 | \ > > +while read i; do echo p ${i}+${OFFS} ; done > kprobe_events ||: > > > > echo 1 > events/kprobes/enable > > echo 0 > events/kprobes/enable > > -- > > 2.13.1 > > > > > -- > Masami Hiramatsu >
Re: [PATCH v3 6/6] powerpc/64s: Blacklist rtas entry/exit from kprobes
On 2017/06/22 01:48PM, Nicholas Piggin wrote: > On Thu, 22 Jun 2017 00:08:42 +0530 > "Naveen N. Rao"wrote: > > > We can't take traps with relocation off, so blacklist enter_rtas() and > > rtas_return_loc(). However, instead of blacklisting all of enter_rtas(), > > introduce a new symbol __enter_rtas from where on we can't take a trap > > and blacklist that. > > > > Signed-off-by: Naveen N. Rao > > --- > > arch/powerpc/kernel/entry_64.S | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > > index d376f07153d7..49c35450f399 100644 > > --- a/arch/powerpc/kernel/entry_64.S > > +++ b/arch/powerpc/kernel/entry_64.S > > @@ -1076,6 +1076,8 @@ _GLOBAL(enter_rtas) > > rldicr r9,r9,MSR_SF_LG,(63-MSR_SF_LG) > > ori r9,r9,MSR_IR|MSR_DR|MSR_FE0|MSR_FE1|MSR_FP|MSR_RI|MSR_LE > > andcr6,r0,r9 > > + > > +__enter_rtas: > > sync/* disable interrupts so SRR0/1 */ > > mtmsrd r0 /* don't get trashed */ > > Along the lines of the system call patch... For consistency, could we > put the __enter_rtas right after mtmsrd? And I wonder if we shoul Sure. > come up with a common prefix or postfix naming convention for these > such labels used to control probing? We could, but I am not sure it will help much. On the other hand, such symbols may make backtraces pretty distracting. I'm just using '__' as a prefix to make it less distracting, though it isn't all that great either. I'm clearly hopeless with names o_O The other option is to just blacklist entire functions, but we will then lose the ability to probe in many places where we may have wanted to. > > How do opal calls avoid tracing? It doesn't yet. I'm still going through the initial few symbols and identifying what needs blacklisting. Opal is further down. Thanks, Naveen
Re: [PATCH V4] cxl: Export library to support IBM XSL
Le 22/06/2017 à 15:07, Christophe Lombard a écrit : This patch exports a in-kernel 'library' API which can be called by other drivers to help interacting with an IBM XSL on a POWER9 system. The XSL (Translation Service Layer) is a stripped down version of the PSL (Power Service Layer) used in some cards such as the Mellanox CX5. Like the PSL, it implements the CAIA architecture, but has a number of differences, mostly in it's implementation dependent registers. The XSL also uses a special DMA cxl mode, which uses a slightly different init sequence for the CAPP and PHB. Signed-off-by: Andrew DonnellanSigned-off-by: Christophe Lombard --- Looks ok to me, thanks! [mpe: just to make sure you notice: there's a prereq patch listed under the changelog] Acked-by: Frederic Barrat Changelog[v4] - Rebase to latest upstream. - Add new Signed-off - Update cxl_handle_mm_fault() and cxl_handle_page_fault() Changelog[v3] - Rebase to latest upstream. - cxl_handle_mm_fault() is now exported. Remove kernel context parameter - Update comments Changelog[v2] - Rebase to latest upstream. - Return -EFAULT in case of NULL pointer in cxllib_handle_fault(). - Reverse parameters when copro_handle_mm_fault() is called. This applies on top of this patch: http://patchwork.ozlabs.org/patch/775322/ --- arch/powerpc/include/asm/opal-api.h | 1 + drivers/misc/cxl/Kconfig| 5 + drivers/misc/cxl/Makefile | 2 +- drivers/misc/cxl/cxl.h | 6 + drivers/misc/cxl/cxllib.c | 246 drivers/misc/cxl/fault.c| 29 +++-- drivers/misc/cxl/native.c | 16 ++- drivers/misc/cxl/pci.c | 41 -- include/misc/cxllib.h | 133 +++ 9 files changed, 449 insertions(+), 30 deletions(-) create mode 100644 drivers/misc/cxl/cxllib.c create mode 100644 include/misc/cxllib.h diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index cb3e624..3e0be78 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -877,6 +877,7 @@ enum { OPAL_PHB_CAPI_MODE_SNOOP_OFF= 2, OPAL_PHB_CAPI_MODE_SNOOP_ON = 3, OPAL_PHB_CAPI_MODE_DMA = 4, + OPAL_PHB_CAPI_MODE_DMA_TVT1 = 5, }; /* OPAL I2C request */ diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig index b75cf83..93397cb 100644 --- a/drivers/misc/cxl/Kconfig +++ b/drivers/misc/cxl/Kconfig @@ -11,11 +11,16 @@ config CXL_AFU_DRIVER_OPS bool default n +config CXL_LIB + bool + default n + config CXL tristate "Support for IBM Coherent Accelerators (CXL)" depends on PPC_POWERNV && PCI_MSI && EEH select CXL_BASE select CXL_AFU_DRIVER_OPS + select CXL_LIB default m help Select this option to enable driver support for IBM Coherent diff --git a/drivers/misc/cxl/Makefile b/drivers/misc/cxl/Makefile index c14fd6b..0b5fd74 100644 --- a/drivers/misc/cxl/Makefile +++ b/drivers/misc/cxl/Makefile @@ -3,7 +3,7 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror cxl-y += main.o file.o irq.o fault.o native.o cxl-y += context.o sysfs.o pci.o trace.o -cxl-y += vphb.o phb.o api.o +cxl-y += vphb.o phb.o api.o cxllib.o cxl-$(CONFIG_PPC_PSERIES) += flash.o guest.o of.o hcalls.o cxl-$(CONFIG_DEBUG_FS)+= debugfs.o obj-$(CONFIG_CXL) += cxl.o diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index a03f8e7..b1afecc 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -1010,6 +1010,7 @@ static inline void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct den void cxl_handle_fault(struct work_struct *work); void cxl_prefault(struct cxl_context *ctx, u64 wed); +int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, u64 dar); struct cxl *get_cxl_adapter(int num); int cxl_alloc_sst(struct cxl_context *ctx); @@ -1061,6 +1062,11 @@ int cxl_afu_slbia(struct cxl_afu *afu); int cxl_data_cache_flush(struct cxl *adapter); int cxl_afu_disable(struct cxl_afu *afu); int cxl_psl_purge(struct cxl_afu *afu); +int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid, + u32 *phb_index, u64 *capp_unit_id); +int cxl_slot_is_switched(struct pci_dev *dev); +int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg); +u64 cxl_calculate_sr(bool master, bool kernel, bool real_mode, bool p9); void cxl_native_irq_dump_regs_psl9(struct cxl_context *ctx); void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx); diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c new file mode 100644 index 000..5dba23c --- /dev/null +++
Re: [PATCH v2] of: update ePAPR references to point to Devicetree Specification
On Thu, Jun 22, 2017 at 09:15:39AM -0700, frowand.l...@gmail.com wrote: > From: Frank Rowand> > The Devicetree Specification has superseded the ePAPR as the > base specification for bindings. Update files in Documentation > to reference the new document. > > First reference to ePAPR in Documentation/devicetree/bindings/arm/cci.txt > is generic, remove it. > > Some files are not updated because there is no hypervisor chapter > in the Devicetree Specification: >Documentation/devicetree/bindings/powerpc/fsl/msi-pic.txt >Documenation/virtual/kvm/api.txt >Documenation/virtual/kvm/ppc-pv.txt > > Signed-off-by: Frank Rowand > --- > > changes from v1: >- remove boilerplat ePAPR reference from cci.txt > > Documentation/devicetree/bindings/arm/cci.txt | 15 > --- > Documentation/devicetree/bindings/arm/cpus.txt| 13 +++-- > Documentation/devicetree/bindings/arm/idle-states.txt | 4 ++-- > Documentation/devicetree/bindings/arm/l2c2x0.txt | 4 ++-- > Documentation/devicetree/bindings/arm/topology.txt| 4 ++-- > Documentation/devicetree/bindings/bus/simple-pm-bus.txt | 2 +- > Documentation/devicetree/bindings/chosen.txt | 3 ++- > Documentation/devicetree/bindings/common-properties.txt | 2 +- > Documentation/devicetree/bindings/crypto/fsl-sec4.txt | 4 ++-- > Documentation/devicetree/bindings/crypto/fsl-sec6.txt | 4 ++-- > .../devicetree/bindings/interrupt-controller/open-pic.txt | 5 ++--- > Documentation/devicetree/bindings/net/ethernet.txt| 9 ++--- > Documentation/devicetree/bindings/powerpc/fsl/cpus.txt| 6 +++--- > Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt | 2 +- > .../devicetree/bindings/powerpc/fsl/srio-rmu.txt | 4 ++-- > Documentation/devicetree/bindings/powerpc/fsl/srio.txt| 3 ++- > Documentation/devicetree/booting-without-of.txt | 2 +- > Documentation/devicetree/usage-model.txt | 2 +- > Documentation/xtensa/mmu.txt | 6 +++--- > 19 files changed, 46 insertions(+), 48 deletions(-) Applied.
Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.
On Thu, Jun 22, 2017 at 02:37:27PM +0530, Anshuman Khandual wrote: > On 06/17/2017 09:22 AM, Ram Pai wrote: > > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 > > in the 4K backed hpte pages. These bits continue to be used > > for 64K backed hpte pages in this patch, but will be freed > > up in the next patch. > > > > The patch does the following change to the 64K PTE format > > > > H_PAGE_BUSY moves from bit 3 to bit 9 > > H_PAGE_F_SECOND which occupied bit 4 moves to the second part > > of the pte. > > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the > > second part of the pte. > > > > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot > > is initialized to 0xF indicating an invalid slot. If a hpte > > gets cached in a 0xF slot(i.e 7th slot of secondary), it is > > released immediately. In other words, even though 0xF is a > > valid slot we discard and consider it as an invalid > > slot;i.e hpte_soft_invalid(). This gives us an opportunity to not > > depend on a bit in the primary PTE in order to determine the > > validity of a slot. > > > > When we release ahpte in the 0xF slot we also release a > > legitimate primary slot andunmapthat entry. This is to > > ensure that we do get a legimate non-0xF slot the next time we > > retry for a slot. > > > > Though treating 0xF slot as invalid reduces the number of available > > slots and may have an effect on the performance, the probabilty > > of hitting a 0xF is extermely low. > > > > Compared to the current scheme, the above described scheme reduces > > the number of false hash table updates significantly and has the > > added advantage of releasing four valuable PTE bits for other > > purpose. > > > > This idea was jointly developed by Paul Mackerras, Aneesh, Michael > > Ellermen and myself. > > > > 4K PTE format remain unchanged currently. > > Scanned through the PTE format again for hash 64K and 4K. It seems > to me that there might be 5 free bits already present on the PTE > format. I might have seriously mistaken something here :) Please > correct me if that is not the case. _RPAGE_RPN* I think is applicable > only for hash page table format and will not be available for radix > later. > > +#define _PAGE_FREE_1 0x0040UL /* Not used */ > +#define _RPAGE_SW0 0x2000UL /* Not used */ > +#define _RPAGE_SW1 0x0800UL /* Not used */ > +#define _RPAGE_RPN42 0x0040UL /* Not used */ > +#define _RPAGE_RPN41 0x0020UL /* Not used */ > The bits are chosen to future proof for radix implementation. _RPAGE_SW* will eat into what is available for software in the future, and these key-bits will certainly be something that the radix hardware will read, in the future. The _RPAGE_RPN* bits cannot be relied on for radix. But finally the bits that we chose (H_PAGE_F_SECOND|H_PAGE_F_GIX) had the best potential for giving us the highest number of free bits with relatively less effort. RP
[PATCH v2] of: update ePAPR references to point to Devicetree Specification
From: Frank RowandThe Devicetree Specification has superseded the ePAPR as the base specification for bindings. Update files in Documentation to reference the new document. First reference to ePAPR in Documentation/devicetree/bindings/arm/cci.txt is generic, remove it. Some files are not updated because there is no hypervisor chapter in the Devicetree Specification: Documentation/devicetree/bindings/powerpc/fsl/msi-pic.txt Documenation/virtual/kvm/api.txt Documenation/virtual/kvm/ppc-pv.txt Signed-off-by: Frank Rowand --- changes from v1: - remove boilerplat ePAPR reference from cci.txt Documentation/devicetree/bindings/arm/cci.txt | 15 --- Documentation/devicetree/bindings/arm/cpus.txt| 13 +++-- Documentation/devicetree/bindings/arm/idle-states.txt | 4 ++-- Documentation/devicetree/bindings/arm/l2c2x0.txt | 4 ++-- Documentation/devicetree/bindings/arm/topology.txt| 4 ++-- Documentation/devicetree/bindings/bus/simple-pm-bus.txt | 2 +- Documentation/devicetree/bindings/chosen.txt | 3 ++- Documentation/devicetree/bindings/common-properties.txt | 2 +- Documentation/devicetree/bindings/crypto/fsl-sec4.txt | 4 ++-- Documentation/devicetree/bindings/crypto/fsl-sec6.txt | 4 ++-- .../devicetree/bindings/interrupt-controller/open-pic.txt | 5 ++--- Documentation/devicetree/bindings/net/ethernet.txt| 9 ++--- Documentation/devicetree/bindings/powerpc/fsl/cpus.txt| 6 +++--- Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt | 2 +- .../devicetree/bindings/powerpc/fsl/srio-rmu.txt | 4 ++-- Documentation/devicetree/bindings/powerpc/fsl/srio.txt| 3 ++- Documentation/devicetree/booting-without-of.txt | 2 +- Documentation/devicetree/usage-model.txt | 2 +- Documentation/xtensa/mmu.txt | 6 +++--- 19 files changed, 46 insertions(+), 48 deletions(-) diff --git a/Documentation/devicetree/bindings/arm/cci.txt b/Documentation/devicetree/bindings/arm/cci.txt index 0f2153e8fa7e..9600761f2d5b 100644 --- a/Documentation/devicetree/bindings/arm/cci.txt +++ b/Documentation/devicetree/bindings/arm/cci.txt @@ -11,13 +11,6 @@ clusters, through memory mapped interface, with a global control register space and multiple sets of interface control registers, one per slave interface. -Bindings for the CCI node follow the ePAPR standard, available from: - -www.power.org/documentation/epapr-version-1-1/ - -with the addition of the bindings described in this document which are -specific to ARM. - * CCI interconnect node Description: Describes a CCI cache coherent Interconnect component @@ -50,10 +43,10 @@ specific to ARM. as a tuple of cells, containing child address, parent address and the size of the region in the child address space. - Definition: A standard property. Follow rules in the ePAPR for - hierarchical bus addressing. CCI interfaces - addresses refer to the parent node addressing - scheme to declare their register bases. + Definition: A standard property. Follow rules in the Devicetree + Specification for hierarchical bus addressing. CCI + interfaces addresses refer to the parent node + addressing scheme to declare their register bases. CCI interconnect node can define the following child nodes: diff --git a/Documentation/devicetree/bindings/arm/cpus.txt b/Documentation/devicetree/bindings/arm/cpus.txt index 1030f5f50207..283c520a2224 100644 --- a/Documentation/devicetree/bindings/arm/cpus.txt +++ b/Documentation/devicetree/bindings/arm/cpus.txt @@ -6,9 +6,9 @@ The device tree allows to describe the layout of CPUs in a system through the "cpus" node, which in turn contains a number of subnodes (ie "cpu") defining properties for every cpu. -Bindings for CPU nodes follow the ePAPR v1.1 standard, available from: +Bindings for CPU nodes follow the Devicetree Specification, available from: -https://www.power.org/documentation/epapr-version-1-1/ +https://www.devicetree.org/specifications/ with updates for 32-bit and 64-bit ARM systems provided in this document. @@ -16,8 +16,8 @@ with updates for 32-bit and 64-bit ARM systems provided in this document. Convention used in this document -This document follows the conventions described in the ePAPR v1.1, with -the addition: +This document follows the conventions described in the Devicetree +Specification, with the addition: - square brackets define bitfields, eg reg[7:0] value of the bitfield in the reg property contained in bits 7 down to 0 @@ -26,8 +26,9 @@ the
Re: [PATCH] Documentation: remove overlay-notes reference to non-existent file
On Tue, Jun 20, 2017 at 03:40:53PM -0700, frowand.l...@gmail.com wrote: > From: Frank Rowand> > File dt-object-internal.txt does not exist. Remove a reference to it > and fix up tags for references to other files. > > Reported-by: afaer...@suse.de > Signed-off-by: Frank Rowand > --- > Documentation/devicetree/overlay-notes.txt | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) Applied.
Re: [PATCH v6 1/4] of: remove *phandle properties from expanded device tree
On Wed, Jun 21, 2017 at 02:57:35PM +1000, Michael Ellerman wrote: > Hi Frank, > > frowand.l...@gmail.com writes: > > From: Frank Rowand> > > > Remove "phandle", "linux,phandle", and "ibm,phandle" properties from > > the internal device tree. The phandle will still be in the struct > > device_node phandle field and will still be displayed as if it is > > a property in /proc/device_tree. > > > > This is to resolve the issue found by Stephen Boyd [1] when he changed > > the type of struct property.value from void * to const void *. As > > a result of the type change, the overlay code had compile errors > > where the resolver updates phandle values. > > > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html > > > > - Add sysfs infrastructure to report np->phandle, as if it was a property. > > - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties > > in the expanded device tree. > > - Remove phandle properties in of_attach_node(), for nodes dynamically > > attached to the live tree. Add the phandle sysfs entry for these nodes. > > - When creating an overlay changeset, duplicate the node phandle in > > __of_node_dup(). > > - Remove no longer needed checks to exclude "phandle" and "linux,phandle" > > properties in several locations. > > - A side effect of these changes is that the obsolete "linux,phandle" and > > "ibm,phandle" properties will no longer appear in /proc/device-tree (they > > will appear as "phandle"). > > Sorry but I don't think that can work for us. > > Our DLPAR (ie. CPU/memory/device hotplug) stuff on PowerVM uses > "ibm,phandle", and it's not the same thing as "phandle" / > "linux,phandle". > > I don't know the code well myself, but the spec (PAPR) says: > > Note: If the “ibm,phandle” property exists, there are two “phandle” > namespaces which must be kept separate. One is that actually used by > the OF client interface, the other is properties in the device tree > making reference to device tree nodes. These requirements are written > to maintain backward compatibility with older FW versions predating > these requirements; if the “ibm,phandle” property is not present, the > OS may assume that any device tree properties which refer to this node > will have a phandle value matching that returned by client interface > services. > > I have systems here that still use "ibm,phandle". I also see at least > some of the userspace code that looks for "ibm,phandle", and nothing > else. > > The note above actually implies that the current Linux code is wrong, > when it uses "ibm,phandle" as the value of np->phandle. > > So sorry that's a big mess, but we can't just rip out those properties. > > I think the minimal change would be to treat "ibm,phandle" like a normal > property, I think that would allow our tools to keep working? > > > The other thing that worries me is that by renaming (effectively) > "linux,phandle" to "phandle", we lose the ability to accurately > regenerate the device tree from /proc/device-tree. In theory it > shouldn't matter, but I worry that in practice something will break. The only scenario I can come up with is booting a kernel with this change and kexec'ing to a kernel older than 2010 (when ePAPR phandle was added). Some how I doubt that would work without this change. > What if we just kept a single bit flag somewhere indicating if the name of > the phandle property we found was "phandle" or "linux,phandle", and > create the sysfs phandle using that name? I'd like to move towards dropping 'linux,phandle' including changing dtc to stop generating both properties by default. Perhaps we should just be more explicit that we are doing that. Stop exposing it first and then change how phandles are stored/managed a cycle later. Then we can easily revert it if needed. Rob
Re: [PATCH v3 4/6] powerpc/64s: Un-blacklist system_call() from kprobes
On 2017/06/22 11:14PM, Nicholas Piggin wrote: > On Thu, 22 Jun 2017 21:14:21 +1000 > Michael Ellermanwrote: > > > Nicholas Piggin writes: > > > > > On Thu, 22 Jun 2017 00:08:40 +0530 > > > "Naveen N. Rao" wrote: > > > > > >> It is actually safe to probe system_call() in entry_64.S, but only till > > >> we unset MSR_RI. To allow this, add a new symbol system_call_exit() > > >> after the mtmsrd and blacklist that. Though the mtmsrd instruction > > >> itself is now whitelisted, we won't be allowed to probe on it as we > > >> don't allow probing on rfi and mtmsr instructions (checked for in > > >> arch_prepare_kprobe()). > > > > > > Can you add a little comment to say probes aren't allowed, and it's > > > located after the mtmsr in order to avoid contaminating traces? Hmm.. it's actually after the mtmsrd since we can probe until that point. In v2, I converted .Lsyscall_exit() into a different label and blacklisted all of it. But Michael prefers to only blacklist what we must. It is helpful if we ever want to probe after returning from a syscall. That said, please see further below (*) > > > > > > Also I wonder if a slightly different name would be more instructive? > > > I don't normally care, but the system_call_common code isn't trivial > > > to follow. system_call_exit might give the impression that it is the > > > entire exit path (which would pair with system_call for entry). > > > > It is the entire path in the happy case isn't it? I'm not sure I know > > what you mean. > > Oh, yes you're right, I thought it was moved down further. > > > If you're tracing etc. then you'll be in syscall_exit_work, isn't > > that sufficient to differentiate the two? Not sure how much this matters, but the new symbol (system_call_exit) is actually a few instructions from after the syscall return (.Lsyscall_exit). And I have converted syscall_exit_work to a private symbol as well. In general, from kprobes standpoint, there are places there where we can probe safely but doing so may require new symbols to be introduced, which could make traces look weird, as well as distribute samples among different symbols. I'm not sure how best to proceed. In this current series, I pretty much blacklist all of system_call_common() through to syscall_exit() except for the small part in between in system_call(). All other symbols have been made private, so nothing else apart from system_call_common(), system_call() and system_call_exit() show up in traces. We could probably retain syscall_dotrace(), syscall_enosys() and parts of syscall_exit_work() after RI is restored, which will allow those to be probed, but end up showing these symbols in traces. What would be preferable? - (*) The other suggestion Michael had was to just put a nop after the bctrl and to again make .Lsyscall_exit public and blacklist that (though he wasn't all for it). I suppose it does solve this issue in a nice way - the call traces when in a system call show the proper symbol and we do have a 'nop' instruction on which we can probe to catch returns from system calls. Is that something we can re-consider? Something like this (along with converting other .Lsyscall_exit references): --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -175,8 +175,9 @@ system_call:/* label this so stack traces look sane */ ldx r12,r11,r0 /* Fetch system call handler [ptr] */ mtctr r12 bctrl /* Call handler */ + nop -.Lsyscall_exit: +system_call_exit: std r3,RESULT(r1) CURRENT_THREAD_INFO(r12, r1) Thanks, Naveen
Re: [PATCH] of: update ePAPR references to point to Devicetree Specification
On Mon, Jun 19, 2017 at 03:23:41AM -0700, Frank Rowand wrote: > On 06/18/17 07:05, Rob Herring wrote: > > On Tue, Jun 13, 2017 at 07:49:04PM -0700, frowand.l...@gmail.com wrote: > >> From: Frank Rowand> >> > >> The Devicetree Specification has superseded the ePAPR as the > >> base specification for bindings. Update files in Documentation > >> to reference the new document. > >> > >> Some files are not updated because there is no hypervisor chapter > >> in the Devicetree Specification: > >>Documentation/devicetree/bindings/powerpc/fsl/msi-pic.txt > >>Documenation/virtual/kvm/api.txt > >>Documenation/virtual/kvm/ppc-pv.txt > >> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt > >> b/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt > >> index f8cd2397aa04..d63ab1dec16d 100644 > >> --- a/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt > >> +++ b/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt > >> @@ -3,10 +3,10 @@ Power Architecture CPU Binding > >> Copyright 2013 Freescale Semiconductor Inc. > >> > >> Power Architecture CPUs in Freescale SOCs are represented in device trees > >> as > >> -per the definition in ePAPR. > >> +per the definition in the Devicetree Specification. > > > > Are we sure we didn't remove any PPC specifics that apply here? > > I don't see any. > > Table 3.7.1 "General Properties of CPU nodes" was slightly > re-ordered, but the same properties are listed in both documents. > > I don't think that the boot requirements removal impacts this > file. > > Am I missing something? No, I just wanted to make sure. So I guess there's just that one minor thing to drop. Rob
RE: DPAA: Software Portal selection for network interface
Hi Sebastian, > -Original Message- > From: Sebastian Huber [mailto:sebastian.hu...@embedded-brains.de] > Sent: Thursday, June 22, 2017 3:42 PM > To: linuxppc-dev@lists.ozlabs.org > Cc: Madalin-cristian Bucur> Subject: DPAA: Software Portal selection for network interface > > Hello, > > as far as I understand the software portal selection for a network > interface is done in > > static int dpaa_eth_probe(struct platform_device *pdev) > { > [...] > channel = dpaa_get_channel(); > if (channel < 0) { > dev_err(dev, "dpaa_get_channel() failed\n"); > err = channel; > goto get_channel_failed; > } > > priv->channel = (u16)channel; > > /* Start a thread that will walk the CPUs with affine portals > * and add this pool channel to each's dequeue mask. > */ > dpaa_eth_add_channel(priv->channel); > > with > > static int dpaa_get_channel(void) > { > spin_lock(_pool_channel_init); > if (!rx_pool_channel) { > u32 pool; > int ret; > > ret = qman_alloc_pool(); > > if (!ret) > rx_pool_channel = pool; > } > spin_unlock(_pool_channel_init); > if (!rx_pool_channel) > return -ENOMEM; > return rx_pool_channel; > } > > which always returns the same pool channel (e.g. 0x401) if successful. That's correct. > This means all the QMan portal_isr() are distributed round-robin to all > affine portals. Is there some way to configure the software portal for a > specific network interface, e.g. use processors 0, 1, 2, 3 for one > interface,and 4, 5, 6, 7 for another? We've stripped all configurability and advanced features from the initial DPAA submission. We don't have options to configure this. On the other hand, the ingress queues are held active in the portal, resulting in one CPU doing dequeues while there are frames available. This is done to avoid frame reordering, improving termination performance (and forwarded TCP performance). The downside is that we're not saturating all CPUs with traffic. We're currently working on adding Rx hashing, to be able to maintain intra-flow frame order while spreading processing on all CPUs. Meanwhile, would RPS (Receive Packet Steering) be a solution for you? > -- > Sebastian Huber, embedded brains GmbH > > Address : Dornierstr. 4, D-82178 Puchheim, Germany > Phone : +49 89 189 47 41-16 > Fax : +49 89 189 47 41-09 > E-Mail : sebastian.hu...@embedded-brains.de > PGP : Public key available on request. > > Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG. Regards, Madalin
Re: [PATCH v3 3/6] powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
On 2017/06/22 11:08PM, Nicholas Piggin wrote: > On Thu, 22 Jun 2017 21:07:46 +1000 > Michael Ellermanwrote: > > > Nicholas Piggin writes: > > > > > On Thu, 22 Jun 2017 00:08:39 +0530 > > > "Naveen N. Rao" wrote: > > > > > >> Convert some of the symbols into private symbols and blacklist > > >> system_call_common() and system_call() from kprobes. We can't take a > > >> trap at parts of these functions as either MSR_RI is unset or the kernel > > >> stack pointer is not yet setup. > > >> > > >> Reviewed-by: Masami Hiramatsu > > >> Signed-off-by: Naveen N. Rao > > > > > > I don't have a problem with this bunch of system call labels > > > going private. They've never added much for me in profiles. > > > > > > Reviewed-by: Nicholas Piggin Thanks for the review. > > > > > > Semi-related question, why is system_call: where it is? > > > > Ancient history. > > > > We used to have: > > > > bne syscall_dotrace > > syscall_dotrace_cont: > > cmpldi 0,r0,NR_syscalls > > bge-syscall_enosys > > > > system_call:/* label this so stack traces look sane > > */ > > > > > > So it was there to hide syscall_dotrace_cont from back traces. > > > > But we made syscall_dotrace_cont local in 2012 and then removed it > > entirely in 2015. > > > > > Should we move it up to right after the mtmsrd / wrteei instruction? > > > (obviously for another patch). It's pretty common to get PMU > > > interrupts coming in right after mtmsr and this makes profiles split > > > the syscall into two which is annoying. > > > > Move it wherever makes sense and gives good back traces. > > I'd be in favour of moving it to right after the interurpt enable. > I suppose you'd want a separate patch for that though. But we could > put it in this series since we're changing a lot of labels. Sure, I'll add a patch that does that. - Naveen
Re: [PATCH v3 1/6] powerpc64/elfv1: Validate function pointer address in the function descriptor
On 2017/06/22 11:06PM, Nicholas Piggin wrote: > On Thu, 22 Jun 2017 20:59:49 +1000 > Michael Ellermanwrote: > > > Nicholas Piggin writes: > > > > > On Thu, 22 Jun 2017 00:08:37 +0530 > > > "Naveen N. Rao" wrote: > > > > > >> Currently, we assume that the function pointer we receive in > > >> ppc_function_entry() points to a function descriptor. However, this is > > >> not always the case. In particular, assembly symbols without the right > > >> annotation do not have an associated function descriptor. Some of these > > >> symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL(). > > >> When such addresses are subsequently processed through > > >> arch_deref_entry_point() in populate_kprobe_blacklist(), we see the > > >> below errors during bootup: > > >> [0.663963] Failed to find blacklist at 7d9b02a648029b6c > > >> [0.663970] Failed to find blacklist at a14d03d0394a0001 > > >> [0.663972] Failed to find blacklist at 7d5302a6f94d0388 > > >> [0.663973] Failed to find blacklist at 48027d11e8610178 > > >> [0.663974] Failed to find blacklist at f8010070f8410080 > > >> [0.663976] Failed to find blacklist at 386100704801f89d > > >> [0.663977] Failed to find blacklist at 7d5302a6f94d00b0 > > >> > > >> Fix this by checking if the address in the function descriptor is > > >> actually a valid kernel address. In the case of assembly symbols, this > > >> will almost always fail as this ends up being powerpc instructions. In > > >> that case, return pointer to the address we received, rather than the > > >> dereferenced value. > > >> > > >> Signed-off-by: Naveen N. Rao > > >> --- > > >> arch/powerpc/include/asm/code-patching.h | 10 +- > > >> 1 file changed, 9 insertions(+), 1 deletion(-) > > >> > > >> diff --git a/arch/powerpc/include/asm/code-patching.h > > >> b/arch/powerpc/include/asm/code-patching.h > > >> index abef812de7f8..ec54050be585 100644 > > >> --- a/arch/powerpc/include/asm/code-patching.h > > >> +++ b/arch/powerpc/include/asm/code-patching.h > > >> @@ -83,8 +83,16 @@ static inline unsigned long ppc_function_entry(void > > >> *func) > > >> * On PPC64 ABIv1 the function pointer actually points to the > > >> * function's descriptor. The first entry in the descriptor is > > >> the > > >> * address of the function text. > > >> + * > > >> + * However, we may have received a pointer to an assembly symbol > > >> + * that may not be a function descriptor. Validate that the > > >> entry > > >> + * points to a valid kernel address and if not, return the > > >> pointer > > >> + * we received as is. > > >> */ > > >> -return ((func_descr_t *)func)->entry; > > >> +if (kernel_text_address(((func_descr_t *)func)->entry)) > > >> +return ((func_descr_t *)func)->entry; > > >> +else > > >> +return (unsigned long)func; > > > > > > What if "func" is a text section label (bare asm function)? > > > Won't func->entry load the random instruction located there > > > and compare it with a kernel address? > > > > Yes, that's the problem. Yes, we were currently returning those instructions as the function entry address. > > > > > I don't know too much about the v1 ABI, but should we check for > > > func belonging in the .opd section first and base the check on > > > that? Alternatively I if "func" is in the kernel text address, > > > we can recognize it's not in the .opd section... right? > > > > That sounds like a more robust solution. But I suspect it won't work for > > modules. > > kernel_text_address() seems to check for module text as well, so it > might work I think? Yes, I think that's a very nice idea! I'll check and confirm that it does what it's supposed to. Thanks for the review, - Naveen
Re: [PATCH v3 4/6] powerpc/64s: Un-blacklist system_call() from kprobes
On Thu, 22 Jun 2017 21:14:21 +1000 Michael Ellermanwrote: > Nicholas Piggin writes: > > > On Thu, 22 Jun 2017 00:08:40 +0530 > > "Naveen N. Rao" wrote: > > > >> It is actually safe to probe system_call() in entry_64.S, but only till > >> we unset MSR_RI. To allow this, add a new symbol system_call_exit() > >> after the mtmsrd and blacklist that. Though the mtmsrd instruction > >> itself is now whitelisted, we won't be allowed to probe on it as we > >> don't allow probing on rfi and mtmsr instructions (checked for in > >> arch_prepare_kprobe()). > > > > Can you add a little comment to say probes aren't allowed, and it's > > located after the mtmsr in order to avoid contaminating traces? > > > > Also I wonder if a slightly different name would be more instructive? > > I don't normally care, but the system_call_common code isn't trivial > > to follow. system_call_exit might give the impression that it is the > > entire exit path (which would pair with system_call for entry). > > It is the entire path in the happy case isn't it? I'm not sure I know > what you mean. Oh, yes you're right, I thought it was moved down further. Thanks, Nick
Re: [1/3] powerpc/64s: Use BRANCH_TO_COMMON() for slb_miss_realmode
On Tue, 2017-06-20 at 12:34:55 UTC, Michael Ellerman wrote: > All the callers of slb_miss_realmode currently open code the #ifndef > CONFIG_RELOCATABLE check and the branch via CTR in the RELOCATABLE case. > We have a macro to do this, BRANCH_TO_COMMON(), so use it. > > Signed-off-by: Michael Ellerman> Reviewed-by: Nicholas Piggin Series applied to powerpc next. https://git.kernel.org/powerpc/c/b102063b47d59752e113c558842227 cheers
Re: [1/9] powerpc/64s: slb_allocate_realmode() preserve r3
On Sun, 2017-05-21 at 13:15:42 UTC, Nicholas Piggin wrote: > One fewer registers clobbered by this function means the SLB miss > handler can save one fewer. > > Signed-off-by: Nicholas PigginSeries applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/d59afffdf04c66c09085160706297e cheers
DPAA: Software Portal selection for network interface
Hello, as far as I understand the software portal selection for a network interface is done in static int dpaa_eth_probe(struct platform_device *pdev) { [...] channel = dpaa_get_channel(); if (channel < 0) { dev_err(dev, "dpaa_get_channel() failed\n"); err = channel; goto get_channel_failed; } priv->channel = (u16)channel; /* Start a thread that will walk the CPUs with affine portals * and add this pool channel to each's dequeue mask. */ dpaa_eth_add_channel(priv->channel); with static int dpaa_get_channel(void) { spin_lock(_pool_channel_init); if (!rx_pool_channel) { u32 pool; int ret; ret = qman_alloc_pool(); if (!ret) rx_pool_channel = pool; } spin_unlock(_pool_channel_init); if (!rx_pool_channel) return -ENOMEM; return rx_pool_channel; } which always returns the same pool channel (e.g. 0x401) if successful. This means all the QMan portal_isr() are distributed round-robin to all affine portals. Is there some way to configure the software portal for a specific network interface, e.g. use processors 0, 1, 2, 3 for one interface,and 4, 5, 6, 7 for another? -- Sebastian Huber, embedded brains GmbH Address : Dornierstr. 4, D-82178 Puchheim, Germany Phone : +49 89 189 47 41-16 Fax : +49 89 189 47 41-09 E-Mail : sebastian.hu...@embedded-brains.de PGP : Public key available on request. Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
Re: [PATCH v3 3/6] powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
On Thu, 22 Jun 2017 21:07:46 +1000 Michael Ellermanwrote: > Nicholas Piggin writes: > > > On Thu, 22 Jun 2017 00:08:39 +0530 > > "Naveen N. Rao" wrote: > > > >> Convert some of the symbols into private symbols and blacklist > >> system_call_common() and system_call() from kprobes. We can't take a > >> trap at parts of these functions as either MSR_RI is unset or the kernel > >> stack pointer is not yet setup. > >> > >> Reviewed-by: Masami Hiramatsu > >> Signed-off-by: Naveen N. Rao > > > > I don't have a problem with this bunch of system call labels > > going private. They've never added much for me in profiles. > > > > Reviewed-by: Nicholas Piggin > > > > Semi-related question, why is system_call: where it is? > > Ancient history. > > We used to have: > > bne syscall_dotrace > syscall_dotrace_cont: > cmpldi 0,r0,NR_syscalls > bge-syscall_enosys > > system_call: /* label this so stack traces look sane */ > > > So it was there to hide syscall_dotrace_cont from back traces. > > But we made syscall_dotrace_cont local in 2012 and then removed it > entirely in 2015. > > > Should we move it up to right after the mtmsrd / wrteei instruction? > > (obviously for another patch). It's pretty common to get PMU > > interrupts coming in right after mtmsr and this makes profiles split > > the syscall into two which is annoying. > > Move it wherever makes sense and gives good back traces. I'd be in favour of moving it to right after the interurpt enable. I suppose you'd want a separate patch for that though. But we could put it in this series since we're changing a lot of labels. Thanks, Nick
[PATCH V4] cxl: Export library to support IBM XSL
This patch exports a in-kernel 'library' API which can be called by other drivers to help interacting with an IBM XSL on a POWER9 system. The XSL (Translation Service Layer) is a stripped down version of the PSL (Power Service Layer) used in some cards such as the Mellanox CX5. Like the PSL, it implements the CAIA architecture, but has a number of differences, mostly in it's implementation dependent registers. The XSL also uses a special DMA cxl mode, which uses a slightly different init sequence for the CAPP and PHB. Signed-off-by: Andrew DonnellanSigned-off-by: Christophe Lombard --- Changelog[v4] - Rebase to latest upstream. - Add new Signed-off - Update cxl_handle_mm_fault() and cxl_handle_page_fault() Changelog[v3] - Rebase to latest upstream. - cxl_handle_mm_fault() is now exported. Remove kernel context parameter - Update comments Changelog[v2] - Rebase to latest upstream. - Return -EFAULT in case of NULL pointer in cxllib_handle_fault(). - Reverse parameters when copro_handle_mm_fault() is called. This applies on top of this patch: http://patchwork.ozlabs.org/patch/775322/ --- arch/powerpc/include/asm/opal-api.h | 1 + drivers/misc/cxl/Kconfig| 5 + drivers/misc/cxl/Makefile | 2 +- drivers/misc/cxl/cxl.h | 6 + drivers/misc/cxl/cxllib.c | 246 drivers/misc/cxl/fault.c| 29 +++-- drivers/misc/cxl/native.c | 16 ++- drivers/misc/cxl/pci.c | 41 -- include/misc/cxllib.h | 133 +++ 9 files changed, 449 insertions(+), 30 deletions(-) create mode 100644 drivers/misc/cxl/cxllib.c create mode 100644 include/misc/cxllib.h diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index cb3e624..3e0be78 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -877,6 +877,7 @@ enum { OPAL_PHB_CAPI_MODE_SNOOP_OFF= 2, OPAL_PHB_CAPI_MODE_SNOOP_ON = 3, OPAL_PHB_CAPI_MODE_DMA = 4, + OPAL_PHB_CAPI_MODE_DMA_TVT1 = 5, }; /* OPAL I2C request */ diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig index b75cf83..93397cb 100644 --- a/drivers/misc/cxl/Kconfig +++ b/drivers/misc/cxl/Kconfig @@ -11,11 +11,16 @@ config CXL_AFU_DRIVER_OPS bool default n +config CXL_LIB + bool + default n + config CXL tristate "Support for IBM Coherent Accelerators (CXL)" depends on PPC_POWERNV && PCI_MSI && EEH select CXL_BASE select CXL_AFU_DRIVER_OPS + select CXL_LIB default m help Select this option to enable driver support for IBM Coherent diff --git a/drivers/misc/cxl/Makefile b/drivers/misc/cxl/Makefile index c14fd6b..0b5fd74 100644 --- a/drivers/misc/cxl/Makefile +++ b/drivers/misc/cxl/Makefile @@ -3,7 +3,7 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror cxl-y += main.o file.o irq.o fault.o native.o cxl-y += context.o sysfs.o pci.o trace.o -cxl-y += vphb.o phb.o api.o +cxl-y += vphb.o phb.o api.o cxllib.o cxl-$(CONFIG_PPC_PSERIES) += flash.o guest.o of.o hcalls.o cxl-$(CONFIG_DEBUG_FS) += debugfs.o obj-$(CONFIG_CXL) += cxl.o diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index a03f8e7..b1afecc 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -1010,6 +1010,7 @@ static inline void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct den void cxl_handle_fault(struct work_struct *work); void cxl_prefault(struct cxl_context *ctx, u64 wed); +int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, u64 dar); struct cxl *get_cxl_adapter(int num); int cxl_alloc_sst(struct cxl_context *ctx); @@ -1061,6 +1062,11 @@ int cxl_afu_slbia(struct cxl_afu *afu); int cxl_data_cache_flush(struct cxl *adapter); int cxl_afu_disable(struct cxl_afu *afu); int cxl_psl_purge(struct cxl_afu *afu); +int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid, + u32 *phb_index, u64 *capp_unit_id); +int cxl_slot_is_switched(struct pci_dev *dev); +int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg); +u64 cxl_calculate_sr(bool master, bool kernel, bool real_mode, bool p9); void cxl_native_irq_dump_regs_psl9(struct cxl_context *ctx); void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx); diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c new file mode 100644 index 000..5dba23c --- /dev/null +++ b/drivers/misc/cxl/cxllib.c @@ -0,0 +1,246 @@ +/* + * Copyright 2017 IBM Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version +
Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd
Thiago Jung Bauermannwrites: > Michael Ellerman writes: >> Thiago Jung Bauermann writes: >> >>> Calling arch_update_cpu_topology from a CPU hotplug state machine callback >>> hits a deadlock because the function tries to get a read lock on >>> cpu_hotplug_lock while the state machine still holds a write lock on it. >>> >>> Since all callers of arch_update_cpu_topology except rtasd already hold >>> cpu_hotplug_lock, this patch changes the function to use >>> stop_machine_cpuslocked and creates a separate function for rtasd which >>> still tries to obtain the lock. >>> >>> Michael Bringmann investigated the bug and provided a detailed analysis >>> of the deadlock on this previous RFC for an alternate solution: >>> >>> https://patchwork.ozlabs.org/patch/771293/ >> >> Do we know when this broke? Or has it never worked? > > It's been broken since at least v4.4, I think. I don't know about > earlier versions. OK. Just to be clear, this is happening on a 4.12-rcX system with no other patches? The code in arch_update_cpu_topology() has used stop_machine() since 30c05350c39d ("powerpc/pseries: Use stop machine to update cpu maps") which went into v3.10, about 4 years ago. Prior to that it used get/put_online_cpus(), since 9eff1a38407c ("powerpc/pseries: Poll VPA for topology changes and update NUMA maps"), which was 2.6.38 in 2010. I wouldn't rule out the possibility it's been broken for 7 years, but I wonder if something else has changed to cause it to break. We really need to work it out before we backport anything. >> Should it go to stable? (can't in its current form AFAICS) > > It's not hard to backport both this patch and commit fe5595c07400 > ("stop_machine: Provide stop_machine_cpuslocked()") from branch > smp/hotplug in tip.git for stable. Yeah but it's not really my business backporting that unfortunately. > Since rtasd only started calling arch_update_cpu_topology since v4.11, > for earlier versions this patch can be simplified to making that > function call stop_machine_cpuslocked unconditionally instead of > defining a separate function. OK. cheers
DNS (?) not working on G5 (64-bit powerpc) (was [net-next, v3, 3/3] udp: try to avoid 2 cache miss on dequeue)
Paolo wrote: > when udp_recvmsg() is executed, on x86_64 and other archs, most skb > fields are on cold cachelines. > If the skb are linear and the kernel don't need to compute the udp > csum, only a handful of skb fields are required by udp_recvmsg(). > Since we already use skb->dev_scratch to cache hot data, and > there are 32 bits unused on 64 bit archs, use such field to cache > as much data as we can, and try to prefetch on dequeue the relevant > fields that are left out. > > This can save up to 2 cache miss per packet. > > v1 -> v2: > - changed udp_dev_scratch fields types to u{32,16} variant, > replaced bitfiled with bool > > Signed-off-by: Paolo Abeni> Acked-by: Eric Dumazet > --- > net/ipv4/udp.c | 114 > +++-- > 1 file changed, 103 insertions(+), 11 deletions(-) This appears to break wget on one of my machines. Networking in general is working, I'm able to SSH in, but then I can't do a wget. eg: $ wget google.com --2017-06-22 22:45:39-- http://google.com/ Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in name resolution. wget: unable to resolve host address ‘proxy.pmdw.com’ $ host proxy.pmdw.com proxy.pmdw.com is an alias for raven.pmdw.com. raven.pmdw.com has address 10.1.2.3 $ wget google.com --2017-06-22 22:52:08-- http://google.com/ Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in name resolution. wget: unable to resolve host address ‘proxy.pmdw.com’ Maybe host is using TCP but the man page says it doesn't? Everything is OK if I boot back to the previous commit 0a463c78d25b ("udp: avoid a cache miss on dequeue"): $ wget google.com --2017-06-22 23:00:01-- http://google.com/ Resolving proxy.pmdw.com (proxy.pmdw.com)... 10.1.2.3 Connecting to proxy.pmdw.com (proxy.pmdw.com)|10.1.2.3|:3128... connected. Proxy request sent, awaiting response... 302 Found Location: http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE [following] --2017-06-22 23:00:01-- http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE Reusing existing connection to proxy.pmdw.com:3128. Proxy request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘index.html’ index.html [ <=> ] 11.37K --.-KB/sin 0.001s 2017-06-22 23:00:01 (22.0 MB/s) - ‘index.html’ saved [11640] $ uname -a Linux 4.12.0-rc4-gcc6-00988-g0a463c7 #88 SMP Thu Jun 22 22:55:12 AEST 2017 ppc64 GNU/Linux Haven't had time to debug any further. Any ideas? cheers
Re: [PATCH v3 1/6] powerpc64/elfv1: Validate function pointer address in the function descriptor
On Thu, 22 Jun 2017 20:59:49 +1000 Michael Ellermanwrote: > Nicholas Piggin writes: > > > On Thu, 22 Jun 2017 00:08:37 +0530 > > "Naveen N. Rao" wrote: > > > >> Currently, we assume that the function pointer we receive in > >> ppc_function_entry() points to a function descriptor. However, this is > >> not always the case. In particular, assembly symbols without the right > >> annotation do not have an associated function descriptor. Some of these > >> symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL(). > >> When such addresses are subsequently processed through > >> arch_deref_entry_point() in populate_kprobe_blacklist(), we see the > >> below errors during bootup: > >> [0.663963] Failed to find blacklist at 7d9b02a648029b6c > >> [0.663970] Failed to find blacklist at a14d03d0394a0001 > >> [0.663972] Failed to find blacklist at 7d5302a6f94d0388 > >> [0.663973] Failed to find blacklist at 48027d11e8610178 > >> [0.663974] Failed to find blacklist at f8010070f8410080 > >> [0.663976] Failed to find blacklist at 386100704801f89d > >> [0.663977] Failed to find blacklist at 7d5302a6f94d00b0 > >> > >> Fix this by checking if the address in the function descriptor is > >> actually a valid kernel address. In the case of assembly symbols, this > >> will almost always fail as this ends up being powerpc instructions. In > >> that case, return pointer to the address we received, rather than the > >> dereferenced value. > >> > >> Signed-off-by: Naveen N. Rao > >> --- > >> arch/powerpc/include/asm/code-patching.h | 10 +- > >> 1 file changed, 9 insertions(+), 1 deletion(-) > >> > >> diff --git a/arch/powerpc/include/asm/code-patching.h > >> b/arch/powerpc/include/asm/code-patching.h > >> index abef812de7f8..ec54050be585 100644 > >> --- a/arch/powerpc/include/asm/code-patching.h > >> +++ b/arch/powerpc/include/asm/code-patching.h > >> @@ -83,8 +83,16 @@ static inline unsigned long ppc_function_entry(void > >> *func) > >> * On PPC64 ABIv1 the function pointer actually points to the > >> * function's descriptor. The first entry in the descriptor is the > >> * address of the function text. > >> + * > >> + * However, we may have received a pointer to an assembly symbol > >> + * that may not be a function descriptor. Validate that the entry > >> + * points to a valid kernel address and if not, return the pointer > >> + * we received as is. > >> */ > >> - return ((func_descr_t *)func)->entry; > >> + if (kernel_text_address(((func_descr_t *)func)->entry)) > >> + return ((func_descr_t *)func)->entry; > >> + else > >> + return (unsigned long)func; > > > > What if "func" is a text section label (bare asm function)? > > Won't func->entry load the random instruction located there > > and compare it with a kernel address? > > Yes, that's the problem. > > > I don't know too much about the v1 ABI, but should we check for > > func belonging in the .opd section first and base the check on > > that? Alternatively I if "func" is in the kernel text address, > > we can recognize it's not in the .opd section... right? > > That sounds like a more robust solution. But I suspect it won't work for > modules. kernel_text_address() seems to check for module text as well, so it might work I think?
Re: [PATCH] dmaengine: fsldma: set BWC, DAHTS and SAHTS values correctly
On Mon, Jun 19, 2017 at 04:40:04PM +0200, Thomas Breitung wrote: > The bits of BWC, DAHTS and SAHTS in the DMA mode register must be cleared > before a new value can be or-ed in. Applied, thanks -- ~Vinod
Re: [PATCH V3] cxl: Export library to support IBM XSL
Salut Christophe, Since there's a respin, 2 details below. diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c new file mode 100644 index 000..4f4c5ca --- /dev/null +++ b/drivers/misc/cxl/cxllib.c @@ -0,0 +1,246 @@ +/* + * Copyright 2017 IBM Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include +#include +#include +#include Maybe somebody can comment on this, but I believe the usual order is the 'linux' headers first, then the 'asm'. Though I don't know if there's a valid reason behind it, or just tradition... diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c index c79e39b..077370f 100644 --- a/drivers/misc/cxl/fault.c +++ b/drivers/misc/cxl/fault.c +static int cxl_handle_page_fault(struct cxl_context *ctx, +struct mm_struct *mm, +u64 dsisr, u64 dar) +{ + trace_cxl_pte_miss(ctx, dsisr, dar); + + if (cxl_handle_mm_fault(mm, dsisr, dar)) { + cxl_ack_ae(ctx); + } else { + pr_devel("Page fault successfully handled for pe: %i!\n", ctx->pe); + cxl_ops->ack_irq(ctx, CXL_PSL_TFC_An_R, 0); + } + + return IRQ_HANDLED; } Why return IRQ_HANDLED? it's not called from an interrupt handler but from the bottom half (actually, I think it's coming from cxl_handle_segment_miss(), but it's wrong there as well). Fred
Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd
On Wed, 21 Jun 2017, Thiago Jung Bauermann wrote: > Michael Ellermanwrites: > >> Notes: > >> This patch applies on tip/smp/hotplug, it should probably be carried > >> there. > > > > stop_machine_cpuslocked() doesn't exist in mainline so I think it has to > > be carried there right? > > Yes. I said "probably" because I don't know if you want to wait > until that branch is merged so that you can carry this patch in your > tree. I'll pick it up for 4.12 and you folks can figure out the backporting once that hits Linus tree. Thanks, tglx
Re: [RFC PATCH 1/2] powerpc/xive: guest exploitation of the XIVE interrupt controller
On Thu, 2017-06-22 at 11:29 +0200, Cédric Le Goater wrote: > This is the framework for using XIVE in a PowerVM guest. The support > is very similar to the native one in a much simpler form. Looks really good. Minor nits & comments... > Instead of OPAL calls, a set of Hypervisors call are used to configure > the interrupt sources and the event/notification queues of the guest: > >H_INT_GET_SOURCE_INFO >H_INT_SET_SOURCE_CONFIG >H_INT_GET_SOURCE_CONFIG >H_INT_GET_QUEUE_INFO >H_INT_SET_QUEUE_CONFIG >H_INT_GET_QUEUE_CONFIG >H_INT_RESET There are the base ones. > Calls that still need to be addressed : > >H_INT_SET_OS_REPORTING_LINE >H_INT_GET_OS_REPORTING_LINE Ah so those have to do with that magic cache line you can register with the HW so that when you get an interrupt, you can do an MMIO store very early on in the interrupt entry path to the XIVE, which will asynchronously write the NSR etc... to that cache line which you can then poke at later one. I don't know if it's worth exploiting in Linux, but we should support it in qemu/kvm. >H_INT_ESB This is a h-call that performs the basic ESB operations. Some interrupts can have a flag telling the OS to do the operations using that hcall rather than directly. This can be used to workaround HW issues with some interrupts sources if needed. >H_INT_SYNC This will be needed for queue accounting in some cases, such as CPU hotplug I think etc... For example if you mask an interrupt in the ESB, a sync will ensure that any previous occurrence of this interrupt has reached its target queue (and thus is visible in memory). > As for XICS, the XIVE interface for the guest is described in the > device tree under the interrupt controller node. A couple of new > properties are specific to XIVE : > > - "reg" > >contains the base address and size of the thread interrupt >managnement areas (TIMA) for the user level for the OS level. Only >the OS level is taken into account. > > - "ibm,xive-eq-sizes" > >the size of the event queues. > > - "ibm,xive-lisn-ranges" > >the interrupt numbers ranges assigned to the guest. These are >allocated using a simple bitmap. > > This is work in progress. It was only tested with a QEMU XIVE model > for pseries. > > Signed-off-by: Cédric Le Goater> --- > arch/powerpc/include/asm/hvcall.h | 13 +- > arch/powerpc/include/asm/xive.h| 1 + > arch/powerpc/platforms/pseries/Kconfig | 1 + > arch/powerpc/platforms/pseries/setup.c | 8 +- > arch/powerpc/platforms/pseries/smp.c | 18 +- > arch/powerpc/sysdev/xive/Kconfig | 5 + > arch/powerpc/sysdev/xive/Makefile | 1 + > arch/powerpc/sysdev/xive/xive-hv.c | 523 > + > 8 files changed, 566 insertions(+), 4 deletions(-) > create mode 100644 arch/powerpc/sysdev/xive/xive-hv.c > > diff --git a/arch/powerpc/include/asm/hvcall.h > b/arch/powerpc/include/asm/hvcall.h > index d73755fafbb0..3c019e9f451a 100644 > --- a/arch/powerpc/include/asm/hvcall.h > +++ b/arch/powerpc/include/asm/hvcall.h > @@ -280,7 +280,18 @@ > #define H_RESIZE_HPT_COMMIT 0x370 > #define H_REGISTER_PROC_TBL 0x37C > #define H_SIGNAL_SYS_RESET 0x380 > -#define MAX_HCALL_OPCODE H_SIGNAL_SYS_RESET > +#define H_INT_GET_SOURCE_INFO 0x3A8 > +#define H_INT_SET_SOURCE_CONFIG 0x3AC > +#define H_INT_GET_SOURCE_CONFIG 0x3B0 > +#define H_INT_GET_QUEUE_INFO0x3B4 > +#define H_INT_SET_QUEUE_CONFIG 0x3B8 > +#define H_INT_GET_QUEUE_CONFIG 0x3BC > +#define H_INT_SET_OS_REPORTING_LINE 0x3C0 > +#define H_INT_GET_OS_REPORTING_LINE 0x3C4 > +#define H_INT_ESB 0x3C8 > +#define H_INT_SYNC 0x3CC > +#define H_INT_RESET 0x3D0 > +#define MAX_HCALL_OPCODE H_INT_RESET > > /* H_VIOCTL functions */ > #define H_GET_VIOA_DUMP_SIZE 0x01 > diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h > index c23ff4389ca2..c947952ed934 100644 > --- a/arch/powerpc/include/asm/xive.h > +++ b/arch/powerpc/include/asm/xive.h > @@ -110,6 +110,7 @@ extern bool __xive_enabled; > > static inline bool xive_enabled(void) { return __xive_enabled; } > > +extern bool xive_hv_init(void); > extern bool xive_native_init(void); > extern void xive_smp_probe(void); > extern int xive_smp_prepare_cpu(unsigned int cpu); > diff --git a/arch/powerpc/platforms/pseries/Kconfig > b/arch/powerpc/platforms/pseries/Kconfig > index 913c54e23eea..4eec0283f043 100644 > --- a/arch/powerpc/platforms/pseries/Kconfig > +++ b/arch/powerpc/platforms/pseries/Kconfig > @@ -7,6 +7,7 @@ config PPC_PSERIES > select PCI > select PCI_MSI > select PPC_XICS > + select PPC_XIVE_HV > select PPC_ICP_NATIVE > select PPC_ICP_HV > select PPC_ICS_RTAS > diff --git a/arch/powerpc/platforms/pseries/setup.c > b/arch/powerpc/platforms/pseries/setup.c > index b5d86426e97b..53bad49b84be 100644 > ---
Re: [PATCH V3] cxl: Export library to support IBM XSL
Le 22/06/2017 à 13:20, Michael Ellerman a écrit : Christophe Lombardwrites: This patch exports a in-kernel 'library' API which can be called by other drivers to help interacting with an IBM XSL on a POWER9 system. The XSL (Translation Service Layer) is a stripped down version of the PSL (Power Service Layer) used in some cards such as the Mellanox CX5. Like the PSL, it implements the CAIA architecture, but has a number of differences, mostly in it's implementation dependent registers. The XSL also uses a special DMA cxl mode, which uses a slightly different init sequence for the CAPP and PHB. Signed-off-by: Christophe Lombard Andrew sent you a SOB for parts he wrote, so you should include that. Waiting for an ack from Fred before I merge this. cheers You are right (Sorry Andrew !). I will push a new patch Thanks
Re: [PATCH V3] cxl: Export library to support IBM XSL
Christophe Lombardwrites: > This patch exports a in-kernel 'library' API which can be called by > other drivers to help interacting with an IBM XSL on a POWER9 system. > > The XSL (Translation Service Layer) is a stripped down version of the > PSL (Power Service Layer) used in some cards such as the Mellanox CX5. > Like the PSL, it implements the CAIA architecture, but has a number > of differences, mostly in it's implementation dependent registers. > > The XSL also uses a special DMA cxl mode, which uses a slightly > different init sequence for the CAPP and PHB. > > Signed-off-by: Christophe Lombard Andrew sent you a SOB for parts he wrote, so you should include that. Waiting for an ack from Fred before I merge this. cheers
Re: 1M hugepage size being registered on Linux
Benjamin Herrenschmidtwrites: > On Thu, 2017-06-22 at 13:59 +1000, Michael Ellerman wrote: >> It's unsupported in Linux because it doesn't match the page table >> geometry. >> >> We merged a patch from Aneesh to filter it out in 4.12-rc1: >> >> a525108cf1cc ("powerpc/mm/hugetlb: Filter out hugepage size not supported >> by page table layout") >> >> I guess we should probably send that patch to stable et. al. > > It's also not supported in HW afaik, why is it there ? Because hostboot said so AIUI. cheers
Re: [PATCH v3 4/6] powerpc/64s: Un-blacklist system_call() from kprobes
Nicholas Pigginwrites: > On Thu, 22 Jun 2017 00:08:40 +0530 > "Naveen N. Rao" wrote: > >> It is actually safe to probe system_call() in entry_64.S, but only till >> we unset MSR_RI. To allow this, add a new symbol system_call_exit() >> after the mtmsrd and blacklist that. Though the mtmsrd instruction >> itself is now whitelisted, we won't be allowed to probe on it as we >> don't allow probing on rfi and mtmsr instructions (checked for in >> arch_prepare_kprobe()). > > Can you add a little comment to say probes aren't allowed, and it's > located after the mtmsr in order to avoid contaminating traces? > > Also I wonder if a slightly different name would be more instructive? > I don't normally care, but the system_call_common code isn't trivial > to follow. system_call_exit might give the impression that it is the > entire exit path (which would pair with system_call for entry). It is the entire path in the happy case isn't it? I'm not sure I know what you mean. > Perhaps system_call_exit_notrace? No that sucks too :( A bit :D If you're tracing etc. then you'll be in syscall_exit_work, isn't that sufficient to differentiate the two? cheers
Re: [PATCH v3 5/6] powerpc/64s: Blacklist functions invoked on a trap
Nicholas Pigginwrites: > On Thu, 22 Jun 2017 00:08:41 +0530 > "Naveen N. Rao" wrote: > >> Blacklist all functions involved while handling a trap. We: >> - convert some of the symbols into private symbols, >> - remove the duplicate 'restore' symbol, and >> - blacklist most functions involved while handling a trap. > > I'm not sure removing "restore" makes it better. Yeah it bloats the patch needlessly, we can do a rename patch later. cheers
Re: [PATCH v3 3/6] powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
Nicholas Pigginwrites: > On Thu, 22 Jun 2017 00:08:39 +0530 > "Naveen N. Rao" wrote: > >> Convert some of the symbols into private symbols and blacklist >> system_call_common() and system_call() from kprobes. We can't take a >> trap at parts of these functions as either MSR_RI is unset or the kernel >> stack pointer is not yet setup. >> >> Reviewed-by: Masami Hiramatsu >> Signed-off-by: Naveen N. Rao > > I don't have a problem with this bunch of system call labels > going private. They've never added much for me in profiles. > > Reviewed-by: Nicholas Piggin > > Semi-related question, why is system_call: where it is? Ancient history. We used to have: bne syscall_dotrace syscall_dotrace_cont: cmpldi 0,r0,NR_syscalls bge-syscall_enosys system_call:/* label this so stack traces look sane */ So it was there to hide syscall_dotrace_cont from back traces. But we made syscall_dotrace_cont local in 2012 and then removed it entirely in 2015. > Should we move it up to right after the mtmsrd / wrteei instruction? > (obviously for another patch). It's pretty common to get PMU > interrupts coming in right after mtmsr and this makes profiles split > the syscall into two which is annoying. Move it wherever makes sense and gives good back traces. cheers
Re: 1M hugepage size being registered on Linux
On Tue, 20 Jun 2017 10:47:47 -0300 victorawrote: > Hi Alistair/Jeremy, > > I am working on a bug related to 1M hugepage size being registered on > Linux (Power 8 Baremetal - Garrison). > > I was checking dmesg and it seems that 1M page size is coming from > firmware to Linux. > > [0.00] base_shift=20: shift=20, sllp=0x0130, > avpnm=0x, tlbiel=0, penc=2 > [1.528867] HugeTLB registered 1 MB page size, pre-allocated 0 > pages > > Should Linux support this page size? As afar as I know, this was an > unsupported page size in the past isn't it? If this should be > supported now, is there any specific reason for that? Hello, a525108cf1cc powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout Should reject unsupported huge page sizes, right? Thanks Michal
Re: [PATCH v3 1/6] powerpc64/elfv1: Validate function pointer address in the function descriptor
Nicholas Pigginwrites: > On Thu, 22 Jun 2017 00:08:37 +0530 > "Naveen N. Rao" wrote: > >> Currently, we assume that the function pointer we receive in >> ppc_function_entry() points to a function descriptor. However, this is >> not always the case. In particular, assembly symbols without the right >> annotation do not have an associated function descriptor. Some of these >> symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL(). >> When such addresses are subsequently processed through >> arch_deref_entry_point() in populate_kprobe_blacklist(), we see the >> below errors during bootup: >> [0.663963] Failed to find blacklist at 7d9b02a648029b6c >> [0.663970] Failed to find blacklist at a14d03d0394a0001 >> [0.663972] Failed to find blacklist at 7d5302a6f94d0388 >> [0.663973] Failed to find blacklist at 48027d11e8610178 >> [0.663974] Failed to find blacklist at f8010070f8410080 >> [0.663976] Failed to find blacklist at 386100704801f89d >> [0.663977] Failed to find blacklist at 7d5302a6f94d00b0 >> >> Fix this by checking if the address in the function descriptor is >> actually a valid kernel address. In the case of assembly symbols, this >> will almost always fail as this ends up being powerpc instructions. In >> that case, return pointer to the address we received, rather than the >> dereferenced value. >> >> Signed-off-by: Naveen N. Rao >> --- >> arch/powerpc/include/asm/code-patching.h | 10 +- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/include/asm/code-patching.h >> b/arch/powerpc/include/asm/code-patching.h >> index abef812de7f8..ec54050be585 100644 >> --- a/arch/powerpc/include/asm/code-patching.h >> +++ b/arch/powerpc/include/asm/code-patching.h >> @@ -83,8 +83,16 @@ static inline unsigned long ppc_function_entry(void *func) >> * On PPC64 ABIv1 the function pointer actually points to the >> * function's descriptor. The first entry in the descriptor is the >> * address of the function text. >> + * >> + * However, we may have received a pointer to an assembly symbol >> + * that may not be a function descriptor. Validate that the entry >> + * points to a valid kernel address and if not, return the pointer >> + * we received as is. >> */ >> -return ((func_descr_t *)func)->entry; >> +if (kernel_text_address(((func_descr_t *)func)->entry)) >> +return ((func_descr_t *)func)->entry; >> +else >> +return (unsigned long)func; > > What if "func" is a text section label (bare asm function)? > Won't func->entry load the random instruction located there > and compare it with a kernel address? Yes, that's the problem. > I don't know too much about the v1 ABI, but should we check for > func belonging in the .opd section first and base the check on > that? Alternatively I if "func" is in the kernel text address, > we can recognize it's not in the .opd section... right? That sounds like a more robust solution. But I suspect it won't work for modules. Another option might be to canonicalise the blacklist so that it always points to the text address, not sure how easy that would be. cheers
Re: [next-20170619][a28e46f] WARNING: CPU: 5 PID: 1 at arch/powerpc/lib/feature-fixups.c:214 .check_features+0xa0/0xbc
Thanks for testing this configuration Abdul. Abdul Haleemwrites: > Hi Michael, > > linux-next booted with warnings on a Power LPAR with CMO feature > enabled. > > Test : boot with CMO enabled > Kernel: 4.12.0-rc5-next-20170619 > gcc : 4.8.5 > Machine : Power 8 Power VM LPAR (Big Endian) > > Steps to recreate: > > 1. Enable Shared Memory/CPU Mode for LPAR in HMC (lpar profile) > 2. Build the kernel for the attached config > 3. Boot the kernel, boot messages shows below warning > > Firmware features changed after feature patching! > [ cut here ] > WARNING: CPU: 5 PID: 1 at > arch/powerpc/lib/feature-fixups.c:214 .check_features+0xa0/0xbc > Modules linked in: > CPU: 5 PID: 1 Comm: swapper/5 Not tainted 4.12.0-rc5-next-20170619-autotest #1 > task: c003bbc2 task.stack: c003bbc8 > NIP: c0ce599c LR: c0ce5998 CTR: 006338e4 > REGS: c003bbc83910 TRAP: 0700 Not tainted > (4.12.0-rc5-next-20170619-autotest) > MSR: 8282b032 > CR: 2222 XER: 0002 > CFAR: c024368c SOFTE: 1 > GPR00: c0ce5998 c003bbc83b90 c132e800 0031 > GPR04: 80380fe53ec0 > GPR08: c11999f8 c11999f8 3ff0 > GPR12: 2824 ce743480 c000d380 > GPR16: > GPR20: > GPR24: c12aad08 c0cc0b88 00e8 > GPR28: c0d3ea38 0007 c1375d20 > NIP [c0ce599c] .check_features+0xa0/0xbc > LR [c0ce5998] .check_features+0x9c/0xbc > Call Trace: > [c003bbc83b90] [c0ce5998] .check_features+0x9c/0xbc > (unreliable) > [c003bbc83c10] [c000cb1c] .do_one_initcall+0x5c/0x1c0 > [c003bbc83ce0] [c0cd47d0] .kernel_init_freeable+0x270/0x350 > [c003bbc83db0] [c000d39c] .kernel_init+0x1c/0x140 > [c003bbc83e30] [c000b3e4] .ret_from_kernel_thread+0x58/0x74 > Instruction dump: > 0fe0 6000 3d02ffa1 e92296e0 e94851d0 e929 7faa4800 41fe0018 > 3c62ff89 3863e6f0 4b55dcb9 6000 <0fe0> 38210080 3860 > e8010010 > random: 0x60003d22001e get_random_bytes called with crng_init=0 > ---[ end trace c0f890377edaeb14 ]--- > Running MSI bitmap self-tests ... > > Possible bad commit looks to be: That's not the bad commit ... > commit a28e46f109c9637b2539b9995078d5df4f7f6c09 > Author: Michael Ellerman > Date: Tue Jul 26 22:29:18 2016 +1000 > > powerpc/kernel: Check features don't change after patching > > Early in boot we binary patch some sections of code based on the CPU and > MMU feature bits. But it is a one-time patching, there is no facility > for repatching the code later if the set of features change. > > It is a major bug if the set of features changes after we've done the > code patching - so add a check for it. ... it's the commit that adds the warning that something is dodgy. In this case it's a false positive, FW_FEATURE_CMO is not used in assembly, so it's OK (but a bit dicey) that we change it after do_feature_fixups(). But we should still fix it so that we enable/disable FW_FEATURE_CMO before patching. cheers
[linux-next][489e45][PowerPC] build broke on bare-metal with gcc 4.8.5
Hi, next-20170621 fails to build on PowerPC bare-metal with gcc 4.8.5 Test : build Machine : Power 8 bare-metal (BMC) kernel : 4.12.0-rc6 gcc : 4.8.5 config : Hab-NV-config $ make -S -j In file included from drivers/net/ethernet/qlogic/qede/qede.h:43:0, from drivers/net/ethernet/qlogic/qede/qede_main.c:63: ./include/linux/qed/qede_rdma.h:85:1: error: expected identifier or ‘(’ before ‘{’ token { ^ ./include/linux/qed/qede_rdma.h:84:19: warning: ‘qede_rdma_dev_add’ used but never defined [enabled by default] static inline int qede_rdma_dev_add(struct qede_dev *dev); ^ make[5]: *** [drivers/net/ethernet/qlogic/qede/qede_main.o] Error 1 make[5]: *** Waiting for unfinished jobs Possible bad commit looks to be: commit 489e45ae42f000a5e045ac203ad5d6f08824cd56 Author: Sudarsana Reddy KalluruDate: Wed Jun 8 06:22:12 2016 -0400 qede: Add dcbnl support. This patch adds the interfaces for ieee/cee dcbnl callbacks and registers them with the kernel. Signed-off-by: Sudarsana Reddy Kalluru Signed-off-by: Yuval Mintz Signed-off-by: David S. Miller diff --git a/drivers/net/ethernet/qlogic/qede/qede_dcbnl.c b/drivers/net/ethernet/qlogic/qede/qede_dcbnl.c new file mode 100644 index 000..03e8c02 --- /dev/null +++ b/drivers/net/ethernet/qlogic/qede/qede_dcbnl.c -- Regard's Abdul Haleem IBM Linux Technology Centre # # Automatically generated file; DO NOT EDIT. # Linux/powerpc 4.9.0-rc8 Kernel Configuration # CONFIG_PPC64=y # # Processor support # CONFIG_PPC_BOOK3S_64=y # CONFIG_PPC_BOOK3E_64 is not set # CONFIG_POWER7_CPU is not set CONFIG_POWER8_CPU=y CONFIG_PPC_BOOK3S=y CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_VSX=y # CONFIG_PPC_ICSWX is not set CONFIG_PPC_STD_MMU=y CONFIG_PPC_STD_MMU_64=y CONFIG_PPC_RADIX_MMU=y CONFIG_PPC_MM_SLICES=y CONFIG_PPC_HAVE_PMU_SUPPORT=y CONFIG_PPC_PERF_CTRS=y CONFIG_SMP=y CONFIG_NR_CPUS=2048 CONFIG_PPC_DOORBELL=y # CONFIG_CPU_BIG_ENDIAN is not set CONFIG_CPU_LITTLE_ENDIAN=y CONFIG_PPC64_BOOT_WRAPPER=y CONFIG_64BIT=y CONFIG_ARCH_PHYS_ADDR_T_64BIT=y CONFIG_ARCH_DMA_ADDR_T_64BIT=y CONFIG_MMU=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NR_IRQS=512 CONFIG_STACKTRACE_SUPPORT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_ILOG2_U32=y CONFIG_ARCH_HAS_ILOG2_U64=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK=y CONFIG_PPC=y CONFIG_GENERIC_CSUM=y CONFIG_EARLY_PRINTK=y CONFIG_PANIC_TIMEOUT=180 CONFIG_COMPAT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_PPC_UDBG_16550=y # CONFIG_GENERIC_TBSYNC is not set CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y CONFIG_EPAPR_BOOT=y # CONFIG_DEFAULT_UIMAGE is not set CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y # CONFIG_PPC_DCR_NATIVE is not set # CONFIG_PPC_DCR_MMIO is not set # CONFIG_PPC_OF_PLATFORM_PCI is not set CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_PPC_EMULATE_SSTEP=y CONFIG_ZONE_DMA32=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_XZ is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y # CONFIG_USELIB is not set CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_SHOW_LEVEL=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_IRQ_DOMAIN=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_IRQ_DOMAIN_DEBUG=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_GENERIC_TIME_VSYSCALL_OLD=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_ARCH_HAS_TICK_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_VIRT_CPU_ACCOUNTING=y # CONFIG_TICK_CPU_ACCOUNTING is not set CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_BSD_PROCESS_ACCT is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y # CONFIG_TREE_RCU_TRACE is not set # CONFIG_RCU_EXPEDITE_BOOT
[RFC v2 3/3] cxl: Add memory barrier to guarantee TLBI scope
With the hash memory model, all TLBIs become global when the cxl driver is active, i.e. as soon as one context is open. It is theoretically possible to send a TLBI with the wrong scope as there's currently no memory barrier between when the driver is marked as in use, and attaching a context to the device, therefore we are exposed to re-ordering. It is highly unlikely as the use count for the driver is incremented on open() and the attachment to the device happens on a different system call (ioctl) Signed-off-by: Frederic Barrat--- include/misc/cxl-base.h | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/include/misc/cxl-base.h b/include/misc/cxl-base.h index b2ebc91fe09a..dcb6d38ab3ad 100644 --- a/include/misc/cxl-base.h +++ b/include/misc/cxl-base.h @@ -25,12 +25,28 @@ extern atomic_t cxl_use_count; static inline bool cxl_ctx_in_use(void) { - return (atomic_read(_use_count) != 0); + /* +* This is called when sending an TLBI, to know whether it +* should be global or local. +* +* We need to make sure the PTE update is happening before +* reading the context global flag. Otherwise, reading the +* flag may be re-ordered and happen first, and we could end +* up in a situation where the old PTE is seen by the device, +* but the TLBI is not global. +*/ + smp_mb(); + return (atomic_read(_use_count) != 0); } static inline void cxl_ctx_get(void) { atomic_inc(_use_count); + /* + * Barrier guarantees that the device will receive all TLBIs + * from that point on + */ + smp_wmb(); } static inline void cxl_ctx_put(void) -- 2.11.0
[RFC v2 2/3] cxl: Mark context requiring global TLBIs
The PSL needs to see all TLBIs pertinent to the memory contexts used on the cxl adapter. For the hash memory model, it was done by making all TLBIs global as soon as the cxl driver is in use. For radix, we need something similar, but we can refine and only convert to global the invalidations for contexts actually used by the device. So mark the contexts being attached to the cxl adapter as requiring global TLBIs. Signed-off-by: Frederic Barrat--- drivers/misc/cxl/api.c | 12 ++-- drivers/misc/cxl/file.c | 12 ++-- 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c index 1a138c83f877..2cebe214195e 100644 --- a/drivers/misc/cxl/api.c +++ b/drivers/misc/cxl/api.c @@ -332,8 +332,17 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed, cxl_context_mm_count_get(ctx); /* decrement the use count */ - if (ctx->mm) + if (ctx->mm) { mmput(ctx->mm); +#ifdef CONFIG_PPC_BOOK3S_64 + mm_context_set_global_tlbi(>mm->context); + /* +* Barrier guarantees that the device will +* receive all TLBIs from that point on +*/ + smp_wmb(); +#endif + } } cxl_ctx_get(); @@ -347,7 +356,6 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed, cxl_context_mm_count_put(ctx); goto out; } - ctx->status = STARTED; out: mutex_unlock(>status_mutex); diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c index 0761271d68c5..3ba6a60e0a6d 100644 --- a/drivers/misc/cxl/file.c +++ b/drivers/misc/cxl/file.c @@ -222,8 +222,17 @@ static long afu_ioctl_start_work(struct cxl_context *ctx, cxl_context_mm_count_get(ctx); /* decrement the use count */ - if (ctx->mm) + if (ctx->mm) { mmput(ctx->mm); +#ifdef CONFIG_PPC_BOOK3S_64 + mm_context_set_global_tlbi(>mm->context); + /* +* Barrier guarantees that the device will receive all +* TLBIs from that point on +*/ + smp_wmb(); +#endif + } trace_cxl_attach(ctx, work.work_element_descriptor, work.num_interrupts, amr); @@ -236,7 +245,6 @@ static long afu_ioctl_start_work(struct cxl_context *ctx, cxl_context_mm_count_put(ctx); goto out; } - ctx->status = STARTED; rc = 0; out: -- 2.11.0
[RFC v2 1/3] powerpc/mm: Add marker for contexts requiring global TLB invalidations
Introduce a new 'flags' attribute per context and define its first bit to be a marker requiring all TLBIs for that context to be broadcasted globally. Once that marker is set on a context, it cannot be removed. Such a marker is useful for memory contexts used by devices behind the NPU and CAPP/PSL. The NPU and the PSL keep their own translation cache so they need to see all the TLBIs for those contexts. Signed-off-by: Frederic Barrat--- arch/powerpc/include/asm/book3s/64/mmu.h | 18 ++ arch/powerpc/include/asm/tlb.h | 23 +-- arch/powerpc/mm/mmu_context_book3s64.c | 1 + 3 files changed, 40 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index 77529a3e3811..cd83f8eb6a3f 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -78,8 +78,12 @@ struct spinlock; /* Maximum possible number of NPUs in a system. */ #define NV_MAX_NPUS 8 +/* Bits definition for the context flags */ +#define MM_GLOBAL_TLBIE0 /* TLBI must be global */ + typedef struct { mm_context_id_t id; + unsigned long flags; u16 user_psize; /* page size index */ /* NPU NMMU context */ @@ -164,5 +168,19 @@ extern void radix_init_pseries(void); static inline void radix_init_pseries(void) { }; #endif +/* + * Mark the memory context as requiring global TLBIs, when used by + * GPUs or CAPI accelerators managing their own TLB or ERAT. +*/ +static inline void mm_context_set_global_tlbi(mm_context_t *ctx) +{ + set_bit(MM_GLOBAL_TLBIE, >flags); +} + +static inline bool mm_context_get_global_tlbi(mm_context_t *ctx) +{ + return test_bit(MM_GLOBAL_TLBIE, >flags); +} + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_BOOK3S_64_MMU_H_ */ diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index 609557569f65..87d4ddcbf7f8 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -71,8 +71,27 @@ static inline int mm_is_core_local(struct mm_struct *mm) static inline int mm_is_thread_local(struct mm_struct *mm) { - return cpumask_equal(mm_cpumask(mm), - cpumask_of(smp_processor_id())); + int rc; + + rc = cpumask_equal(mm_cpumask(mm), + cpumask_of(smp_processor_id())); +#ifdef CONFIG_PPC_BOOK3S_64 + if (rc) { + /* +* Check if context requires global TLBI. +* +* We need to make sure the PTE update is happening +* before reading the context global flag. Otherwise, +* reading the flag may be re-ordered and happen +* first, and we could end up in a situation where the +* old PTE was seen by a device, but the TLBI is not +* global. +*/ + smp_mb(); + rc = !mm_context_get_global_tlbi(>context); + } +#endif + return rc; } #else diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index a3edf813d455..c32a3f729d81 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -156,6 +156,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) return index; mm->context.id = index; + mm->context.flags = 0; #ifdef CONFIG_PPC_ICSWX mm->context.cop_lockp = kmalloc(sizeof(spinlock_t), GFP_KERNEL); if (!mm->context.cop_lockp) { -- 2.11.0
[RFC v2 0/3] powerpc/mm: Mark memory contexts requiring global TLBIs
capi2 and opencapi require the TLB invalidations being sent for addresses used on the cxl adapter or opencapi device to be global, as there's a translation cache in the PSL (for capi2) or NPU (for opencapi). The CAPP (for PSL) and NPU snoop the power bus. This is not new: for the hash memory model, as soon as the cxl driver is active, all local TLBIs become global. We need a similar mechanism for the radix memory model. This patch tries to improve things a bit by flagging the contexts requiring global TLBIs, therefore limiting the "upgrade" and not affecting contexts not used by the card. A longer-term goal is to modify the current implementation for hash to follow the same direction, i.e. identify contexts needing global TLBIs, but that will be for later. It would be required to support hash for opencapi. Changelog: >From previous comments: - rename MM_CONTEXT_GLOBAL_TLBI -> MM_GLOBAL_TLBIE - add memory barriers to make sure the device doesn't miss any TLBI - also add barrier for the hash implemention to fix the same issue Frederic Barrat (3): powerpc/mm: Add marker for contexts requiring global TLB invalidations cxl: Mark context requiring global TLBIs cxl: Add memory barrier to guarantee TLBI scope arch/powerpc/include/asm/book3s/64/mmu.h | 18 ++ arch/powerpc/include/asm/tlb.h | 23 +-- arch/powerpc/mm/mmu_context_book3s64.c | 1 + drivers/misc/cxl/api.c | 12 ++-- drivers/misc/cxl/file.c | 12 ++-- include/misc/cxl-base.h | 18 +- 6 files changed, 77 insertions(+), 7 deletions(-) -- 2.11.0
[RFC PATCH 0/2] guest exploitation of the XIVE interrupt controller
Hello, I am currently working on a XIVE model for the pseries QEMU machine and I started hacking Linux to have something to test with. Here's a first draft of XIVE support for the guest. This is work in progress, only tested with the QEMU XIVE model I am working on. Feedback most welcome. Thanks, C. Cédric Le Goater (2): powerpc/xive: guest exploitation of the XIVE interrupt controller powerpc/xive: add XIVE exploitation mode to CAS arch/powerpc/include/asm/hvcall.h | 13 +- arch/powerpc/include/asm/xive.h| 1 + arch/powerpc/kernel/prom_init.c| 15 +- arch/powerpc/platforms/pseries/Kconfig | 1 + arch/powerpc/platforms/pseries/setup.c | 8 +- arch/powerpc/platforms/pseries/smp.c | 18 +- arch/powerpc/sysdev/xive/Kconfig | 5 + arch/powerpc/sysdev/xive/Makefile | 1 + arch/powerpc/sysdev/xive/xive-hv.c | 523 + 9 files changed, 580 insertions(+), 5 deletions(-) create mode 100644 arch/powerpc/sysdev/xive/xive-hv.c -- 2.7.5