date:20170622

Re: 1M hugepage size being registered on Linux

2017-06-22 Thread Michael Ellerman

Michal Suchánek  writes:

> On Tue, 20 Jun 2017 10:47:47 -0300
> victora  wrote:
>
>> Hi Alistair/Jeremy,
>> 
>> I am working on a bug related to 1M hugepage size being registered on 
>> Linux (Power 8 Baremetal - Garrison).
>> 
>> I was checking dmesg and it seems that 1M page size is coming from 
>> firmware to Linux.
>> 
>> [0.00] base_shift=20: shift=20, sllp=0x0130,
>> avpnm=0x, tlbiel=0, penc=2
>> [1.528867] HugeTLB registered 1 MB page size, pre-allocated 0
>> pages
>> 
>> Should Linux support this page size? As afar as I know, this was an 
>> unsupported page size in the past isn't it? If this should be
>> supported now, is there any specific reason for that?
>
> Hello,
>
> a525108cf1cc powerpc/mm/hugetlb: Filter out hugepage size not supported
> by page table layout
>
> Should reject unsupported huge page sizes, right?

Yes, see my reply.

cheers

Re: [PATCH V3] cxl: Export library to support IBM XSL

2017-06-22 Thread Michael Ellerman

Frederic Barrat  writes:
>> diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
>> new file mode 100644
>> index 000..4f4c5ca
>> --- /dev/null
>> +++ b/drivers/misc/cxl/cxllib.c
>> @@ -0,0 +1,246 @@
>> +/*
>> + * Copyright 2017 IBM Corp.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version
>> + * 2 of the License, or (at your option) any later version.
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>
> Maybe somebody can comment on this, but I believe the usual order is the 
> 'linux' headers first, then the 'asm'. Though I don't know if there's a 
> valid reason behind it, or just tradition...

AFAIK it's mainly tradition.

In general if there's a linux/foo.h and an asm/foo.h you're usually
supposed to include the linux version, and then that will include
asm/foo.h appropriately.

In some cases you might have an asm header that relies on something
defined in the linux version, meaning including the asm one first
doesn't build.

Outside of arch it's slightly dubious to be including asm/ headers
directly, because you require all arches to have that header.

None of which really answers your question :)

cheers

Re: [PATCH kernel 0/3 REPOST] vfio-pci: Add support for mmapping MSI-X table

2017-06-22 Thread Alexey Kardashevskiy

On 23/06/17 07:11, Alex Williamson wrote:
> On Thu, 15 Jun 2017 15:48:42 +1000
> Alexey Kardashevskiy  wrote:
> 
>> Here is a patchset which Yongji was working on before
>> leaving IBM LTC. Since we still want to have this functionality
>> in the kernel (DPDK is the first user), here is a rebase
>> on the current upstream.
>>
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>> The patch 3 are based on the proposed patchset[1].
>>
>> Changelog
>> v3:
>> - rebased on the current upstream
> 
> There's something not forthcoming here, the last version I see from
> Yongji is this one:
> 
> https://lists.linuxfoundation.org/pipermail/iommu/2016-June/017245.html
> 
> Which was a 6-patch series where patches 2-4 tried to apply
> PCI_BUS_FLAGS_MSI_REMAP for cases that supported other platforms.  That
> doesn't exist here, so it's not simply a rebase.  Patch 1/ seems to
> equate this new flag to the IOMMU capability IOMMU_CAP_INTR_REMAP, but
> nothing is done here to match them together.  That patch also mentions
> the work Eric has done for similar features on ARM, but again those
> patches are dropped.  It seems like an incomplete feature now.  Thanks,


Thanks! I suspected this is not the latest but could not find anything
better than we use internally for tests, and I could not reach Yongji for
comments whether this was the latest update.

As I am reading the patches, I notice that the "msi remap" term is used all
over the place. While this remapping capability may be the case for x86/arm
(and therefore the IOMMU_CAP_INTR_REMAP flag makes sense), powernv does not
do remapping but provides hardware isolation. When we are allowing MSIX BAR
mapping to the userspace - the isolation is what we really care about. Will
it make sense to rename PCI_BUS_FLAGS_MSI_REMAP to
PCI_BUS_FLAGS_MSI_ISOLATED ?

Another thing - the patchset enables PCI_BUS_FLAGS_MSI_REMAP when IOMMU
just advertises IOMMU_CAP_INTR_REMAP, not necessarily uses it, should the
patchset actually look at something like irq_remapping_enabled in
drivers/iommu/amd_iommu.c instead?



> 
> Alex
> 
>> v2:
>> - Make the commit log more clear
>> - Replace pci_bus_check_msi_remapping() with pci_bus_msi_isolated()
>>   so that we could clearly know what the function does
>> - Set PCI_BUS_FLAGS_MSI_REMAP in pci_create_root_bus() instead
>>   of iommu_bus_notifier()
>> - Reserve VFIO_REGION_INFO_FLAG_CAPS when we allow to mmap MSI-X
>>   table so that we can know whether we allow to mmap MSI-X table
>>   in QEMU
>>
>> [1] 
>> https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg1138820.html
>>
>>
>> This is based on sha1
>> 63f700aab4c1 Linus Torvalds "Merge tag 'xtensa-20170612' of 
>> git://github.com/jcmvbkbc/linux-xtensa".
>>
>> Please comment. Thanks.
>>
>>
>>
>> Yongji Xie (3):
>>   PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag
>>   pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
>>   vfio-pci: Allow to expose MSI-X table to userspace if interrupt
>> remapping is enabled
>>
>>  include/linux/pci.h   |  1 +
>>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 
>>  drivers/vfio/pci/vfio_pci.c   | 18 +++---
>>  drivers/vfio/pci/vfio_pci_rdwr.c  |  3 ++-
>>  4 files changed, 26 insertions(+), 4 deletions(-)
>>
> 


-- 
Alexey

Re: linux-next: build warning after merge of the tip tree

2017-06-22 Thread Stephen Rothwell

Hi Michael,

On Fri, 23 Jun 2017 14:10:25 +1000 Michael Ellerman  wrote:
>
> Fixed in my next today by:
> 
>   d4cfb11387ee ("powerpc: Convert VDSO update function to use new 
> update_vsyscall interface")
> 
> But you must have pulled before I pushed that, so the warning will go
> away tomorrow.

Excellent, thanks.

-- 
Cheers,
Stephen Rothwell

Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd

2017-06-22 Thread Michael Ellerman

Thomas Gleixner  writes:

> On Thu, 22 Jun 2017, Thiago Jung Bauermann wrote:
>> Michael Bringmann provided this information:
>> >> It's not hard to backport both this patch and commit fe5595c07400
>> >> ("stop_machine: Provide stop_machine_cpuslocked()") from branch
>> >> smp/hotplug in tip.git for stable.
>> >
>> > Yeah but it's not really my business backporting that unfortunately.
>> 
>> Sorry, I wasn't clear. I was offering to provide backported patches for
>> the relevant stable branches.
>> 
>> Though that will only be necessary if we also backport the topology
>> fixes as well.
>
> So shall I pick up the fix and route it through tip smp/hotplug where the
> cpu hotplug core changes reside on which that patch depends on?

Yes please. Here's an ack if you like:

Acked-by: Michael Ellerman 

cheers

Re: linux-next: build warning after merge of the tip tree

2017-06-22 Thread Michael Ellerman

Stephen Rothwell  writes:

> Hi all,
>
> [Forgot to cc John]
>
> On Fri, 23 Jun 2017 13:58:34 +1000 Stephen Rothwell  
> wrote:
>>
>> Hi all,
>> 
>> After merging the tip tree, today's linux-next build (powerpc
>> ppc64_defconfig) produced this warning:
>> 
>> kernel/time/timekeeping.c:519:2: warning: #warning Please contact your 
>> maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. 
>> [-Wcpp]
>>  #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD 
>> compatibity will disappear soon.
>>   ^
>> kernel/time/timekeeping.c:519:2: warning: #warning Please contact your 
>> maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. 
>> [-Wcpp]
>>  #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD 
>> compatibity will disappear soon.
>>   ^
>> 
>> Introduced by commit
>> 
>>   369adf04d80a ("time: Add warning about imminent deprecation of 
>> CONFIG_GENERIC_TIME_VSYSCALL_OLD")

Fixed in my next today by:

  d4cfb11387ee ("powerpc: Convert VDSO update function to use new 
update_vsyscall interface")

But you must have pulled before I pushed that, so the warning will go
away tomorrow.

cheers

Re: linux-next: build warning after merge of the tip tree

2017-06-22 Thread Stephen Rothwell

Hi all,

[Forgot to cc John]

On Fri, 23 Jun 2017 13:58:34 +1000 Stephen Rothwell  
wrote:
>
> Hi all,
> 
> After merging the tip tree, today's linux-next build (powerpc
> ppc64_defconfig) produced this warning:
> 
> kernel/time/timekeeping.c:519:2: warning: #warning Please contact your 
> maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. 
> [-Wcpp]
>  #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD 
> compatibity will disappear soon.
>   ^
> kernel/time/timekeeping.c:519:2: warning: #warning Please contact your 
> maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. 
> [-Wcpp]
>  #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD 
> compatibity will disappear soon.
>   ^
> 
> Introduced by commit
> 
>   369adf04d80a ("time: Add warning about imminent deprecation of 
> CONFIG_GENERIC_TIME_VSYSCALL_OLD")

-- 
Cheers,
Stephen Rothwell

linux-next: build warning after merge of the tip tree

2017-06-22 Thread Stephen Rothwell

Hi all,

After merging the tip tree, today's linux-next build (powerpc
ppc64_defconfig) produced this warning:

kernel/time/timekeeping.c:519:2: warning: #warning Please contact your 
maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. 
[-Wcpp]
 #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD 
compatibity will disappear soon.
  ^
kernel/time/timekeeping.c:519:2: warning: #warning Please contact your 
maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. 
[-Wcpp]
 #warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD 
compatibity will disappear soon.
  ^

Introduced by commit

  369adf04d80a ("time: Add warning about imminent deprecation of 
CONFIG_GENERIC_TIME_VSYSCALL_OLD")

-- 
Cheers,
Stephen Rothwell

Re: [PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9

2017-06-22 Thread Michael Neuling

Yes please... Would be helpful


On 22 Jun. 2017 7:50 pm, "Stewart Smith"  wrote:

> Michael Neuling  writes:
> > On POWER9 the ERAT may be incorrect on wakeup from some stop states
> > that lose state. This causes random segvs and illegal instructions
> > when these stop states are enabled.
> >
> > This patch invalidates the ERAT on wakeup on POWER9 to prevent this
> > from causing a problem.
> >
> > Signed-off-by: Michael Neuling 
>
> CC: stable?
>
> --
> Stewart Smith
> OPAL Architect, IBM.
>
>

Re: [PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9

2017-06-22 Thread Michael Neuling

On Fri, 2017-06-23 at 10:50 +1000, Stewart Smith wrote:
> Michael Neuling  writes:
> > On POWER9 the ERAT may be incorrect on wakeup from some stop states
> > that lose state. This causes random segvs and illegal instructions
> > when these stop states are enabled.
> > 
> > This patch invalidates the ERAT on wakeup on POWER9 to prevent this
> > from causing a problem.
> > 
> > Signed-off-by: Michael Neuling 
> 
> CC: stable?

Yes please!

Mikey

Re: [PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9

2017-06-22 Thread Stewart Smith

Michael Neuling  writes:
> On POWER9 the ERAT may be incorrect on wakeup from some stop states
> that lose state. This causes random segvs and illegal instructions
> when these stop states are enabled.
>
> This patch invalidates the ERAT on wakeup on POWER9 to prevent this
> from causing a problem.
>
> Signed-off-by: Michael Neuling 

CC: stable?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: new dma-mapping tree, was Re: clean up and modularize arch dma_mapping interface V2

2017-06-22 Thread Stephen Rothwell

Hi all,

On Wed, 21 Jun 2017 15:32:39 +0200 Marek Szyprowski  
wrote:
>
> On 2017-06-20 15:16, Christoph Hellwig wrote:
> > On Tue, Jun 20, 2017 at 11:04:00PM +1000, Stephen Rothwell wrote:  
> >> git://git.linaro.org/people/mszyprowski/linux-dma-mapping.git#dma-mapping-next
> >>
> >> Contacts: Marek Szyprowski and Kyungmin Park (cc'd)
> >>
> >> I have called your tree dma-mapping-hch for now.  The other tree has
> >> not been updated since 4.9-rc1 and I am not sure how general it is.
> >> Marek, Kyungmin, any comments?  
> > I'd be happy to join efforts - co-maintainers and reviers are always
> > welcome.  
> 
> I did some dma-mapping unification works in the past and my tree in 
> linux-next
> was a side effect of that. I think that for now it can be dropped in 
> favor of
> Christoph's tree. I can also do some review and help in maintainers work if
> needed, although I was recently busy with other stuff.

OK, so I have dropped the dma-mapping tree and renamed dma-mapping-hch
to dma-mapping.

-- 
Cheers,
Stephen Rothwell

Re: DNS (?) not working on G5 (64-bit powerpc) (was [net-next, v3, 3/3] udp: try to avoid 2 cache miss on dequeue)

2017-06-22 Thread Hannes Frederic Sowa

On Thu, Jun 22, 2017, at 22:57, Paolo Abeni wrote:
> 
> Can you please check if the following patch fixes the issue? Only
> compiled tested here.
> 
> Thanks!!!
> ---
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 067a607..80d89fe 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1446,16 +1446,19 @@ static struct sk_buff
> *__first_packet_length(struct sock *sk,
>  {
>   struct sk_buff *skb;
>  
> -   while ((skb = skb_peek(rcvq)) != NULL &&
> -      udp_lib_checksum_complete(skb)) {
> -   __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS,
> -   IS_UDPLITE(sk));
> -   __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
> -   IS_UDPLITE(sk));
> -   atomic_inc(>sk_drops);
> -   __skb_unlink(skb, rcvq);
> -   *total += skb->truesize;
> -   kfree_skb(skb);
> +   while ((skb = skb_peek(rcvq)) != NULL) {
> +   if (udp_lib_checksum_complete(skb)) {
> +   __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS,
> +   IS_UDPLITE(sk));
> +   __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
> +   IS_UDPLITE(sk));
> +   atomic_inc(>sk_drops);
> +   __skb_unlink(skb, rcvq);
> +   *total += skb->truesize;
> +   kfree_skb(skb);
> +   } else {
> +   udp_set_dev_scratch(skb);

It needs a "break;" here.

> +   }
>   }
>   return skb;
>  }

Bye,
Hannes

Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Logan Gunthorpe


On 6/22/2017 3:03 PM, Arnd Bergmann wrote:

Drivers that want a non-atomic variant should include either
include/linux/io-64-nonatomic-hi-lo.h or include/linux/io-64-nonatomic-lo-hi.h
depending on what they need. Drivers that require 64-bit I/O should
probably just depend on CONFIG_64BIT and maybe use readq/writeq.


Ok, I will work something like that up.

We'll still need a patch similar to patch 2 (less the non-atomic 
versions) seeing even CONFIG_GENERIC_IOMAP arches don't actually have a 
working ioread64/iowrite64 implementation.


Thanks,

Logan

Re: DNS (?) not working on G5 (64-bit powerpc) (was [net-next,v3,3/3] udp: try to avoid 2 cache miss on dequeue)

2017-06-22 Thread Paolo Abeni

On Thu, 2017-06-22 at 23:06 +1000, Michael Ellerman wrote:
> Paolo wrote:
> > when udp_recvmsg() is executed, on x86_64 and other archs, most skb
> > fields are on cold cachelines.
> > If the skb are linear and the kernel don't need to compute the udp
> > csum, only a handful of skb fields are required by udp_recvmsg().
> > Since we already use skb->dev_scratch to cache hot data, and
> > there are 32 bits unused on 64 bit archs, use such field to cache
> > as much data as we can, and try to prefetch on dequeue the relevant
> > fields that are left out.
> > 
> > This can save up to 2 cache miss per packet.
> > 
> > v1 -> v2:
> >   - changed udp_dev_scratch fields types to u{32,16} variant,
> > replaced bitfiled with bool
> > 
> > Signed-off-by: Paolo Abeni 
> > Acked-by: Eric Dumazet 
> > ---
> >  net/ipv4/udp.c | 114 
> > +++--
> >  1 file changed, 103 insertions(+), 11 deletions(-)
> 
> This appears to break wget on one of my machines.
> 
> Networking in general is working, I'm able to SSH in, but then I can't
> do a wget.

Can you please check if the following patch fixes the issue? Only
compiled tested here.

Thanks!!!
---
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 067a607..80d89fe 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1446,16 +1446,19 @@ static struct sk_buff *__first_packet_length(struct 
sock *sk,
 {
    struct sk_buff *skb;
 
-   while ((skb = skb_peek(rcvq)) != NULL &&
-      udp_lib_checksum_complete(skb)) {
-   __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS,
-   IS_UDPLITE(sk));
-   __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
-   IS_UDPLITE(sk));
-   atomic_inc(>sk_drops);
-   __skb_unlink(skb, rcvq);
-   *total += skb->truesize;
-   kfree_skb(skb);
+   while ((skb = skb_peek(rcvq)) != NULL) {
+   if (udp_lib_checksum_complete(skb)) {
+   __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS,
+   IS_UDPLITE(sk));
+   __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
+   IS_UDPLITE(sk));
+   atomic_inc(>sk_drops);
+   __skb_unlink(skb, rcvq);
+   *total += skb->truesize;
+   kfree_skb(skb);
+   } else {
+   udp_set_dev_scratch(skb);
+   }
    }
    return skb;
 }

Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available

2017-06-22 Thread Logan Gunthorpe


On 6/22/2017 2:36 PM, Alan Cox wrote:

I think that makes sense for the platforms with that problem. I'm not
sure there are many that can't do it for mmio at least. 486SX can't do it
and I guess some ARM32 but I think almost everyone else can including
most 32bit x86.

What's more of a problem is a lot of platforms can do 64bit MMIO via
ioread/write64 but not 64bit port I/O, and it's not clear how you
represent that via an ioread/write API that abstracts it away.


In Patch 2, we call bad_io_access for anyone trying to do 64bit accesses 
on port I/O.


Logan

Re: DNS (?) not working on G5 (64-bit powerpc) (was [net-next,v3,3/3] udp: try to avoid 2 cache miss on dequeue)

2017-06-22 Thread Paolo Abeni

On Thu, 2017-06-22 at 18:43 +0200, Paolo Abeni wrote:
> On Thu, 2017-06-22 at 23:06 +1000, Michael Ellerman wrote:
> > Paolo wrote:
> > > when udp_recvmsg() is executed, on x86_64 and other archs, most skb
> > > fields are on cold cachelines.
> > > If the skb are linear and the kernel don't need to compute the udp
> > > csum, only a handful of skb fields are required by udp_recvmsg().
> > > Since we already use skb->dev_scratch to cache hot data, and
> > > there are 32 bits unused on 64 bit archs, use such field to cache
> > > as much data as we can, and try to prefetch on dequeue the relevant
> > > fields that are left out.
> > > 
> > > This can save up to 2 cache miss per packet.
> > > 
> > > v1 -> v2:
> > >   - changed udp_dev_scratch fields types to u{32,16} variant,
> > > replaced bitfiled with bool
> > > 
> > > Signed-off-by: Paolo Abeni 
> > > Acked-by: Eric Dumazet 
> > > ---
> > >  net/ipv4/udp.c | 114 
> > > +++--
> > >  1 file changed, 103 insertions(+), 11 deletions(-)
> > 
> > This appears to break wget on one of my machines.
> > 
> > Networking in general is working, I'm able to SSH in, but then I can't
> > do a wget.
> > 
> > eg:
> > 
> > $ wget google.com
> > --2017-06-22 22:45:39--  http://google.com/
> > Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in 
> > name resolution.
> > wget: unable to resolve host address ‘proxy.pmdw.com’
> > 
> > $ host proxy.pmdw.com
> > proxy.pmdw.com is an alias for raven.pmdw.com.
> > raven.pmdw.com has address 10.1.2.3
> > 
> > $ wget google.com
> > --2017-06-22 22:52:08--  http://google.com/
> > Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in 
> > name resolution.
> > wget: unable to resolve host address ‘proxy.pmdw.com’
> > 
> > Maybe host is using TCP but the man page says it doesn't?
> > 
> > 
> > Everything is OK if I boot back to the previous commit 0a463c78d25b
> > ("udp: avoid a cache miss on dequeue"):
> > 
> > $ wget google.com
> > --2017-06-22 23:00:01--  http://google.com/
> > Resolving proxy.pmdw.com (proxy.pmdw.com)... 10.1.2.3
> > Connecting to proxy.pmdw.com (proxy.pmdw.com)|10.1.2.3|:3128... connected.
> > Proxy request sent, awaiting response... 302 Found
> > Location: http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE 
> > [following]
> > --2017-06-22 23:00:01--  
> > http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE
> > Reusing existing connection to proxy.pmdw.com:3128.
> > Proxy request sent, awaiting response... 200 OK
> > Length: unspecified [text/html]
> > Saving to: ‘index.html’
> > 
> > index.html  [ <=>   
> > ]  11.37K  --.-KB/sin 0.001s  
> > 
> > 2017-06-22 23:00:01 (22.0 MB/s) - ‘index.html’ saved [11640]
> > 
> > $ uname -a
> > Linux 4.12.0-rc4-gcc6-00988-g0a463c7 #88 SMP Thu Jun 22 22:55:12 AEST 2017 
> > ppc64 GNU/Linux
> > 
> > 
> > Haven't had time to debug any further. Any ideas?
> 
> Thank you for this report.
> 
> Can you please specify features of the relevant NIC ? (ethtool -k
> ) 
> 
> I'll try to replicate the issue as soon I'll get hands on suitable HW,

I had my hands on power7, but I can't trivially reproduce the issue so
I'm going to bug you for more info. 

Can you please specify the host CPU, the NIC in use (ethtool -i
), the compiler version used to build the kernel and possibly
provide a tcpdump of the DNS packets received/sent while running wget
and while running the host command?

Do you have the relevant kernel running on others PPC hosts?

Thank you,

Paolo

Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available

2017-06-22 Thread Logan Gunthorpe


On 6/22/2017 2:14 PM, Alan Cox wrote:

If a platform doesn't support 64bit I/O operations from the CPU then you
either need to use some kind of platform/architecture specific interface
if present or accept you don't have one.


Yes, I understand that.

The thing is that every user that's currently using it right now is 
patching in their own version that splits it on non-64bit systems.



It's not safe to split it. Possibly for some use cases you could add an
ioread64_maysplit()


I'm open to doing something like that.


What btw is the actual ARM compiler warning ? Is the compiler also trying
to tell you it's a bad idea ?


It's just the compiler noting that you are mixing volatile and 
non-volatile pointers. Strangely some io{read|write}XX use volatile but 
most do not. But it's nothing crazy.


Logan

Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Logan Gunthorpe


On 6/22/2017 2:08 PM, Alan Cox wrote:

But this does not do the same thing as an ioread64 with regards to
atomicity or side effects on the device. The same is true of the other
hacks. You either have a real 64bit single read/write from MMIO space or
you don't. You can't fake it.


Yes, I know. But is it not better than having every driver that wants to 
use these functions fake it themselves?


Logan

Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Logan Gunthorpe


On 6/22/2017 11:29 AM, Stephen  Bates wrote:

+#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p))


Logan, thanks for taking this cleanup on. I think this should be iowrite64 not 
iowrite32?


Yup, good catch. Thanks. I'll fix it in a v2 of this series.

Logan

Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Stephen Bates

> +#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p))
 
Logan, thanks for taking this cleanup on. I think this should be iowrite64 not 
iowrite32?

Stephen

[PATCH 1/7] drm/tilcdc: don't use volatile with iowrite64

2017-06-22 Thread Logan Gunthorpe

This is a prep patch for adding a universal iowrite64.

The patch is to prevent compiler warnings when we add iowrite64 that
would occur because there is an unnecessary volatile in this driver.

Signed-off-by: Logan Gunthorpe 
Cc: Jyri Sarha 
Cc: Tomi Valkeinen 
Cc: David Airlie 
---
 drivers/gpu/drm/tilcdc/tilcdc_regs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/tilcdc/tilcdc_regs.h 
b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
index 9d528c0a67a4..e9ce725698a9 100644
--- a/drivers/gpu/drm/tilcdc/tilcdc_regs.h
+++ b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
@@ -131,14 +131,14 @@ static inline void tilcdc_write(struct drm_device *dev, 
u32 reg, u32 data)
 static inline void tilcdc_write64(struct drm_device *dev, u32 reg, u64 data)
 {
struct tilcdc_drm_private *priv = dev->dev_private;
-   volatile void __iomem *addr = priv->mmio + reg;
+   void __iomem *addr = priv->mmio + reg;
 
 #ifdef iowrite64
iowrite64(data, addr);
 #else
__iowmb();
/* This compiles to strd (=64-bit write) on ARM7 */
-   *(volatile u64 __force *)addr = __cpu_to_le64(data);
+   *(u64 __force *)addr = __cpu_to_le64(data);
 #endif
 }
 
-- 
2.11.0

[PATCH 7/7] crypto: caam: cleanup CONFIG_64BIT ifdefs when using io{read|write}64

2017-06-22 Thread Logan Gunthorpe

Now that ioread64 and iowrite64 are always available we don't
need the ugly ifdefs to change their implementation when they
are not.

Signed-off-by: Logan Gunthorpe 
Cc: "Horia Geantă" 
Cc: Dan Douglass 
Cc: Herbert Xu 
Cc: "David S. Miller" 
---
 drivers/crypto/caam/regs.h | 29 -
 1 file changed, 29 deletions(-)

diff --git a/drivers/crypto/caam/regs.h b/drivers/crypto/caam/regs.h
index 84d2f838a063..26fc19dd0c39 100644
--- a/drivers/crypto/caam/regs.h
+++ b/drivers/crypto/caam/regs.h
@@ -134,7 +134,6 @@ static inline void clrsetbits_32(void __iomem *reg, u32 
clear, u32 set)
  *base + 0x : least-significant 32 bits
  *base + 0x0004 : most-significant 32 bits
  */
-#ifdef CONFIG_64BIT
 static inline void wr_reg64(void __iomem *reg, u64 data)
 {
if (caam_little_end)
@@ -151,34 +150,6 @@ static inline u64 rd_reg64(void __iomem *reg)
return ioread64be(reg);
 }
 
-#else /* CONFIG_64BIT */
-static inline void wr_reg64(void __iomem *reg, u64 data)
-{
-#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX
-   if (caam_little_end) {
-   wr_reg32((u32 __iomem *)(reg) + 1, data >> 32);
-   wr_reg32((u32 __iomem *)(reg), data);
-   } else
-#endif
-   {
-   wr_reg32((u32 __iomem *)(reg), data >> 32);
-   wr_reg32((u32 __iomem *)(reg) + 1, data);
-   }
-}
-
-static inline u64 rd_reg64(void __iomem *reg)
-{
-#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX
-   if (caam_little_end)
-   return ((u64)rd_reg32((u32 __iomem *)(reg) + 1) << 32 |
-   (u64)rd_reg32((u32 __iomem *)(reg)));
-   else
-#endif
-   return ((u64)rd_reg32((u32 __iomem *)(reg)) << 32 |
-   (u64)rd_reg32((u32 __iomem *)(reg) + 1));
-}
-#endif /* CONFIG_64BIT  */
-
 #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
 #ifdef CONFIG_SOC_IMX7D
 #define cpu_to_caam_dma(value) \
-- 
2.11.0

[PATCH 6/7] drm/tilcdc: clean up ifdef hacks around iowrite64

2017-06-22 Thread Logan Gunthorpe

Now that we can expect iowrite64 to always exist the hack is no longer
necessary so we just call iowrite64 directly.

Signed-off-by: Logan Gunthorpe 
Cc: Jyri Sarha 
Cc: Tomi Valkeinen 
Cc: David Airlie 
---
 drivers/gpu/drm/tilcdc/tilcdc_regs.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/gpu/drm/tilcdc/tilcdc_regs.h 
b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
index e9ce725698a9..0b901405f30a 100644
--- a/drivers/gpu/drm/tilcdc/tilcdc_regs.h
+++ b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
@@ -133,13 +133,7 @@ static inline void tilcdc_write64(struct drm_device *dev, 
u32 reg, u64 data)
struct tilcdc_drm_private *priv = dev->dev_private;
void __iomem *addr = priv->mmio + reg;
 
-#ifdef iowrite64
iowrite64(data, addr);
-#else
-   __iowmb();
-   /* This compiles to strd (=64-bit write) on ARM7 */
-   *(u64 __force *)addr = __cpu_to_le64(data);
-#endif
 }
 
 static inline u32 tilcdc_read(struct drm_device *dev, u32 reg)
-- 
2.11.0

[PATCH 2/7] iomap: implement ioread64 and iowrite64

2017-06-22 Thread Logan Gunthorpe

Currently, ioread64 and iowrite64 are not impleminted in the generic
iomap implementation. The prototypes are defined if CONFIG_64BIT is set
but there is no actual implementation.

Seeing the functions are not universally available, they are unusable
for driver developers. This leads to ugly hacks such as those at
the top of

drivers/ntb/hw/intel/ntb_hw_intel.c

This patch adds generic implementations for these functions. We add
the obvious version if readq/writeq are implemented and fall back
to using two io32 calls in cases that don't provide direct 64bit
accesses. Thus making the functions universally available to
configurations with CONFIG_GENERIC_IOMAP=y.

For any pio accesses, the 64bit operations remain unsupported and
simply call bad_io_access in cases readq would be called.

Signed-off-by: Logan Gunthorpe 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Arnd Bergmann 
Cc: Suresh Warrier 
Cc: Nicholas Piggin 
---
 arch/powerpc/include/asm/io.h |  2 ++
 include/asm-generic/iomap.h   |  4 ---
 lib/iomap.c   | 62 +++
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 422f99cf9924..11a83667d2c3 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -788,8 +788,10 @@ extern void __iounmap_at(void *ea, unsigned long size);
 
 #define mmio_read16be(addr)readw_be(addr)
 #define mmio_read32be(addr)readl_be(addr)
+#define mmio_read64be(addr)readq_be(addr)
 #define mmio_write16be(val, addr)  writew_be(val, addr)
 #define mmio_write32be(val, addr)  writel_be(val, addr)
+#define mmio_write64be(val, addr)  writeq_be(val, addr)
 #define mmio_insb(addr, dst, count)readsb(addr, dst, count)
 #define mmio_insw(addr, dst, count)readsw(addr, dst, count)
 #define mmio_insl(addr, dst, count)readsl(addr, dst, count)
diff --git a/include/asm-generic/iomap.h b/include/asm-generic/iomap.h
index 650fede33c25..43ec4ea9f6f9 100644
--- a/include/asm-generic/iomap.h
+++ b/include/asm-generic/iomap.h
@@ -30,20 +30,16 @@ extern unsigned int ioread16(void __iomem *);
 extern unsigned int ioread16be(void __iomem *);
 extern unsigned int ioread32(void __iomem *);
 extern unsigned int ioread32be(void __iomem *);
-#ifdef CONFIG_64BIT
 extern u64 ioread64(void __iomem *);
 extern u64 ioread64be(void __iomem *);
-#endif
 
 extern void iowrite8(u8, void __iomem *);
 extern void iowrite16(u16, void __iomem *);
 extern void iowrite16be(u16, void __iomem *);
 extern void iowrite32(u32, void __iomem *);
 extern void iowrite32be(u32, void __iomem *);
-#ifdef CONFIG_64BIT
 extern void iowrite64(u64, void __iomem *);
 extern void iowrite64be(u64, void __iomem *);
-#endif
 
 /*
  * "string" versions of the above. Note that they
diff --git a/lib/iomap.c b/lib/iomap.c
index fc3dcb4b238e..e38e036cb52f 100644
--- a/lib/iomap.c
+++ b/lib/iomap.c
@@ -66,6 +66,7 @@ static void bad_io_access(unsigned long port, const char 
*access)
 #ifndef mmio_read16be
 #define mmio_read16be(addr) be16_to_cpu(__raw_readw(addr))
 #define mmio_read32be(addr) be32_to_cpu(__raw_readl(addr))
+#define mmio_read64be(addr) be64_to_cpu(__raw_readq(addr))
 #endif
 
 unsigned int ioread8(void __iomem *addr)
@@ -93,11 +94,45 @@ unsigned int ioread32be(void __iomem *addr)
IO_COND(addr, return pio_read32be(port), return mmio_read32be(addr));
return 0x;
 }
+
+#ifdef readq
+u64 ioread64(void __iomem *addr)
+{
+   IO_COND(addr, bad_io_access(port, "ioread64"), return readq(addr));
+   return 0xLL;
+}
+u64 ioread64be(void __iomem *addr)
+{
+   IO_COND(addr, bad_io_access(port, "ioread64be"),
+   return mmio_read64be(addr));
+   return 0xLL;
+}
+#else
+u64 ioread64(void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32(addr);
+   high = ioread32(addr + sizeof(u32));
+   return low | (high << 32);
+}
+u64 ioread64be(void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32be(addr + sizeof(u32));
+   high = ioread32be(addr);
+   return low | (high << 32);
+}
+#endif
+
 EXPORT_SYMBOL(ioread8);
 EXPORT_SYMBOL(ioread16);
 EXPORT_SYMBOL(ioread16be);
 EXPORT_SYMBOL(ioread32);
 EXPORT_SYMBOL(ioread32be);
+EXPORT_SYMBOL(ioread64);
+EXPORT_SYMBOL(ioread64be);
 
 #ifndef pio_write16be
 #define pio_write16be(val,port) outw(swab16(val),port)
@@ -107,6 +142,7 @@ EXPORT_SYMBOL(ioread32be);
 #ifndef mmio_write16be
 #define mmio_write16be(val,port) __raw_writew(be16_to_cpu(val),port)
 #define mmio_write32be(val,port) __raw_writel(be32_to_cpu(val),port)
+#define mmio_write64be(val,port) __raw_writeq(be64_to_cpu(val),port)
 #endif
 
 void iowrite8(u8 val, void __iomem *addr)
@@ -129,11

[PATCH 0/7] cleanup issues with io{read|write}64

2017-06-22 Thread Logan Gunthorpe

Hi,

Presently, the 64bit IO functions are not very usable in drivers because
they are not universally available in all architectures. This leads to
a bunch of hacks in the kernel to work around this. (See the last 3
patches in this series.) As part of my switchtec_ntb submission which
added another one of these warts, Greg asked me to look into fixing
it[1].

So this patchset attempts to solve this issue by filling in the missing
implementations in iomap.c and io.h. After that, the alpha architecture is
the only one I found that also needed a fix for this. Finally, this
patchset removes the hacks that have accumulated in the kernel,
thus far, for working around this.

This set is based off of v4.12-rc6.

Thanks,

Logan

[1] https://marc.info/?l=linux-kernel=149774601910663=2

Logan Gunthorpe (7):
  drm/tilcdc: don't use volatile with iowrite64
  iomap: implement ioread64 and iowrite64
  asm-generic/io.h: make ioread64 and iowrite64 universally available
  alpha: provide ioread64 and iowrite64 implementations
  ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks
  drm/tilcdc: clean up ifdef hacks around iowrite64
  crypto: caam: cleanup CONFIG_64BIT ifdefs when using io{read|write}64

 arch/alpha/include/asm/io.h  |  2 ++
 arch/alpha/kernel/io.c   | 18 +++
 arch/powerpc/include/asm/io.h|  2 ++
 drivers/crypto/caam/regs.h   | 29 -
 drivers/gpu/drm/tilcdc/tilcdc_regs.h |  8 +
 drivers/ntb/hw/intel/ntb_hw_intel.c  | 30 -
 include/asm-generic/io.h | 54 ---
 include/asm-generic/iomap.h  |  4 ---
 lib/iomap.c  | 62 
 9 files changed, 127 insertions(+), 82 deletions(-)

--
2.11.0

[PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Logan Gunthorpe

Alpha implements its own io operation and doesn't use the
common library. Thus to make ioread64 and iowrite64 globally
available we need to add implementations for alpha.

For this, we simply use calls that chain two 32-bit operations.
(mostly because I don't really understand the alpha architecture.)

Signed-off-by: Logan Gunthorpe 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
---
 arch/alpha/include/asm/io.h |  2 ++
 arch/alpha/kernel/io.c  | 18 ++
 2 files changed, 20 insertions(+)

diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h
index ff4049155c84..15588092c062 100644
--- a/arch/alpha/include/asm/io.h
+++ b/arch/alpha/include/asm/io.h
@@ -493,8 +493,10 @@ extern inline void writeq(u64 b, volatile void __iomem 
*addr)
 
 #define ioread16be(p) be16_to_cpu(ioread16(p))
 #define ioread32be(p) be32_to_cpu(ioread32(p))
+#define ioread64be(p) be64_to_cpu(ioread64(p))
 #define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p))
 #define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p))
+#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p))
 
 #define inb_p  inb
 #define inw_p  inw
diff --git a/arch/alpha/kernel/io.c b/arch/alpha/kernel/io.c
index 19c5875ab398..8c28026f7849 100644
--- a/arch/alpha/kernel/io.c
+++ b/arch/alpha/kernel/io.c
@@ -59,6 +59,24 @@ EXPORT_SYMBOL(iowrite8);
 EXPORT_SYMBOL(iowrite16);
 EXPORT_SYMBOL(iowrite32);
 
+u64 ioread64(void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32(addr);
+   high = ioread32(addr + sizeof(u32));
+   return low | (high << 32);
+}
+
+void iowrite64(u64 val, void __iomem *addr)
+{
+   iowrite32(val, addr);
+   iowrite32(val >> 32, addr + sizeof(u32));
+}
+
+EXPORT_SYMBOL(ioread64);
+EXPORT_SYMBOL(iowrite64);
+
 u8 inb(unsigned long port)
 {
return ioread8(ioport_map(port, 1));
-- 
2.11.0

[PATCH 5/7] ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks

2017-06-22 Thread Logan Gunthorpe

Now that ioread64 and iowrite64 are available generically we can
remove the hack at the top of ntb_hw_intel.c that patches them in
when they are not available.

Signed-off-by: Logan Gunthorpe 
Cc: Jon Mason 
Cc: Dave Jiang 
Cc: Allen Hubbe 
---
 drivers/ntb/hw/intel/ntb_hw_intel.c | 30 --
 1 file changed, 30 deletions(-)

diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c 
b/drivers/ntb/hw/intel/ntb_hw_intel.c
index c00238491673..56221d540c2b 100644
--- a/drivers/ntb/hw/intel/ntb_hw_intel.c
+++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
@@ -153,35 +153,6 @@ MODULE_PARM_DESC(xeon_b2b_dsd_bar5_addr32,
 static inline enum ntb_topo xeon_ppd_topo(struct intel_ntb_dev *ndev, u8 ppd);
 static int xeon_init_isr(struct intel_ntb_dev *ndev);
 
-#ifndef ioread64
-#ifdef readq
-#define ioread64 readq
-#else
-#define ioread64 _ioread64
-static inline u64 _ioread64(void __iomem *mmio)
-{
-   u64 low, high;
-
-   low = ioread32(mmio);
-   high = ioread32(mmio + sizeof(u32));
-   return low | (high << 32);
-}
-#endif
-#endif
-
-#ifndef iowrite64
-#ifdef writeq
-#define iowrite64 writeq
-#else
-#define iowrite64 _iowrite64
-static inline void _iowrite64(u64 val, void __iomem *mmio)
-{
-   iowrite32(val, mmio);
-   iowrite32(val >> 32, mmio + sizeof(u32));
-}
-#endif
-#endif
-
 static inline int pdev_is_atom(struct pci_dev *pdev)
 {
switch (pdev->device) {
@@ -3008,4 +2979,3 @@ static void __exit intel_ntb_pci_driver_exit(void)
debugfs_remove_recursive(debugfs_dir);
 }
 module_exit(intel_ntb_pci_driver_exit);
-
-- 
2.11.0

[PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available

2017-06-22 Thread Logan Gunthorpe

Currently, ioread64 and iowrite64 are only available io CONFIG_64BIT=y
and CONFIG_GENERIC_IOMAP=n. Thus, seeing the functions are not
universally available, it makes them unusable for driver developers.
This leads to ugly hacks such as those at the top of

drivers/ntb/hw/intel/ntb_hw_intel.c

This patch adds fallback implementations for when CONFIG_64BIT and
CONFIG_GENERIC_IOMAP are not set. These functions use two io32 based
calls to complete the operation.

Note, we do not use the volatile keyword in these functions like the
others in the same file. It is necessary to avoid a compiler warning
on arm.

Signed-off-by: Logan Gunthorpe 
Cc: Arnd Bergmann 
---
 include/asm-generic/io.h | 54 +---
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index 7ef015eb3403..817edaa3da78 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -585,15 +585,24 @@ static inline u32 ioread32(const volatile void __iomem 
*addr)
 }
 #endif
 
-#ifdef CONFIG_64BIT
 #ifndef ioread64
 #define ioread64 ioread64
-static inline u64 ioread64(const volatile void __iomem *addr)
+#ifdef readq
+static inline u64 ioread64(const void __iomem *addr)
 {
return readq(addr);
 }
+#else
+static inline u64 ioread64(const void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32(addr);
+   high = ioread32(addr + sizeof(u32));
+   return low | (high << 32);
+}
+#endif
 #endif
-#endif /* CONFIG_64BIT */
 
 #ifndef iowrite8
 #define iowrite8 iowrite8
@@ -619,15 +628,21 @@ static inline void iowrite32(u32 value, volatile void 
__iomem *addr)
 }
 #endif
 
-#ifdef CONFIG_64BIT
 #ifndef iowrite64
 #define iowrite64 iowrite64
-static inline void iowrite64(u64 value, volatile void __iomem *addr)
+#ifdef writeq
+static inline void iowrite64(u64 value, void __iomem *addr)
 {
writeq(value, addr);
 }
+#else
+static inline void iowrite64(u64 value, void __iomem *addr)
+{
+   iowrite32(value, addr);
+   iowrite32(value >> 32, addr + sizeof(u32));
+}
+#endif
 #endif
-#endif /* CONFIG_64BIT */
 
 #ifndef ioread16be
 #define ioread16be ioread16be
@@ -645,15 +660,24 @@ static inline u32 ioread32be(const volatile void __iomem 
*addr)
 }
 #endif
 
-#ifdef CONFIG_64BIT
 #ifndef ioread64be
 #define ioread64be ioread64be
-static inline u64 ioread64be(const volatile void __iomem *addr)
+#ifdef readq
+static inline u64 ioread64be(const void __iomem *addr)
 {
return swab64(readq(addr));
 }
+#else
+static inline u64 ioread64be(const void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32be(addr + sizeof(u32));
+   high = ioread32be(addr);
+   return low | (high << 32);
+}
+#endif
 #endif
-#endif /* CONFIG_64BIT */
 
 #ifndef iowrite16be
 #define iowrite16be iowrite16be
@@ -671,15 +695,21 @@ static inline void iowrite32be(u32 value, volatile void 
__iomem *addr)
 }
 #endif
 
-#ifdef CONFIG_64BIT
 #ifndef iowrite64be
 #define iowrite64be iowrite64be
-static inline void iowrite64be(u64 value, volatile void __iomem *addr)
+#ifdef writeq
+static inline void iowrite64be(u64 value, void __iomem *addr)
 {
writeq(swab64(value), addr);
 }
+#else
+static inline void iowrite64be(u64 value, void __iomem *addr)
+{
+   iowrite32be(value >> 32, addr);
+   iowrite32be(value, addr + sizeof(u32));
+}
+#endif
 #endif
-#endif /* CONFIG_64BIT */
 
 #ifndef ioread8_rep
 #define ioread8_rep ioread8_rep
-- 
2.11.0

Re: DNS (?) not working on G5 (64-bit powerpc) (was [net-next,v3,3/3] udp: try to avoid 2 cache miss on dequeue)

2017-06-22 Thread Paolo Abeni

On Thu, 2017-06-22 at 23:06 +1000, Michael Ellerman wrote:
> Paolo wrote:
> > when udp_recvmsg() is executed, on x86_64 and other archs, most skb
> > fields are on cold cachelines.
> > If the skb are linear and the kernel don't need to compute the udp
> > csum, only a handful of skb fields are required by udp_recvmsg().
> > Since we already use skb->dev_scratch to cache hot data, and
> > there are 32 bits unused on 64 bit archs, use such field to cache
> > as much data as we can, and try to prefetch on dequeue the relevant
> > fields that are left out.
> > 
> > This can save up to 2 cache miss per packet.
> > 
> > v1 -> v2:
> >   - changed udp_dev_scratch fields types to u{32,16} variant,
> > replaced bitfiled with bool
> > 
> > Signed-off-by: Paolo Abeni 
> > Acked-by: Eric Dumazet 
> > ---
> >  net/ipv4/udp.c | 114 
> > +++--
> >  1 file changed, 103 insertions(+), 11 deletions(-)
> 
> This appears to break wget on one of my machines.
> 
> Networking in general is working, I'm able to SSH in, but then I can't
> do a wget.
> 
> eg:
> 
> $ wget google.com
> --2017-06-22 22:45:39--  http://google.com/
> Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in 
> name resolution.
> wget: unable to resolve host address ‘proxy.pmdw.com’
> 
> $ host proxy.pmdw.com
> proxy.pmdw.com is an alias for raven.pmdw.com.
> raven.pmdw.com has address 10.1.2.3
> 
> $ wget google.com
> --2017-06-22 22:52:08--  http://google.com/
> Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in 
> name resolution.
> wget: unable to resolve host address ‘proxy.pmdw.com’
> 
> Maybe host is using TCP but the man page says it doesn't?
> 
> 
> Everything is OK if I boot back to the previous commit 0a463c78d25b
> ("udp: avoid a cache miss on dequeue"):
> 
> $ wget google.com
> --2017-06-22 23:00:01--  http://google.com/
> Resolving proxy.pmdw.com (proxy.pmdw.com)... 10.1.2.3
> Connecting to proxy.pmdw.com (proxy.pmdw.com)|10.1.2.3|:3128... connected.
> Proxy request sent, awaiting response... 302 Found
> Location: http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE 
> [following]
> --2017-06-22 23:00:01--  
> http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE
> Reusing existing connection to proxy.pmdw.com:3128.
> Proxy request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: ‘index.html’
> 
> index.html  [ <=>   ] 
>  11.37K  --.-KB/sin 0.001s  
> 
> 2017-06-22 23:00:01 (22.0 MB/s) - ‘index.html’ saved [11640]
> 
> $ uname -a
> Linux 4.12.0-rc4-gcc6-00988-g0a463c7 #88 SMP Thu Jun 22 22:55:12 AEST 2017 
> ppc64 GNU/Linux
> 
> 
> Haven't had time to debug any further. Any ideas?

Thank you for this report.

Can you please specify features of the relevant NIC ? (ethtool -k
) 

I'll try to replicate the issue as soon I'll get hands on suitable HW,
meanwhile can you please try to trace the system behavior with perf?

Something like:

perf probe -a __udp_enqueue_schedule_skb
perf probe -a udp_recvmsg
perf probe -a udpv6_recvmsg
perf record -e probe:__udp_enqueue_schedule_skb -e probe:udp_recvmsg -e 
probe:udpv6_recvmsg -ag wget google.com
perf report --stdio

Thanks,

Paolo

Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd

2017-06-22 Thread Thomas Gleixner

On Thu, 22 Jun 2017, Thiago Jung Bauermann wrote:
> Michael Bringmann provided this information:
> >> It's not hard to backport both this patch and commit fe5595c07400
> >> ("stop_machine: Provide stop_machine_cpuslocked()") from branch
> >> smp/hotplug in tip.git for stable.
> >
> > Yeah but it's not really my business backporting that unfortunately.
> 
> Sorry, I wasn't clear. I was offering to provide backported patches for
> the relevant stable branches.
> 
> Though that will only be necessary if we also backport the topology
> fixes as well.

So shall I pick up the fix and route it through tip smp/hotplug where the
cpu hotplug core changes reside on which that patch depends on?

Thanks,

tglx

Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

2017-06-22 Thread Ram Pai

On Sat, Jun 17, 2017 at 08:54:44AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-06-16 at 12:15 -0700, Ram Pai wrote:
> > gp_regs size is not changed, nor is the layout. A unused field in
> > the gp_regs is used to fill in the AMR contents. Old binaries will not
> > be knowing about this unused field, and hence should not break.
> > 
> > New binaries can leverage this already existing but newly defined
> > field; to read the contents of AMR.
> > 
> > Is it still a concern?
> 
> Calls to sys_swapcontext with a made-up context will end up with a crap
> AMR if done by code who didn't know about that register.

Turns out x86 does not have this problem, because x86 does not implement
sys_swapcontext.  However; unlike x86, powerpc lets signal handler
program the AMR(x86 PKRU equivalent), which can persist even after the
signal handler returns to the kernel through sys_sigreturn.

So I am inclined to deviate from the x86 protection-key semantics.

On x86 the persistent way for the signal handler
to change the key register(PKRU) is through a field in the siginfo structure.

And on powerpc  the persistent way for the signal handler to change the
key register(AMR) will be to directly program the AMR register.

This should resolve your concern on powerpc, since there is no way
a crap AMR value will change the real AMR register, because the powerpc
kernel will not be letting it happen. Acceptable?

RP

Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Richard Henderson


On 06/22/2017 09:48 AM, Logan Gunthorpe wrote:

Alpha implements its own io operation and doesn't use the
common library. Thus to make ioread64 and iowrite64 globally
available we need to add implementations for alpha.

For this, we simply use calls that chain two 32-bit operations.
(mostly because I don't really understand the alpha architecture.)


It's not difficult to provide this interface[*].  I believe the only reason I 
didn't do so from the beginning is that it wasn't used.




r~


* At least for systems other than Jensen, which cannot generate 64-bit I/O.  On 
the other hand, Jensen doesn't have PCI (EISA only), and so won't have any 
devices that care.

Re: [PATCH kernel 0/3 REPOST] vfio-pci: Add support for mmapping MSI-X table

2017-06-22 Thread Alex Williamson

On Thu, 15 Jun 2017 15:48:42 +1000
Alexey Kardashevskiy  wrote:

> Here is a patchset which Yongji was working on before
> leaving IBM LTC. Since we still want to have this functionality
> in the kernel (DPDK is the first user), here is a rebase
> on the current upstream.
> 
> 
> Current vfio-pci implementation disallows to mmap the page
> containing MSI-X table in case that users can write directly
> to MSI-X table and generate an incorrect MSIs.
> 
> However, this will cause some performance issue when there
> are some critical device registers in the same page as the
> MSI-X table. We have to handle the mmio access to these
> registers in QEMU emulation rather than in guest.
> 
> To solve this issue, this series allows to expose MSI-X table
> to userspace when hardware enables the capability of interrupt
> remapping which can ensure that a given PCI device can only
> shoot the MSIs assigned for it. And we introduce a new bus_flags
> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> for different archs.
> 
> The patch 3 are based on the proposed patchset[1].
> 
> Changelog
> v3:
> - rebased on the current upstream

There's something not forthcoming here, the last version I see from
Yongji is this one:

https://lists.linuxfoundation.org/pipermail/iommu/2016-June/017245.html

Which was a 6-patch series where patches 2-4 tried to apply
PCI_BUS_FLAGS_MSI_REMAP for cases that supported other platforms.  That
doesn't exist here, so it's not simply a rebase.  Patch 1/ seems to
equate this new flag to the IOMMU capability IOMMU_CAP_INTR_REMAP, but
nothing is done here to match them together.  That patch also mentions
the work Eric has done for similar features on ARM, but again those
patches are dropped.  It seems like an incomplete feature now.  Thanks,

Alex

> v2:
> - Make the commit log more clear
> - Replace pci_bus_check_msi_remapping() with pci_bus_msi_isolated()
>   so that we could clearly know what the function does
> - Set PCI_BUS_FLAGS_MSI_REMAP in pci_create_root_bus() instead
>   of iommu_bus_notifier()
> - Reserve VFIO_REGION_INFO_FLAG_CAPS when we allow to mmap MSI-X
>   table so that we can know whether we allow to mmap MSI-X table
>   in QEMU
> 
> [1] 
> https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg1138820.html
> 
> 
> This is based on sha1
> 63f700aab4c1 Linus Torvalds "Merge tag 'xtensa-20170612' of 
> git://github.com/jcmvbkbc/linux-xtensa".
> 
> Please comment. Thanks.
> 
> 
> 
> Yongji Xie (3):
>   PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag
>   pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
>   vfio-pci: Allow to expose MSI-X table to userspace if interrupt
> remapping is enabled
> 
>  include/linux/pci.h   |  1 +
>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 
>  drivers/vfio/pci/vfio_pci.c   | 18 +++---
>  drivers/vfio/pci/vfio_pci_rdwr.c  |  3 ++-
>  4 files changed, 26 insertions(+), 4 deletions(-)
>

Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Arnd Bergmann

On Thu, Jun 22, 2017 at 10:09 PM, Logan Gunthorpe  wrote:
> On 6/22/2017 2:08 PM, Alan Cox wrote:
>>
>> But this does not do the same thing as an ioread64 with regards to
>> atomicity or side effects on the device. The same is true of the other
>> hacks. You either have a real 64bit single read/write from MMIO space or
>> you don't. You can't fake it.
>
>
> Yes, I know. But is it not better than having every driver that wants to use
> these functions fake it themselves?

Drivers that want a non-atomic variant should include either
include/linux/io-64-nonatomic-hi-lo.h or include/linux/io-64-nonatomic-lo-hi.h
depending on what they need. Drivers that require 64-bit I/O should
probably just depend on CONFIG_64BIT and maybe use readq/writeq.

I see that there are exactly two drivers calling ioread64/iowrite64:
drivers/crypto/caam/ is architecture specific and
drivers/ntb/hw/intel/ntb_hw_intel.c already has a workaround that should
make it build on alpha.

Arnd

Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Alan Cox

On Thu, 22 Jun 2017 10:48:14 -0600
Logan Gunthorpe  wrote:

> Alpha implements its own io operation and doesn't use the
> common library. Thus to make ioread64 and iowrite64 globally
> available we need to add implementations for alpha.
> 
> For this, we simply use calls that chain two 32-bit operations.
> (mostly because I don't really understand the alpha architecture.)

But this does not do the same thing as an ioread64 with regards to
atomicity or side effects on the device. The same is true of the other
hacks. You either have a real 64bit single read/write from MMIO space or
you don't. You can't fake it.

Alan

Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available

2017-06-22 Thread Alan Cox

On Thu, 22 Jun 2017 14:24:58 -0600
Logan Gunthorpe  wrote:

> On 6/22/2017 2:14 PM, Alan Cox wrote:
> > If a platform doesn't support 64bit I/O operations from the CPU then you
> > either need to use some kind of platform/architecture specific interface
> > if present or accept you don't have one.  
> 
> Yes, I understand that.
> 
> The thing is that every user that's currently using it right now is 
> patching in their own version that splits it on non-64bit systems.
> 
> > It's not safe to split it. Possibly for some use cases you could add an
> > ioread64_maysplit()  
> 
> I'm open to doing something like that.

I think that makes sense for the platforms with that problem. I'm not
sure there are many that can't do it for mmio at least. 486SX can't do it
and I guess some ARM32 but I think almost everyone else can including
most 32bit x86.

What's more of a problem is a lot of platforms can do 64bit MMIO via
ioread/write64 but not 64bit port I/O, and it's not clear how you
represent that via an ioread/write API that abstracts it away.

Alan

[PATCH] powerpc/kernel: Avoid redundancies on giveup_all

2017-06-22 Thread Breno Leitao

Currently giveup_all() calls __giveup_fpu(), __giveup_altivec(), and
__giveup_vsx(). But __giveup_vsx() also calls __giveup_fpu() and
__giveup_altivec() again, in a redudant manner.

Other than giving up FP and Altivec, __giveup_vsx() also disables
MSR_VSX on MSR, but this is already done by __giveup_{fpu,altivec}().
As VSX can not be enabled alone on MSR (without FP and/or VEC
enabled), this is also a redundancy. VSX is never enabled alone (without
FP and VEC) because every time VSX is enabled, as through load_up_vsx()
and restore_math(), FP is also enabled together.

This change improves giveup_all() in average just 3%, but since
giveup_all() is called very frequently, around 8x per CPU per second on
an idle machine, this change might show some noticeable improvement.

Signed-off-by: Breno Leitao 
Signed-off-by: Gustavo Romero 
---
 arch/powerpc/kernel/process.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 2ad725ef4368..5d6af58270e6 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -494,10 +494,6 @@ void giveup_all(struct task_struct *tsk)
if (usermsr & MSR_VEC)
__giveup_altivec(tsk);
 #endif
-#ifdef CONFIG_VSX
-   if (usermsr & MSR_VSX)
-   __giveup_vsx(tsk);
-#endif
 #ifdef CONFIG_SPE
if (usermsr & MSR_SPE)
__giveup_spe(tsk);
-- 
2.11.0

Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available

2017-06-22 Thread Alan Cox

On Thu, 22 Jun 2017 10:48:13 -0600
Logan Gunthorpe  wrote:

> Currently, ioread64 and iowrite64 are only available io CONFIG_64BIT=y
> and CONFIG_GENERIC_IOMAP=n. Thus, seeing the functions are not
> universally available, it makes them unusable for driver developers.
> This leads to ugly hacks such as those at the top of
> 
> drivers/ntb/hw/intel/ntb_hw_intel.c
> 
> This patch adds fallback implementations for when CONFIG_64BIT and
> CONFIG_GENERIC_IOMAP are not set. These functions use two io32 based
> calls to complete the operation.
> 
> Note, we do not use the volatile keyword in these functions like the
> others in the same file. It is necessary to avoid a compiler warning
> on arm.

This is a really really bad idea as per the Alpha comment.

ioread64 and iowrite64 generate a single 64bit bus transaction. There is
hardware where mmio operations have side effects so simply using a pair
of 32bit operations blindly does not work (consider something as trivial
as reading a 64bit performance counter or incrementing pointer).

If a platform doesn't support 64bit I/O operations from the CPU then you
either need to use some kind of platform/architecture specific interface
if present or accept you don't have one.

It's not safe to split it. Possibly for some use cases you could add an

ioread64_maysplit()

but you cannot blindly break ioread64/write64() and expect it to
magically allow you to use drivers that depend upon it.

What btw is the actual ARM compiler warning ? Is the compiler also trying
to tell you it's a bad idea ?

Alan

Re: [PATCH 09/17] cxlflash: Create character device to provide host management interface

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:15 PM, Uma Krishnan  wrote:
> 
> The cxlflash driver currently lacks host management interface. Future
> devices supported by cxlflash will provide a variety of host-wide
> management functions. Examples include LUN provisioning, hardware debug
> support, and firmware download.
> 
> In order to provide a way to manage the device, a character device will
> be created during probe of each adapter. This device will support a set of
> ioctls defined in the SISLite specification from which administrators can
> manage the adapter.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 08/17] cxlflash: Add scsi command abort handler

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:15 PM, Uma Krishnan  wrote:
> 
> To date, CXL flash devices do not support a single command abort operation.
> Instead, the SISLite specification provides a context reset operation to
> cleanup all pending commands for a given context.
> 
> When a context reset is successful, it is guaranteed that the AFU has
> aborted all currently pending I/O. This sequence is less invasive than a
> device or host reset and can be executed to support scsi command abort
> requests. Add eh_abort_handler callback support to process command timeouts
> and abort requests.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 07/17] cxlflash: Flush pending commands in cleanup path

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:14 PM, Uma Krishnan  wrote:
> 
> When the AFU is reset in an error path, pending scsi commands can be
> silently dropped without completion or a formal abort. This puts the onus
> on the cxlflash driver to notify mid-layer and indicating that the command
> can be retried.
> 
> Once the card has been quiesced, the hardware send queue lock is acquired
> to prevent any data movement while the pending commands are processed.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 06/17] cxlflash: Track pending scsi commands in each hardware queue

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:14 PM, Uma Krishnan  wrote:
> 
> Currently, there is no book keeping of the pending scsi commands in the
> cxlflash driver. This lack of tracking in-flight requests is too
> restrictive and requires a heavy-hammer reset each time an adapter error is
> encountered. Additionally, it does not allow for commands to be properly
> retried.
> 
> In order to avoid this problem and to better handle error path command
> cleanup, introduce a linked list for each hardware queue that tracks
> pending commands.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 05/17] cxlflash: Handle AFU sync failures

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:14 PM, Uma Krishnan  wrote:
> 
> AFU sync operations are not currently evaluated for failure. This is
> acceptable for paths where there is not a dependency on the AFU being
> consistent with the host. Examples include link reset events and LUN
> cleanup operations. On paths where there is a dependency, such as a LUN
> open, a sync failure should be acted upon.
> 
> In the event of AFU sync failures, either log or cleanup as appropriate for
> operations that are dependent on a successful sync completion.
> 
> Update documentation to reflect behavior in the event of an AFU sync
> failure.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 04/17] cxlflash: Schedule asynchronous reset of the host

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:14 PM, Uma Krishnan  wrote:
> 
> A context reset failure indicates the AFU is in a bad state. At present,
> when such a situation occurs, no further action is taken. This leaves the
> adapter in an unusable state with no recoverable actions.
> 
> To avoid this situation, context reset failures will be escalated to a host
> reset operation. This will be done asynchronously to allow the acting
> thread to return to the user with a failure.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 03/17] cxlflash: Reset hardware queue context via specified register

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:14 PM, Uma Krishnan  wrote:
> 
> Per the SISLite specification, context_reset() writes 0x1 to the LSB of the
> reset register. When the AFU processes this reset request, it is expected
> to clear the bit after reset is complete. The current implementation simply
> checks that the entire value read back is not 1, instead of masking off the
> LSB and evaluating it for a change to 0. Should the AFU manipulate other
> bits during the reset (reading back a value of 0xF for example), successful
> completion will be prematurely indicated given the existing logic.
> 
> Additionally, in the event that the context reset operation fails, there
> does not currently exist a way to provide feedback to the initiator of the
> reset. This poses a problem for the rare case that a context reset fails as
> the caller will proceed on the assumption that all is well.
> 
> To remedy these issues, refactor the context reset routine to only mask off
> the LSB when evaluating for success and return status to the caller. Also
> update the context reset handler parameters to pass a hardware queue
> reference instead of a single command to better reflect that the entire
> queue associated with the context is impacted by the reset.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 02/17] cxlflash: Update cxlflash_afu_sync() to return errno

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:13 PM, Uma Krishnan  wrote:
> 
> The cxlflash_afu_sync() routine returns a negative one to indicate any kind
> of failure. This makes it impossible to establish why the error occurred.
> 
> Update the return codes to clearly indicate the failure cause to the
> caller.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 01/17] cxlflash: Combine the send queue locks

2017-06-22 Thread Matthew R. Ochs

> On Jun 21, 2017, at 9:13 PM, Uma Krishnan  wrote:
> 
> Currently there are separate spin locks for the two supported I/O queueing
> models. This makes it difficult to serialize with paths outside the enqueue
> path.
> 
> As a design simplification and to support serialization with enqueue
> operations, move to only a single lock that is used for enqueueing
> regardless of the queueing model.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd

2017-06-22 Thread Thiago Jung Bauermann

Michael Ellerman  writes:

> Thiago Jung Bauermann  writes:
>
>> Michael Ellerman  writes:
>>> Thiago Jung Bauermann  writes:
>>>
 Calling arch_update_cpu_topology from a CPU hotplug state machine callback
 hits a deadlock because the function tries to get a read lock on
 cpu_hotplug_lock while the state machine still holds a write lock on it.

 Since all callers of arch_update_cpu_topology except rtasd already hold
 cpu_hotplug_lock, this patch changes the function to use
 stop_machine_cpuslocked and creates a separate function for rtasd which
 still tries to obtain the lock.

 Michael Bringmann investigated the bug and provided a detailed analysis
 of the deadlock on this previous RFC for an alternate solution:

 https://patchwork.ozlabs.org/patch/771293/
>>>
>>> Do we know when this broke? Or has it never worked?
>>
>> It's been broken since at least v4.4, I think. I don't know about
>> earlier versions.
>
> OK.
>
> Just to be clear, this is happening on a 4.12-rcX system with no other
> patches?
>
> The code in arch_update_cpu_topology() has used stop_machine() since 
> 30c05350c39d ("powerpc/pseries: Use stop machine to update cpu maps")
> which went into v3.10, about 4 years ago.
>
> Prior to that it used get/put_online_cpus(), since 9eff1a38407c
> ("powerpc/pseries: Poll VPA for topology changes and update NUMA maps"),
> which was 2.6.38 in 2010.
>
> I wouldn't rule out the possibility it's been broken for 7 years, but I
> wonder if something else has changed to cause it to break.
>
> We really need to work it out before we backport anything.

Michael Bringmann provided this information:

We need at least one patch to show the issue in the latest 4.12
codebase:

[PATCH V6 2/2] powerpc/numa: Update CPU topology when VPHN enabled
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-June/159341.html

Reason: Prior to this patch we were not exercising the PowerPC VPHN
hcall nor the associated path through the kernel/sched code that
encounters the problem. All CPUs, whether present at boot or
hot-added, are added to node 0 without this patch. This
representation of the topology is incorrect in many/most cases.

cpu_hotplug_begin() + get_online_cpus() have potentially been broken
since the implementation of multithreading for _cpu_up() and
_cpu_down().

Reason: 'cpu_hotplug.active_writer' check in get_online_cpus() is
dependent upon the nested routines that call get_online_cpus() to
execute in the same thread as the one that invokes
'cpu_hotplug_begin'.

PowerPC's version of arch_update_cpu_topology() has used
stop_machine() or get_online_cpus() for years, since at least 2011.

Practically speaking, until the recent patch, in the 4.12 codebase
PowerPC CPUs are being added only to nodes that were online at boot,
and the topology did not change enough to trigger the paths through
'stop_machine'.

I can't say for certain about the earlier code bases where
'arch_update_cpu_topology' used 'get_online_cpus'/'put_online_cpus'
directly.

>>> Should it go to stable? (can't in its current form AFAICS)
>>
>> It's not hard to backport both this patch and commit fe5595c07400
>> ("stop_machine: Provide stop_machine_cpuslocked()") from branch
>> smp/hotplug in tip.git for stable.
>
> Yeah but it's not really my business backporting that unfortunately.

Sorry, I wasn't clear. I was offering to provide backported patches for
the relevant stable branches.

Though that will only be necessary if we also backport the topology
fixes as well.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH 1/2] trace/kprobes: Sanitize derived event names

2017-06-22 Thread Naveen N. Rao

On 2017/06/22 06:29PM, Masami Hiramatsu wrote:
> On Thu, 22 Jun 2017 00:20:27 +0530
> "Naveen N. Rao"  wrote:
> 
> > When we derive event names, convert some expected symbols (such as ':'
> > used to specify module:name and '.' present in some symbols) into
> > underscores so that the event name is not rejected.
> 
> Oops, ok, this is my mistake.
> 
> Acked-by: Masami Hiramatsu 
> 
> This must be marked as bugfix for stable trees.
> 
> Could you also add a testcase for this (module name) bug?
> 
> MODNAME=`lsmod | head -n 2 | tail -n 1 | cut -f 1 -d " "`
> FUNCNAME=`grep -m 1 "\\[$MODNAME\\]" /proc/kallsyms | xargs | cut -f 3 -d " "`
> 
> May gives you a target name :)

Sure. Here is a test.

Thanks for the review,
Naveen

-
[PATCH] selftests/ftrace: Add a test to probe module functions

Add a kprobes test to ensure that we are able to add a probe on a
module function using 'p :' format, without having to
specify a probe name.

Suggested-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 .../testing/selftests/ftrace/test.d/kprobe/probe_module.tc | 14 ++
 1 file changed, 14 insertions(+)
 create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/probe_module.tc

diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/probe_module.tc 
b/tools/testing/selftests/ftrace/test.d/kprobe/probe_module.tc
new file mode 100644
index ..ea7657041ba6
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/probe_module.tc
@@ -0,0 +1,14 @@
+#!/bin/sh
+# description: Kprobe dynamic event - probing module
+
+[ -f kprobe_events ] || exit_unsupported # this is configurable
+
+echo 0 > events/enable
+echo > kprobe_events
+export MOD=`lsmod | head -n 2 | tail -n 1 | cut -f1 -d" "`
+export FUNC=`grep -m 1 ".* t .*\\[$MOD\\]" /proc/kallsyms | xargs | cut -f3 
-d" "`
+[ "x" != "x$MOD" -a "y" != "y$FUNC" ] || exit_untested
+echo p $MOD:$FUNC > kprobe_events
+grep $MOD kprobe_events
+echo > kprobe_events
+clear_trace
-- 
2.13.1

Re: [next-20170609] WARNING: CPU: 3 PID: 71167 at lib/idr.c:157 idr_replace

2017-06-22 Thread Tejun Heo

On Thu, Jun 22, 2017 at 12:31:35AM +0100, Chris Wilson wrote:
> Quoting Tejun Heo (2017-06-13 14:58:49)
> > Cc'ing David Airlie.
> > 
> > This is from drm driver calling in idr_replace() w/ a negative id.
> > Probably a silly bug in error handling path?
> 
> No, this is the validation of an invalid userspace handle. The drm ABI
> for handles is supposed to be a full u32 range with 0 reserved for an
> invalid handle (constrained by the idr_alloc ofc). The WARN was
> introduced by 0a835c4f090a.

Matthew, did idr lost some of its range while being reimplemented
using radix tree?

Thanks.

-- 
tejun

Re: [RFC v3 01/23] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-06-22 Thread Ram Pai

On Thu, Jun 22, 2017 at 07:21:03PM +1000, Balbir Singh wrote:
> On Wed, 2017-06-21 at 18:39 -0700, Ram Pai wrote:
> > Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6,
> > in the 4K backed HPTE pages. These bits continue to be used
> > for 64K backed HPTE pages in this patch,  but will be freed
> > up in the next patch. The  bit  numbers  are big-endian  as
> > defined in the ISA3.0
> > 
> > The patch does the following change to the 64K PTE format
> >
> 
> Why can't we stuff the bits in the VMA and retrieve it from there?
> Basically always get a minor fault in hash and for keys handle
> the fault in do_page_fault() and handle the keys from the VMA?

I think you raise a valid point. We dont necessarily have to program
the pte. the hpte can be programmed directly from the key in the vma.
Just that the code becomes a little ugly to do so, since the
_hash_page_*() functions do not have access to the vma.

However we are also trying to maintain consistency between hpte and rpte
implementation. The keys have to be programmed into the rpte.
The patch is working towards enabling the consistency, so that
the same code can work on both, hpte for now and rpte in the future.

Maybe I can just do what you propose.  However this patch by itself
has value, because it frees up four valuable pte bits, irrespective
of whether we use it for memory keys. Let me see what others have
to say.  

Aneesh: thoughts?

> 
> > H_PAGE_BUSY moves from bit 3 to bit 9
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> > 
> > the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is  initialized  to  0xF  indicating  an invalid  slot.  If  a HPTE
> > gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
> > released immediately. In  other  words, even  though   0xF   is   a
> > valid slot we discard  and consider it as an invalid
> > slot;i.e HPTE(). This  gives  us  an opportunity to not
> > depend on a bit in the primary PTE in order to determine the
> > validity of a slot.
> 
> This is not clear, could you please rephrase? What is the bit in the
> primary key we rely on?

(H_PAGE_F_SECOND|H_PAGE_F_GIX) bits, which is big-endian bits 3 4 5 and
6. They are currently used to track the validitiy of the 4k-hptes backing the
64k-pte.   Each bit tracks four 4k-hptes, for a total of sixteen
4k-hptes.


> 
> > 
> > When  we  release  aHPTE   in the 0xF   slot we also   release a
> > legitimate primary   slot  andunmapthat  entry. This  is  to
> > ensure  that we do get a   legimate   non-0xF  slot the next time we
> > retry for a slot.
> > 
> > Though treating 0xF slot as invalid reduces the number of available
> > slots  and  may  have an effect  on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> > 
> > Compared  to the current scheme, the above described scheme reduces
> > the number of false hash table updates  significantly  and  has the
> > added  advantage  of  releasing  four  valuable  PTE bits for other
> > purpose.
> > 
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> 
> It would be helpful if you had a text diagram explaining the PTE bits
> before and after.

ok. will add it in the next version.

> 
> > 4K PTE format remain unchanged currently.
> >
> 
> The code seems to be doing a lot more than the changelog suggests. A few
> functions are completely removed, common code between 64K and 4K has been
> split under #ifndef. It would be good to call all of these out.

ok. will do.

> 
> > Signed-off-by: Ram Pai 
> > 
> > Conflicts:
> > arch/powerpc/include/asm/book3s/64/hash.h
> > ---
> >  arch/powerpc/include/asm/book3s/64/hash-4k.h  |  7 +++
> >  arch/powerpc/include/asm/book3s/64/hash-64k.h | 17 ---
> >  arch/powerpc/include/asm/book3s/64/hash.h | 12 +++--
> >  arch/powerpc/mm/hash64_64k.c  | 70 
> > +++
> >  arch/powerpc/mm/hash_utils_64.c   |  4 +-
> >  5 files changed, 66 insertions(+), 44 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> > b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > index b4b5e6b..9c2c8f1 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > @@ -16,6 +16,13 @@
> >  #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> >  #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >  
> > +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG 
> > */
> > +#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > +#define H_PAGE_F_GIX_SHIFT 56
> > +
> > +#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are 
> > busy */
> > +#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated

[PATCH 1/4] arm: Reduce ELF_ET_DYN_BASE

2017-06-22 Thread Kees Cook

Now that explicitly executed loaders are loaded in the mmap region,
position PIE binaries lower in the address space to avoid possible
collisions with mmap or stack regions.

Signed-off-by: Kees Cook 
---
 arch/arm/include/asm/elf.h | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/elf.h b/arch/arm/include/asm/elf.h
index d2315ffd8f12..f13ae153fb24 100644
--- a/arch/arm/include/asm/elf.h
+++ b/arch/arm/include/asm/elf.h
@@ -112,12 +112,8 @@ int dump_task_regs(struct task_struct *t, elf_gregset_t 
*elfregs);
 #define CORE_DUMP_USE_REGSET
 #define ELF_EXEC_PAGESIZE  4096
 
-/* This is the location that an ET_DYN program is loaded if exec'ed.  Typical
-   use of this is to invoke "./ld.so someprog" to test out a new version of
-   the loader.  We need to make sure that it is out of the way of the program
-   that it will "exec", and that there is sufficient room for the brk.  */
-
-#define ELF_ET_DYN_BASE(TASK_SIZE / 3 * 2)
+/* This is the base location for PIE (ET_DYN with INTERP) loads. */
+#define ELF_ET_DYN_BASE0x40UL
 
 /* When the program starts, a1 contains a pointer to a function to be 
registered with atexit, as per the SVR4 ABI.  A value of 0 means we 
-- 
2.7.4

[PATCH 2/4] arm64: Reduce ELF_ET_DYN_BASE

2017-06-22 Thread Kees Cook

Now that explicitly executed loaders are loaded in the mmap region,
position PIE binaries lower in the address space to avoid possible
collisions with mmap or stack regions. For 64-bit, align to 4GB to
allow runtimes to use the entire 32-bit address space for 32-bit
pointers.

Signed-off-by: Kees Cook 
---
 arch/arm64/include/asm/elf.h | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 5d1700425efe..f742af8f7c42 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -113,12 +113,13 @@
 #define ELF_EXEC_PAGESIZE  PAGE_SIZE
 
 /*
- * This is the location that an ET_DYN program is loaded if exec'ed.  Typical
- * use of this is to invoke "./ld.so someprog" to test out a new version of
- * the loader.  We need to make sure that it is out of the way of the program
- * that it will "exec", and that there is sufficient room for the brk.
+ * This is the base location for PIE (ET_DYN with INTERP) loads. On
+ * 64-bit, this is raised to 4GB to leave the entire 32-bit address
+ * space open for things that want to use the area for 32-bit pointers.
  */
-#define ELF_ET_DYN_BASE(2 * TASK_SIZE_64 / 3)
+#define ELF_ET_DYN_BASE(test_thread_flag(TIF_32BIT) ?  \
+   0x00040UL : \
+   0x1UL)
 
 #ifndef __ASSEMBLY__
 
@@ -173,8 +174,6 @@ extern int arch_setup_additional_pages(struct linux_binprm 
*bprm,
 
 #ifdef CONFIG_COMPAT
 
-#define COMPAT_ELF_ET_DYN_BASE (2 * TASK_SIZE_32 / 3)
-
 /* AArch32 registers. */
 #define COMPAT_ELF_NGREG   18
 typedef unsigned int   compat_elf_greg_t;
-- 
2.7.4

[PATCH 4/4] s390: Reduce ELF_ET_DYN_BASE

2017-06-22 Thread Kees Cook

Now that explicitly executed loaders are loaded in the mmap region,
position PIE binaries lower in the address space to avoid possible
collisions with mmap or stack regions. For 64-bit, align to 4GB to
allow runtimes to use the entire 32-bit address space for 32-bit
pointers.

Signed-off-by: Kees Cook 
---
 arch/s390/include/asm/elf.h | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index e8f623041769..7c58d599f91b 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -161,14 +161,13 @@ extern unsigned int vdso_enabled;
 #define CORE_DUMP_USE_REGSET
 #define ELF_EXEC_PAGESIZE  4096
 
-/* This is the location that an ET_DYN program is loaded if exec'ed.  Typical
-   use of this is to invoke "./ld.so someprog" to test out a new version of
-   the loader.  We need to make sure that it is out of the way of the program
-   that it will "exec", and that there is sufficient room for the brk. 64-bit
-   tasks are aligned to 4GB. */
-#define ELF_ET_DYN_BASE (is_compat_task() ? \
-   (STACK_TOP / 3 * 2) : \
-   (STACK_TOP / 3 * 2) & ~((1UL << 32) - 1))
+/*
+ * This is the base location for PIE (ET_DYN with INTERP) loads. On
+ * 64-bit, this is raised to 4GB to leave the entire 32-bit address
+ * space open for things that want to use the area for 32-bit pointers.
+ */
+#define ELF_ET_DYN_BASE(is_compat_task() ? 0x00040UL : \
+   0x1UL)
 
 /* This yields a mask that user programs can use to figure out what
instruction set this CPU supports. */
-- 
2.7.4

[PATCH 3/4] powerpc: Reduce ELF_ET_DYN_BASE

2017-06-22 Thread Kees Cook

Now that explicitly executed loaders are loaded in the mmap region,
position PIE binaries lower in the address space to avoid possible
collisions with mmap or stack regions. For 64-bit, align to 4GB to
allow runtimes to use the entire 32-bit address space for 32-bit
pointers.

Signed-off-by: Kees Cook 
---
 arch/powerpc/include/asm/elf.h | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 09bde6e34f5d..548d9a411a0d 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -23,12 +23,13 @@
 #define CORE_DUMP_USE_REGSET
 #define ELF_EXEC_PAGESIZE  PAGE_SIZE
 
-/* This is the location that an ET_DYN program is loaded if exec'ed.  Typical
-   use of this is to invoke "./ld.so someprog" to test out a new version of
-   the loader.  We need to make sure that it is out of the way of the program
-   that it will "exec", and that there is sufficient room for the brk.  */
-
-#define ELF_ET_DYN_BASE0x2000
+/*
+ * This is the base location for PIE (ET_DYN with INTERP) loads. On
+ * 64-bit, this is raised to 4GB to leave the entire 32-bit address
+ * space open for things that want to use the area for 32-bit pointers.
+ */
+#define ELF_ET_DYN_BASE(is_32bit_task() ? 0x00040UL : \
+  0x1UL)
 
 #define ELF_CORE_EFLAGS (is_elf2_task() ? 2 : 0)
 
-- 
2.7.4

[PATCH 0/4] Reduce ELF_ET_DYN_BASE

2017-06-22 Thread Kees Cook

This is a follow-up to "binfmt_elf: Use ELF_ET_DYN_BASE only for PIE"[1],
which allow ELF_ET_DYN_BASE to be reduced from high in the address space.
That patch only changed x86, and this series changes arm, arm64, powerpc,
and s390.

Since these depend on the mentioned patch (which I'm hoping akpm will
pick up), I'm hoping to get arch-maintainer Acks for these (so they can
be carried all together).

Thanks!

-Kees

[1] https://patchwork.kernel.org/patch/9802387/

[PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9

2017-06-22 Thread Michael Neuling

On POWER9 the ERAT may be incorrect on wakeup from some stop states
that lose state. This causes random segvs and illegal instructions
when these stop states are enabled.

This patch invalidates the ERAT on wakeup on POWER9 to prevent this
from causing a problem.

Signed-off-by: Michael Neuling 
---
 arch/powerpc/kernel/idle_book3s.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 1ea14b96f1..ace2ad50c8 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -793,6 +793,9 @@ fastsleep_workaround_at_exit:
  */
 .global pnv_wakeup_loss
 pnv_wakeup_loss:
+BEGIN_FTR_SECTION
+   PPC_INVALIDATE_ERAT
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
ld  r1,PACAR1(r13)
 BEGIN_FTR_SECTION
CHECK_HMI_INTERRUPT
-- 
2.11.0

RE: [PATCH 5/7] ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks

2017-06-22 Thread Jiang, Dave



> -Original Message-
> From: linux-...@googlegroups.com [mailto:linux-...@googlegroups.com] On 
> Behalf Of Logan Gunthorpe
> Sent: Thursday, June 22, 2017 9:48 AM
> To: linux-ker...@vger.kernel.org; linux-a...@vger.kernel.org; 
> linux-...@googlegroups.com; linux-al...@vger.kernel.org; linuxppc-
> d...@lists.ozlabs.org; linux-cry...@vger.kernel.org; 
> dri-de...@lists.freedesktop.org
> Cc: Arnd Bergmann ; Greg Kroah-Hartman 
> ; Stephen Bates ;
> Logan Gunthorpe ; Jon Mason ; Jiang, 
> Dave ; Allen Hubbe
> 
> Subject: [PATCH 5/7] ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks
> 
> Now that ioread64 and iowrite64 are available generically we can
> remove the hack at the top of ntb_hw_intel.c that patches them in
> when they are not available.
> 
> Signed-off-by: Logan Gunthorpe 
> Cc: Jon Mason 
> Cc: Dave Jiang 
> Cc: Allen Hubbe 

Thanks for doing this Logan. 

Acked-by: Dave Jiang 

> ---
>  drivers/ntb/hw/intel/ntb_hw_intel.c | 30 --
>  1 file changed, 30 deletions(-)
> 
> diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c 
> b/drivers/ntb/hw/intel/ntb_hw_intel.c
> index c00238491673..56221d540c2b 100644
> --- a/drivers/ntb/hw/intel/ntb_hw_intel.c
> +++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
> @@ -153,35 +153,6 @@ MODULE_PARM_DESC(xeon_b2b_dsd_bar5_addr32,
>  static inline enum ntb_topo xeon_ppd_topo(struct intel_ntb_dev *ndev, u8 
> ppd);
>  static int xeon_init_isr(struct intel_ntb_dev *ndev);
> 
> -#ifndef ioread64
> -#ifdef readq
> -#define ioread64 readq
> -#else
> -#define ioread64 _ioread64
> -static inline u64 _ioread64(void __iomem *mmio)
> -{
> - u64 low, high;
> -
> - low = ioread32(mmio);
> - high = ioread32(mmio + sizeof(u32));
> - return low | (high << 32);
> -}
> -#endif
> -#endif
> -
> -#ifndef iowrite64
> -#ifdef writeq
> -#define iowrite64 writeq
> -#else
> -#define iowrite64 _iowrite64
> -static inline void _iowrite64(u64 val, void __iomem *mmio)
> -{
> - iowrite32(val, mmio);
> - iowrite32(val >> 32, mmio + sizeof(u32));
> -}
> -#endif
> -#endif
> -
>  static inline int pdev_is_atom(struct pci_dev *pdev)
>  {
>   switch (pdev->device) {
> @@ -3008,4 +2979,3 @@ static void __exit intel_ntb_pci_driver_exit(void)
>   debugfs_remove_recursive(debugfs_dir);
>  }
>  module_exit(intel_ntb_pci_driver_exit);
> -
> --
> 2.11.0
> 
> --
> You received this message because you are subscribed to the Google Groups 
> "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to linux-ntb+unsubscr...@googlegroups.com.
> To post to this group, send email to linux-...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/linux-ntb/20170622164817.25515-6-
> logang%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

Re: [PATCH 2/2] selftests/ftrace: Update multiple kprobes test for powerpc

2017-06-22 Thread Naveen N. Rao

On 2017/06/22 06:07PM, Masami Hiramatsu wrote:
> On Thu, 22 Jun 2017 00:20:28 +0530
> "Naveen N. Rao"  wrote:
> 
> > KPROBES_ON_FTRACE is only available on powerpc64le. Update comment to
> > clarify this.
> > 
> > Also, we should use an offset of 8 to ensure that the probe does not
> > fall on ftrace location. The current offset of 4 will fall before the
> > function local entry point and won't fire, while an offset of 12 or 16
> > will fall on ftrace location. Offset 8 is currently guaranteed to not be
> > the ftrace location.
> 
> OK, these part seems good to me.
> 
> > 
> > Finally, do not filter out symbols with a dot. Powerpc Elfv1 uses dot
> > prefix for all functions and this prevents us from testing some of those
> > symbols. Furthermore, with the patch to derive event names properly in
> > the presence of ':' and '.', such names are accepted by kprobe_events
> > and constitutes a good test for those symbols.
> 
> Hmm, the reason why I added such filter was to avoid symbols including
> gcc-generated suffixes like as .constprop or .isra etc.

I see.

I do wonder -- is there a problem if we try probing those symbols? On my 
local x86 vm, I don't see an issue probing it especially with the 
previous patch to enable probing with symbols having a '.' or ':'.

Furthermore, since this is for testing kprobe_events, I feel it is good 
to try probing those symbols too to catch any weird errors we may hit.

Thanks for the review!
- Naveen


> So if the Powerpc Elfv1 use dot prefix, that is OK, in that case,
> could you update the filter as "^.*\\..*" ?
> 
> Thank you,
> 
> > 
> > Signed-off-by: Naveen N. Rao 
> > ---
> >  tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc | 8 
> > 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git 
> > a/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc 
> > b/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc
> > index f4d1ff785d67..d209c071b2c0 100644
> > --- a/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc
> > +++ b/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc
> > @@ -2,16 +2,16 @@
> >  # description: Register/unregister many kprobe events
> >  
> >  # ftrace fentry skip size depends on the machine architecture.
> > -# Currently HAVE_KPROBES_ON_FTRACE defined on x86 and powerpc
> > +# Currently HAVE_KPROBES_ON_FTRACE defined on x86 and powerpc64le
> >  case `uname -m` in
> >x86_64|i[3456]86) OFFS=5;;
> > -  ppc*) OFFS=4;;
> > +  ppc64le) OFFS=8;;
> >*) OFFS=0;;
> >  esac
> >  
> >  echo "Setup up to 256 kprobes"
> > -grep t /proc/kallsyms | cut -f3 -d" " | grep -v .*\\..* | \
> > -head -n 256 | while read i; do echo p ${i}+${OFFS} ; done > kprobe_events 
> > ||:
> > +grep t /proc/kallsyms | cut -f3 -d" " | head -n 256 | \
> > +while read i; do echo p ${i}+${OFFS} ; done > kprobe_events ||:
> >  
> >  echo 1 > events/kprobes/enable
> >  echo 0 > events/kprobes/enable
> > -- 
> > 2.13.1
> > 
> 
> 
> -- 
> Masami Hiramatsu 
>

Re: [PATCH v3 6/6] powerpc/64s: Blacklist rtas entry/exit from kprobes

2017-06-22 Thread Naveen N. Rao

On 2017/06/22 01:48PM, Nicholas Piggin wrote:
> On Thu, 22 Jun 2017 00:08:42 +0530
> "Naveen N. Rao"  wrote:
> 
> > We can't take traps with relocation off, so blacklist enter_rtas() and
> > rtas_return_loc(). However, instead of blacklisting all of enter_rtas(),
> > introduce a new symbol __enter_rtas from where on we can't take a trap
> > and blacklist that.
> > 
> > Signed-off-by: Naveen N. Rao 
> > ---
> >  arch/powerpc/kernel/entry_64.S | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index d376f07153d7..49c35450f399 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -1076,6 +1076,8 @@ _GLOBAL(enter_rtas)
> >  rldicr  r9,r9,MSR_SF_LG,(63-MSR_SF_LG)
> > ori r9,r9,MSR_IR|MSR_DR|MSR_FE0|MSR_FE1|MSR_FP|MSR_RI|MSR_LE
> > andcr6,r0,r9
> > +
> > +__enter_rtas:
> > sync/* disable interrupts so SRR0/1 */
> > mtmsrd  r0  /* don't get trashed */
> 
> Along the lines of the system call patch... For consistency, could we
> put the __enter_rtas right after mtmsrd? And I wonder if we shoul

Sure.

> come up with a common prefix or postfix naming convention for these
> such labels used to control probing?

We could, but I am not sure it will help much. On the other hand, such 
symbols may make backtraces pretty distracting.

I'm just using '__' as a prefix to make it less distracting, though it 
isn't all that great either. I'm clearly hopeless with names o_O

The other option is to just blacklist entire functions, but we will then 
lose the ability to probe in many places where we may have wanted to.

> 
> How do opal calls avoid tracing?

It doesn't yet. I'm still going through the initial few symbols and 
identifying what needs blacklisting. Opal is further down.

Thanks,
Naveen

Re: [PATCH V4] cxl: Export library to support IBM XSL

2017-06-22 Thread Frederic Barrat




Le 22/06/2017 à 15:07, Christophe Lombard a écrit :

This patch exports a in-kernel 'library' API which can be called by
other drivers to help interacting with an IBM XSL on a POWER9 system.

The XSL (Translation Service Layer) is a stripped down version of the
PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
Like the PSL, it implements the CAIA architecture, but has a number
of differences, mostly in it's implementation dependent registers.

The XSL also uses a special DMA cxl mode, which uses a slightly
different init sequence for the CAPP and PHB.

Signed-off-by: Andrew Donnellan 
Signed-off-by: Christophe Lombard 

---


Looks ok to me, thanks!
[mpe: just to make sure you notice: there's a prereq patch listed under 
the changelog]


Acked-by: Frederic Barrat 



Changelog[v4]
  - Rebase to latest upstream.
  - Add new Signed-off
  - Update cxl_handle_mm_fault() and cxl_handle_page_fault()

Changelog[v3]
  - Rebase to latest upstream.
  - cxl_handle_mm_fault() is now exported. Remove kernel context
parameter
  - Update comments

Changelog[v2]
  - Rebase to latest upstream.
  - Return -EFAULT in case of NULL pointer in cxllib_handle_fault().
  - Reverse parameters when copro_handle_mm_fault() is called.

  This applies on top of this patch:
  http://patchwork.ozlabs.org/patch/775322/
---
  arch/powerpc/include/asm/opal-api.h |   1 +
  drivers/misc/cxl/Kconfig|   5 +
  drivers/misc/cxl/Makefile   |   2 +-
  drivers/misc/cxl/cxl.h  |   6 +
  drivers/misc/cxl/cxllib.c   | 246 
  drivers/misc/cxl/fault.c|  29 +++--
  drivers/misc/cxl/native.c   |  16 ++-
  drivers/misc/cxl/pci.c  |  41 --
  include/misc/cxllib.h   | 133 +++
  9 files changed, 449 insertions(+), 30 deletions(-)
  create mode 100644 drivers/misc/cxl/cxllib.c
  create mode 100644 include/misc/cxllib.h

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index cb3e624..3e0be78 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -877,6 +877,7 @@ enum {
OPAL_PHB_CAPI_MODE_SNOOP_OFF= 2,
OPAL_PHB_CAPI_MODE_SNOOP_ON = 3,
OPAL_PHB_CAPI_MODE_DMA  = 4,
+   OPAL_PHB_CAPI_MODE_DMA_TVT1 = 5,
  };

  /* OPAL I2C request */
diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
index b75cf83..93397cb 100644
--- a/drivers/misc/cxl/Kconfig
+++ b/drivers/misc/cxl/Kconfig
@@ -11,11 +11,16 @@ config CXL_AFU_DRIVER_OPS
bool
default n

+config CXL_LIB
+   bool
+   default n
+
  config CXL
tristate "Support for IBM Coherent Accelerators (CXL)"
depends on PPC_POWERNV && PCI_MSI && EEH
select CXL_BASE
select CXL_AFU_DRIVER_OPS
+   select CXL_LIB
default m
help
  Select this option to enable driver support for IBM Coherent
diff --git a/drivers/misc/cxl/Makefile b/drivers/misc/cxl/Makefile
index c14fd6b..0b5fd74 100644
--- a/drivers/misc/cxl/Makefile
+++ b/drivers/misc/cxl/Makefile
@@ -3,7 +3,7 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror

  cxl-y += main.o file.o irq.o fault.o native.o
  cxl-y += context.o sysfs.o pci.o trace.o
-cxl-y  += vphb.o phb.o api.o
+cxl-y  += vphb.o phb.o api.o cxllib.o
  cxl-$(CONFIG_PPC_PSERIES) += flash.o guest.o of.o hcalls.o
  cxl-$(CONFIG_DEBUG_FS)+= debugfs.o
  obj-$(CONFIG_CXL) += cxl.o
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index a03f8e7..b1afecc 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -1010,6 +1010,7 @@ static inline void cxl_debugfs_add_afu_regs_psl8(struct 
cxl_afu *afu, struct den

  void cxl_handle_fault(struct work_struct *work);
  void cxl_prefault(struct cxl_context *ctx, u64 wed);
+int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, u64 dar);

  struct cxl *get_cxl_adapter(int num);
  int cxl_alloc_sst(struct cxl_context *ctx);
@@ -1061,6 +1062,11 @@ int cxl_afu_slbia(struct cxl_afu *afu);
  int cxl_data_cache_flush(struct cxl *adapter);
  int cxl_afu_disable(struct cxl_afu *afu);
  int cxl_psl_purge(struct cxl_afu *afu);
+int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid,
+ u32 *phb_index, u64 *capp_unit_id);
+int cxl_slot_is_switched(struct pci_dev *dev);
+int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg);
+u64 cxl_calculate_sr(bool master, bool kernel, bool real_mode, bool p9);

  void cxl_native_irq_dump_regs_psl9(struct cxl_context *ctx);
  void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx);
diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
new file mode 100644
index 000..5dba23c
--- /dev/null
+++

Re: [PATCH v2] of: update ePAPR references to point to Devicetree Specification

2017-06-22 Thread Rob Herring

On Thu, Jun 22, 2017 at 09:15:39AM -0700, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> The Devicetree Specification has superseded the ePAPR as the
> base specification for bindings.  Update files in Documentation
> to reference the new document.
> 
> First reference to ePAPR in Documentation/devicetree/bindings/arm/cci.txt
> is generic, remove it.
> 
> Some files are not updated because there is no hypervisor chapter
> in the Devicetree Specification:
>Documentation/devicetree/bindings/powerpc/fsl/msi-pic.txt
>Documenation/virtual/kvm/api.txt
>Documenation/virtual/kvm/ppc-pv.txt
> 
> Signed-off-by: Frank Rowand 
> ---
> 
> changes from v1:
>- remove boilerplat ePAPR reference from cci.txt
> 
>  Documentation/devicetree/bindings/arm/cci.txt | 15 
> ---
>  Documentation/devicetree/bindings/arm/cpus.txt| 13 +++--
>  Documentation/devicetree/bindings/arm/idle-states.txt |  4 ++--
>  Documentation/devicetree/bindings/arm/l2c2x0.txt  |  4 ++--
>  Documentation/devicetree/bindings/arm/topology.txt|  4 ++--
>  Documentation/devicetree/bindings/bus/simple-pm-bus.txt   |  2 +-
>  Documentation/devicetree/bindings/chosen.txt  |  3 ++-
>  Documentation/devicetree/bindings/common-properties.txt   |  2 +-
>  Documentation/devicetree/bindings/crypto/fsl-sec4.txt |  4 ++--
>  Documentation/devicetree/bindings/crypto/fsl-sec6.txt |  4 ++--
>  .../devicetree/bindings/interrupt-controller/open-pic.txt |  5 ++---
>  Documentation/devicetree/bindings/net/ethernet.txt|  9 ++---
>  Documentation/devicetree/bindings/powerpc/fsl/cpus.txt|  6 +++---
>  Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt |  2 +-
>  .../devicetree/bindings/powerpc/fsl/srio-rmu.txt  |  4 ++--
>  Documentation/devicetree/bindings/powerpc/fsl/srio.txt|  3 ++-
>  Documentation/devicetree/booting-without-of.txt   |  2 +-
>  Documentation/devicetree/usage-model.txt  |  2 +-
>  Documentation/xtensa/mmu.txt  |  6 +++---
>  19 files changed, 46 insertions(+), 48 deletions(-)

Applied.

Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

2017-06-22 Thread Ram Pai

On Thu, Jun 22, 2017 at 02:37:27PM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
> > in the 4K backed hpte pages. These bits continue to be used
> > for 64K backed hpte pages in this patch, but will be freed
> > up in the next patch.
> > 
> > The patch does the following change to the 64K PTE format
> > 
> > H_PAGE_BUSY moves from bit 3 to bit 9
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> > 
> > the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is  initialized  to  0xF  indicating  an invalid  slot.  If  a hpte
> > gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
> > released immediately. In  other  words, even  though   0xF   is   a
> > valid slot we discard  and consider it as an invalid
> > slot;i.e hpte_soft_invalid(). This  gives  us  an opportunity to not
> > depend on a bit in the primary PTE in order to determine the
> > validity of a slot.
> > 
> > When  we  release  ahpte   in the 0xF   slot we also   release a
> > legitimate primary   slot  andunmapthat  entry. This  is  to
> > ensure  that we do get a   legimate   non-0xF  slot the next time we
> > retry for a slot.
> > 
> > Though treating 0xF slot as invalid reduces the number of available
> > slots  and  may  have an effect  on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> > 
> > Compared  to the current scheme, the above described scheme reduces
> > the number of false hash table updates  significantly  and  has the
> > added  advantage  of  releasing  four  valuable  PTE bits for other
> > purpose.
> > 
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> > 
> > 4K PTE format remain unchanged currently.
> 
> Scanned through the PTE format again for hash 64K and 4K. It seems
> to me that there might be 5 free bits already present on the PTE
> format. I might have seriously mistaken something here :) Please
> correct me if that is not the case. _RPAGE_RPN* I think is applicable
> only for hash page table format and will not be available for radix
> later.
> 
> +#define _PAGE_FREE_1   0x0040UL /* Not used */
> +#define _RPAGE_SW0 0x2000UL /* Not used */
> +#define _RPAGE_SW1 0x0800UL /* Not used */
> +#define _RPAGE_RPN42   0x0040UL /* Not used */
> +#define _RPAGE_RPN41   0x0020UL /* Not used */
> 

The bits are chosen to future proof for radix implementation.
_RPAGE_SW* will eat into what is available for software in the future,
and these key-bits will certainly be something that the radix
hardware will read, in the future.

The _RPAGE_RPN* bits cannot be relied on for radix.

But finally the bits that we chose (H_PAGE_F_SECOND|H_PAGE_F_GIX) had
the best potential for giving us the highest number of free bits with
relatively less effort.

RP

[PATCH v2] of: update ePAPR references to point to Devicetree Specification

2017-06-22 Thread frowand . list

From: Frank Rowand 

The Devicetree Specification has superseded the ePAPR as the
base specification for bindings.  Update files in Documentation
to reference the new document.

First reference to ePAPR in Documentation/devicetree/bindings/arm/cci.txt
is generic, remove it.

Some files are not updated because there is no hypervisor chapter
in the Devicetree Specification:
   Documentation/devicetree/bindings/powerpc/fsl/msi-pic.txt
   Documenation/virtual/kvm/api.txt
   Documenation/virtual/kvm/ppc-pv.txt

Signed-off-by: Frank Rowand 
---

changes from v1:
   - remove boilerplat ePAPR reference from cci.txt

 Documentation/devicetree/bindings/arm/cci.txt | 15 ---
 Documentation/devicetree/bindings/arm/cpus.txt| 13 +++--
 Documentation/devicetree/bindings/arm/idle-states.txt |  4 ++--
 Documentation/devicetree/bindings/arm/l2c2x0.txt  |  4 ++--
 Documentation/devicetree/bindings/arm/topology.txt|  4 ++--
 Documentation/devicetree/bindings/bus/simple-pm-bus.txt   |  2 +-
 Documentation/devicetree/bindings/chosen.txt  |  3 ++-
 Documentation/devicetree/bindings/common-properties.txt   |  2 +-
 Documentation/devicetree/bindings/crypto/fsl-sec4.txt |  4 ++--
 Documentation/devicetree/bindings/crypto/fsl-sec6.txt |  4 ++--
 .../devicetree/bindings/interrupt-controller/open-pic.txt |  5 ++---
 Documentation/devicetree/bindings/net/ethernet.txt|  9 ++---
 Documentation/devicetree/bindings/powerpc/fsl/cpus.txt|  6 +++---
 Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt |  2 +-
 .../devicetree/bindings/powerpc/fsl/srio-rmu.txt  |  4 ++--
 Documentation/devicetree/bindings/powerpc/fsl/srio.txt|  3 ++-
 Documentation/devicetree/booting-without-of.txt   |  2 +-
 Documentation/devicetree/usage-model.txt  |  2 +-
 Documentation/xtensa/mmu.txt  |  6 +++---
 19 files changed, 46 insertions(+), 48 deletions(-)

diff --git a/Documentation/devicetree/bindings/arm/cci.txt 
b/Documentation/devicetree/bindings/arm/cci.txt
index 0f2153e8fa7e..9600761f2d5b 100644
--- a/Documentation/devicetree/bindings/arm/cci.txt
+++ b/Documentation/devicetree/bindings/arm/cci.txt
@@ -11,13 +11,6 @@ clusters, through memory mapped interface, with a global 
control register
 space and multiple sets of interface control registers, one per slave
 interface.
 
-Bindings for the CCI node follow the ePAPR standard, available from:
-
-www.power.org/documentation/epapr-version-1-1/
-
-with the addition of the bindings described in this document which are
-specific to ARM.
-
 * CCI interconnect node
 
Description: Describes a CCI cache coherent Interconnect component
@@ -50,10 +43,10 @@ specific to ARM.
as a tuple of cells, containing child address,
parent address and the size of the region in the
child address space.
-   Definition: A standard property. Follow rules in the ePAPR for
-   hierarchical bus addressing. CCI interfaces
-   addresses refer to the parent node addressing
-   scheme to declare their register bases.
+   Definition: A standard property. Follow rules in the Devicetree
+   Specification for hierarchical bus addressing. CCI
+   interfaces addresses refer to the parent node
+   addressing scheme to declare their register bases.
 
CCI interconnect node can define the following child nodes:
 
diff --git a/Documentation/devicetree/bindings/arm/cpus.txt 
b/Documentation/devicetree/bindings/arm/cpus.txt
index 1030f5f50207..283c520a2224 100644
--- a/Documentation/devicetree/bindings/arm/cpus.txt
+++ b/Documentation/devicetree/bindings/arm/cpus.txt
@@ -6,9 +6,9 @@ The device tree allows to describe the layout of CPUs in a 
system through
 the "cpus" node, which in turn contains a number of subnodes (ie "cpu")
 defining properties for every cpu.
 
-Bindings for CPU nodes follow the ePAPR v1.1 standard, available from:
+Bindings for CPU nodes follow the Devicetree Specification, available from:
 
-https://www.power.org/documentation/epapr-version-1-1/
+https://www.devicetree.org/specifications/
 
 with updates for 32-bit and 64-bit ARM systems provided in this document.
 
@@ -16,8 +16,8 @@ with updates for 32-bit and 64-bit ARM systems provided in 
this document.
 Convention used in this document
 
 
-This document follows the conventions described in the ePAPR v1.1, with
-the addition:
+This document follows the conventions described in the Devicetree
+Specification, with the addition:
 
 - square brackets define bitfields, eg reg[7:0] value of the bitfield in
   the reg property contained in bits 7 down to 0
@@ -26,8 +26,9 @@ the

Re: [PATCH] Documentation: remove overlay-notes reference to non-existent file

2017-06-22 Thread Rob Herring

On Tue, Jun 20, 2017 at 03:40:53PM -0700, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> File dt-object-internal.txt does not exist.  Remove a reference to it
> and fix up tags for references to other files.
> 
> Reported-by: afaer...@suse.de
> Signed-off-by: Frank Rowand 
> ---
>  Documentation/devicetree/overlay-notes.txt | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)

Applied.

Re: [PATCH v6 1/4] of: remove *phandle properties from expanded device tree

2017-06-22 Thread Rob Herring

On Wed, Jun 21, 2017 at 02:57:35PM +1000, Michael Ellerman wrote:
> Hi Frank,
> 
> frowand.l...@gmail.com writes:
> > From: Frank Rowand 
> >
> > Remove "phandle", "linux,phandle", and "ibm,phandle" properties from
> > the internal device tree.  The phandle will still be in the struct
> > device_node phandle field and will still be displayed as if it is
> > a property in /proc/device_tree.
> >
> > This is to resolve the issue found by Stephen Boyd [1] when he changed
> > the type of struct property.value from void * to const void *.  As
> > a result of the type change, the overlay code had compile errors
> > where the resolver updates phandle values.
> >
> >   [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html
> >
> > - Add sysfs infrastructure to report np->phandle, as if it was a property.
> > - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties
> >   in the expanded device tree.
> > - Remove phandle properties in of_attach_node(), for nodes dynamically
> >   attached to the live tree.  Add the phandle sysfs entry for these nodes.
> > - When creating an overlay changeset, duplicate the node phandle in
> >   __of_node_dup().
> > - Remove no longer needed checks to exclude "phandle" and "linux,phandle"
> >   properties in several locations.
> > - A side effect of these changes is that the obsolete "linux,phandle" and
> >   "ibm,phandle" properties will no longer appear in /proc/device-tree (they
> >   will appear as "phandle").
> 
> Sorry but I don't think that can work for us.
> 
> Our DLPAR (ie. CPU/memory/device hotplug) stuff on PowerVM uses
> "ibm,phandle", and it's not the same thing as "phandle" /
> "linux,phandle".
> 
> I don't know the code well myself, but the spec (PAPR) says:
> 
>   Note: If the “ibm,phandle” property exists, there are two “phandle”
>   namespaces which must be kept separate. One is that actually used by
>   the OF client interface, the other is properties in the device tree
>   making reference to device tree nodes. These requirements are written
>   to maintain backward compatibility with older FW versions predating
>   these requirements; if the “ibm,phandle” property is not present, the
>   OS may assume that any device tree properties which refer to this node
>   will have a phandle value matching that returned by client interface
>   services.
> 
> I have systems here that still use "ibm,phandle". I also see at least
> some of the userspace code that looks for "ibm,phandle", and nothing
> else.
> 
> The note above actually implies that the current Linux code is wrong,
> when it uses "ibm,phandle" as the value of np->phandle.
> 
> So sorry that's a big mess, but we can't just rip out those properties.
> 
> I think the minimal change would be to treat "ibm,phandle" like a normal
> property, I think that would allow our tools to keep working?
> 
> 
> The other thing that worries me is that by renaming (effectively)
> "linux,phandle" to "phandle", we lose the ability to accurately
> regenerate the device tree from /proc/device-tree. In theory it
> shouldn't matter, but I worry that in practice something will break.

The only scenario I can come up with is booting a kernel with this 
change and kexec'ing to a kernel older than 2010 (when ePAPR phandle was 
added). Some how I doubt that would work without this change.

> What if we just kept a single bit flag somewhere indicating if the name of
> the phandle property we found was "phandle" or "linux,phandle", and
> create the sysfs phandle using that name?

I'd like to move towards dropping 'linux,phandle' including changing 
dtc to stop generating both properties by default. Perhaps we should 
just be more explicit that we are doing that. Stop exposing it first and 
then change how phandles are stored/managed a cycle later. Then we can 
easily revert it if needed.

Rob

Re: [PATCH v3 4/6] powerpc/64s: Un-blacklist system_call() from kprobes

2017-06-22 Thread Naveen N. Rao

On 2017/06/22 11:14PM, Nicholas Piggin wrote:
> On Thu, 22 Jun 2017 21:14:21 +1000
> Michael Ellerman  wrote:
> 
> > Nicholas Piggin  writes:
> > 
> > > On Thu, 22 Jun 2017 00:08:40 +0530
> > > "Naveen N. Rao"  wrote:
> > >  
> > >> It is actually safe to probe system_call() in entry_64.S, but only till
> > >> we unset MSR_RI. To allow this, add a new symbol system_call_exit()
> > >> after the mtmsrd and blacklist that. Though the mtmsrd instruction
> > >> itself is now whitelisted, we won't be allowed to probe on it as we
> > >> don't allow probing on rfi and mtmsr instructions (checked for in
> > >> arch_prepare_kprobe()).  
> > >
> > > Can you add a little comment to say probes aren't allowed, and it's
> > > located after the mtmsr in order to avoid contaminating traces?

Hmm.. it's actually after the mtmsrd since we can probe until that 
point. In v2, I converted .Lsyscall_exit() into a different label and 
blacklisted all of it. But Michael prefers to only blacklist what we 
must. It is helpful if we ever want to probe after returning from a 
syscall. That said, please see further below (*)

> > >
> > > Also I wonder if a slightly different name would be more instructive?
> > > I don't normally care, but the system_call_common code isn't trivial
> > > to follow. system_call_exit might give the impression that it is the
> > > entire exit path (which would pair with system_call for entry).  
> > 
> > It is the entire path in the happy case isn't it? I'm not sure I know
> > what you mean.
> 
> Oh, yes you're right, I thought it was moved down further.
> 
> > If you're tracing etc. then you'll be in syscall_exit_work, isn't 
> > that sufficient to differentiate the two?

Not sure how much this matters, but the new symbol (system_call_exit) is 
actually a few instructions from after the syscall return 
(.Lsyscall_exit). And I have converted syscall_exit_work to a private 
symbol as well.

In general, from kprobes standpoint, there are places there where we can 
probe safely but doing so may require new symbols to be introduced, 
which could make traces look weird, as well as distribute samples among 
different symbols.

I'm not sure how best to proceed. In this current series, I pretty much 
blacklist all of system_call_common() through to syscall_exit() except 
for the small part in between in system_call(). All other symbols have 
been made private, so nothing else apart from system_call_common(), 
system_call() and system_call_exit() show up in traces.

We could probably retain syscall_dotrace(), syscall_enosys() and parts 
of syscall_exit_work() after RI is restored, which will allow those to 
be probed, but end up showing these symbols in traces.

What would be preferable?

-
(*)
The other suggestion Michael had was to just put a nop after the bctrl 
and to again make .Lsyscall_exit public and blacklist that (though he 
wasn't all for it). I suppose it does solve this issue in a nice way - 
the call traces when in a system call show the proper symbol and we do 
have a 'nop' instruction on which we can probe to catch returns from 
system calls. Is that something we can re-consider? Something like this 
(along with converting other .Lsyscall_exit references):

--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -175,8 +175,9 @@ system_call:/* label this so stack 
traces look sane */
ldx r12,r11,r0  /* Fetch system call handler [ptr] */
mtctr   r12
bctrl   /* Call handler */
+   nop

-.Lsyscall_exit:
+system_call_exit:
std r3,RESULT(r1)
CURRENT_THREAD_INFO(r12, r1)

Thanks,
Naveen

Re: [PATCH] of: update ePAPR references to point to Devicetree Specification

2017-06-22 Thread Rob Herring

On Mon, Jun 19, 2017 at 03:23:41AM -0700, Frank Rowand wrote:
> On 06/18/17 07:05, Rob Herring wrote:
> > On Tue, Jun 13, 2017 at 07:49:04PM -0700, frowand.l...@gmail.com wrote:
> >> From: Frank Rowand 
> >>
> >> The Devicetree Specification has superseded the ePAPR as the
> >> base specification for bindings.  Update files in Documentation
> >> to reference the new document.
> >>
> >> Some files are not updated because there is no hypervisor chapter
> >> in the Devicetree Specification:
> >>Documentation/devicetree/bindings/powerpc/fsl/msi-pic.txt
> >>Documenation/virtual/kvm/api.txt
> >>Documenation/virtual/kvm/ppc-pv.txt

> >> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt 
> >> b/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt
> >> index f8cd2397aa04..d63ab1dec16d 100644
> >> --- a/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt
> >> +++ b/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt
> >> @@ -3,10 +3,10 @@ Power Architecture CPU Binding
> >>  Copyright 2013 Freescale Semiconductor Inc.
> >>  
> >>  Power Architecture CPUs in Freescale SOCs are represented in device trees 
> >> as
> >> -per the definition in ePAPR.
> >> +per the definition in the Devicetree Specification.
> > 
> > Are we sure we didn't remove any PPC specifics that apply here?
> 
> I don't see any.
> 
> Table 3.7.1 "General Properties of CPU nodes" was slightly
> re-ordered, but the same properties are listed in both documents.
> 
> I don't think that the boot requirements removal impacts this
> file.
> 
> Am I missing something?

No, I just wanted to make sure. So I guess there's just that one minor 
thing to drop.

Rob

RE: DPAA: Software Portal selection for network interface

2017-06-22 Thread Madalin-cristian Bucur

Hi Sebastian,

> -Original Message-
> From: Sebastian Huber [mailto:sebastian.hu...@embedded-brains.de]
> Sent: Thursday, June 22, 2017 3:42 PM
> To: linuxppc-dev@lists.ozlabs.org
> Cc: Madalin-cristian Bucur 
> Subject: DPAA: Software Portal selection for network interface
> 
> Hello,
> 
> as far as I understand the software portal selection for a network
> interface is done in
> 
> static int dpaa_eth_probe(struct platform_device *pdev)
> {
>  [...]
>  channel = dpaa_get_channel();
>  if (channel < 0) {
>  dev_err(dev, "dpaa_get_channel() failed\n");
>  err = channel;
>  goto get_channel_failed;
>  }
> 
>  priv->channel = (u16)channel;
> 
>  /* Start a thread that will walk the CPUs with affine portals
>   * and add this pool channel to each's dequeue mask.
>   */
>  dpaa_eth_add_channel(priv->channel);
> 
> with
> 
> static int dpaa_get_channel(void)
> {
>  spin_lock(_pool_channel_init);
>  if (!rx_pool_channel) {
>  u32 pool;
>  int ret;
> 
>  ret = qman_alloc_pool();
> 
>  if (!ret)
>  rx_pool_channel = pool;
>  }
>  spin_unlock(_pool_channel_init);
>  if (!rx_pool_channel)
>  return -ENOMEM;
>  return rx_pool_channel;
> }
> 
> which always returns the same pool channel (e.g. 0x401) if successful.

That's correct.

> This means all the QMan portal_isr() are distributed round-robin to all
> affine portals. Is there some way to configure the software portal for a
> specific network interface, e.g. use processors 0, 1, 2, 3 for one
> interface,and 4, 5, 6, 7 for another?

We've stripped all configurability and advanced features from the initial
DPAA submission. We don't have options to configure this. On the other hand,
the ingress queues are held active in the portal, resulting in one CPU doing
dequeues while there are frames available. This is done to avoid frame 
reordering, improving termination performance (and forwarded TCP performance).

The downside is that we're not saturating all CPUs with traffic. We're
currently working on adding Rx hashing, to be able to maintain intra-flow
frame order while spreading processing on all CPUs.

Meanwhile, would RPS (Receive Packet Steering) be a solution for you?

> --
> Sebastian Huber, embedded brains GmbH
> 
> Address : Dornierstr. 4, D-82178 Puchheim, Germany
> Phone   : +49 89 189 47 41-16
> Fax : +49 89 189 47 41-09
> E-Mail  : sebastian.hu...@embedded-brains.de
> PGP : Public key available on request.
> 
> Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.


Regards,
Madalin

Re: [PATCH v3 3/6] powerpc/64s: Blacklist system_call() and system_call_common() from kprobes

2017-06-22 Thread Naveen N. Rao

On 2017/06/22 11:08PM, Nicholas Piggin wrote:
> On Thu, 22 Jun 2017 21:07:46 +1000
> Michael Ellerman  wrote:
> 
> > Nicholas Piggin  writes:
> > 
> > > On Thu, 22 Jun 2017 00:08:39 +0530
> > > "Naveen N. Rao"  wrote:
> > >  
> > >> Convert some of the symbols into private symbols and blacklist
> > >> system_call_common() and system_call() from kprobes. We can't take a
> > >> trap at parts of these functions as either MSR_RI is unset or the kernel
> > >> stack pointer is not yet setup.
> > >> 
> > >> Reviewed-by: Masami Hiramatsu 
> > >> Signed-off-by: Naveen N. Rao   
> > >
> > > I don't have a problem with this bunch of system call labels
> > > going private. They've never added much for me in profiles.
> > >
> > > Reviewed-by: Nicholas Piggin 

Thanks for the review.

> > >
> > > Semi-related question, why is system_call: where it is?  
> > 
> > Ancient history.
> > 
> > We used to have:
> > 
> > bne syscall_dotrace
> > syscall_dotrace_cont:
> > cmpldi  0,r0,NR_syscalls
> > bge-syscall_enosys
> > 
> > system_call:/* label this so stack traces look sane 
> > */
> > 
> > 
> > So it was there to hide syscall_dotrace_cont from back traces.
> > 
> > But we made syscall_dotrace_cont local in 2012 and then removed it
> > entirely in 2015.
> > 
> > > Should we move it up to right after the mtmsrd / wrteei instruction?
> > > (obviously for another patch). It's pretty common to get PMU
> > > interrupts coming in right after mtmsr and this makes profiles split
> > > the syscall into two which is annoying.  
> > 
> > Move it wherever makes sense and gives good back traces.
> 
> I'd be in favour of moving it to right after the interurpt enable.
> I suppose you'd want a separate patch for that though. But we could
> put it in this series since we're changing a lot of labels.

Sure, I'll add a patch that does that.

- Naveen

Re: [PATCH v3 1/6] powerpc64/elfv1: Validate function pointer address in the function descriptor

2017-06-22 Thread Naveen N. Rao

On 2017/06/22 11:06PM, Nicholas Piggin wrote:
> On Thu, 22 Jun 2017 20:59:49 +1000
> Michael Ellerman  wrote:
> 
> > Nicholas Piggin  writes:
> > 
> > > On Thu, 22 Jun 2017 00:08:37 +0530
> > > "Naveen N. Rao"  wrote:
> > >  
> > >> Currently, we assume that the function pointer we receive in
> > >> ppc_function_entry() points to a function descriptor. However, this is
> > >> not always the case. In particular, assembly symbols without the right
> > >> annotation do not have an associated function descriptor. Some of these
> > >> symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL().
> > >> When such addresses are subsequently processed through
> > >> arch_deref_entry_point() in populate_kprobe_blacklist(), we see the
> > >> below errors during bootup:
> > >> [0.663963] Failed to find blacklist at 7d9b02a648029b6c
> > >> [0.663970] Failed to find blacklist at a14d03d0394a0001
> > >> [0.663972] Failed to find blacklist at 7d5302a6f94d0388
> > >> [0.663973] Failed to find blacklist at 48027d11e8610178
> > >> [0.663974] Failed to find blacklist at f8010070f8410080
> > >> [0.663976] Failed to find blacklist at 386100704801f89d
> > >> [0.663977] Failed to find blacklist at 7d5302a6f94d00b0
> > >> 
> > >> Fix this by checking if the address in the function descriptor is
> > >> actually a valid kernel address. In the case of assembly symbols, this
> > >> will almost always fail as this ends up being powerpc instructions. In
> > >> that case, return pointer to the address we received, rather than the
> > >> dereferenced value.
> > >> 
> > >> Signed-off-by: Naveen N. Rao 
> > >> ---
> > >>  arch/powerpc/include/asm/code-patching.h | 10 +-
> > >>  1 file changed, 9 insertions(+), 1 deletion(-)
> > >> 
> > >> diff --git a/arch/powerpc/include/asm/code-patching.h 
> > >> b/arch/powerpc/include/asm/code-patching.h
> > >> index abef812de7f8..ec54050be585 100644
> > >> --- a/arch/powerpc/include/asm/code-patching.h
> > >> +++ b/arch/powerpc/include/asm/code-patching.h
> > >> @@ -83,8 +83,16 @@ static inline unsigned long ppc_function_entry(void 
> > >> *func)
> > >>   * On PPC64 ABIv1 the function pointer actually points to the
> > >>   * function's descriptor. The first entry in the descriptor is 
> > >> the
> > >>   * address of the function text.
> > >> + *
> > >> + * However, we may have received a pointer to an assembly symbol
> > >> + * that may not be a function descriptor. Validate that the 
> > >> entry
> > >> + * points to a valid kernel address and if not, return the 
> > >> pointer
> > >> + * we received as is.
> > >>   */
> > >> -return ((func_descr_t *)func)->entry;
> > >> +if (kernel_text_address(((func_descr_t *)func)->entry))
> > >> +return ((func_descr_t *)func)->entry;
> > >> +else
> > >> +return (unsigned long)func;  
> > >
> > > What if "func" is a text section label (bare asm function)?
> > > Won't func->entry load the random instruction located there
> > > and compare it with a kernel address?  
> > 
> > Yes, that's the problem.

Yes, we were currently returning those instructions as the function 
entry address.

> > 
> > > I don't know too much about the v1 ABI, but should we check for
> > > func belonging in the .opd section first and base the check on
> > > that? Alternatively I if "func" is in the kernel text address,
> > > we can recognize it's not in the .opd section... right?  
> > 
> > That sounds like a more robust solution. But I suspect it won't work for
> > modules.
> 
> kernel_text_address() seems to check for module text as well, so it
> might work I think?

Yes, I think that's a very nice idea! I'll check and confirm that it 
does what it's supposed to.

Thanks for the review,
- Naveen

Re: [PATCH v3 4/6] powerpc/64s: Un-blacklist system_call() from kprobes

2017-06-22 Thread Nicholas Piggin

On Thu, 22 Jun 2017 21:14:21 +1000
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > On Thu, 22 Jun 2017 00:08:40 +0530
> > "Naveen N. Rao"  wrote:
> >  
> >> It is actually safe to probe system_call() in entry_64.S, but only till
> >> we unset MSR_RI. To allow this, add a new symbol system_call_exit()
> >> after the mtmsrd and blacklist that. Though the mtmsrd instruction
> >> itself is now whitelisted, we won't be allowed to probe on it as we
> >> don't allow probing on rfi and mtmsr instructions (checked for in
> >> arch_prepare_kprobe()).  
> >
> > Can you add a little comment to say probes aren't allowed, and it's
> > located after the mtmsr in order to avoid contaminating traces?
> >
> > Also I wonder if a slightly different name would be more instructive?
> > I don't normally care, but the system_call_common code isn't trivial
> > to follow. system_call_exit might give the impression that it is the
> > entire exit path (which would pair with system_call for entry).  
> 
> It is the entire path in the happy case isn't it? I'm not sure I know
> what you mean.

Oh, yes you're right, I thought it was moved down further.

Thanks,
Nick

Re: [1/3] powerpc/64s: Use BRANCH_TO_COMMON() for slb_miss_realmode

2017-06-22 Thread Michael Ellerman

On Tue, 2017-06-20 at 12:34:55 UTC, Michael Ellerman wrote:
> All the callers of slb_miss_realmode currently open code the #ifndef
> CONFIG_RELOCATABLE check and the branch via CTR in the RELOCATABLE case.
> We have a macro to do this, BRANCH_TO_COMMON(), so use it.
> 
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Nicholas Piggin 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/b102063b47d59752e113c558842227

cheers

Re: [1/9] powerpc/64s: slb_allocate_realmode() preserve r3

2017-06-22 Thread Michael Ellerman

On Sun, 2017-05-21 at 13:15:42 UTC, Nicholas Piggin wrote:
> One fewer registers clobbered by this function means the SLB miss
> handler can save one fewer.
> 
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d59afffdf04c66c09085160706297e

cheers

DPAA: Software Portal selection for network interface

2017-06-22 Thread Sebastian Huber


Hello,

as far as I understand the software portal selection for a network 
interface is done in


static int dpaa_eth_probe(struct platform_device *pdev)
{
[...]
channel = dpaa_get_channel();
if (channel < 0) {
dev_err(dev, "dpaa_get_channel() failed\n");
err = channel;
goto get_channel_failed;
}

priv->channel = (u16)channel;

/* Start a thread that will walk the CPUs with affine portals
 * and add this pool channel to each's dequeue mask.
 */
dpaa_eth_add_channel(priv->channel);

with

static int dpaa_get_channel(void)
{
spin_lock(_pool_channel_init);
if (!rx_pool_channel) {
u32 pool;
int ret;

ret = qman_alloc_pool();

if (!ret)
rx_pool_channel = pool;
}
spin_unlock(_pool_channel_init);
if (!rx_pool_channel)
return -ENOMEM;
return rx_pool_channel;
}

which always returns the same pool channel (e.g. 0x401) if successful.

This means all the QMan portal_isr() are distributed round-robin to all 
affine portals. Is there some way to configure the software portal for a 
specific network interface, e.g. use processors 0, 1, 2, 3 for one 
interface,and 4, 5, 6, 7 for another?


--
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax : +49 89 189 47 41-09
E-Mail  : sebastian.hu...@embedded-brains.de
PGP : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.

Re: [PATCH v3 3/6] powerpc/64s: Blacklist system_call() and system_call_common() from kprobes

2017-06-22 Thread Nicholas Piggin

On Thu, 22 Jun 2017 21:07:46 +1000
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > On Thu, 22 Jun 2017 00:08:39 +0530
> > "Naveen N. Rao"  wrote:
> >  
> >> Convert some of the symbols into private symbols and blacklist
> >> system_call_common() and system_call() from kprobes. We can't take a
> >> trap at parts of these functions as either MSR_RI is unset or the kernel
> >> stack pointer is not yet setup.
> >> 
> >> Reviewed-by: Masami Hiramatsu 
> >> Signed-off-by: Naveen N. Rao   
> >
> > I don't have a problem with this bunch of system call labels
> > going private. They've never added much for me in profiles.
> >
> > Reviewed-by: Nicholas Piggin 
> >
> > Semi-related question, why is system_call: where it is?  
> 
> Ancient history.
> 
> We used to have:
> 
>   bne syscall_dotrace
> syscall_dotrace_cont:
>   cmpldi  0,r0,NR_syscalls
>   bge-syscall_enosys
> 
> system_call:  /* label this so stack traces look sane */
> 
> 
> So it was there to hide syscall_dotrace_cont from back traces.
> 
> But we made syscall_dotrace_cont local in 2012 and then removed it
> entirely in 2015.
> 
> > Should we move it up to right after the mtmsrd / wrteei instruction?
> > (obviously for another patch). It's pretty common to get PMU
> > interrupts coming in right after mtmsr and this makes profiles split
> > the syscall into two which is annoying.  
> 
> Move it wherever makes sense and gives good back traces.

I'd be in favour of moving it to right after the interurpt enable.
I suppose you'd want a separate patch for that though. But we could
put it in this series since we're changing a lot of labels.

Thanks,
Nick

[PATCH V4] cxl: Export library to support IBM XSL

2017-06-22 Thread Christophe Lombard

This patch exports a in-kernel 'library' API which can be called by
other drivers to help interacting with an IBM XSL on a POWER9 system.

The XSL (Translation Service Layer) is a stripped down version of the
PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
Like the PSL, it implements the CAIA architecture, but has a number
of differences, mostly in it's implementation dependent registers.

The XSL also uses a special DMA cxl mode, which uses a slightly
different init sequence for the CAPP and PHB.

Signed-off-by: Andrew Donnellan 
Signed-off-by: Christophe Lombard 

---
Changelog[v4]
 - Rebase to latest upstream.
 - Add new Signed-off
 - Update cxl_handle_mm_fault() and cxl_handle_page_fault()

Changelog[v3]
 - Rebase to latest upstream.
 - cxl_handle_mm_fault() is now exported. Remove kernel context
   parameter
 - Update comments

Changelog[v2]
 - Rebase to latest upstream.
 - Return -EFAULT in case of NULL pointer in cxllib_handle_fault().
 - Reverse parameters when copro_handle_mm_fault() is called.

 This applies on top of this patch:
 http://patchwork.ozlabs.org/patch/775322/
---
 arch/powerpc/include/asm/opal-api.h |   1 +
 drivers/misc/cxl/Kconfig|   5 +
 drivers/misc/cxl/Makefile   |   2 +-
 drivers/misc/cxl/cxl.h  |   6 +
 drivers/misc/cxl/cxllib.c   | 246 
 drivers/misc/cxl/fault.c|  29 +++--
 drivers/misc/cxl/native.c   |  16 ++-
 drivers/misc/cxl/pci.c  |  41 --
 include/misc/cxllib.h   | 133 +++
 9 files changed, 449 insertions(+), 30 deletions(-)
 create mode 100644 drivers/misc/cxl/cxllib.c
 create mode 100644 include/misc/cxllib.h

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index cb3e624..3e0be78 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -877,6 +877,7 @@ enum {
OPAL_PHB_CAPI_MODE_SNOOP_OFF= 2,
OPAL_PHB_CAPI_MODE_SNOOP_ON = 3,
OPAL_PHB_CAPI_MODE_DMA  = 4,
+   OPAL_PHB_CAPI_MODE_DMA_TVT1 = 5,
 };
 
 /* OPAL I2C request */
diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
index b75cf83..93397cb 100644
--- a/drivers/misc/cxl/Kconfig
+++ b/drivers/misc/cxl/Kconfig
@@ -11,11 +11,16 @@ config CXL_AFU_DRIVER_OPS
bool
default n
 
+config CXL_LIB
+   bool
+   default n
+
 config CXL
tristate "Support for IBM Coherent Accelerators (CXL)"
depends on PPC_POWERNV && PCI_MSI && EEH
select CXL_BASE
select CXL_AFU_DRIVER_OPS
+   select CXL_LIB
default m
help
  Select this option to enable driver support for IBM Coherent
diff --git a/drivers/misc/cxl/Makefile b/drivers/misc/cxl/Makefile
index c14fd6b..0b5fd74 100644
--- a/drivers/misc/cxl/Makefile
+++ b/drivers/misc/cxl/Makefile
@@ -3,7 +3,7 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
 
 cxl-y  += main.o file.o irq.o fault.o native.o
 cxl-y  += context.o sysfs.o pci.o trace.o
-cxl-y  += vphb.o phb.o api.o
+cxl-y  += vphb.o phb.o api.o cxllib.o
 cxl-$(CONFIG_PPC_PSERIES)  += flash.o guest.o of.o hcalls.o
 cxl-$(CONFIG_DEBUG_FS) += debugfs.o
 obj-$(CONFIG_CXL)  += cxl.o
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index a03f8e7..b1afecc 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -1010,6 +1010,7 @@ static inline void cxl_debugfs_add_afu_regs_psl8(struct 
cxl_afu *afu, struct den
 
 void cxl_handle_fault(struct work_struct *work);
 void cxl_prefault(struct cxl_context *ctx, u64 wed);
+int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, u64 dar);
 
 struct cxl *get_cxl_adapter(int num);
 int cxl_alloc_sst(struct cxl_context *ctx);
@@ -1061,6 +1062,11 @@ int cxl_afu_slbia(struct cxl_afu *afu);
 int cxl_data_cache_flush(struct cxl *adapter);
 int cxl_afu_disable(struct cxl_afu *afu);
 int cxl_psl_purge(struct cxl_afu *afu);
+int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid,
+ u32 *phb_index, u64 *capp_unit_id);
+int cxl_slot_is_switched(struct pci_dev *dev);
+int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg);
+u64 cxl_calculate_sr(bool master, bool kernel, bool real_mode, bool p9);
 
 void cxl_native_irq_dump_regs_psl9(struct cxl_context *ctx);
 void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx);
diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
new file mode 100644
index 000..5dba23c
--- /dev/null
+++ b/drivers/misc/cxl/cxllib.c
@@ -0,0 +1,246 @@
+/*
+ * Copyright 2017 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+

Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd

2017-06-22 Thread Michael Ellerman

Thiago Jung Bauermann  writes:

> Michael Ellerman  writes:
>> Thiago Jung Bauermann  writes:
>>
>>> Calling arch_update_cpu_topology from a CPU hotplug state machine callback
>>> hits a deadlock because the function tries to get a read lock on
>>> cpu_hotplug_lock while the state machine still holds a write lock on it.
>>>
>>> Since all callers of arch_update_cpu_topology except rtasd already hold
>>> cpu_hotplug_lock, this patch changes the function to use
>>> stop_machine_cpuslocked and creates a separate function for rtasd which
>>> still tries to obtain the lock.
>>>
>>> Michael Bringmann investigated the bug and provided a detailed analysis
>>> of the deadlock on this previous RFC for an alternate solution:
>>>
>>> https://patchwork.ozlabs.org/patch/771293/
>>
>> Do we know when this broke? Or has it never worked?
>
> It's been broken since at least v4.4, I think. I don't know about
> earlier versions.

OK.

Just to be clear, this is happening on a 4.12-rcX system with no other
patches?

The code in arch_update_cpu_topology() has used stop_machine() since 
30c05350c39d ("powerpc/pseries: Use stop machine to update cpu maps")
which went into v3.10, about 4 years ago.

Prior to that it used get/put_online_cpus(), since 9eff1a38407c
("powerpc/pseries: Poll VPA for topology changes and update NUMA maps"),
which was 2.6.38 in 2010.

I wouldn't rule out the possibility it's been broken for 7 years, but I
wonder if something else has changed to cause it to break.

We really need to work it out before we backport anything.

>> Should it go to stable? (can't in its current form AFAICS)
>
> It's not hard to backport both this patch and commit fe5595c07400
> ("stop_machine: Provide stop_machine_cpuslocked()") from branch
> smp/hotplug in tip.git for stable.

Yeah but it's not really my business backporting that unfortunately.

> Since rtasd only started calling arch_update_cpu_topology since v4.11,
> for earlier versions this patch can be simplified to making that
> function call stop_machine_cpuslocked unconditionally instead of
> defining a separate function.

OK.

cheers

DNS (?) not working on G5 (64-bit powerpc) (was [net-next, v3, 3/3] udp: try to avoid 2 cache miss on dequeue)

2017-06-22 Thread Michael Ellerman

Paolo wrote:
> when udp_recvmsg() is executed, on x86_64 and other archs, most skb
> fields are on cold cachelines.
> If the skb are linear and the kernel don't need to compute the udp
> csum, only a handful of skb fields are required by udp_recvmsg().
> Since we already use skb->dev_scratch to cache hot data, and
> there are 32 bits unused on 64 bit archs, use such field to cache
> as much data as we can, and try to prefetch on dequeue the relevant
> fields that are left out.
> 
> This can save up to 2 cache miss per packet.
> 
> v1 -> v2:
>   - changed udp_dev_scratch fields types to u{32,16} variant,
> replaced bitfiled with bool
> 
> Signed-off-by: Paolo Abeni 
> Acked-by: Eric Dumazet 
> ---
>  net/ipv4/udp.c | 114 
> +++--
>  1 file changed, 103 insertions(+), 11 deletions(-)

This appears to break wget on one of my machines.

Networking in general is working, I'm able to SSH in, but then I can't
do a wget.

eg:

$ wget google.com
--2017-06-22 22:45:39--  http://google.com/
Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in name 
resolution.
wget: unable to resolve host address ‘proxy.pmdw.com’

$ host proxy.pmdw.com
proxy.pmdw.com is an alias for raven.pmdw.com.
raven.pmdw.com has address 10.1.2.3

$ wget google.com
--2017-06-22 22:52:08--  http://google.com/
Resolving proxy.pmdw.com (proxy.pmdw.com)... failed: Temporary failure in name 
resolution.
wget: unable to resolve host address ‘proxy.pmdw.com’

Maybe host is using TCP but the man page says it doesn't?


Everything is OK if I boot back to the previous commit 0a463c78d25b
("udp: avoid a cache miss on dequeue"):

$ wget google.com
--2017-06-22 23:00:01--  http://google.com/
Resolving proxy.pmdw.com (proxy.pmdw.com)... 10.1.2.3
Connecting to proxy.pmdw.com (proxy.pmdw.com)|10.1.2.3|:3128... connected.
Proxy request sent, awaiting response... 302 Found
Location: http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE 
[following]
--2017-06-22 23:00:01--  
http://www.google.com.au/?gfe_rd=cr=Ub9LWbPbLujDXrH1uPgE
Reusing existing connection to proxy.pmdw.com:3128.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

index.html  [ <=>   ]  
11.37K  --.-KB/sin 0.001s  

2017-06-22 23:00:01 (22.0 MB/s) - ‘index.html’ saved [11640]

$ uname -a
Linux 4.12.0-rc4-gcc6-00988-g0a463c7 #88 SMP Thu Jun 22 22:55:12 AEST 2017 
ppc64 GNU/Linux


Haven't had time to debug any further. Any ideas?

cheers

Re: [PATCH v3 1/6] powerpc64/elfv1: Validate function pointer address in the function descriptor

2017-06-22 Thread Nicholas Piggin

On Thu, 22 Jun 2017 20:59:49 +1000
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > On Thu, 22 Jun 2017 00:08:37 +0530
> > "Naveen N. Rao"  wrote:
> >  
> >> Currently, we assume that the function pointer we receive in
> >> ppc_function_entry() points to a function descriptor. However, this is
> >> not always the case. In particular, assembly symbols without the right
> >> annotation do not have an associated function descriptor. Some of these
> >> symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL().
> >> When such addresses are subsequently processed through
> >> arch_deref_entry_point() in populate_kprobe_blacklist(), we see the
> >> below errors during bootup:
> >> [0.663963] Failed to find blacklist at 7d9b02a648029b6c
> >> [0.663970] Failed to find blacklist at a14d03d0394a0001
> >> [0.663972] Failed to find blacklist at 7d5302a6f94d0388
> >> [0.663973] Failed to find blacklist at 48027d11e8610178
> >> [0.663974] Failed to find blacklist at f8010070f8410080
> >> [0.663976] Failed to find blacklist at 386100704801f89d
> >> [0.663977] Failed to find blacklist at 7d5302a6f94d00b0
> >> 
> >> Fix this by checking if the address in the function descriptor is
> >> actually a valid kernel address. In the case of assembly symbols, this
> >> will almost always fail as this ends up being powerpc instructions. In
> >> that case, return pointer to the address we received, rather than the
> >> dereferenced value.
> >> 
> >> Signed-off-by: Naveen N. Rao 
> >> ---
> >>  arch/powerpc/include/asm/code-patching.h | 10 +-
> >>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/arch/powerpc/include/asm/code-patching.h 
> >> b/arch/powerpc/include/asm/code-patching.h
> >> index abef812de7f8..ec54050be585 100644
> >> --- a/arch/powerpc/include/asm/code-patching.h
> >> +++ b/arch/powerpc/include/asm/code-patching.h
> >> @@ -83,8 +83,16 @@ static inline unsigned long ppc_function_entry(void 
> >> *func)
> >> * On PPC64 ABIv1 the function pointer actually points to the
> >> * function's descriptor. The first entry in the descriptor is the
> >> * address of the function text.
> >> +   *
> >> +   * However, we may have received a pointer to an assembly symbol
> >> +   * that may not be a function descriptor. Validate that the entry
> >> +   * points to a valid kernel address and if not, return the pointer
> >> +   * we received as is.
> >> */
> >> -  return ((func_descr_t *)func)->entry;
> >> +  if (kernel_text_address(((func_descr_t *)func)->entry))
> >> +  return ((func_descr_t *)func)->entry;
> >> +  else
> >> +  return (unsigned long)func;  
> >
> > What if "func" is a text section label (bare asm function)?
> > Won't func->entry load the random instruction located there
> > and compare it with a kernel address?  
> 
> Yes, that's the problem.
> 
> > I don't know too much about the v1 ABI, but should we check for
> > func belonging in the .opd section first and base the check on
> > that? Alternatively I if "func" is in the kernel text address,
> > we can recognize it's not in the .opd section... right?  
> 
> That sounds like a more robust solution. But I suspect it won't work for
> modules.

kernel_text_address() seems to check for module text as well, so it
might work I think?

Re: [PATCH] dmaengine: fsldma: set BWC, DAHTS and SAHTS values correctly

2017-06-22 Thread Vinod Koul

On Mon, Jun 19, 2017 at 04:40:04PM +0200, Thomas Breitung wrote:
> The bits of BWC, DAHTS and SAHTS in the DMA mode register must be cleared
> before a new value can be or-ed in.

Applied, thanks

-- 
~Vinod

Re: [PATCH V3] cxl: Export library to support IBM XSL

2017-06-22 Thread Frederic Barrat


Salut Christophe,

Since there's a respin, 2 details below.



diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
new file mode 100644
index 000..4f4c5ca
--- /dev/null
+++ b/drivers/misc/cxl/cxllib.c
@@ -0,0 +1,246 @@
+/*
+ * Copyright 2017 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 


Maybe somebody can comment on this, but I believe the usual order is the 
'linux' headers first, then the 'asm'. Though I don't know if there's a 
valid reason behind it, or just tradition...




diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
index c79e39b..077370f 100644
--- a/drivers/misc/cxl/fault.c
+++ b/drivers/misc/cxl/fault.c



+static int cxl_handle_page_fault(struct cxl_context *ctx,
+struct mm_struct *mm,
+u64 dsisr, u64 dar)
+{
+   trace_cxl_pte_miss(ctx, dsisr, dar);
+
+   if (cxl_handle_mm_fault(mm, dsisr, dar)) {
+   cxl_ack_ae(ctx);
+   } else {
+   pr_devel("Page fault successfully handled for pe: %i!\n", 
ctx->pe);
+   cxl_ops->ack_irq(ctx, CXL_PSL_TFC_An_R, 0);
+   }
+
+   return IRQ_HANDLED;
  }


Why return IRQ_HANDLED? it's not called from an interrupt handler but 
from the bottom half (actually, I think it's coming from 
cxl_handle_segment_miss(), but it's wrong there as well).


  Fred

Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd

2017-06-22 Thread Thomas Gleixner

On Wed, 21 Jun 2017, Thiago Jung Bauermann wrote:
> Michael Ellerman  writes:
> >> Notes:
> >> This patch applies on tip/smp/hotplug, it should probably be carried 
> >> there.
> >
> > stop_machine_cpuslocked() doesn't exist in mainline so I think it has to
> > be carried there right?
> 
> Yes. I said "probably" because I don't know if you want to wait
> until that branch is merged so that you can carry this patch in your
> tree.

I'll pick it up for 4.12 and you folks can figure out the backporting once
that hits Linus tree.

Thanks,

tglx

Re: [RFC PATCH 1/2] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-06-22 Thread Benjamin Herrenschmidt

On Thu, 2017-06-22 at 11:29 +0200, Cédric Le Goater wrote:
> This is the framework for using XIVE in a PowerVM guest. The support
> is very similar to the native one in a much simpler form.

Looks really good. Minor nits & comments...

> Instead of OPAL calls, a set of Hypervisors call are used to configure
> the interrupt sources and the event/notification queues of the guest:
> 
>H_INT_GET_SOURCE_INFO
>H_INT_SET_SOURCE_CONFIG
>H_INT_GET_SOURCE_CONFIG
>H_INT_GET_QUEUE_INFO
>H_INT_SET_QUEUE_CONFIG
>H_INT_GET_QUEUE_CONFIG
>H_INT_RESET

There are the base ones.

> Calls that still need to be addressed :
> 
>H_INT_SET_OS_REPORTING_LINE
>H_INT_GET_OS_REPORTING_LINE

Ah so those have to do with that magic cache line you can register with
the HW so that when you get an interrupt, you can do an MMIO store very
early on in the interrupt entry path to the XIVE, which will
asynchronously write the NSR etc... to that cache line which you can
then poke at later one.

I don't know if it's worth exploiting in Linux, but we should support
it in qemu/kvm.

>H_INT_ESB

This is a h-call that performs the basic ESB operations. Some
interrupts can have a flag telling the OS to do the operations using
that hcall rather than directly. This can be used to workaround HW
issues with some interrupts sources if needed.

>H_INT_SYNC

This will be needed for queue accounting in some cases, such as CPU
hotplug I think etc... For example if you mask an interrupt in the ESB,
a sync will ensure that any previous occurrence of this interrupt has
reached its target queue (and thus is visible in memory).

> As for XICS, the XIVE interface for the guest is described in the
> device tree under the interrupt controller node. A couple of new
> properties are specific to XIVE :
> 
>  - "reg"
> 
>contains the base address and size of the thread interrupt
>managnement areas (TIMA) for the user level for the OS level. Only
>the OS level is taken into account.
> 
>  - "ibm,xive-eq-sizes"
> 
>the size of the event queues.
> 
>  - "ibm,xive-lisn-ranges"
> 
>the interrupt numbers ranges assigned to the guest. These are
>allocated using a simple bitmap.
> 
> This is work in progress. It was only tested with a QEMU XIVE model
> for pseries.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/asm/hvcall.h  |  13 +-
>  arch/powerpc/include/asm/xive.h|   1 +
>  arch/powerpc/platforms/pseries/Kconfig |   1 +
>  arch/powerpc/platforms/pseries/setup.c |   8 +-
>  arch/powerpc/platforms/pseries/smp.c   |  18 +-
>  arch/powerpc/sysdev/xive/Kconfig   |   5 +
>  arch/powerpc/sysdev/xive/Makefile  |   1 +
>  arch/powerpc/sysdev/xive/xive-hv.c | 523 
> +
>  8 files changed, 566 insertions(+), 4 deletions(-)
>  create mode 100644 arch/powerpc/sysdev/xive/xive-hv.c
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index d73755fafbb0..3c019e9f451a 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -280,7 +280,18 @@
>  #define H_RESIZE_HPT_COMMIT  0x370
>  #define H_REGISTER_PROC_TBL  0x37C
>  #define H_SIGNAL_SYS_RESET   0x380
> -#define MAX_HCALL_OPCODE H_SIGNAL_SYS_RESET
> +#define H_INT_GET_SOURCE_INFO   0x3A8
> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
> +#define H_INT_GET_QUEUE_INFO0x3B4
> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
> +#define H_INT_ESB   0x3C8
> +#define H_INT_SYNC  0x3CC
> +#define H_INT_RESET 0x3D0
> +#define MAX_HCALL_OPCODE H_INT_RESET
>  
>  /* H_VIOCTL functions */
>  #define H_GET_VIOA_DUMP_SIZE 0x01
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index c23ff4389ca2..c947952ed934 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -110,6 +110,7 @@ extern bool __xive_enabled;
>  
>  static inline bool xive_enabled(void) { return __xive_enabled; }
>  
> +extern bool xive_hv_init(void);
>  extern bool xive_native_init(void);
>  extern void xive_smp_probe(void);
>  extern int  xive_smp_prepare_cpu(unsigned int cpu);
> diff --git a/arch/powerpc/platforms/pseries/Kconfig 
> b/arch/powerpc/platforms/pseries/Kconfig
> index 913c54e23eea..4eec0283f043 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -7,6 +7,7 @@ config PPC_PSERIES
>   select PCI
>   select PCI_MSI
>   select PPC_XICS
> + select PPC_XIVE_HV
>   select PPC_ICP_NATIVE
>   select PPC_ICP_HV
>   select PPC_ICS_RTAS
> diff --git a/arch/powerpc/platforms/pseries/setup.c 
> b/arch/powerpc/platforms/pseries/setup.c
> index b5d86426e97b..53bad49b84be 100644
> ---

Re: [PATCH V3] cxl: Export library to support IBM XSL

2017-06-22 Thread christophe lombard


Le 22/06/2017 à 13:20, Michael Ellerman a écrit :

Christophe Lombard  writes:


This patch exports a in-kernel 'library' API which can be called by
other drivers to help interacting with an IBM XSL on a POWER9 system.

The XSL (Translation Service Layer) is a stripped down version of the
PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
Like the PSL, it implements the CAIA architecture, but has a number
of differences, mostly in it's implementation dependent registers.

The XSL also uses a special DMA cxl mode, which uses a slightly
different init sequence for the CAPP and PHB.

Signed-off-by: Christophe Lombard 

Andrew sent you a SOB for parts he wrote, so you should include that.

Waiting for an ack from Fred before I merge this.

cheers



You are right (Sorry Andrew !). I will push a new patch

Thanks

Re: [PATCH V3] cxl: Export library to support IBM XSL

2017-06-22 Thread Michael Ellerman

Christophe Lombard  writes:

> This patch exports a in-kernel 'library' API which can be called by
> other drivers to help interacting with an IBM XSL on a POWER9 system.
>
> The XSL (Translation Service Layer) is a stripped down version of the
> PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
> Like the PSL, it implements the CAIA architecture, but has a number
> of differences, mostly in it's implementation dependent registers.
>
> The XSL also uses a special DMA cxl mode, which uses a slightly
> different init sequence for the CAPP and PHB.
>
> Signed-off-by: Christophe Lombard 

Andrew sent you a SOB for parts he wrote, so you should include that.

Waiting for an ack from Fred before I merge this.

cheers

Re: 1M hugepage size being registered on Linux

2017-06-22 Thread Michael Ellerman

Benjamin Herrenschmidt  writes:

> On Thu, 2017-06-22 at 13:59 +1000, Michael Ellerman wrote:
>> It's unsupported in Linux because it doesn't match the page table
>> geometry.
>> 
>> We merged a patch from Aneesh to filter it out in 4.12-rc1:
>> 
>>   a525108cf1cc ("powerpc/mm/hugetlb: Filter out hugepage size not supported 
>> by page table layout")
>> 
>> I guess we should probably send that patch to stable et. al.
>
> It's also not supported in HW afaik, why is it there ?

Because hostboot said so AIUI.

cheers

Re: [PATCH v3 4/6] powerpc/64s: Un-blacklist system_call() from kprobes

2017-06-22 Thread Michael Ellerman

Nicholas Piggin  writes:

> On Thu, 22 Jun 2017 00:08:40 +0530
> "Naveen N. Rao"  wrote:
>
>> It is actually safe to probe system_call() in entry_64.S, but only till
>> we unset MSR_RI. To allow this, add a new symbol system_call_exit()
>> after the mtmsrd and blacklist that. Though the mtmsrd instruction
>> itself is now whitelisted, we won't be allowed to probe on it as we
>> don't allow probing on rfi and mtmsr instructions (checked for in
>> arch_prepare_kprobe()).
>
> Can you add a little comment to say probes aren't allowed, and it's
> located after the mtmsr in order to avoid contaminating traces?
>
> Also I wonder if a slightly different name would be more instructive?
> I don't normally care, but the system_call_common code isn't trivial
> to follow. system_call_exit might give the impression that it is the
> entire exit path (which would pair with system_call for entry).

It is the entire path in the happy case isn't it? I'm not sure I know
what you mean.

> Perhaps system_call_exit_notrace? No that sucks too :(

A bit :D

If you're tracing etc. then you'll be in syscall_exit_work, isn't that
sufficient to differentiate the two?

cheers

Re: [PATCH v3 5/6] powerpc/64s: Blacklist functions invoked on a trap

2017-06-22 Thread Michael Ellerman

Nicholas Piggin  writes:

> On Thu, 22 Jun 2017 00:08:41 +0530
> "Naveen N. Rao"  wrote:
>
>> Blacklist all functions involved while handling a trap. We:
>> - convert some of the symbols into private symbols,
>> - remove the duplicate 'restore' symbol, and
>> - blacklist most functions involved while handling a trap.
>
> I'm not sure removing "restore" makes it better. 

Yeah it bloats the patch needlessly, we can do a rename patch later.

cheers

Re: [PATCH v3 3/6] powerpc/64s: Blacklist system_call() and system_call_common() from kprobes

2017-06-22 Thread Michael Ellerman

Nicholas Piggin  writes:

> On Thu, 22 Jun 2017 00:08:39 +0530
> "Naveen N. Rao"  wrote:
>
>> Convert some of the symbols into private symbols and blacklist
>> system_call_common() and system_call() from kprobes. We can't take a
>> trap at parts of these functions as either MSR_RI is unset or the kernel
>> stack pointer is not yet setup.
>> 
>> Reviewed-by: Masami Hiramatsu 
>> Signed-off-by: Naveen N. Rao 
>
> I don't have a problem with this bunch of system call labels
> going private. They've never added much for me in profiles.
>
> Reviewed-by: Nicholas Piggin 
>
> Semi-related question, why is system_call: where it is?

Ancient history.

We used to have:

bne syscall_dotrace
syscall_dotrace_cont:
cmpldi  0,r0,NR_syscalls
bge-syscall_enosys

system_call:/* label this so stack traces look sane */


So it was there to hide syscall_dotrace_cont from back traces.

But we made syscall_dotrace_cont local in 2012 and then removed it
entirely in 2015.

> Should we move it up to right after the mtmsrd / wrteei instruction?
> (obviously for another patch). It's pretty common to get PMU
> interrupts coming in right after mtmsr and this makes profiles split
> the syscall into two which is annoying.

Move it wherever makes sense and gives good back traces.

cheers

Re: 1M hugepage size being registered on Linux

2017-06-22 Thread Michal Suchánek

On Tue, 20 Jun 2017 10:47:47 -0300
victora  wrote:

> Hi Alistair/Jeremy,
> 
> I am working on a bug related to 1M hugepage size being registered on 
> Linux (Power 8 Baremetal - Garrison).
> 
> I was checking dmesg and it seems that 1M page size is coming from 
> firmware to Linux.
> 
> [0.00] base_shift=20: shift=20, sllp=0x0130,
> avpnm=0x, tlbiel=0, penc=2
> [1.528867] HugeTLB registered 1 MB page size, pre-allocated 0
> pages
> 
> Should Linux support this page size? As afar as I know, this was an 
> unsupported page size in the past isn't it? If this should be
> supported now, is there any specific reason for that?

Hello,

a525108cf1cc powerpc/mm/hugetlb: Filter out hugepage size not supported
by page table layout

Should reject unsupported huge page sizes, right?

Thanks

Michal

Re: [PATCH v3 1/6] powerpc64/elfv1: Validate function pointer address in the function descriptor

2017-06-22 Thread Michael Ellerman

Nicholas Piggin  writes:

> On Thu, 22 Jun 2017 00:08:37 +0530
> "Naveen N. Rao"  wrote:
>
>> Currently, we assume that the function pointer we receive in
>> ppc_function_entry() points to a function descriptor. However, this is
>> not always the case. In particular, assembly symbols without the right
>> annotation do not have an associated function descriptor. Some of these
>> symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL().
>> When such addresses are subsequently processed through
>> arch_deref_entry_point() in populate_kprobe_blacklist(), we see the
>> below errors during bootup:
>> [0.663963] Failed to find blacklist at 7d9b02a648029b6c
>> [0.663970] Failed to find blacklist at a14d03d0394a0001
>> [0.663972] Failed to find blacklist at 7d5302a6f94d0388
>> [0.663973] Failed to find blacklist at 48027d11e8610178
>> [0.663974] Failed to find blacklist at f8010070f8410080
>> [0.663976] Failed to find blacklist at 386100704801f89d
>> [0.663977] Failed to find blacklist at 7d5302a6f94d00b0
>> 
>> Fix this by checking if the address in the function descriptor is
>> actually a valid kernel address. In the case of assembly symbols, this
>> will almost always fail as this ends up being powerpc instructions. In
>> that case, return pointer to the address we received, rather than the
>> dereferenced value.
>> 
>> Signed-off-by: Naveen N. Rao 
>> ---
>>  arch/powerpc/include/asm/code-patching.h | 10 +-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/include/asm/code-patching.h 
>> b/arch/powerpc/include/asm/code-patching.h
>> index abef812de7f8..ec54050be585 100644
>> --- a/arch/powerpc/include/asm/code-patching.h
>> +++ b/arch/powerpc/include/asm/code-patching.h
>> @@ -83,8 +83,16 @@ static inline unsigned long ppc_function_entry(void *func)
>>   * On PPC64 ABIv1 the function pointer actually points to the
>>   * function's descriptor. The first entry in the descriptor is the
>>   * address of the function text.
>> + *
>> + * However, we may have received a pointer to an assembly symbol
>> + * that may not be a function descriptor. Validate that the entry
>> + * points to a valid kernel address and if not, return the pointer
>> + * we received as is.
>>   */
>> -return ((func_descr_t *)func)->entry;
>> +if (kernel_text_address(((func_descr_t *)func)->entry))
>> +return ((func_descr_t *)func)->entry;
>> +else
>> +return (unsigned long)func;
>
> What if "func" is a text section label (bare asm function)?
> Won't func->entry load the random instruction located there
> and compare it with a kernel address?

Yes, that's the problem.

> I don't know too much about the v1 ABI, but should we check for
> func belonging in the .opd section first and base the check on
> that? Alternatively I if "func" is in the kernel text address,
> we can recognize it's not in the .opd section... right?

That sounds like a more robust solution. But I suspect it won't work for
modules.

Another option might be to canonicalise the blacklist so that it always
points to the text address, not sure how easy that would be.

cheers

Re: [next-20170619][a28e46f] WARNING: CPU: 5 PID: 1 at arch/powerpc/lib/feature-fixups.c:214 .check_features+0xa0/0xbc

2017-06-22 Thread Michael Ellerman

Thanks for testing this configuration Abdul.

Abdul Haleem  writes:
> Hi Michael,
>
> linux-next booted with warnings on a Power LPAR with CMO feature
> enabled.
>
> Test : boot with CMO enabled 
> Kernel: 4.12.0-rc5-next-20170619
> gcc : 4.8.5
> Machine : Power 8 Power VM LPAR (Big Endian)
>
> Steps to recreate:
>
> 1. Enable Shared Memory/CPU Mode for LPAR in HMC (lpar profile)
> 2. Build the kernel for the attached config
> 3. Boot the kernel, boot messages shows below warning
>
> Firmware features changed after feature patching!
> [ cut here ]
> WARNING: CPU: 5 PID: 1 at
> arch/powerpc/lib/feature-fixups.c:214 .check_features+0xa0/0xbc
> Modules linked in:
> CPU: 5 PID: 1 Comm: swapper/5 Not tainted 4.12.0-rc5-next-20170619-autotest #1
> task: c003bbc2 task.stack: c003bbc8
> NIP: c0ce599c LR: c0ce5998 CTR: 006338e4
> REGS: c003bbc83910 TRAP: 0700   Not tainted  
> (4.12.0-rc5-next-20170619-autotest)
> MSR: 8282b032 
>   CR: 2222  XER: 0002  
> CFAR: c024368c SOFTE: 1 
> GPR00: c0ce5998 c003bbc83b90 c132e800 0031 
> GPR04:  80380fe53ec0   
> GPR08:  c11999f8 c11999f8 3ff0 
> GPR12: 2824 ce743480 c000d380  
> GPR16:     
> GPR20:     
> GPR24:  c12aad08 c0cc0b88 00e8 
> GPR28: c0d3ea38 0007  c1375d20 
> NIP [c0ce599c] .check_features+0xa0/0xbc
> LR [c0ce5998] .check_features+0x9c/0xbc
> Call Trace:
> [c003bbc83b90] [c0ce5998] .check_features+0x9c/0xbc
> (unreliable)
> [c003bbc83c10] [c000cb1c] .do_one_initcall+0x5c/0x1c0
> [c003bbc83ce0] [c0cd47d0] .kernel_init_freeable+0x270/0x350
> [c003bbc83db0] [c000d39c] .kernel_init+0x1c/0x140
> [c003bbc83e30] [c000b3e4] .ret_from_kernel_thread+0x58/0x74
> Instruction dump:
> 0fe0 6000 3d02ffa1 e92296e0 e94851d0 e929 7faa4800 41fe0018 
> 3c62ff89 3863e6f0 4b55dcb9 6000 <0fe0> 38210080 3860
> e8010010 
> random: 0x60003d22001e get_random_bytes called with crng_init=0
> ---[ end trace c0f890377edaeb14 ]---
> Running MSI bitmap self-tests ... 
>
> Possible bad commit looks to be:

That's not the bad commit ...

> commit a28e46f109c9637b2539b9995078d5df4f7f6c09
> Author: Michael Ellerman 
> Date:   Tue Jul 26 22:29:18 2016 +1000
>
> powerpc/kernel: Check features don't change after patching
> 
> Early in boot we binary patch some sections of code based on the CPU and
> MMU feature bits. But it is a one-time patching, there is no facility
> for repatching the code later if the set of features change.
> 
> It is a major bug if the set of features changes after we've done the
> code patching - so add a check for it.

... it's the commit that adds the warning that something is dodgy.

In this case it's a false positive, FW_FEATURE_CMO is not used in
assembly, so it's OK (but a bit dicey) that we change it after
do_feature_fixups().

But we should still fix it so that we enable/disable FW_FEATURE_CMO
before patching.

cheers

[linux-next][489e45][PowerPC] build broke on bare-metal with gcc 4.8.5

2017-06-22 Thread Abdul Haleem

Hi,

next-20170621 fails to build on PowerPC bare-metal with gcc 4.8.5

Test : build
Machine : Power 8 bare-metal (BMC)
kernel : 4.12.0-rc6
gcc : 4.8.5
config : Hab-NV-config 


$ make -S -j 
In file included from drivers/net/ethernet/qlogic/qede/qede.h:43:0,
 from drivers/net/ethernet/qlogic/qede/qede_main.c:63:
./include/linux/qed/qede_rdma.h:85:1: error: expected identifier or ‘(’
before ‘{’ token
 {
 ^
./include/linux/qed/qede_rdma.h:84:19: warning: ‘qede_rdma_dev_add’ used
but never defined [enabled by default]
 static inline int qede_rdma_dev_add(struct qede_dev *dev);
   ^
make[5]: *** [drivers/net/ethernet/qlogic/qede/qede_main.o] Error 1
make[5]: *** Waiting for unfinished jobs

Possible bad commit looks to be:

commit 489e45ae42f000a5e045ac203ad5d6f08824cd56
Author: Sudarsana Reddy Kalluru 
Date:   Wed Jun 8 06:22:12 2016 -0400

qede: Add dcbnl support.

This patch adds the interfaces for ieee/cee dcbnl callbacks and registers
them with the kernel.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Yuval Mintz 
Signed-off-by: David S. Miller 

diff --git a/drivers/net/ethernet/qlogic/qede/qede_dcbnl.c 
b/drivers/net/ethernet/qlogic/qede/qede_dcbnl.c
new file mode 100644
index 000..03e8c02
--- /dev/null
+++ b/drivers/net/ethernet/qlogic/qede/qede_dcbnl.c


-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre


#
# Automatically generated file; DO NOT EDIT.
# Linux/powerpc 4.9.0-rc8 Kernel Configuration
#
CONFIG_PPC64=y

#
# Processor support
#
CONFIG_PPC_BOOK3S_64=y
# CONFIG_PPC_BOOK3E_64 is not set
# CONFIG_POWER7_CPU is not set
CONFIG_POWER8_CPU=y
CONFIG_PPC_BOOK3S=y
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_VSX=y
# CONFIG_PPC_ICSWX is not set
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_STD_MMU_64=y
CONFIG_PPC_RADIX_MMU=y
CONFIG_PPC_MM_SLICES=y
CONFIG_PPC_HAVE_PMU_SUPPORT=y
CONFIG_PPC_PERF_CTRS=y
CONFIG_SMP=y
CONFIG_NR_CPUS=2048
CONFIG_PPC_DOORBELL=y
# CONFIG_CPU_BIG_ENDIAN is not set
CONFIG_CPU_LITTLE_ENDIAN=y
CONFIG_PPC64_BOOT_WRAPPER=y
CONFIG_64BIT=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_MMU=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NR_IRQS=512
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_ILOG2_U32=y
CONFIG_ARCH_HAS_ILOG2_U64=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK=y
CONFIG_PPC=y
CONFIG_GENERIC_CSUM=y
CONFIG_EARLY_PRINTK=y
CONFIG_PANIC_TIMEOUT=180
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_PPC_UDBG_16550=y
# CONFIG_GENERIC_TBSYNC is not set
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
CONFIG_EPAPR_BOOT=y
# CONFIG_DEFAULT_UIMAGE is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_PPC_DCR_NATIVE is not set
# CONFIG_PPC_DCR_MMIO is not set
# CONFIG_PPC_OF_PLATFORM_PCI is not set
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_PPC_EMULATE_SSTEP=y
CONFIG_ZONE_DMA32=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_XZ is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_IRQ_DOMAIN_DEBUG=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_GENERIC_TIME_VSYSCALL_OLD=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_BSD_PROCESS_ACCT is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_RCU_EXPEDITE_BOOT

[RFC v2 3/3] cxl: Add memory barrier to guarantee TLBI scope

2017-06-22 Thread Frederic Barrat

With the hash memory model, all TLBIs become global when the cxl
driver is active, i.e. as soon as one context is open.
It is theoretically possible to send a TLBI with the wrong scope as
there's currently no memory barrier between when the driver is marked
as in use, and attaching a context to the device, therefore we are
exposed to re-ordering. It is highly unlikely as the use count for the
driver is incremented on open() and the attachment to the device
happens on a different system call (ioctl)

Signed-off-by: Frederic Barrat 
---
 include/misc/cxl-base.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/misc/cxl-base.h b/include/misc/cxl-base.h
index b2ebc91fe09a..dcb6d38ab3ad 100644
--- a/include/misc/cxl-base.h
+++ b/include/misc/cxl-base.h
@@ -25,12 +25,28 @@ extern atomic_t cxl_use_count;
 
 static inline bool cxl_ctx_in_use(void)
 {
-   return (atomic_read(_use_count) != 0);
+   /*
+* This is called when sending an TLBI, to know whether it
+* should be global or local.
+*
+* We need to make sure the PTE update is happening before
+* reading the context global flag. Otherwise, reading the
+* flag may be re-ordered and happen first, and we could end
+* up in a situation where the old PTE is seen by the device,
+* but the TLBI is not global.
+*/
+   smp_mb();
+   return (atomic_read(_use_count) != 0);
 }
 
 static inline void cxl_ctx_get(void)
 {
atomic_inc(_use_count);
+   /*
+   * Barrier guarantees that the device will receive all TLBIs
+   * from that point on
+   */
+   smp_wmb();
 }
 
 static inline void cxl_ctx_put(void)
-- 
2.11.0

[RFC v2 2/3] cxl: Mark context requiring global TLBIs

2017-06-22 Thread Frederic Barrat

The PSL needs to see all TLBIs pertinent to the memory contexts used
on the cxl adapter. For the hash memory model, it was done by making
all TLBIs global as soon as the cxl driver is in use. For radix, we
need something similar, but we can refine and only convert to global
the invalidations for contexts actually used by the device.

So mark the contexts being attached to the cxl adapter as requiring
global TLBIs.

Signed-off-by: Frederic Barrat 
---
 drivers/misc/cxl/api.c  | 12 ++--
 drivers/misc/cxl/file.c | 12 ++--
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 1a138c83f877..2cebe214195e 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -332,8 +332,17 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
cxl_context_mm_count_get(ctx);
 
/* decrement the use count */
-   if (ctx->mm)
+   if (ctx->mm) {
mmput(ctx->mm);
+#ifdef CONFIG_PPC_BOOK3S_64
+   mm_context_set_global_tlbi(>mm->context);
+   /*
+* Barrier guarantees that the device will
+* receive all TLBIs from that point on
+*/
+   smp_wmb();
+#endif
+   }
}
 
cxl_ctx_get();
@@ -347,7 +356,6 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
cxl_context_mm_count_put(ctx);
goto out;
}
-
ctx->status = STARTED;
 out:
mutex_unlock(>status_mutex);
diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 0761271d68c5..3ba6a60e0a6d 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -222,8 +222,17 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
cxl_context_mm_count_get(ctx);
 
/* decrement the use count */
-   if (ctx->mm)
+   if (ctx->mm) {
mmput(ctx->mm);
+#ifdef CONFIG_PPC_BOOK3S_64
+   mm_context_set_global_tlbi(>mm->context);
+   /*
+* Barrier guarantees that the device will receive all
+* TLBIs from that point on
+*/
+   smp_wmb();
+#endif
+   }
 
trace_cxl_attach(ctx, work.work_element_descriptor, 
work.num_interrupts, amr);
 
@@ -236,7 +245,6 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
cxl_context_mm_count_put(ctx);
goto out;
}
-
ctx->status = STARTED;
rc = 0;
 out:
-- 
2.11.0

[RFC v2 1/3] powerpc/mm: Add marker for contexts requiring global TLB invalidations

2017-06-22 Thread Frederic Barrat

Introduce a new 'flags' attribute per context and define its first bit
to be a marker requiring all TLBIs for that context to be broadcasted
globally. Once that marker is set on a context, it cannot be removed.

Such a marker is useful for memory contexts used by devices behind the
NPU and CAPP/PSL. The NPU and the PSL keep their own translation cache
so they need to see all the TLBIs for those contexts.

Signed-off-by: Frederic Barrat 
---
 arch/powerpc/include/asm/book3s/64/mmu.h | 18 ++
 arch/powerpc/include/asm/tlb.h   | 23 +--
 arch/powerpc/mm/mmu_context_book3s64.c   |  1 +
 3 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3e3811..cd83f8eb6a3f 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -78,8 +78,12 @@ struct spinlock;
 /* Maximum possible number of NPUs in a system. */
 #define NV_MAX_NPUS 8
 
+/* Bits definition for the context flags */
+#define MM_GLOBAL_TLBIE0   /* TLBI must be global */
+
 typedef struct {
mm_context_id_t id;
+   unsigned long flags;
u16 user_psize; /* page size index */
 
/* NPU NMMU context */
@@ -164,5 +168,19 @@ extern void radix_init_pseries(void);
 static inline void radix_init_pseries(void) { };
 #endif
 
+/*
+ * Mark the memory context as requiring global TLBIs, when used by
+ * GPUs or CAPI accelerators managing their own TLB or ERAT.
+*/
+static inline void mm_context_set_global_tlbi(mm_context_t *ctx)
+{
+   set_bit(MM_GLOBAL_TLBIE, >flags);
+}
+
+static inline bool mm_context_get_global_tlbi(mm_context_t *ctx)
+{
+   return test_bit(MM_GLOBAL_TLBIE, >flags);
+}
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_MMU_H_ */
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 609557569f65..87d4ddcbf7f8 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -71,8 +71,27 @@ static inline int mm_is_core_local(struct mm_struct *mm)
 
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
-   return cpumask_equal(mm_cpumask(mm),
- cpumask_of(smp_processor_id()));
+   int rc;
+
+   rc = cpumask_equal(mm_cpumask(mm),
+   cpumask_of(smp_processor_id()));
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (rc) {
+   /*
+* Check if context requires global TLBI.
+*
+* We need to make sure the PTE update is happening
+* before reading the context global flag. Otherwise,
+* reading the flag may be re-ordered and happen
+* first, and we could end up in a situation where the
+* old PTE was seen by a device, but the TLBI is not
+* global.
+*/
+   smp_mb();
+   rc = !mm_context_get_global_tlbi(>context);
+   }
+#endif
+   return rc;
 }
 
 #else
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index a3edf813d455..c32a3f729d81 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -156,6 +156,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
return index;
 
mm->context.id = index;
+   mm->context.flags = 0;
 #ifdef CONFIG_PPC_ICSWX
mm->context.cop_lockp = kmalloc(sizeof(spinlock_t), GFP_KERNEL);
if (!mm->context.cop_lockp) {
-- 
2.11.0

[RFC v2 0/3] powerpc/mm: Mark memory contexts requiring global TLBIs

2017-06-22 Thread Frederic Barrat

capi2 and opencapi require the TLB invalidations being sent for
addresses used on the cxl adapter or opencapi device to be global, as
there's a translation cache in the PSL (for capi2) or NPU (for
opencapi). The CAPP (for PSL) and NPU snoop the power bus.

This is not new: for the hash memory model, as soon as the cxl driver
is active, all local TLBIs become global. We need a similar mechanism
for the radix memory model. This patch tries to improve things a bit
by flagging the contexts requiring global TLBIs, therefore limiting
the "upgrade" and not affecting contexts not used by the card.

A longer-term goal is to modify the current implementation for hash to
follow the same direction, i.e. identify contexts needing global
TLBIs, but that will be for later. It would be required to support
hash for opencapi.

Changelog:
>From previous comments:
 - rename MM_CONTEXT_GLOBAL_TLBI -> MM_GLOBAL_TLBIE
 - add memory barriers to make sure the device doesn't miss any TLBI
 - also add barrier for the hash implemention to fix the same issue
 
Frederic Barrat (3):
  powerpc/mm: Add marker for contexts requiring global TLB invalidations
  cxl: Mark context requiring global TLBIs
  cxl: Add memory barrier to guarantee TLBI scope

 arch/powerpc/include/asm/book3s/64/mmu.h | 18 ++
 arch/powerpc/include/asm/tlb.h   | 23 +--
 arch/powerpc/mm/mmu_context_book3s64.c   |  1 +
 drivers/misc/cxl/api.c   | 12 ++--
 drivers/misc/cxl/file.c  | 12 ++--
 include/misc/cxl-base.h  | 18 +-
 6 files changed, 77 insertions(+), 7 deletions(-)

-- 
2.11.0

[RFC PATCH 0/2] guest exploitation of the XIVE interrupt controller

2017-06-22 Thread Cédric Le Goater

Hello,

I am currently working on a XIVE model for the pseries QEMU machine
and I started hacking Linux to have something to test with. 

Here's a first draft of XIVE support for the guest. This is work in
progress, only tested with the QEMU XIVE model I am working
on. Feedback most welcome. 

Thanks,

C.

Cédric Le Goater (2):
  powerpc/xive: guest exploitation of the XIVE interrupt controller
  powerpc/xive: add XIVE exploitation mode to CAS

 arch/powerpc/include/asm/hvcall.h  |  13 +-
 arch/powerpc/include/asm/xive.h|   1 +
 arch/powerpc/kernel/prom_init.c|  15 +-
 arch/powerpc/platforms/pseries/Kconfig |   1 +
 arch/powerpc/platforms/pseries/setup.c |   8 +-
 arch/powerpc/platforms/pseries/smp.c   |  18 +-
 arch/powerpc/sysdev/xive/Kconfig   |   5 +
 arch/powerpc/sysdev/xive/Makefile  |   1 +
 arch/powerpc/sysdev/xive/xive-hv.c | 523 +
 9 files changed, 580 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/sysdev/xive/xive-hv.c

-- 
2.7.5

1 2 >

1 - 100 of 113 matches

Mail list logo