Re: G5 Quad hangs early on 4.20.2 / 5.0-rc2+

2019-01-15 Thread Benjamin Herrenschmidt
On Tue, 2019-01-15 at 23:49 +0100, Tobias Ulmer wrote:
> Hi,
> 
> both the latest stable 4.20.2 and 5.0 rc2+ hang early on the G5 Quad.
> 
> Surely I'm not the first to run into this, but I couldn't find any
> discussion or bug report. Sorry if you're already aware.
> 
> You can see it hang here (5.0 rc2+, 4.20.2 is nearly identical) until
> the watchdog triggers a reboot:
> 
> https://i.imgur.com/UiCVRuG.jpg
> 
> If I had to make an uneducated guess, it seems to boot into the same
> codepath twice (mpic was already initialized, then it starts again right
> after smp bringup). Maybe on a second CPU?
> 
> To narrow it down a little, my last known good was 4.18.9

I don't think it's an MPIC related problem but it does appear to hang
about when interrupts get turned on.

I have one of these critters in the office, but I'm working remotely
this week so I won't be able to dig into this until next week.

It might help if you could bisect in the meantime.

Cheers,
Ben.




Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-08 Thread Benjamin Herrenschmidt
On Wed, 2019-01-09 at 17:32 +1100, Alexey Kardashevskiy wrote:
> I have just moved the "Mellanox Technologies MT27700 Family
> [ConnectX-4]" from garrison to firestone machine and there it does not
> produce an EEH, with the same kernel and skiboot (both upstream + my
> debug). Hm. I cannot really blame the card but I cannot see what could
> cause the difference in skiboot either. I even tried disabling NPU so
> garrison would look like firestone, still EEH'ing.

The systems have a different chip though, firestone is P8 and garrison
is P8', which a slightly different PHB revision. Worth checking if we
have anything significantly different in our inits and poke at the HW
guys.

BTW. Are the cards behind a switch in either case ?

Cheers,
Ben.




Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-08 Thread Benjamin Herrenschmidt
On Wed, 2019-01-09 at 15:53 +1100, Alexey Kardashevskiy wrote:
> "A PCI completion timeout occurred for an outstanding PCI-E transaction"
> it is.
> 
> This is how I bind the device to vfio:
> 
> echo vfio-pci > '/sys/bus/pci/devices/:01:00.0/driver_override'
> echo vfio-pci > '/sys/bus/pci/devices/:01:00.1/driver_override'
> echo ':01:00.0' > '/sys/bus/pci/devices/:01:00.0/driver/unbind'
> echo ':01:00.1' > '/sys/bus/pci/devices/:01:00.1/driver/unbind'
> echo ':01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind
> echo ':01:00.1' > /sys/bus/pci/drivers/vfio-pci/bind
> 
> 
> and I noticed that EEH only happens with the last command. The order
> (.0,.1  or .1,.0) does not matter, it seems that putting one function to
> D3 is fine but putting another one when the first one is already in D3 -
> produces EEH. And I do not recall ever seeing this on the firestone
> machine. Weird.

Putting all functions into D3 is what allows the device to actually go
into D3.

Does it work with other devices ? We do have that bug on early P9
revisions where the attempt of bringing the link to L1 as part of the
D3 process fails in horrible ways, I thought P8 would be ok but maybe
not ...

Otherwise, it might be that our timeouts are too low (you may want to
talk to our PCIe guys internally)

Cheers,
Ben.




Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-08 Thread Benjamin Herrenschmidt
On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote:
> 
> > In a very cryptic way that requires manual parsing using non-public
> > docs sadly but yes. From the look of it, it's a completion timeout.
> > 
> > Looks to me like we don't get a response to a config space access
> > during the change of D state. I don't know if it's the write of the D3
> > state itself or the read back though (it's probably detected on the
> > read back or a subsequent read, but that doesn't tell me which specific
> > one failed).
> 
> If it is just one card doing it (again, check you have latest
> firmware) I wonder if it is a sketchy PCI-E electrical link that is
> causing a long re-training cycle? Can you tell if the PCI-E link is
> permanently gone or does it eventually return?

No, it's 100% reproducable on systems with that specific card model,
not card instance, and maybe different systems/cards as well, I'll let
David & Alexey comment further on that.

> Does the card work in Gen 3 when it starts? Is there any indication of
> PCI-E link errors?

Nope.

> Everytime or sometimes?
> 
> POWER 8 firmware is good? If the link does eventually come back, is
> the POWER8's D3 resumption timeout long enough?
> 
> If this doesn't lead to an obvious conclusion you'll probably need to
> connect to IBM's Mellanox support team to get more information from
> the card side.

We are IBM :-) So far, it seems to be that the card is doing something
not quite right, but we don't know what. We might need to engage
Mellanox themselves.

Cheers,
Ben.




Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-05 Thread Benjamin Herrenschmidt
On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
> 
> > Interesting.  I've investigated this further, though I don't have as
> > many new clues as I'd like.  The problem occurs reliably, at least on
> > one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> > I don't yet know if it occurs with other machines, I'm having trouble
> > getting access to other machines with a suitable card.  I didn't
> > manage to reproduce it on a different POWER8 machine with a
> > ConnectX-5, but I don't know if it's the difference in machine or
> > difference in card revision that's important.
> 
> Make sure the card has the latest firmware is always good advice..
> 
> > So possibilities that occur to me:
> >   * It's something specific about how the vfio-pci driver uses D3
> > state - have you tried rebinding your device to vfio-pci?
> >   * It's something specific about POWER, either the kernel or the PCI
> > bridge hardware
> >   * It's something specific about this particular type of machine
> 
> Does the EEH indicate what happend to actually trigger it?

In a very cryptic way that requires manual parsing using non-public
docs sadly but yes. From the look of it, it's a completion timeout.

Looks to me like we don't get a response to a config space access
during the change of D state. I don't know if it's the write of the D3
state itself or the read back though (it's probably detected on the
read back or a subsequent read, but that doesn't tell me which specific
one failed).

Some extra logging in OPAL might help pin that down by checking the InA
error state in the config accessor after the config write (and polling
on it for a while as from a CPU perspective I don't knw if the write is
synchronous, probably not).

Cheers,
Ben.




Re: Runtime warnings in powerpc code

2018-12-31 Thread Benjamin Herrenschmidt
On Thu, 2018-12-27 at 11:05 -0800, Guenter Roeck wrote:
> Hi,
> 
> I am getting the attached runtime warnings when enabling certain debug
> options in powerpc code. The warnings are seen with pretty much all
> platforms, and all active kernel releases.
> 
> Question: Is it worthwhile to keep building / testing powerpc builds
> with the respective debug options enabled, and report it once in a while,
> or should I just disable the options ?

I've been fixing some issues with ppc32 and some of that stuff, I sent
some experimental patches updating 4xx and Christoph sent 8xx variants,
I still need to go through the other ones.

Cheers,
Ben.


> Thanks,
> Guenter
> 
> ---
> CONFIG_DEBUG_ATOMIC_SLEEP
> 
> [ cut here ]
> do not call blocking ops when !TASK_RUNNING; state=2 set at [<(ptrval)>]
> prepare_to_wait+0x54/0xe4
> WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:6099 __might_sleep+0x94/0x9c
> Modules linked in:
> CPU: 0 PID: 1 Comm: init Not tainted 4.20.0-yocto-standard+ #1
> NIP:  c00667a0 LR: c00667a0 CTR: 
> REGS: cf8df8c0 TRAP: 0700   Not tainted  (4.20.0-yocto-standard+)
> MSR:  00029032   CR: 2822  XER: 2000
> 
> GPR00: c00667a0 cf8df970 cf8e 0062 c0af15c8 0007 fa1ae97e 
> 757148e2 
> GPR08: cf8de000    1f386000   
> cfd83c8c 
> GPR16: 0004 0004 0004  060c 000a cf8dfdb8 
> cf267804 
> GPR24: cf8dfd78 cf8dfd68 cfa88a20 cec70830 0001  01d3 
> c0b444cc 
> NIP [c00667a0] __might_sleep+0x94/0x9c
> LR [c00667a0] __might_sleep+0x94/0x9c
> Call Trace:
> [cf8df970] [c00667a0] __might_sleep+0x94/0x9c (unreliable)
> [cf8df990] [c05beddc] do_ide_request+0x48/0x6bc
> [cf8dfa10] [c0492bcc] __blk_run_queue+0x80/0x10c
> [cf8dfa20] [c049a938] blk_flush_plug_list+0x23c/0x258
> [cf8dfa60] [c006b888] io_schedule_prepare+0x44/0x5c
> [cf8dfa70] [c006b8c0] io_schedule+0x20/0x48
> [cf8dfa80] [c095e1ac] bit_wait_io+0x24/0x74
> [cf8dfa90] [c095dd94] __wait_on_bit+0xac/0x104
> [cf8dfab0] [c095de74] out_of_line_wait_on_bit+0x88/0x98
> [cf8dfae0] [c0229094] bh_submit_read+0xf8/0x104
> [cf8dfaf0] [c028b9a8] ext4_get_branch+0xdc/0x168
> [cf8dfb20] [c028c7fc] ext4_ind_map_blocks+0x2b0/0xc08
> [cf8dfc30] [c029551c] ext4_map_blocks+0x2e0/0x65c
> [cf8dfc80] [c02b8c84] ext4_mpage_readpages+0x5e8/0x97c
> [cf8dfd60] [c016c3cc] read_pages+0x60/0x1a0
> [cf8dfdb0] [c016c6e8] __do_page_cache_readahead+0x1dc/0x208
> [cf8dfe10] [c0159768] filemap_fault+0x418/0x834
> [cf8dfe50] [c02a00fc] ext4_filemap_fault+0x40/0x64
> [cf8dfe60] [c0198d0c] __do_fault+0x34/0xb8
> [cf8dfe70] [c019e264] handle_mm_fault+0xc44/0xf88
> [cf8dfef0] [c001a218] __do_page_fault+0x158/0x7b4
> [cf8dff40] [c00143b4] handle_page_fault+0x14/0x40
> --- interrupt: 301 at 0xb7904a70
> LR = 0xb78ef0c8
> Instruction dump:
> 7fe3fb78 bba10014 7c0803a6 38210020 4bfffd20 808a 3c60c0b0 3941 
> 7cc53378 3863a558 99490001 4bfd03bd <0fe0> 4bc0 7c0802a6 90010004 
> irq event stamp: 126702
> hardirqs last  enabled at (126701): [] console_unlock+0x434/0x5d0
> hardirqs last disabled at (126702): [] reenable_mmu+0x30/0x88
> softirqs last  enabled at (126552): [] __do_softirq+0x42c/0x4a0
> softirqs last disabled at (126529): [] irq_exit+0x104/0x108
> ---[ end trace 4f6c84b7815474d9 ]---
> 
> ---
> #if defined(CONFIG_PROVE_LOCKING) && defined(CONFIG_DEBUG_LOCKDEP) && \
> defined(CONFIG_TRACE_IRQFLAGS)
> 
> [ cut here ]
> DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)
> WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3762
> check_flags.part.25+0x1a0/0x1c4
> Modules linked in:
> CPU: 0 PID: 1 Comm: init Tainted: GW 4.20.0-yocto-standard+ #1
> NIP:  c00839f0 LR: c00839f0 CTR: 
> REGS: cf8dfe00 TRAP: 0700   Tainted: GW
> (4.20.0-yocto-standard+)
> MSR:  00021032   CR: 2828  XER: 2000
> 
> GPR00: c00839f0 cf8dfeb0 cf8e 002f 0001 c00938f4 c1425b76 
> 002f 
> GPR08: 1032  0001 0004 28282828   
> b7927688 
> GPR16:  bfe20a5c bfe20a58 0fe5fff8 1b38 10002178  
> b7929c20 
> GPR24: c095d81c 7c9319ee   b7929ae0 cf8e 9032 
> c0d2 
> NIP [c00839f0] check_flags.part.25+0x1a0/0x1c4
> LR [c00839f0] check_flags.part.25+0x1a0/0x1c4
> Call Trace:
> [cf8dfeb0] [c00839f0] check_flags.part.25+0x1a0/0x1c4 (unreliable)
> [cf8dfec0] [c0085f6c] lock_is_held_type+0x78/0xb4
> [cf8dfee0] [c095d35c] __schedule+0x6cc/0xb44
> [cf8dff30] [c095d81c] schedule+0x48/0xb8
> [cf8dff40] [c0014694] recheck+0x0/0x20
> --- interrupt: 501 at 0xb78f2850
> LR = 0xb78f2a24
> Instruction dump:
> 3c80c0b0 3c60c0af 3884d684 38635f94 4bfb3189 0fe0 4bfffec8 3c80c0b0 
> 3c60c0af 3884d668 38635f94 4bfb316d <0fe0> 4bfffefc 3c80c0b0 3c60c0af 
> irq event stamp: 127630
> hardirqs last  enabled at (127629): []
> _raw_spin_unlock_irq+0x3c/0x94
> hardirqs last disabled at (127630): [] reenable_mmu+0x30/0x88
> softirqs 

Re: [PATCH 2/8] powerpc: remove CONFIG_PCI_QSPAN

2018-12-26 Thread Benjamin Herrenschmidt
On Wed, 2018-10-17 at 10:01 +0200, Christoph Hellwig wrote:
> This option isn't actually used anywhere.

Oh my, that's ancient. Probably didn't make the cut from arch/ppc to
arch/powerpc

> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/Kconfig | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index a80669209155..e8c8970248bc 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -955,7 +955,6 @@ config PCI
>   bool "PCI support" if PPC_PCI_CHOICE
>   default y if !40x && !CPM2 && !PPC_8xx && !PPC_83xx \
>   && !PPC_85xx && !PPC_86xx && !GAMECUBE_COMMON
> - default PCI_QSPAN if PPC_8xx
>   select GENERIC_PCI_IOMAP
>   help
> Find out whether your system includes a PCI bus. PCI is the name of
> @@ -969,14 +968,6 @@ config PCI_DOMAINS
>  config PCI_SYSCALL
>   def_bool PCI
>  
> -config PCI_QSPAN
> - bool "QSpan PCI"
> - depends on PPC_8xx
> - select PPC_I8259
> - help
> -   Say Y here if you have a system based on a Motorola 8xx-series
> -   embedded processor with a QSPAN PCI interface, otherwise say N.
> -
>  config PCI_8260
>   bool
>   depends on PCI && 8260



Re: trace_hardirqs_on/off vs. extra stack frames

2018-12-21 Thread Benjamin Herrenschmidt
On Thu, 2018-12-20 at 21:02 -0500, Steven Rostedt wrote:
> On Fri, 21 Dec 2018 12:11:35 +1100
> Benjamin Herrenschmidt  wrote:
> 
> > Hi Steven !
> > 
> > I'm trying to untangle something, and I need your help :-)
> > 
> > In commit 3cb5f1a3e58c0bd70d47d9907cc5c65192281dee, you added a summy
> > stack frame around the assembly calls to trace_hardirqs_on/off on the
> > ground that when using the latency tracer (irqsoff), you might poke at
> > CALLER_ADDR1 and that could blow up if there's only one frame at hand.
> > 
> > However, I can't see where it would be doing that. lockdep.c only uses
> > CALLER_ADDR0 and irqsoff uses the values passed by it. In fact, that
> > was already the case when the above commit was merged.
> > 
> > I tried on a 32-bit kernel to remove the dummy stack frame with no
> > issue so far  (though I do get stupid values reported with or
> > without a stack frame, but I think that's normal, looking into it).
> 
> BTW, I only had a 64 bit PPC working, so I would have been testing that.
> 
> > The reason I'm asking is that we have other code path, on return
> > from interrupts for example, at least on 32-bits where we call the
> > tracing without the extra stack frame, and I yet to see it crash.
> 
> Have you tried enabling the irqsoff tracer and running it for a while?
> 
>  echo irqsoff > /sys/kernel/debug/tracing/current_tracer
> 
> The problem is that when we come from user space, and we disable
> interrupts in the entry code, it calls into the irqsoff tracer:
> 
> [ in userspace ]
> 
> [ in kernel ]
> bl .trace_hardirqs_off
> 
>   kernel/trace/trace_preemptirq.c:
> 
>trace_hardirqs_off(CALLER_ADDR_0, CALLER_ADDR1)
> 
> IIRC, without the stack frame, that CALLER_ADDR1 can end up having the
> kernel read garbage.

You're right, I was looking at a too old tree where trace_hardirqs_* is
implemented in kernel/locking/lockdep.c and only uses CALLER_ADDR0.

> 
> -- Steve
> 
> 
> > I wonder if the commit and bug fix above relates to some older code
> > that no longer existed even at the point where the commit was
> merged...



Re: [PATCH 01/33] powerpc: use mm zones more sensibly

2018-12-21 Thread Benjamin Herrenschmidt
On Tue, 2018-10-09 at 15:24 +0200, Christoph Hellwig wrote:
>   * Find the least restrictive zone that is entirely below the
> @@ -324,11 +305,14 @@ void __init paging_init(void)
> printk(KERN_DEBUG "Memory hole size: %ldMB\n",
>(long int)((top_of_ram - total_ram) >> 20));
>  
> +#ifdef CONFIG_ZONE_DMA
> +   max_zone_pfns[ZONE_DMA] = min(max_low_pfn, 0x7fffUL >> 
> PAGE_SHIFT);
> +#endif
> +   max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
>  #ifdef CONFIG_HIGHMEM
> -   limit_zone_pfn(ZONE_NORMAL, lowmem_end_addr >> PAGE_SHIFT);
> +   max_zone_pfns[ZONE_HIGHMEM] = max_pfn
   ^
Missing a  ";" here  --|

Sorry ... works with that fix on an old laptop with highmem.

>  #endif
> -   limit_zone_pfn(TOP_ZONE, top_of_ram >> PAGE_SHIFT);
> -   zone_limits_final = true;
> +
> free_area_init_nodes(max_zone_pfns);
>  



trace_hardirqs_on/off vs. extra stack frames

2018-12-20 Thread Benjamin Herrenschmidt
Hi Steven !

I'm trying to untangle something, and I need your help :-)

In commit 3cb5f1a3e58c0bd70d47d9907cc5c65192281dee, you added a summy
stack frame around the assembly calls to trace_hardirqs_on/off on the
ground that when using the latency tracer (irqsoff), you might poke at
CALLER_ADDR1 and that could blow up if there's only one frame at hand.

However, I can't see where it would be doing that. lockdep.c only uses
CALLER_ADDR0 and irqsoff uses the values passed by it. In fact, that
was already the case when the above commit was merged.

I tried on a 32-bit kernel to remove the dummy stack frame with no
issue so far  (though I do get stupid values reported with or
without a stack frame, but I think that's normal, looking into it).

The reason I'm asking is that we have other code path, on return
from interrupts for example, at least on 32-bits where we call the
tracing without the extra stack frame, and I yet to see it crash.

I wonder if the commit and bug fix above relates to some older code
that no longer existed even at the point where the commit was merged...

Any idea ?

Cheers,
Ben.



Re: [RFC/WIP] powerpc: Fix 32-bit handling of MSR_EE on exceptions

2018-12-20 Thread Benjamin Herrenschmidt


> >   /*
> >* MSR_KERNEL is > 0x1 on 4xx/Book-E since it include MSR_CE.
> > @@ -205,20 +208,46 @@ transfer_to_handler_cont:
> > mflrr9
> > lwz r11,0(r9)   /* virtual address of handler */
> > lwz r9,4(r9)/* where to go when done */
> > +#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PERF_EVENTS)
> > +   mtspr   SPRN_NRI, r0
> > +#endif
> 
> That's not part of your patch, it's already in the tree.

Yup rebase glitch.

 .../...

> I tested it on the 8xx with the below changes in addition. No issue seen 
> so far.

Thanks !

I'll merge that in.

The main obscure area is that business with the irqsoff tracer and thus
the need to create stack frames around calls to trace_hardirqs_* ... we
do it in some places and not others, but I've not managed to make it
crash either. I need to get to the bottom of that, and possibly provide
proper macro helpers like ppc64 has to do it.

Cheers,
Ben.



[RFC/WIP] powerpc: Fix 32-bit handling of MSR_EE on exceptions

2018-12-19 Thread Benjamin Herrenschmidt
Hi folks !

Why trying to figure out why we had occasionally lockdep barf about
interrupt state on ppc32 (440 in my case but I could reproduce on e500
as well using qemu), I realized that we are still doing something
rather gothic and wrong on 32-bit which we stopped doing on 64-bit
a while ago.

We have that thing where some handlers "copy" the EE value from the
original stack frame into the new MSR before transferring to the
handler.

Thus for a number of exceptions, we enter the handlers with interrupts
enabled.

This is rather fishy, some of the stuff that handlers might do early
on such as irq_enter/exit or user_exit, context tracking, etc... should
be run with interrupts off afaik.

Generally our handlers know when to re-enable interrupts if needed
(though some of the FSL specific SPE ones don't).

The problem we were having is that we assumed these interrupts would
return with interrupts enabled. However that isn't the case.

Instead, this changes things so that we always enter exception handlers
with interrupts *off* with the notable exception of syscalls which are
special (and get a fast path).

Currently, the patch only changes BookE (440 and E5xx tested in qemu),
the same recipe needs to be applied to 6xx, 8xx and 40x.

Also I'm not sure whether we need to create a stack frame around some
of the calls to trace_hardirqs_* in asm. ppc64 does it, due to problems
with the irqsoff tracer, but I haven't managed to reproduce those
issues. We need to look into it a bit more.

I'll work more on this in the next few days, comments appreciated.

Not-signed-off-by: Benjamin Herrenschmidt 

---
 arch/powerpc/kernel/entry_32.S   | 113 ++-
 arch/powerpc/kernel/head_44x.S   |   9 +--
 arch/powerpc/kernel/head_booke.h |  34 ---
 arch/powerpc/kernel/head_fsl_booke.S |  28 -
 arch/powerpc/kernel/traps.c  |   8 +++
 5 files changed, 111 insertions(+), 81 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 3841d74..39b4cb5 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -34,6 +34,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 /*
  * MSR_KERNEL is > 0x1 on 4xx/Book-E since it include MSR_CE.
@@ -205,20 +208,46 @@ transfer_to_handler_cont:
mflrr9
lwz r11,0(r9)   /* virtual address of handler */
lwz r9,4(r9)/* where to go when done */
+#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PERF_EVENTS)
+   mtspr   SPRN_NRI, r0
+#endif
+
 #ifdef CONFIG_TRACE_IRQFLAGS
+   /*
+* When tracing IRQ state (lockdep) we enable the MMU before we call
+* the IRQ tracing functions as they might access vmalloc space or
+* perform IOs for console output.
+*
+* To speed up the syscall path where interrupts stay on, let's check
+* first if we are changing the MSR value at all.
+*/
+   lwz r12,_MSR(r1)
+   xor r0,r10,r12
+   andi.   r0,r0,MSR_EE
+   bne 1f
+
+   /* MSR isn't changing, just transition directly */
+   lwz r0,GPR0(r1)
+   mtspr   SPRN_SRR0,r11
+   mtspr   SPRN_SRR1,r10
+   mtlrr9
+   SYNC
+   RFI
+
+1: /* MSR is changing, re-enable MMU so we can notify lockdep. We need to
+* keep interrupts disabled at this point otherwise we might risk
+* taking an interrupt before we tell lockdep they are enabled.
+*/
lis r12,reenable_mmu@h
ori r12,r12,reenable_mmu@l
+   lis r0,MSR_KERNEL@h
+   ori r0,r0,MSR_KERNEL@l
mtspr   SPRN_SRR0,r12
-   mtspr   SPRN_SRR1,r10
+   mtspr   SPRN_SRR1,r0
SYNC
RFI
-reenable_mmu:  /* re-enable mmu so we can */
-   mfmsr   r10
-   lwz r12,_MSR(r1)
-   xor r10,r10,r12
-   andi.   r10,r10,MSR_EE  /* Did EE change? */
-   beq 1f
 
+reenable_mmu:
/*
 * The trace_hardirqs_off will use CALLER_ADDR0 and CALLER_ADDR1.
 * If from user mode there is only one stack frame on the stack, and
@@ -239,8 +268,29 @@ reenable_mmu:  /* re-enable 
mmu so we can */
stw r3,16(r1)
stw r4,20(r1)
stw r5,24(r1)
-   bl  trace_hardirqs_off
-   lwz r5,24(r1)
+
+   /* Are we enabling or disabling interrupts ? */
+   andi.   r0,r10,MSR_EE
+   beq 1f
+
+   /* If we are enabling interrupt, this is a syscall. They shouldn't
+* happen while interrupts are disabled, so let's do a warning here.
+*/
+0: trap
+   EMIT_BUG_ENTRY 0b,__FILE__,__LINE__, BUGFLAG_WARNING
+   bl  trace_hardirqs_on
+
+   /* Now enable for real */
+   mfmsr   r10
+   ori r10,r10,MSR_EE
+   mtmsr   r10
+   b   2f
+
+   /* If we are disabling in

Re: [PATCH V4 5/5] arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW upgrade

2018-12-19 Thread Benjamin Herrenschmidt
On Wed, 2018-12-19 at 08:50 +0530, Aneesh Kumar K.V wrote:
> Christoph Hellwig  writes:
> 
> > On Tue, Dec 18, 2018 at 03:11:37PM +0530, Aneesh Kumar K.V wrote:
> > > +EXPORT_SYMBOL(huge_ptep_modify_prot_start);
> > 
> > The only user of this function is the one you added in the last patch
> > in mm/hugetlb.c, so there is no need to export this function.
> > 
> > > +
> > > +void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned 
> > > long addr,
> > > +   pte_t *ptep, pte_t old_pte, pte_t pte)
> > > +{
> > > +
> > > + if (radix_enabled())
> > > + return radix__huge_ptep_modify_prot_commit(vma, addr, ptep,
> > > +old_pte, pte);
> > > + set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
> > > +}
> > > +EXPORT_SYMBOL(huge_ptep_modify_prot_commit);
> > 
> > Same here.
> 
> That was done considering that ptep_modify_prot_start/commit was defined
> in asm-generic/pgtable.h. I was trying to make sure I didn't break
> anything with the patch. Also s390 do have that EXPORT_SYMBOL() for the
> same. hugetlb just inherited that.
> 
> If you feel strongly about it, I can drop the EXPORT_SYMBOL().

At the very least it should be _GPL

Cheers,
Ben.



Re: [PATCH V4 0/5] NestMMU pte upgrade workaround for mprotect

2018-12-18 Thread Benjamin Herrenschmidt
On Tue, 2018-12-18 at 09:17 -0800, Christoph Hellwig wrote:
> This series seems to miss patches 1 and 2.

Odd, I got them...

Ben.




Re: use generic DMA mapping code in powerpc V4

2018-12-11 Thread Benjamin Herrenschmidt
On Tue, 2018-12-11 at 19:17 +0100, Christian Zigotzky wrote:
> X5000 (P5020 board): U-Boot loads the kernel and the dtb file. Then the 
> kernel starts but it doesn't find any hard disks (partitions). That 
> means this is also the bad commit for the P5020 board.

What are the disks hanging off ? A PCIe device of some sort ?

Can you send good & bad dmesg logs ?

Ben.




[PATCH] powerpc/44x/bamboo: Fix PCI range

2018-12-10 Thread Benjamin Herrenschmidt
>From 88509506b80b4960004146280eb740be64513a0b Mon Sep 17 00:00:00 2001
The bamboo dts has a bug: it uses a non-naturally aligned range
for PCI memory space. This isnt' supported by the code, thus
causing PCI to break on this system.

This is due to the fact that while the chip memory map has 1G
reserved for PCI memory, it's only 512M aligned. The code doesn't
know how to split that into 2 different PMMs and fails, so limit
the region to 512M.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/boot/dts/bamboo.dts | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/dts/bamboo.dts b/arch/powerpc/boot/dts/bamboo.dts
index 538e42b..b5861fa 100644
--- a/arch/powerpc/boot/dts/bamboo.dts
+++ b/arch/powerpc/boot/dts/bamboo.dts
@@ -268,8 +268,10 @@
/* Outbound ranges, one memory and one IO,
 * later cannot be changed. Chip supports a second
 * IO range but we don't use it for now
+* The chip also supports a larger memory range but
+* it's not naturally aligned, so our code will break
 */
-   ranges = <0x0200 0x 0xa000 0x 
0xa000 0x 0x4000
+   ranges = <0x0200 0x 0xa000 0x 
0xa000 0x 0x2000
  0x0200 0x 0x 0x 
0xe000 0x 0x0010
  0x0100 0x 0x 0x 
0xe800 0x 0x0001>;
 




[PATCH] powerpc: Fix bogus usage of MSR_RI on BookE and 40x

2018-12-10 Thread Benjamin Herrenschmidt
BookE and 40x processors lack the MSR:RI bit. However, we have a
few common code places that rely on it.

This fixes it by not defining MSR_RI on those processor types and
using the appropriate ifdef's in those locations.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/include/asm/reg.h   | 2 ++
 arch/powerpc/include/asm/reg_booke.h | 4 ++--
 arch/powerpc/kernel/process.c| 2 +-
 arch/powerpc/kernel/traps.c  | 8 ++--
 arch/powerpc/lib/sstep.c | 2 ++
 5 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index de52c31..41d0b2e 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -110,7 +110,9 @@
 #ifndef MSR_PMM
 #define MSR_PMM__MASK(MSR_PMM_LG)  /* Performance monitor 
*/
 #endif
+#if !defined(CONFIG_BOOKE) && !defined(CONFIG_4xx)
 #define MSR_RI __MASK(MSR_RI_LG)   /* Recoverable Exception */
+#endif
 #define MSR_LE __MASK(MSR_LE_LG)   /* Little Endian */
 
 #define MSR_TM __MASK(MSR_TM_LG)   /* Transactional Mem Available 
*/
diff --git a/arch/powerpc/include/asm/reg_booke.h 
b/arch/powerpc/include/asm/reg_booke.h
index eb2a33d..06967f1 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -46,10 +46,10 @@
 #define MSR_USER32 (MSR_ | MSR_PR | MSR_EE)
 #define MSR_USER64 (MSR_USER32 | MSR_64BIT)
 #elif defined (CONFIG_40x)
-#define MSR_KERNEL (MSR_ME|MSR_RI|MSR_IR|MSR_DR|MSR_CE)
+#define MSR_KERNEL (MSR_ME|MSR_IR|MSR_DR|MSR_CE)
 #define MSR_USER   (MSR_KERNEL|MSR_PR|MSR_EE)
 #else
-#define MSR_KERNEL (MSR_ME|MSR_RI|MSR_CE)
+#define MSR_KERNEL (MSR_ME|MSR_CE)
 #define MSR_USER   (MSR_KERNEL|MSR_PR|MSR_EE)
 #endif
 
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 96f3473..77679a7 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1359,7 +1359,7 @@ static struct regbit msr_bits[] = {
{MSR_IR,"IR"},
{MSR_DR,"DR"},
{MSR_PMM,   "PMM"},
-#ifndef CONFIG_BOOKE
+#if !defined(CONFIG_BOOKE) && !defined(CONFIG_40x)
{MSR_RI,"RI"},
{MSR_LE,"LE"},
 #endif
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 9a86572..2d00696 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -429,10 +429,11 @@ void system_reset_exception(struct pt_regs *regs)
if (get_paca()->in_nmi > 1)
nmi_panic(regs, "Unrecoverable nested System Reset");
 #endif
+#ifdef MSR_RI
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
nmi_panic(regs, "Unrecoverable System Reset");
-
+#endif
if (!nested)
nmi_exit();
 
@@ -478,7 +479,9 @@ static inline int check_io_access(struct pt_regs *regs)
printk(KERN_DEBUG "%s bad port %lx at %p\n",
   (*nip & 0x100)? "OUT to": "IN from",
   regs->gpr[rb] - _IO_BASE, nip);
+#ifdef MSR_RI
regs->msr |= MSR_RI;
+#endif
regs->nip = extable_fixup(entry);
return 1;
}
@@ -763,10 +766,11 @@ void machine_check_exception(struct pt_regs *regs)
if (check_io_access(regs))
goto bail;
 
+#ifdef MSR_RI
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
nmi_panic(regs, "Unrecoverable Machine check");
-
+#endif
if (!nested)
nmi_exit();
 
diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index d81568f..c03c453 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -3059,9 +3059,11 @@ int emulate_step(struct pt_regs *regs, unsigned int 
instr)
 
case MTMSR:
val = regs->gpr[op.reg];
+#ifdef MSR_RI
if ((val & MSR_RI) == 0)
/* can't step mtmsr[d] that would clear MSR_RI */
return -1;
+#endif
/* here op.val is the mask of bits to change */
regs->msr = (regs->msr & ~op.val) | (val & op.val);
goto instr_done;




Re: use generic DMA mapping code in powerpc V4

2018-12-10 Thread Benjamin Herrenschmidt
On Mon, 2018-12-10 at 20:33 +0100, Christoph Hellwig wrote:
> On Mon, Dec 10, 2018 at 05:04:46PM +, Rui Salvaterra wrote:
> > Hi, Christoph and Ben,
> > 
> > It just came to my mind (and this is most likely a stupid question,
> > but still)… Is there any possibility of these changes having an
> > (positive) effect on the long-standing problem of Power Mac machines
> > with AGP graphics cards (which have to be limited to PCI transfers,
> > otherwise they'll hang, due to coherence issues)? If so, I have a G4
> > machine where I'd gladly test them.
> 
> These patches themselves are not going to affect that directly.
> But IFF the problem really is that the AGP needs to be treated as not
> cache coherent (I have no idea if that is true) the generic direct
> mapping code has full support for a per-device coherent flag, so
> support for a non-coherent AGP slot could be implemented relatively
> simply.

AGP is a gigantic nightmare :-) It's not just cache coherency issues
(some implementations are coherent, some aren't, Apple's is ... weird).

Apple has all sort of bugs, and Darwin source code only sheds light on
some of them. Some implementation can only read, not write I think, for
example. There are issues with transfers crossing some boundaries I
beleive, but it's all unclear.

Apple makes this work with a combination of hacks in the AGP "driver"
and the closed source GPU driver, which we don't see.

I have given up trying to make that stuff work reliably a decade ago :)

Cheers,
Ben.




Re: use generic DMA mapping code in powerpc V4

2018-12-08 Thread Benjamin Herrenschmidt
On Tue, 2018-11-27 at 08:42 +0100, Christoph Hellwig wrote:
> Any comments?  I'd like to at least get the ball moving on the easy
> bits.

I completely missed your posting of V4 ! I was wondering what was
taking you so long :)

I'll give it a spin & send acks over the next 2 or 3 days.

Cheers,
Ben.

> On Wed, Nov 14, 2018 at 09:22:40AM +0100, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series switches the powerpc port to use the generic swiotlb and
> > noncoherent dma ops, and to use more generic code for the coherent
> > direct mapping, as well as removing a lot of dead code.
> > 
> > As this series is very large and depends on the dma-mapping tree I've
> > also published a git tree:
> > 
> > git://git.infradead.org/users/hch/misc.git powerpc-dma.4
> > 
> > Gitweb:
> > 
> > 
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.4
> > 
> > Changes since v3:
> >  - rebase on the powerpc fixes tree
> >  - add a new patch to actually make the baseline amigaone config
> >configure without warnings
> >  - only use ZONE_DMA for 64-bit embedded CPUs, on pseries an IOMMU is
> >always present
> >  - fix compile in mem.c for one configuration
> >  - drop the full npu removal for now, will be resent separately
> >  - a few git bisection fixes
> > 
> > The changes since v1 are to big to list and v2 was not posted in public.
> > 
> > ___
> > iommu mailing list
> > io...@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> ---end quoted text---



Re: ppc440gp "ebony" board manual/specs

2018-12-04 Thread Benjamin Herrenschmidt
On Tue, 2018-12-04 at 12:19 +0100, Mathieu Malaterre wrote:
> On Tue, Dec 4, 2018 at 12:09 PM Benjamin Herrenschmidt
>  wrote:
> > Hi folks !
> > 
> > Does anybody still has the manual or schematics (or both!) of the old
> > "ebony" ppc440gp eval board around by any chance ?
> 
> A google search for:
> 
> IBM/AMCC "PPC440GP" Evaluation Board datasheet
> 
> leads to:
> 
> https://4donline.ihs.com/images/VipMasterIC/IC/AMCC/AMCCS00334/AMCCS00334-1.pdf?hkey=EF798316E3902B6ED9A73243A3159BB0

Which is the processor data sheet, not the board :)

Cheers,
Ben.




ppc440gp "ebony" board manual/specs

2018-12-04 Thread Benjamin Herrenschmidt
Hi folks !

Does anybody still has the manual or schematics (or both!) of the old
"ebony" ppc440gp eval board around by any chance ?

Cheers
Ben.




Re: use generic DMA mapping code in powerpc V4

2018-12-01 Thread Benjamin Herrenschmidt
On Fri, 2018-11-30 at 11:44 +, Rui Salvaterra wrote:
> Thanks for the quick response! I applied it on top of your
> powerpc-dma.4 branch and retested.
> I'm not seeing nouveau complaining anymore (I'm not using X11 or any
> DE, though).
> In any case and FWIW, this series is
> 
> Tested-by: Rui Salvaterra 

Talking of which ... Christoph, not sure if we can do something about
this at the DMA API level or keep hacks but some adapters such as the
nVidia GPUs have a HW hack we can use to work around their limitations
in that case.

They have a register that can program a fixed value for the top bits
that they don't support.

This works fine for any linear mapping with an offset, provided they
can program the offset in that register and they have enough DMA range
to cover all memory from that offset.

I can probably get the info about this from them so we can exploit it
in nouveau.

Cheers,
Ben.

> Thanks,
> Rui



[PATCH] powerpc: Look for "stdout-path" when setting up legacy consoles

2018-11-29 Thread Benjamin Herrenschmidt
Commit 78e5dfea8 "powerpc: dts: replace 'linux,stdout-path' with 'stdout-path'"
broke the default console on a number of embedded PowerPC systems, because it
failed to also update the code in arch/powerpc/kernel/legacy_serial.c to
look for that property in addition to the old one.

This fixes it.

Signed-off-by: Benjamin Herrenschmidt 
Fixes: 78e5dfea8 powerpc: dts: replace 'linux,stdout-path' with 'stdout-path'
---
diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
index 33b34a5..5b9dce1 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -372,6 +372,8 @@ void __init find_legacy_serial_ports(void)
 
/* Now find out if one of these is out firmware console */
path = of_get_property(of_chosen, "linux,stdout-path", NULL);
+   if (path == NULL)
+   path = of_get_property(of_chosen, "stdout-path", NULL);
if (path != NULL) {
stdout = of_find_node_by_path(path);
if (stdout)
@@ -595,8 +597,10 @@ static int __init check_legacy_serial_console(void)
/* We are getting a weird phandle from OF ... */
/* ... So use the full path instead */
name = of_get_property(of_chosen, "linux,stdout-path", NULL);
+   if (name == NULL)
+   name = of_get_property(of_chosen, "stdout-path", NULL);
if (name == NULL) {
-   DBG(" no linux,stdout-path !\n");
+   DBG(" no stdout-path !\n");
return -ENODEV;
}
prom_stdout = of_find_node_by_path(name);



Re: [PATCH 0/5] Guarded Userspace Access Prevention on Radix

2018-11-08 Thread Benjamin Herrenschmidt
On Thu, 2018-11-08 at 18:52 +0100, Christophe LEROY wrote:
> 
> In signal_32.c and signal_64.c, save_user_regs() calls __put_user() to 
> modify code, then calls flush_icache_range() on user addresses.
> 
> Shouldn't flush_icache_range() be performed with userspace access 
> protection unlocked ?

Thankfully this code is pretty much never used these days...

Russell: To trigger that, you need to disable the VDSO.

This brings back the idea however of having a way to "bulk" open the
gate during the whole signal sequence...

Cheers,
Ben.




Re: [PATCH 12/24] powerpc/mm: Fix reporting of kernel execute faults

2018-11-07 Thread Benjamin Herrenschmidt
On Wed, 2018-11-07 at 09:35 +0100, Christophe LEROY wrote:
> Hi Ben,
> 
> I have an issue on the 8xx with this change

Ah ouch...

 .../...
  
> > +/* Is this a bad kernel fault ? */
> > +static bool bad_kernel_fault(bool is_exec, unsigned long error_code,
> > +unsigned long address)
> > +{
> > +   if (is_exec && (error_code & (DSISR_NOEXEC_OR_G | DSISR_KEYFAULT))) {
> 
> Do you mind if we had DSISR_PROTFAULT here as well ?

Off the top of my mind, I don't see a problem with that... but it would
definitely require an explanation comment.

> > +   printk_ratelimited(KERN_CRIT "kernel tried to execute"
> > +  " exec-protected page (%lx) -"
> > +  "exploit attempt? (uid: %d)\n",
> > +  address, from_kuid(_user_ns,
> > + current_uid()));
> > +   }
> > +   return is_exec || (address >= TASK_SIZE);
> > +}
> > +
> >   /*
> >* Define the correct "is_write" bit in error_code based
> >* on the processor family
> > @@ -252,7 +266,7 @@ static int __do_page_fault(struct pt_regs *regs, 
> > unsigned long address,
> >  * The kernel should never take an execute fault nor should it
> >  * take a page fault to a kernel address.
> >  */
> > -   if (!is_user && (is_exec || (address >= TASK_SIZE)))
> > +   if (unlikely(!is_user && bad_kernel_fault(is_exec, error_code, 
> > address)))
> > return SIGSEGV;
> >   
> > /* We restore the interrupt state now */
> > @@ -491,11 +505,6 @@ static int __do_page_fault(struct pt_regs *regs, 
> > unsigned long address,
> > return 0;
> > }
> >   
> > -   if (is_exec && (error_code & DSISR_PROTFAULT))
> > -   printk_ratelimited(KERN_CRIT "kernel tried to execute 
> > NX-protected"
> > -  " page (%lx) - exploit attempt? (uid: %d)\n",
> > -  address, from_kuid(_user_ns, 
> > current_uid()));
> > -
> > return SIGSEGV;
> >   }
> >   NOKPROBE_SYMBOL(__do_page_fault);
> > 



Re: PIE binaries are no longer mapped below 4 GiB on ppc64le

2018-10-31 Thread Benjamin Herrenschmidt
On Wed, 2018-10-31 at 18:54 +0100, Florian Weimer wrote:
> 
> It would matter to C code which returns the address of a global variable
> in the main program through and (implicit) int return value.
> 
> The old behavior hid some pointer truncation issues.

Hiding bugs like that is never a good idea..

> > Maybe it would be good idea to generate 64bit relocations on 64bit
> > targets?
> 
> Yes, the Go toolchain definitely needs fixing for PIE.  I don't dispute
> that.

There was never any ABI guarantee that programs would be loaded below
4G... it just *happened*, so that's not per-se an ABI change.

That said, I'm surprised of the choice of address.. I would have rather
moved to above 1TB to benefit from 1T segments...

Nick, Anton, do you know anything about that change ?

Ben.




Re: [PATCH] powerpc/npu-dma: Remove NPU DMA ops

2018-10-30 Thread Benjamin Herrenschmidt
On Tue, 2018-10-30 at 13:58 +0100, Christoph Hellwig wrote:
> Please take my patch instead.  We have a kernel polcity to not keep
> dead code around, and everyone including Linus and the attending IBMers
> confirmed this.

Let's call a cat a cat ... ;-)

It's not *dead* code. It's code that is used by an out of tree driver
(which also happen not to be open source).

Just making things clear for everybody.

Cheers,
Ben.

> On Tue, Oct 30, 2018 at 10:02:03PM +1100, Alistair Popple wrote:
> > The NPU IOMMU is setup to mirror the parent PCIe device IOMMU
> > setup. Therefore it does not make sense to call dma operations such as
> > dma_map_page, etc. directly on these devices. The existing dma-ops
> > simply print a warning if they are ever called, however this is
> > unnecessary and the warnings are likely to go unnoticed.
> > 
> > It is instead simpler to remove these operations and let the generic
> > DMA code print warnings (eg. via a NULL pointer deref) in cases of
> > buggy drivers attempting dma operations on NVLink devices.
> > 
> > Signed-off-by: Alistair Popple 
> > ---
> >  arch/powerpc/platforms/powernv/npu-dma.c | 64 
> > ++--
> >  1 file changed, 4 insertions(+), 60 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> > b/arch/powerpc/platforms/powernv/npu-dma.c
> > index 6f60e0931922..75b935252981 100644
> > --- a/arch/powerpc/platforms/powernv/npu-dma.c
> > +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> > @@ -102,63 +102,6 @@ struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev 
> > *gpdev, int index)
> >  }
> >  EXPORT_SYMBOL(pnv_pci_get_npu_dev);
> > 
> > -#define NPU_DMA_OP_UNSUPPORTED()   \
> > -   dev_err_once(dev, "%s operation unsupported for NVLink devices\n", \
> > -   __func__)
> > -
> > -static void *dma_npu_alloc(struct device *dev, size_t size,
> > -  dma_addr_t *dma_handle, gfp_t flag,
> > -  unsigned long attrs)
> > -{
> > -   NPU_DMA_OP_UNSUPPORTED();
> > -   return NULL;
> > -}
> > -
> > -static void dma_npu_free(struct device *dev, size_t size,
> > -void *vaddr, dma_addr_t dma_handle,
> > -unsigned long attrs)
> > -{
> > -   NPU_DMA_OP_UNSUPPORTED();
> > -}
> > -
> > -static dma_addr_t dma_npu_map_page(struct device *dev, struct page *page,
> > -  unsigned long offset, size_t size,
> > -  enum dma_data_direction direction,
> > -  unsigned long attrs)
> > -{
> > -   NPU_DMA_OP_UNSUPPORTED();
> > -   return 0;
> > -}
> > -
> > -static int dma_npu_map_sg(struct device *dev, struct scatterlist *sglist,
> > - int nelems, enum dma_data_direction direction,
> > - unsigned long attrs)
> > -{
> > -   NPU_DMA_OP_UNSUPPORTED();
> > -   return 0;
> > -}
> > -
> > -static int dma_npu_dma_supported(struct device *dev, u64 mask)
> > -{
> > -   NPU_DMA_OP_UNSUPPORTED();
> > -   return 0;
> > -}
> > -
> > -static u64 dma_npu_get_required_mask(struct device *dev)
> > -{
> > -   NPU_DMA_OP_UNSUPPORTED();
> > -   return 0;
> > -}
> > -
> > -static const struct dma_map_ops dma_npu_ops = {
> > -   .map_page   = dma_npu_map_page,
> > -   .map_sg = dma_npu_map_sg,
> > -   .alloc  = dma_npu_alloc,
> > -   .free   = dma_npu_free,
> > -   .dma_supported  = dma_npu_dma_supported,
> > -   .get_required_mask  = dma_npu_get_required_mask,
> > -};
> > -
> >  /*
> >   * Returns the PE assoicated with the PCI device of the given
> >   * NPU. Returns the linked pci device if pci_dev != NULL.
> > @@ -270,10 +213,11 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe 
> > *npe)
> > rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
> > 
> > /*
> > -* We don't initialise npu_pe->tce32_table as we always use
> > -* dma_npu_ops which are nops.
> > +* NVLink devices use the same TCE table configuration as
> > +* their parent device so drivers shouldn't be doing DMA
> > +* operations directly on these devices.
> >  */
> > -   set_dma_ops(>pdev->dev, _npu_ops);
> > +   set_dma_ops(>pdev->dev, NULL);
> >  }
> > 
> >  /*
> > --
> > 2.11.0
> 
> ---end quoted text---



Re: NXP P50XX/e5500: SMP doesn't work anymore with the latest Git kernel

2018-10-29 Thread Benjamin Herrenschmidt
On Tue, 2018-10-30 at 02:42 +0100, Christian Zigotzky wrote:
> OF patch for the latest Git kernel: http://www.xenosoft.de/of_v2.patch

This just seems to revert a whole bunch of stuff, not really the right
way to go. Why is of_get_next_cpu_node() not finding your CPUs ? There
must be something wrong with the device-tree...

> - of_v2.patch -
> 
> diff -rupN a/drivers/of/base.c b/drivers/of/base.c
> --- a/drivers/of/base.c2018-10-30 02:19:30.827089495 +0100
> +++ b/drivers/of/base.c2018-10-30 02:18:51.666856715 +0100
> @@ -395,7 +395,7 @@ struct device_node *of_get_cpu_node(int
>   {
>   struct device_node *cpun;
> 
> -for_each_of_cpu_node(cpun) {
> +for_each_node_by_type(cpun, "cpu") {
>   if (arch_find_n_match_cpu_physical_id(cpun, cpu, thread))
>   return cpun;
>   }
> @@ -750,45 +750,6 @@ struct device_node *of_get_next_availabl
>   EXPORT_SYMBOL(of_get_next_available_child);
> 
>   /**
> - *of_get_next_cpu_node - Iterate on cpu nodes
> - *@prev:previous child of the /cpus node, or NULL to get first
> - *
> - *Returns a cpu node pointer with refcount incremented, use 
> of_node_put()
> - *on it when done. Returns NULL when prev is the last child. Decrements
> - *the refcount of prev.
> - */
> -struct device_node *of_get_next_cpu_node(struct device_node *prev)
> -{
> -struct device_node *next = NULL;
> -unsigned long flags;
> -struct device_node *node;
> -
> -if (!prev)
> -node = of_find_node_by_path("/cpus");
> -
> -raw_spin_lock_irqsave(_lock, flags);
> -if (prev)
> -next = prev->sibling;
> -else if (node) {
> -next = node->child;
> -of_node_put(node);
> -}
> -for (; next; next = next->sibling) {
> -if (!(of_node_name_eq(next, "cpu") ||
> -  (next->type && !of_node_cmp(next->type, "cpu"
> -continue;
> -if (!__of_device_is_available(next))
> -continue;
> -if (of_node_get(next))
> -break;
> -}
> -of_node_put(prev);
> -raw_spin_unlock_irqrestore(_lock, flags);
> -return next;
> -}
> -EXPORT_SYMBOL(of_get_next_cpu_node);
> -
> -/**
>* of_get_compatible_child - Find compatible child node
>* @parent:parent node
>* @compatible:compatible string
> diff -rupN a/include/linux/of.h b/include/linux/of.h
> --- a/include/linux/of.h2018-10-30 02:19:32.047096634 +0100
> +++ b/include/linux/of.h2018-10-30 02:18:51.666856715 +0100
> @@ -347,7 +347,6 @@ extern const void *of_get_property(const
>   const char *name,
>   int *lenp);
>   extern struct device_node *of_get_cpu_node(int cpu, unsigned int *thread);
> -extern struct device_node *of_get_next_cpu_node(struct device_node *prev);
> 
>   #define for_each_property_of_node(dn, pp) \
>   for (pp = dn->properties; pp != NULL; pp = pp->next)
> @@ -757,11 +756,6 @@ static inline struct device_node *of_get
>   return NULL;
>   }
> 
> -static inline struct device_node *of_get_next_cpu_node(struct 
> device_node *prev)
> -{
> -return NULL;
> -}
> -
>   static inline int of_n_addr_cells(struct device_node *np)
>   {
>   return 0;
> @@ -1239,10 +1233,6 @@ static inline int of_property_read_s32(c
>   for (child = of_get_next_available_child(parent, NULL); child != 
> NULL; \
>child = of_get_next_available_child(parent, child))
> 
> -#define for_each_of_cpu_node(cpu) \
> -for (cpu = of_get_next_cpu_node(NULL); cpu != NULL; \
> - cpu = of_get_next_cpu_node(cpu))
> -
>   #define for_each_node_with_property(dn, prop_name) \
>   for (dn = of_find_node_with_property(NULL, prop_name); dn; \
>dn = of_find_node_with_property(dn, prop_name))



Re: What is an address translation in powerISA jarogn ?

2018-10-16 Thread Benjamin Herrenschmidt
On Tue, 2018-10-16 at 20:58 +0300, Raz wrote:
> Section 5.7.3
> "Storage accesses in real, hypervisor real, and virtual real
> addressing modes are performed in a manner that depends on the
> contents of MSR HV , VPM, VRMASD, HRMOR, RMLS, RMOR (see Chapter 2),
> bit 0 of the
> effective address (EA0),"
> 
> Hello
> 1. If MSR_IR = 0 and MSR_DR = 0, does it mean that addresses are not
> translated by the MMU ?

It depends.

If HV=1 (hypervisor mode), then they are only translated to the extent
that HRMOR is applied if the MSB is 0, and untranslated if the MSB is
1.

If HV=0 (guest mode), then they *are* translated but using a different
mechanism than what's normally used when IR/DR=1. This mechanism
depends on whether you are using the Radix or the Hash MMU, and the top
2 bits are ignored.

With hash MMU, it's using things like VRMASD etc... (RMOR is deprecated
afaik) to lookup a "Virtual real mode" area in the hash table. It's
essentially a mapping of the guest "physical" space to real physical
space. It's usually initialized (and maintained) by the HV but the
guest can extend it using things like H_ENTER afaik.

With the radix MMU, it's the guest physical space as mapped by the 2nd
level page tables maintained by the hypervisor.

> 2. If EA0 is the 63-rd bit of the effective address e address ? Does
> this mean that the translation model is
>derived from the address ? a non privileged context may access
> privileged memory.

Nope.

Cheers,
Ben.




Re: linux-next: Tree for Oct 15

2018-10-15 Thread Benjamin Herrenschmidt
On Tue, 2018-10-16 at 13:19 +1100, Stephen Rothwell wrote:
> Hi all,
> 
> On Tue, 16 Oct 2018 13:02:16 +1100 Stephen Rothwell  
> wrote:
> > 
> > Reverting fe3d2a45e8079fdd7d4da1ff07f4b40bc3cb499f (and the following 2
> > commits) produces a kernel that boots.
> 
> Instead of that, I applied this patch on top of linux-next and it boots
> and produces a stack trace ...
> 
> From: Stephen Rothwell 
> Date: Tue, 16 Oct 2018 13:07:01 +1100
> Subject: [PATCH] mm/memblock.c: use dump_stack() instead of WARN_ON_ONCE for
>  the alignment checks
> 
> Using WARN_ON_ONCE too early causes the PowerPC kernel to fail.

Interesting ... I thought I had fixed that. Might need to be re-fixed.

> Signed-off-by: Stephen Rothwell 
> ---
>  mm/memblock.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 5fefc70253ee..f2ef3915a356 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1298,8 +1298,10 @@ static phys_addr_t __init 
> memblock_alloc_range_nid(phys_addr_t size,
>  {
>   phys_addr_t found;
>  
> - if (WARN_ON_ONCE(!align))
> + if (!align) {
> + dump_stack();
>   align = SMP_CACHE_BYTES;
> + }
>  
>   found = memblock_find_in_range_node(size, align, start, end, nid,
>   flags);
> @@ -1423,8 +1425,10 @@ static void * __init memblock_alloc_internal(
>   if (WARN_ON_ONCE(slab_is_available()))
>   return kzalloc_node(size, GFP_NOWAIT, nid);
>  
> - if (WARN_ON_ONCE(!align))
> + if (!align) {
> + dump_stack();
>   align = SMP_CACHE_BYTES;
> + }
>  
>   if (max_addr > memblock.current_limit)
>   max_addr = memblock.current_limit;
> -- 
> 2.18.0
> 
> So, patch "memblock: stop using implicit alignment to SMP_CACHE_BYTES"
> should *not* remove the 0 -> SMP_CACHE_BYTES update from mm/memblock.c
> and just add the dump_stack().



[PATCH 09/12] powerpc/prom_init: Move a few remaining statics to appropriate sections

2018-10-14 Thread Benjamin Herrenschmidt
Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index cbd54fe30367..306562d4d818 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -173,7 +173,7 @@ static unsigned long __prombss prom_tce_alloc_end;
 #endif
 
 #ifdef CONFIG_PPC_PSERIES
-static bool prom_radix_disable __prombss;
+static bool __prombss prom_radix_disable;
 #endif
 
 struct platform_support {
@@ -210,7 +210,7 @@ static int __prombss mem_reserve_cnt;
 
 static cell_t __prombss regbuf[1024];
 
-static bool rtas_has_query_cpu_stopped;
+static bool  __prombss rtas_has_query_cpu_stopped;
 
 
 /*
@@ -525,8 +525,8 @@ static void add_string(char **str, const char *q)
 
 static char *tohex(unsigned int x)
 {
-   static char digits[] = "0123456789abcdef";
-   static char result[9];
+   static const char digits[] __initconst = "0123456789abcdef";
+   static char result[9] __prombss;
int i;
 
result[8] = 0;
@@ -2324,7 +2324,7 @@ static void __init scan_dt_build_struct(phandle node, 
unsigned long *mem_start,
char *namep, *prev_name, *sstart, *p, *ep, *lp, *path;
unsigned long soff;
unsigned char *valp;
-   static char pname[MAX_PROPERTY_NAME];
+   static char pname[MAX_PROPERTY_NAME] __prombss;
int l, room, has_phandle = 0;
 
dt_push_token(OF_DT_BEGIN_NODE, mem_start, mem_end);
-- 
2.17.1



[PATCH 10/12] powerpc/prom_init: Move __prombss to it's own section and store it in .bss

2018-10-14 Thread Benjamin Herrenschmidt
This makes __prombss its own section, and for now store
it in .bss.

This will give us the ability later to store it elsewhere
and/or free it after boot (it's about 8KB).

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c   | 2 +-
 arch/powerpc/kernel/vmlinux.lds.S | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 306562d4d818..57762f19f136 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -48,7 +48,7 @@
 #include 
 
 /* All of prom_init bss lives here */
-#define __prombss __initdata
+#define __prombss __section(.bss.prominit)
 
 /*
  * Eventually bump that one up
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 07ae018e550e..ac0ceb31b336 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -4,6 +4,9 @@
 #else
 #define PROVIDE32(x)   PROVIDE(x)
 #endif
+
+#define BSS_FIRST_SECTIONS *(.bss.prominit)
+
 #include 
 #include 
 #include 
-- 
2.17.1



[PATCH 04/12] powerpc/prom_init: Replace __initdata with __prombss when applicable

2018-10-14 Thread Benjamin Herrenschmidt
This replaces all occurrences of __initdata for uninitialized
data with a new __prombss

Currently __promdata is defined to be __initdata but we'll
eventually change that.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 55 +
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 2e1ebf26e8e4..9942bfca12e4 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -48,6 +48,9 @@
 
 #include 
 
+/* All of prom_init bss lives here */
+#define __prombss __initdata
+
 /*
  * Eventually bump that one up
  */
@@ -87,7 +90,7 @@
 #define OF_WORKAROUNDS 0
 #else
 #define OF_WORKAROUNDS of_workarounds
-static int of_workarounds __promdata;
+static int of_workarounds __prombss;
 #endif
 
 #define OF_WA_CLAIM1   /* do phys/virt claim separately, then map */
@@ -148,26 +151,26 @@ extern void copy_and_flush(unsigned long dest, unsigned 
long src,
   unsigned long size, unsigned long offset);
 
 /* prom structure */
-static struct prom_t __initdata prom;
+static struct prom_t __prombss prom;
 
-static unsigned long prom_entry __initdata;
+static unsigned long __prombss prom_entry;
 
 #define PROM_SCRATCH_SIZE 256
 
-static char __initdata of_stdout_device[256];
-static char __initdata prom_scratch[PROM_SCRATCH_SIZE];
+static char __prombss of_stdout_device[256];
+static char __prombss prom_scratch[PROM_SCRATCH_SIZE];
 
-static unsigned long __initdata dt_header_start;
-static unsigned long __initdata dt_struct_start, dt_struct_end;
-static unsigned long __initdata dt_string_start, dt_string_end;
+static unsigned long __prombss dt_header_start;
+static unsigned long __prombss dt_struct_start, dt_struct_end;
+static unsigned long __prombss dt_string_start, dt_string_end;
 
-static unsigned long __initdata prom_initrd_start, prom_initrd_end;
+static unsigned long __prombss prom_initrd_start, prom_initrd_end;
 
 #ifdef CONFIG_PPC64
-static int __initdata prom_iommu_force_on;
-static int __initdata prom_iommu_off;
-static unsigned long __initdata prom_tce_alloc_start;
-static unsigned long __initdata prom_tce_alloc_end;
+static int __prombss prom_iommu_force_on;
+static int __prombss prom_iommu_off;
+static unsigned long __prombss prom_tce_alloc_start;
+static unsigned long __prombss prom_tce_alloc_end;
 #endif
 
 static bool prom_radix_disable __initdata = 
!IS_ENABLED(CONFIG_PPC_RADIX_MMU_DEFAULT);
@@ -190,22 +193,22 @@ struct platform_support {
 #define PLATFORM_GENERIC   0x0500
 #define PLATFORM_OPAL  0x0600
 
-static int __initdata of_platform;
+static int __prombss of_platform;
 
-static char __initdata prom_cmd_line[COMMAND_LINE_SIZE];
+static char __prombss prom_cmd_line[COMMAND_LINE_SIZE];
 
-static unsigned long __initdata prom_memory_limit;
+static unsigned long __prombss prom_memory_limit;
 
-static unsigned long __initdata alloc_top;
-static unsigned long __initdata alloc_top_high;
-static unsigned long __initdata alloc_bottom;
-static unsigned long __initdata rmo_top;
-static unsigned long __initdata ram_top;
+static unsigned long __prombss alloc_top;
+static unsigned long __prombss alloc_top_high;
+static unsigned long __prombss alloc_bottom;
+static unsigned long __prombss rmo_top;
+static unsigned long __prombss ram_top;
 
-static struct mem_map_entry __initdata mem_reserve_map[MEM_RESERVE_MAP_SIZE];
-static int __initdata mem_reserve_cnt;
+static struct mem_map_entry __prombss mem_reserve_map[MEM_RESERVE_MAP_SIZE];
+static int __prombss mem_reserve_cnt;
 
-static cell_t __initdata regbuf[1024];
+static cell_t __prombss regbuf[1024];
 
 static bool rtas_has_query_cpu_stopped;
 
@@ -1565,8 +1568,8 @@ static void __init prom_close_stdin(void)
 #ifdef CONFIG_PPC_POWERNV
 
 #ifdef CONFIG_PPC_EARLY_DEBUG_OPAL
-static u64 __initdata prom_opal_base;
-static u64 __initdata prom_opal_entry;
+static u64 __prombss prom_opal_base;
+static u64 __prombss prom_opal_entry;
 #endif
 
 /*
-- 
2.17.1



[PATCH 03/12] powerpc/prom_init: Make "default_colors" const

2018-10-14 Thread Benjamin Herrenschmidt
It's never modified either

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index e44f55be64ba..2e1ebf26e8e4 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2202,7 +2202,7 @@ static void __init prom_check_displays(void)
ihandle ih;
int i;
 
-   static unsigned char default_colors[] = {
+   static const unsigned char default_colors[] = {
0x00, 0x00, 0x00,
0x00, 0x00, 0xaa,
0x00, 0xaa, 0x00,
-- 
2.17.1



[PATCH 06/12] powerpc/prom_init: Move prom_radix_disable to __prombss

2018-10-14 Thread Benjamin Herrenschmidt
Initialize it dynamically instead of statically

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 9852cdb8bb9e..70f2bde06ab3 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -172,7 +172,9 @@ static unsigned long __prombss prom_tce_alloc_start;
 static unsigned long __prombss prom_tce_alloc_end;
 #endif
 
-static bool prom_radix_disable __initdata = 
!IS_ENABLED(CONFIG_PPC_RADIX_MMU_DEFAULT);
+#ifdef CONFIG_PPC_PSERIES
+static bool prom_radix_disable __prombss;
+#endif
 
 struct platform_support {
bool hash_mmu;
@@ -665,6 +667,8 @@ static void __init early_cmdline_parse(void)
 #endif
}
 
+#ifdef CONFIG_PPC_PSERIES
+   prom_radix_disable = !IS_ENABLED(CONFIG_PPC_RADIX_MMU_DEFAULT);
opt = strstr(prom_cmd_line, "disable_radix");
if (opt) {
opt += 13;
@@ -680,6 +684,7 @@ static void __init early_cmdline_parse(void)
}
if (prom_radix_disable)
prom_debug("Radix disabled from cmdline\n");
+#endif /* CONFIG_PPC_PSERIES */
 }
 
 #ifdef CONFIG_PPC_PSERIES
-- 
2.17.1



[PATCH 08/12] powerpc/prom_init: Move const structures to __initconst

2018-10-14 Thread Benjamin Herrenschmidt
As they are no longer used past the end of prom_init

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 03b171bb1c29..cbd54fe30367 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -788,7 +788,7 @@ struct ibm_arch_vec {
struct option_vector6 vec6;
 } __packed;
 
-static const struct ibm_arch_vec ibm_architecture_vec_template = {
+static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
.pvrs = {
{
.mask = cpu_to_be32(0xfffe), /* POWER5/POWER5+ */
@@ -963,7 +963,7 @@ static const struct fake_elf {
u32 ignore_me;
} rpadesc;
} rpanote;
-} fake_elf = {
+} fake_elf __initconst = {
.elfhdr = {
.e_ident = { 0x7f, 'E', 'L', 'F',
 ELFCLASS32, ELFDATA2MSB, EV_CURRENT },
@@ -2128,7 +2128,7 @@ static void __init prom_check_displays(void)
ihandle ih;
int i;
 
-   static const unsigned char default_colors[] = {
+   static const unsigned char default_colors[] __initconst = {
0x00, 0x00, 0x00,
0x00, 0x00, 0xaa,
0x00, 0xaa, 0x00,
-- 
2.17.1



[PATCH 05/12] powerpc/prom_init: Remove support for OPAL v2

2018-10-14 Thread Benjamin Herrenschmidt
We removed support for running under any OPAL version
earlier than v3 in 2015 (they never saw the light of day
anyway), but we kept some leftovers of this support in
prom_init.c, so let's take it out.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 125 +++-
 1 file changed, 10 insertions(+), 115 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 9942bfca12e4..9852cdb8bb9e 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -43,7 +43,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -191,7 +190,6 @@ struct platform_support {
 #define PLATFORM_LPAR  0x0001
 #define PLATFORM_POWERMAC  0x0400
 #define PLATFORM_GENERIC   0x0500
-#define PLATFORM_OPAL  0x0600
 
 static int __prombss of_platform;
 
@@ -684,7 +682,7 @@ static void __init early_cmdline_parse(void)
prom_debug("Radix disabled from cmdline\n");
 }
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
+#ifdef CONFIG_PPC_PSERIES
 /*
  * The architecture vector has an array of PVR mask/value pairs,
  * followed by # option vectors - 1, followed by the option vectors.
@@ -1228,7 +1226,7 @@ static void __init prom_send_capabilities(void)
}
 #endif /* __BIG_ENDIAN__ */
 }
-#endif /* #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) */
+#endif /* CONFIG_PPC_PSERIES */
 
 /*
  * Memory allocation strategy... our layout is normally:
@@ -1565,88 +1563,6 @@ static void __init prom_close_stdin(void)
}
 }
 
-#ifdef CONFIG_PPC_POWERNV
-
-#ifdef CONFIG_PPC_EARLY_DEBUG_OPAL
-static u64 __prombss prom_opal_base;
-static u64 __prombss prom_opal_entry;
-#endif
-
-/*
- * Allocate room for and instantiate OPAL
- */
-static void __init prom_instantiate_opal(void)
-{
-   phandle opal_node;
-   ihandle opal_inst;
-   u64 base, entry;
-   u64 size = 0, align = 0x1;
-   __be64 val64;
-   u32 rets[2];
-
-   prom_debug("prom_instantiate_opal: start...\n");
-
-   opal_node = call_prom("finddevice", 1, 1, ADDR("/ibm,opal"));
-   prom_debug("opal_node: %x\n", opal_node);
-   if (!PHANDLE_VALID(opal_node))
-   return;
-
-   val64 = 0;
-   prom_getprop(opal_node, "opal-runtime-size", , sizeof(val64));
-   size = be64_to_cpu(val64);
-   if (size == 0)
-   return;
-   val64 = 0;
-   prom_getprop(opal_node, "opal-runtime-alignment", ,sizeof(val64));
-   align = be64_to_cpu(val64);
-
-   base = alloc_down(size, align, 0);
-   if (base == 0) {
-   prom_printf("OPAL allocation failed !\n");
-   return;
-   }
-
-   opal_inst = call_prom("open", 1, 1, ADDR("/ibm,opal"));
-   if (!IHANDLE_VALID(opal_inst)) {
-   prom_printf("opening opal package failed (%x)\n", opal_inst);
-   return;
-   }
-
-   prom_printf("instantiating opal at 0x%llx...", base);
-
-   if (call_prom_ret("call-method", 4, 3, rets,
- ADDR("load-opal-runtime"),
- opal_inst,
- base >> 32, base & 0x) != 0
-   || (rets[0] == 0 && rets[1] == 0)) {
-   prom_printf(" failed\n");
-   return;
-   }
-   entry = (((u64)rets[0]) << 32) | rets[1];
-
-   prom_printf(" done\n");
-
-   reserve_mem(base, size);
-
-   prom_debug("opal base = 0x%llx\n", base);
-   prom_debug("opal align= 0x%llx\n", align);
-   prom_debug("opal entry= 0x%llx\n", entry);
-   prom_debug("opal size = 0x%llx\n", size);
-
-   prom_setprop(opal_node, "/ibm,opal", "opal-base-address",
-, sizeof(base));
-   prom_setprop(opal_node, "/ibm,opal", "opal-entry-address",
-, sizeof(entry));
-
-#ifdef CONFIG_PPC_EARLY_DEBUG_OPAL
-   prom_opal_base = base;
-   prom_opal_entry = entry;
-#endif
-   prom_debug("prom_instantiate_opal: end...\n");
-}
-
-#endif /* CONFIG_PPC_POWERNV */
-
 /*
  * Allocate room for and instantiate RTAS
  */
@@ -2153,10 +2069,6 @@ static int __init prom_find_machine_type(void)
}
}
 #ifdef CONFIG_PPC64
-   /* Try to detect OPAL */
-   if (PHANDLE_VALID(call_prom("finddevice", 1, 1, ADDR("/ibm,opal"
-   return PLATFORM_OPAL;
-
/* Try to figure out if it's an IBM pSeries or any other
 * PAPR compliant platform. We assume it is if :
 *  - /device_type is "chrp" (please, do NOT use that for future
@@ -2485,7 +2397,7 @@ static void __init scan_dt_build

[PATCH 07/12] powerpc/prom_init: Move ibm_arch_vec to __prombss

2018-10-14 Thread Benjamin Herrenschmidt
Make the existing initialized definition constant and copy
it to a __prombss copy

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 70f2bde06ab3..03b171bb1c29 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -788,7 +788,7 @@ struct ibm_arch_vec {
struct option_vector6 vec6;
 } __packed;
 
-struct ibm_arch_vec __cacheline_aligned ibm_architecture_vec = {
+static const struct ibm_arch_vec ibm_architecture_vec_template = {
.pvrs = {
{
.mask = cpu_to_be32(0xfffe), /* POWER5/POWER5+ */
@@ -926,6 +926,8 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
},
 };
 
+static struct ibm_arch_vec __prombss ibm_architecture_vec  
cacheline_aligned;
+
 /* Old method - ELF header with PT_NOTE sections only works on BE */
 #ifdef __BIG_ENDIAN__
 static const struct fake_elf {
@@ -1135,6 +1137,10 @@ static void __init prom_check_platform_support(void)
};
int prop_len = prom_getproplen(prom.chosen,
   "ibm,arch-vec-5-platform-support");
+
+   /* First copy the architecture vec template */
+   ibm_architecture_vec = ibm_architecture_vec_template;
+
if (prop_len > 1) {
int i;
u8 vec[prop_len];
-- 
2.17.1



[PATCH 12/12] powerpc/prom_init: Generate "phandle" instead of "linux, phandle"

2018-10-14 Thread Benjamin Herrenschmidt
When creating the boot-time FDT from an actual Open Firmware live
tree, let's generate "phandle" properties for the phandles instead
of the old deprecated "linux,phandle".

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 57762f19f136..64409d2ecaee 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2407,14 +2407,12 @@ static void __init scan_dt_build_struct(phandle node, 
unsigned long *mem_start,
has_phandle = 1;
}
 
-   /* Add a "linux,phandle" property if no "phandle" property already
-* existed.
-*/
+   /* Add a "phandle" property if none already exist */
if (!has_phandle) {
-   soff = dt_find_string("linux,phandle");
+   soff = dt_find_string("phandle");
if (soff == 0)
prom_printf("WARNING: Can't find string index for"
-   "  node %s\n", path);
+   "  node %s\n", path);
else {
dt_push_token(OF_DT_PROP, mem_start, mem_end);
dt_push_token(4, mem_start, mem_end);
@@ -2474,9 +2472,9 @@ static void __init flatten_device_tree(void)
dt_string_start = mem_start;
mem_start += 4; /* hole */
 
-   /* Add "linux,phandle" in there, we'll need it */
+   /* Add "phandle" in there, we'll need it */
namep = make_room(_start, _end, 16, 1);
-   strcpy(namep, "linux,phandle");
+   strcpy(namep, "phandle");
mem_start = (unsigned long)namep + strlen(namep) + 1;
 
/* Build string array */
-- 
2.17.1



[PATCH 11/12] powerpc: Check prom_init for disallowed sections

2018-10-14 Thread Benjamin Herrenschmidt
prom_init.c must not modify the kernel image outside
of the .bss.prominit section. Thus make sure that
prom_init.o doesn't have anything in any of these:

.data
.bss
.init.data

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init_check.sh | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/kernel/prom_init_check.sh 
b/arch/powerpc/kernel/prom_init_check.sh
index acb6b9226352..667df97d2595 100644
--- a/arch/powerpc/kernel/prom_init_check.sh
+++ b/arch/powerpc/kernel/prom_init_check.sh
@@ -28,6 +28,18 @@ OBJ="$2"
 
 ERROR=0
 
+function check_section()
+{
+file=$1
+section=$2
+size=$(objdump -h -j $section $file 2>/dev/null | awk "\$2 == \"$section\" 
{print \$3}")
+size=${size:-0}
+if [ $size -ne 0 ]; then
+   ERROR=1
+   echo "Error: Section $section not empty in prom_init.c" >&2
+fi
+}
+
 for UNDEF in $($NM -u $OBJ | awk '{print $2}')
 do
# On 64-bit nm gives us the function descriptors, which have
@@ -66,4 +78,8 @@ do
fi
 done
 
+check_section $OBJ .data
+check_section $OBJ .bss
+check_section $OBJ .init.data
+
 exit $ERROR
-- 
2.17.1



[PATCH 02/12] powerpc/prom_init: Make "fake_elf" const

2018-10-14 Thread Benjamin Herrenschmidt
It is never modified

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index a1ba56e8c917..e44f55be64ba 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -922,7 +922,7 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
 
 /* Old method - ELF header with PT_NOTE sections only works on BE */
 #ifdef __BIG_ENDIAN__
-static struct fake_elf {
+static const struct fake_elf {
Elf32_Ehdr  elfhdr;
Elf32_Phdr  phdr[2];
struct chrpnote {
-- 
2.17.1



[PATCH 01/12] powerpc/prom_init: Make of_workarounds static

2018-10-14 Thread Benjamin Herrenschmidt
It's not used anywhere else

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/prom_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 9b38a2e5dd35..a1ba56e8c917 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -87,7 +87,7 @@
 #define OF_WORKAROUNDS 0
 #else
 #define OF_WORKAROUNDS of_workarounds
-int of_workarounds;
+static int of_workarounds __promdata;
 #endif
 
 #define OF_WA_CLAIM1   /* do phys/virt claim separately, then map */
-- 
2.17.1



[PATCH] macintosh/windfarm_smu_sat: Fix debug output

2018-10-14 Thread Benjamin Herrenschmidt
There's some antiquated debug output that's trying
to do a hand-made hexdump and turning into horrible
1-byte-per-line output these days.

Use print_hex_dump() instead

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/macintosh/windfarm_smu_sat.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/drivers/macintosh/windfarm_smu_sat.c 
b/drivers/macintosh/windfarm_smu_sat.c
index da7f4fc1a51d..a0f61eb853c5 100644
--- a/drivers/macintosh/windfarm_smu_sat.c
+++ b/drivers/macintosh/windfarm_smu_sat.c
@@ -22,14 +22,6 @@
 
 #define VERSION "1.0"
 
-#define DEBUG
-
-#ifdef DEBUG
-#define DBG(args...)   printk(args)
-#else
-#define DBG(args...)   do { } while(0)
-#endif
-
 /* If the cache is older than 800ms we'll refetch it */
 #define MAX_AGEmsecs_to_jiffies(800)
 
@@ -106,13 +98,10 @@ struct smu_sdbp_header *smu_sat_get_sdb_partition(unsigned 
int sat_id, int id,
buf[i+2] = data[3];
buf[i+3] = data[2];
}
-#ifdef DEBUG
-   DBG(KERN_DEBUG "sat %d partition %x:", sat_id, id);
-   for (i = 0; i < len; ++i)
-   DBG(" %x", buf[i]);
-   DBG("\n");
-#endif
 
+   printk(KERN_DEBUG "sat %d partition %x:", sat_id, id);
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, buf, len, false);
if (size)
*size = len;
return (struct smu_sdbp_header *) buf;
@@ -132,13 +121,13 @@ static int wf_sat_read_cache(struct wf_sat *sat)
if (err < 0)
return err;
sat->last_read = jiffies;
+
 #ifdef LOTSA_DEBUG
{
int i;
-   DBG(KERN_DEBUG "wf_sat_get: data is");
-   for (i = 0; i < 16; ++i)
-   DBG(" %.2x", sat->cache[i]);
-   DBG("\n");
+   printk(KERN_DEBUG "wf_sat_get: data is");
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, sat->cache, 16, false);
}
 #endif
return 0;




Re: [PATCH] powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y

2018-10-09 Thread Benjamin Herrenschmidt
On Tue, 2018-10-09 at 21:37 +1100, Michael Ellerman wrote:
> Technically yes. But I thought because we build with mcmodel=medium
> we'll never actually get multiple TOCs in the kernel itself, so it
> doesn't actually matter.
> 
> So this seems to work for now:

Ok, fine. I forgot about DOTSYM. Will do for now.

Cheers,
Ben.

> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 301a6a86a20f..e5566e248a02 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -1149,7 +1149,7 @@ TRAMP_REAL_BEGIN(hmi_exception_early)
>   EXCEPTION_PROLOG_COMMON_2(PACA_EXGEN)
>   EXCEPTION_PROLOG_COMMON_3(0xe60)
>   addir3,r1,STACK_FRAME_OVERHEAD
> - BRANCH_LINK_TO_FAR(hmi_exception_realmode) /* Function call ABI */
> + BRANCH_LINK_TO_FAR(DOTSYM(hmi_exception_realmode)) /* Function call ABI 
> */
>   cmpdi   cr0,r3,0
>  
>   /* Windup the stack. */
> 
> 
> cheers



Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size

2018-10-08 Thread Benjamin Herrenschmidt
On Wed, 2018-10-03 at 16:10 -0700, Alexander Duyck wrote:
> > -* Because 32-bit DMA masks are so common we expect every 
> > architecture
> > -* to be able to satisfy them - either by not supporting more 
> > physical
> > -* memory, or by providing a ZONE_DMA32.  If neither is the case, 
> > the
> > -* architecture needs to use an IOMMU instead of the direct mapping.
> > -*/
> > -   if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
> > +   u64 min_mask;
> > +
> > +   if (IS_ENABLED(CONFIG_ZONE_DMA))
> > +   min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
> > +   else
> > +   min_mask = DMA_BIT_MASK(32);
> > +
> > +   min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
> > +
> > +   if (mask >= phys_to_dma(dev, min_mask))
> >  return 0;
> > -#endif
> >  return 1;
> >   }
> 
> So I believe I have run into the same issue that Guenter reported. On
> an x86_64 system w/ Intel IOMMU. I wasn't able to complete boot and
> all probe attempts for various devices were failing with -EIO errors.
> 
> I believe the last mask check should be "if (mask < phys_to_dma(dev,
> min_mask))" not a ">=" check.

Right, that test is backwards. I needed to change it here too (powermac
with the rest of the powerpc series).

Cheers,
Ben.




Re: [PATCH v6 0/9] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2018-10-08 Thread Benjamin Herrenschmidt
On Mon, 2018-10-08 at 09:16 +, Christophe Leroy wrote:
> The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
> moves the thread_info into task_struct.

We need to make sure we don't have code that assumes that we don't take
faults on TI access.

On ppc64, the stack SLB entries are bolted, which means the TI is too.

We might have code that assumes that we don't get SLB faults when
accessing TI. If not, we're fine but that needs a close look.

Ben.

> Moving thread_info into task_struct has the following advantages:
> - It protects thread_info from corruption in the case of stack
> overflows.
> - Its address is harder to determine if stack addresses are
> leaked, making a number of attacks more difficult.
> 
> Changes since v5:
>  - Fixed livepatch_sp setup by using end_of_stack() instead of hardcoding
>  - Fixed PPC_BPF_LOAD_CPU() macro
> 
> Changes since v4:
>  - Fixed a build failure on 32bits SMP when include/generated/asm-offsets.h 
> is not
>  already existing, was due to spaces instead of a tab in the Makefile
> 
> Changes since RFC v3: (based on Nick's review)
>  - Renamed task_size.h to task_size_user64.h to better relate to what it 
> contains.
>  - Handling of the isolation of thread_info cpu field inside CONFIG_SMP 
> #ifdefs moved to a separate patch.
>  - Removed CURRENT_THREAD_INFO macro completely.
>  - Added a guard in asm/smp.h to avoid build failure before _TASK_CPU is 
> defined.
>  - Added a patch at the end to rename 'tp' pointers to 'sp' pointers
>  - Renamed 'tp' into 'sp' pointers in preparation patch when relevant
>  - Fixed a few commit logs
>  - Fixed checkpatch report.
> 
> Changes since RFC v2:
>  - Removed the modification of names in asm-offsets
>  - Created a rule in arch/powerpc/Makefile to append the offset of 
> current->cpu in CFLAGS
>  - Modified asm/smp.h to use the offset set in CFLAGS
>  - Squashed the renaming of THREAD_INFO to TASK_STACK in the preparation patch
>  - Moved the modification of current_pt_regs in the patch activating 
> CONFIG_THREAD_INFO_IN_TASK
> 
> Changes since RFC v1:
>  - Removed the first patch which was modifying header inclusion order in timer
>  - Modified some names in asm-offsets to avoid conflicts when including 
> asm-offsets in C files
>  - Modified asm/smp.h to avoid having to include linux/sched.h (using 
> asm-offsets instead)
>  - Moved some changes from the activation patch to the preparation patch.
> 
> Christophe Leroy (9):
>   book3s/64: avoid circular header inclusion in mmu-hash.h
>   powerpc: Only use task_struct 'cpu' field on SMP
>   powerpc: Prepare for moving thread_info into task_struct
>   powerpc: Activate CONFIG_THREAD_INFO_IN_TASK
>   powerpc: regain entire stack space
>   powerpc: 'current_set' is now a table of task_struct pointers
>   powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU
>   powerpc/64: Remove CURRENT_THREAD_INFO
>   powerpc: clean stack pointers naming
> 
>  arch/powerpc/Kconfig   |  1 +
>  arch/powerpc/Makefile  |  8 ++-
>  arch/powerpc/include/asm/asm-prototypes.h  |  4 +-
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h  |  2 +-
>  arch/powerpc/include/asm/exception-64s.h   |  4 +-
>  arch/powerpc/include/asm/irq.h | 14 ++---
>  arch/powerpc/include/asm/livepatch.h   |  7 ++-
>  arch/powerpc/include/asm/processor.h   | 39 +
>  arch/powerpc/include/asm/ptrace.h  |  2 +-
>  arch/powerpc/include/asm/reg.h |  2 +-
>  arch/powerpc/include/asm/smp.h | 17 +-
>  arch/powerpc/include/asm/task_size_user64.h| 42 ++
>  arch/powerpc/include/asm/thread_info.h | 19 ---
>  arch/powerpc/kernel/asm-offsets.c  | 10 ++--
>  arch/powerpc/kernel/entry_32.S | 66 --
>  arch/powerpc/kernel/entry_64.S | 12 ++--
>  arch/powerpc/kernel/epapr_hcalls.S |  5 +-
>  arch/powerpc/kernel/exceptions-64e.S   | 13 +
>  arch/powerpc/kernel/exceptions-64s.S   |  2 +-
>  arch/powerpc/kernel/head_32.S  | 14 ++---
>  arch/powerpc/kernel/head_40x.S |  4 +-
>  arch/powerpc/kernel/head_44x.S |  8 +--
>  arch/powerpc/kernel/head_64.S  |  1 +
>  arch/powerpc/kernel/head_8xx.S |  2 +-
>  arch/powerpc/kernel/head_booke.h   | 12 +---
>  arch/powerpc/kernel/head_fsl_booke.S   | 16 +++---
>  arch/powerpc/kernel/idle_6xx.S |  8 +--
>  arch/powerpc/kernel/idle_book3e.S  |  2 +-
>  arch/powerpc/kernel/idle_e500.S|  8 +--
>  arch/powerpc/kernel/idle_power4.S  |  2 +-
>  arch/powerpc/kernel/irq.c  | 77 
> +-
>  arch/powerpc/kernel/kgdb.c | 28 --
>  arch/powerpc/kernel/machine_kexec_64.c |  6 +-
>  

Re: [PATCH] powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y

2018-10-08 Thread Benjamin Herrenschmidt
On Mon, 2018-10-08 at 17:04 +1000, Nicholas Piggin wrote:
> On Mon, 08 Oct 2018 15:08:31 +1100
> Benjamin Herrenschmidt  wrote:
> 
> > HMIs will crash the kernel due to
> > 
> > BRANCH_LINK_TO_FAR(hmi_exception_realmode)
> > 
> > Calling into the OPD instead of the actual code.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> > 
> > This hack fixes it for me, but it's not great. Nick, any better idea ?
> 
> Is it a hack because the ifdef gunk, or because there's something
> deeper wrong with using the .sym?

I'd say ifdef gunk, also the KVM use doesn't need it bcs the kvm entry
isn't an OPD.

> I guess all those handlers that load label address by hand could have
> the bug silently creep in. Can we have them use the DOTSYM() macro?

The KVM one doesnt have a dotsym does it ?

Also should we load the TOC from the OPD ?

> Thanks,
> Nick
> 
> > 
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> > b/arch/powerpc/kernel/exceptions-64s.S
> > index ea04dfb..752709cc8 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -1119,7 +1119,11 @@ TRAMP_REAL_BEGIN(hmi_exception_early)
> > EXCEPTION_PROLOG_COMMON_2(PACA_EXGEN)
> > EXCEPTION_PROLOG_COMMON_3(0xe60)
> > addir3,r1,STACK_FRAME_OVERHEAD
> > +#ifdef PPC64_ELF_ABI_v1
> > +   BRANCH_LINK_TO_FAR(.hmi_exception_realmode) /* Function call ABI */
> > +#else
> > BRANCH_LINK_TO_FAR(hmi_exception_realmode) /* Function call ABI */
> > +#endif
> > cmpdi   cr0,r3,0
> >  
> > /* Windup the stack. */
> > 
> > 



[PATCH] powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y

2018-10-07 Thread Benjamin Herrenschmidt
HMIs will crash the kernel due to

BRANCH_LINK_TO_FAR(hmi_exception_realmode)

Calling into the OPD instead of the actual code.

Signed-off-by: Benjamin Herrenschmidt 
---

This hack fixes it for me, but it's not great. Nick, any better idea ?

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index ea04dfb..752709cc8 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1119,7 +1119,11 @@ TRAMP_REAL_BEGIN(hmi_exception_early)
EXCEPTION_PROLOG_COMMON_2(PACA_EXGEN)
EXCEPTION_PROLOG_COMMON_3(0xe60)
addir3,r1,STACK_FRAME_OVERHEAD
+#ifdef PPC64_ELF_ABI_v1
+   BRANCH_LINK_TO_FAR(.hmi_exception_realmode) /* Function call ABI */
+#else
BRANCH_LINK_TO_FAR(hmi_exception_realmode) /* Function call ABI */
+#endif
cmpdi   cr0,r3,0
 
/* Windup the stack. */




Re: [PATCH] memblock: stop using implicit alignement to SMP_CACHE_BYTES

2018-10-04 Thread Benjamin Herrenschmidt
On Fri, 2018-10-05 at 00:07 +0300, Mike Rapoport wrote:
> When a memblock allocation APIs are called with align = 0, the alignment is
> implicitly set to SMP_CACHE_BYTES.
> 
> Replace all such uses of memblock APIs with the 'align' parameter explicitly
> set to SMP_CACHE_BYTES and stop implicit alignment assignment in the
> memblock internal allocation functions.
> 
> For the case when memblock APIs are used via helper functions, e.g. like
> iommu_arena_new_node() in Alpha, the helper functions were detected with
> Coccinelle's help and then manually examined and updated where appropriate.
> 
> The direct memblock APIs users were updated using the semantic patch below:

What is the purpose of this ? It sounds rather counter-intuitive...

Ben.




Re: dma mask related fixups (including full bus_dma_mask support) v2

2018-10-01 Thread Benjamin Herrenschmidt
On Mon, 2018-10-01 at 16:32 +0200, Christoph Hellwig wrote:
> FYI, I've pulled this into the dma-mapping tree to make forward
> progress.  All but oatch 4 had formal ACKs, and for that one Robin
> was fine even without and explicit ack.  I'll also send a patch to
> better document the zone selection as it confuses even well versed
> people like Ben.

Thanks. I"ll try to dig out some older systems to test.

Cheers,
Ben.



Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-27 Thread Benjamin Herrenschmidt
On Thu, 2018-09-27 at 15:49 +0200, Christoph Hellwig wrote:
> On Thu, Sep 27, 2018 at 11:45:15AM +1000, Benjamin Herrenschmidt wrote:
> > I'm not sure this is entirely right.
> > 
> > Let's say the mask is 30 bits. You will return GFP_DMA32, which will
> > fail if you allocate something above 1G (which is legit for
> > ZONE_DMA32).
> 
> And then we will try GFP_DMA further down in the function:
> 
>   if (IS_ENABLED(CONFIG_ZONE_DMA) &&
>   dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
>   !(gfp & GFP_DMA)) {
>   gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
>   goto again;
>   }
> 
> This is and old optimization from x86, because chances are high that
> GFP_DMA32 will give you suitable memory for the infamous 31-bit
> dma mask devices (at least at boot time) and thus we don't have
> to deplete the tiny ZONE_DMA pool.

I see, it's rather confusing :-) Wouldn't it be better to check against
top of 32-bit memory instead here too ?


Cheers,
Ben.



Re: [PATCH 1/5] dma-mapping: make the get_required_mask method available unconditionally

2018-09-26 Thread Benjamin Herrenschmidt
On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> This save some duplication for ia64, and makes the interface more
> general.  In the long run we want each dma_map_ops instance to fill this
> out, but this will take a little more prep work.
> 
> Signed-off-by: Christoph Hellwig 

(For powerpc)

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/ia64/include/asm/dma-mapping.h  |  2 --
>  arch/ia64/include/asm/machvec.h  |  7 ---
>  arch/ia64/include/asm/machvec_init.h |  1 -
>  arch/ia64/include/asm/machvec_sn2.h  |  2 --
>  arch/ia64/pci/pci.c  | 26 --
>  arch/ia64/sn/pci/pci_dma.c   |  4 ++--
>  drivers/base/platform.c  | 13 +++--
>  drivers/pci/controller/vmd.c |  4 
>  include/linux/dma-mapping.h  |  2 --
>  9 files changed, 13 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/dma-mapping.h 
> b/arch/ia64/include/asm/dma-mapping.h
> index 76e4d6632d68..522745ae67bb 100644
> --- a/arch/ia64/include/asm/dma-mapping.h
> +++ b/arch/ia64/include/asm/dma-mapping.h
> @@ -10,8 +10,6 @@
>  #include 
>  #include 
>  
> -#define ARCH_HAS_DMA_GET_REQUIRED_MASK
> -
>  extern const struct dma_map_ops *dma_ops;
>  extern struct ia64_machine_vector ia64_mv;
>  extern void set_iommu_machvec(void);
> diff --git a/arch/ia64/include/asm/machvec.h b/arch/ia64/include/asm/machvec.h
> index 267f4f170191..5133739966bc 100644
> --- a/arch/ia64/include/asm/machvec.h
> +++ b/arch/ia64/include/asm/machvec.h
> @@ -44,7 +44,6 @@ typedef void ia64_mv_kernel_launch_event_t(void);
>  
>  /* DMA-mapping interface: */
>  typedef void ia64_mv_dma_init (void);
> -typedef u64 ia64_mv_dma_get_required_mask (struct device *);
>  typedef const struct dma_map_ops *ia64_mv_dma_get_ops(struct device *);
>  
>  /*
> @@ -127,7 +126,6 @@ extern void machvec_tlb_migrate_finish (struct mm_struct 
> *);
>  #  define platform_global_tlb_purge  ia64_mv.global_tlb_purge
>  #  define platform_tlb_migrate_finishia64_mv.tlb_migrate_finish
>  #  define platform_dma_init  ia64_mv.dma_init
> -#  define platform_dma_get_required_mask ia64_mv.dma_get_required_mask
>  #  define platform_dma_get_ops   ia64_mv.dma_get_ops
>  #  define platform_irq_to_vector ia64_mv.irq_to_vector
>  #  define platform_local_vector_to_irq   ia64_mv.local_vector_to_irq
> @@ -171,7 +169,6 @@ struct ia64_machine_vector {
>   ia64_mv_global_tlb_purge_t *global_tlb_purge;
>   ia64_mv_tlb_migrate_finish_t *tlb_migrate_finish;
>   ia64_mv_dma_init *dma_init;
> - ia64_mv_dma_get_required_mask *dma_get_required_mask;
>   ia64_mv_dma_get_ops *dma_get_ops;
>   ia64_mv_irq_to_vector *irq_to_vector;
>   ia64_mv_local_vector_to_irq *local_vector_to_irq;
> @@ -211,7 +208,6 @@ struct ia64_machine_vector {
>   platform_global_tlb_purge,  \
>   platform_tlb_migrate_finish,\
>   platform_dma_init,  \
> - platform_dma_get_required_mask, \
>   platform_dma_get_ops,   \
>   platform_irq_to_vector, \
>   platform_local_vector_to_irq,   \
> @@ -286,9 +282,6 @@ extern const struct dma_map_ops *dma_get_ops(struct 
> device *);
>  #ifndef platform_dma_get_ops
>  # define platform_dma_get_opsdma_get_ops
>  #endif
> -#ifndef platform_dma_get_required_mask
> -# define  platform_dma_get_required_mask ia64_dma_get_required_mask
> -#endif
>  #ifndef platform_irq_to_vector
>  # define platform_irq_to_vector  __ia64_irq_to_vector
>  #endif
> diff --git a/arch/ia64/include/asm/machvec_init.h 
> b/arch/ia64/include/asm/machvec_init.h
> index 2b32fd06b7c6..2aafb69a3787 100644
> --- a/arch/ia64/include/asm/machvec_init.h
> +++ b/arch/ia64/include/asm/machvec_init.h
> @@ -4,7 +4,6 @@
>  
>  extern ia64_mv_send_ipi_t ia64_send_ipi;
>  extern ia64_mv_global_tlb_purge_t ia64_global_tlb_purge;
> -extern ia64_mv_dma_get_required_mask ia64_dma_get_required_mask;
>  extern ia64_mv_irq_to_vector __ia64_irq_to_vector;
>  extern ia64_mv_local_vector_to_irq __ia64_local_vector_to_irq;
>  extern ia64_mv_pci_get_legacy_mem_t ia64_pci_get_legacy_mem;
> diff --git a/arch/ia64/include/asm/machvec_sn2.h 
> b/arch/ia64/include/asm/machvec_sn2.h
> index ece9fa85be88..b5153d300289 100644
> --- a/arch/ia64/include/asm/machvec_sn2.h
> +++ b/arch/ia64/include/asm/machvec_sn2.h
> @@ -55,7 +55,6 @@ extern ia64_mv_readb_t __sn_readb_relaxed;
>  extern ia64_mv_readw_t __sn_readw_relaxed;
>  extern ia64_mv_readl_t __sn_readl_relaxed;
>  extern ia64_mv_readq_t __sn_readq_relaxed;
> -extern ia64_mv_dma_get_requir

Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size

2018-09-26 Thread Benjamin Herrenschmidt
On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> This way an architecture with less than 4G of RAM can support dma_mask
> smaller than 32-bit without a ZONE_DMA.  Apparently that is a common
> case on powerpc.

Anything that uses a b43 wifi adapter which has a 31-bit limitation
actually :-)
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/dma/direct.c | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 64466b7ef67b..d1e103c6b107 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -4,6 +4,7 @@
>   *
>   * DMA operations that map physical memory directly without using an IOMMU.
>   */
> +#include  /* for max_pfn */
>  #include 
>  #include 
>  #include 
> @@ -283,21 +284,24 @@ int dma_direct_map_sg(struct device *dev, struct 
> scatterlist *sgl, int nents,
>   return nents;
>  }
>  
> +/*
> + * Because 32-bit DMA masks are so common we expect every architecture to be
> + * able to satisfy them - either by not supporting more physical memory, or 
> by
> + * providing a ZONE_DMA32.  If neither is the case, the architecture needs to
> + * use an IOMMU instead of the direct mapping.
> + */
>  int dma_direct_supported(struct device *dev, u64 mask)
>  {
> -#ifdef CONFIG_ZONE_DMA
> - if (mask < phys_to_dma(dev, DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)))
> - return 0;
> -#else
> - /*
> -  * Because 32-bit DMA masks are so common we expect every architecture
> -  * to be able to satisfy them - either by not supporting more physical
> -  * memory, or by providing a ZONE_DMA32.  If neither is the case, the
> -  * architecture needs to use an IOMMU instead of the direct mapping.
> -  */
> - if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
> + u64 min_mask;
> +
> + if (IS_ENABLED(CONFIG_ZONE_DMA))
> + min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
> + else
> + min_mask = min_t(u64, DMA_BIT_MASK(32),
> +  (max_pfn - 1) << PAGE_SHIFT);
> +
> + if (mask >= phys_to_dma(dev, min_mask))
>   return 0;

nitpick ... to be completely "correct", I would have written

if (IS_ENABLED(CONFIG_ZONE_DMA))
min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
else
min_mask = DMA_BIT_MASK(32);

min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);

In "theory" it's also ok to have a mask < ZONE_DMA_BITS as long as it's
big enough to fit all memory :-)

> -#endif
>   return 1;
>  }
>  



Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-26 Thread Benjamin Herrenschmidt
On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> +static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,
> +   u64 *phys_mask)
> +{
> +   if (force_dma_unencrypted())
> +   *phys_mask = __dma_to_phys(dev, dma_mask);
> +   else
> +   *phys_mask = dma_to_phys(dev, dma_mask);
> +
> +   /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: 
> */
> +   if (*phys_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
> +   return GFP_DMA;
> +   if (*phys_mask <= DMA_BIT_MASK(32))
> +   return GFP_DMA32;
> +   return 0;
> +}

I'm not sure this is entirely right.

Let's say the mask is 30 bits. You will return GFP_DMA32, which will
fail if you allocate something above 1G (which is legit for
ZONE_DMA32).

I think the logic should be:

if (mask < ZONE_DMA)
fail;
else if (mask < ZONE_DMA32)
use ZONE_DMA or fail if doesn't exist
else if (mask < top_of_ram)
use ZONE_DMA32 or fail if doesn't exist
else
use ZONE_NORMAL

Additionally, we want to fold-in the top-of-ram test such that we don't
fail the second case if the mask is 31-bits (smaller than ZONE_DMA32)
but top of ram is also small enough.

So the top of ram test should take precendence.

Cheers,
Ben.




Re: [PATCH 2/5] dma-direct: add an explicit dma_direct_get_required_mask

2018-09-26 Thread Benjamin Herrenschmidt
On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> This is somewhat modelled after the powerpc version, and differs from
> the legacy fallback in use fls64 instead of pointlessly splitting up the
> address into low and high dwords and in that it takes (__)phys_to_dma
> into account.

This looks like it will be usable if/when we switch powerpc to
dma/direct.c

Acked-by: Benjamin Herrenschmidt 
---
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/dma-direct.h |  1 +
>  kernel/dma/direct.c| 21 ++---
>  2 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index 86a59ba5a7f3..b79496d8c75b 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -55,6 +55,7 @@ static inline void dma_mark_clean(void *addr, size_t size)
>  }
>  #endif /* CONFIG_ARCH_HAS_DMA_MARK_CLEAN */
>  
> +u64 dma_direct_get_required_mask(struct device *dev);
>  void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t 
> *dma_handle,
>   gfp_t gfp, unsigned long attrs);
>  void dma_direct_free(struct device *dev, size_t size, void *cpu_addr,
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index c954f0a6dc62..81b73a5bba54 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -53,11 +53,25 @@ check_addr(struct device *dev, dma_addr_t dma_addr, 
> size_t size,
>   return true;
>  }
>  
> +static inline dma_addr_t phys_to_dma_direct(struct device *dev,
> + phys_addr_t phys)
> +{
> + if (force_dma_unencrypted())
> + return __phys_to_dma(dev, phys);
> + return phys_to_dma(dev, phys);
> +}
> +
> +u64 dma_direct_get_required_mask(struct device *dev)
> +{
> + u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT);
> +
> + return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
> +}
> +
>  static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t 
> size)
>  {
> - dma_addr_t addr = force_dma_unencrypted() ?
> - __phys_to_dma(dev, phys) : phys_to_dma(dev, phys);
> - return addr + size - 1 <= dev->coherent_dma_mask;
> + return phys_to_dma_direct(dev, phys) + size - 1 <=
> + dev->coherent_dma_mask;
>  }
>  
>  void *dma_direct_alloc_pages(struct device *dev, size_t size,
> @@ -296,6 +310,7 @@ const struct dma_map_ops dma_direct_ops = {
>   .unmap_page = dma_direct_unmap_page,
>   .unmap_sg   = dma_direct_unmap_sg,
>  #endif
> + .get_required_mask  = dma_direct_get_required_mask,
>   .dma_supported  = dma_direct_supported,
>   .mapping_error  = dma_direct_mapping_error,
>   .cache_sync = arch_dma_cache_sync,



Re: [PATCH] MAINTAINERS: Add PPC contacts for PCI core error handling

2018-09-13 Thread Benjamin Herrenschmidt
On Wed, 2018-09-12 at 11:58 -0500, Bjorn Helgaas wrote:
> > Add the generic PCI core error recovery files to the powerpc EEH
> > MAINTAINERS entry so the powerpc folks will be copied on changes to the
> > generic PCI error handling strategy.
> 
> I really want to make sure the powerpc folks are plugged into any PCI core
> error handling discussions.  Please let me know if there's a better way
> than this patch, or if there are other people who should be added.

Sounds good. Oliver, you want to be looped in as well ?

Cheers,
Ben.




Re: [PATCH 0/2] sriov enablement on s390

2018-09-12 Thread Benjamin Herrenschmidt
On Wed, 2018-09-12 at 08:02 -0500, Bjorn Helgaas wrote:
> [+cc Arnd, powerpc folks]

[+Oliver]

> On Wed, Sep 12, 2018 at 02:34:09PM +0200, Sebastian Ott wrote:
> > Hello Bjorn,
> > 
> > On s390 we currently handle SRIOV within firmware. Which means
> > that the PF is under firmware control and not visible to operating
> > systems. SRIOV enablement happens within firmware and VFs are
> > passed through to logical partitions.
> > 
> > I'm working on a new mode were the PF is under operating system
> > control (including SRIOV enablement). However we still need
> > firmware support to access the VFs. The way this is supposed
> > to work is that when firmware traps the SRIOV enablement it
> > will present machine checks to the logical partition that
> > triggered the SRIOV enablement and provide the VFs via hotplug
> > events.
> > 
> > The problem I'm faced with is that the VF detection code in
> > sriov_enable leads to unusable functions in s390.
> 
> We're moving away from the weak function implementation style.  Can
> you take a look at Arnd's work here, which uses pci_host_bridge
> callbacks instead?
> 
>   https://lkml.kernel.org/r/20180817102645.3839621-1-a...@arndb.de
> 
> I cc'd some powerpc folks because they also have a fair amount of
> arch-specific SR-IOV code that might one day move in this direction.
> 
> > Sebastian Ott (2):
> >   pci: provide pcibios_sriov_add_vfs
> >   s390/pci: handle function enumeration after sriov enablement
> > 
> >  arch/s390/pci/pci.c | 11 +++
> >  drivers/pci/iov.c   | 43 +++
> >  include/linux/pci.h |  2 ++
> >  3 files changed, 44 insertions(+), 12 deletions(-)
> > 
> > -- 
> > 2.13.4
> > 



Re: [PATCH v2 5/5] PCI/powerpc/eeh: Add pcibios hooks for preparing to rescan

2018-09-12 Thread Benjamin Herrenschmidt
On Mon, 2018-09-10 at 19:00 +0300, Sergey Miroshnichenko wrote:
> 
> Yes, missing a real EEH event is possible, unfortunately, and it is
> indeed worth mentioning.
> 
> To reduce this probability the next patchset I'll post in a few days
> among other things puts all the affected device drivers to pause during
> rescan, mainly because of moving BARs and bridge windows, but it will
> also help here a bit.

How do you deal with moving BARs etc... within the segmenting
restrictions of EEH ?

It's a horrible mess right now and I don't know if the current code can
even work properly to be honest.

Cheers,
Ben.




Re: [PATCH 5/5] PCI/powerpc/eeh: Add pcibios hooks for preparing to rescan

2018-09-12 Thread Benjamin Herrenschmidt
On Wed, 2018-09-05 at 18:40 +0300, Sergey Miroshnichenko wrote:
> Reading an empty slot returns all ones, which triggers a false
> EEH error event on PowerNV.
> 
> New callbacks pcibios_rescan_prepare/done are introduced to
> pause/resume the EEH during rescan.

This freaks me out a bit. EEH will still happen in HW, you will just
drop the notifications. Disabling the interrupt globally is not a great
idea either as it will affect other unrelated devices or busses.

Just marking the specific bus might work if we deal with unfreezing it
properly but you need some locking there, your code doesn't seem to
have any.

Ben.

> 
> Signed-off-by: Sergey Miroshnichenko 
> ---
>  arch/powerpc/include/asm/eeh.h   |  2 ++
>  arch/powerpc/kernel/eeh.c| 14 ++
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 20 
>  drivers/pci/probe.c  | 14 ++
>  include/linux/pci.h  |  2 ++
>  5 files changed, 52 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 219637ea69a1..926c3e31df99 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -219,6 +219,8 @@ struct eeh_ops {
>   int (*next_error)(struct eeh_pe **pe);
>   int (*restore_config)(struct pci_dn *pdn);
>   int (*notify_resume)(struct pci_dn *pdn);
> + int (*pause)(struct pci_bus *bus);
> + int (*resume)(struct pci_bus *bus);
>  };
>  
>  extern int eeh_subsystem_flags;
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 6ebba3e48b01..dce9b0978cb5 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1831,3 +1831,17 @@ static int __init eeh_init_proc(void)
>   return 0;
>  }
>  __initcall(eeh_init_proc);
> +
> +void pcibios_rescan_prepare(struct pci_bus *bus)
> +{
> + if (eeh_ops && eeh_ops->pause) {
> + eeh_ops->pause(bus);
> + }
> +}
> +
> +void pcibios_rescan_done(struct pci_bus *bus)
> +{
> + if (eeh_ops && eeh_ops->resume) {
> + eeh_ops->resume(bus);
> + }
> +}
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
> b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 3c1beae29f2d..9c9213d92550 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -59,6 +59,24 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
>   eeh_sysfs_add_device(pdev);
>  }
>  
> +static int pnv_eeh_pause(struct pci_bus *bus)
> +{
> + struct pci_controller *hose = pci_bus_to_host(bus);
> + struct pnv_phb *phb = hose->private_data;
> + phb->flags &= ~PNV_PHB_FLAG_EEH;
> + disable_irq(eeh_event_irq);
> + return 0;
> +}
> +
> +static int pnv_eeh_resume(struct pci_bus *bus)
> +{
> + struct pci_controller *hose = pci_bus_to_host(bus);
> + struct pnv_phb *phb = hose->private_data;
> + enable_irq(eeh_event_irq);
> + phb->flags |= PNV_PHB_FLAG_EEH;
> + return 0;
> +}
> +
>  static int pnv_eeh_init(void)
>  {
>   struct pci_controller *hose;
> @@ -1710,6 +1728,8 @@ static struct eeh_ops pnv_eeh_ops = {
>   .write_config   = pnv_eeh_write_config,
>   .next_error = pnv_eeh_next_error,
>   .restore_config = pnv_eeh_restore_config,
> + .pause  = pnv_eeh_pause,
> + .resume = pnv_eeh_resume,
>   .notify_resume  = NULL
>  };
>  
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ec784009a36b..203368566896 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2893,6 +2893,14 @@ void __weak pcibios_remove_bus(struct pci_bus *bus)
>  {
>  }
>  
> +void __weak pcibios_rescan_prepare(struct pci_bus *bus)
> +{
> +}
> +
> +void __weak pcibios_rescan_done(struct pci_bus *bus)
> +{
> +}
> +
>  struct pci_bus *pci_create_root_bus(struct device *parent, int bus,
>   struct pci_ops *ops, void *sysdata, struct list_head *resources)
>  {
> @@ -3147,9 +3155,15 @@ unsigned int pci_rescan_bus_bridge_resize(struct 
> pci_dev *bridge)
>  unsigned int pci_rescan_bus(struct pci_bus *bus)
>  {
>   unsigned int max;
> + struct pci_bus *root = bus;
> + while (!pci_is_root_bus(root)) {
> + root = root->parent;
> + }
>  
> + pcibios_rescan_prepare(root);
>   max = pci_scan_child_bus(bus);
>   pci_assign_unassigned_bus_resources(bus);
> + pcibios_rescan_done(root);
>   pci_bus_add_devices(bus);
>  
>   return max;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index e72ca8dd6241..d7fe72aa53b3 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1952,6 +1952,8 @@ void pcibios_penalize_isa_irq(int irq, int active);
>  int pcibios_alloc_irq(struct pci_dev *dev);
>  void pcibios_free_irq(struct pci_dev *dev);
>  resource_size_t pcibios_default_alignment(void);
> +void 

Re: [PATCH 4/5] powerpc/powernv/pci: Enable reassigning the bus numbers

2018-09-12 Thread Benjamin Herrenschmidt
On Wed, 2018-09-05 at 18:40 +0300, Sergey Miroshnichenko wrote:
> PowerNV doesn't depend on PCIe topology info from DT anymore, and now
> it is able to enumerate the fabric and assign the bus numbers.

No it's not, at least unless we drop P7 support.

P7 has constraints on the bus ranges being aligned power-of-two for the
PE assignment to work, which is why we have to honor the firmware
provided numbers.

Additionally, this breaks the mapping between the firmware idea of the
bus numbers and Linux idea. This will probably break all of the SR-IOV
stuff.

Now we should probably fix it all by removing the FW bits completely
and doing it all from Linux, though we really need to better handle how
we deal with the segmented MMIO space.

I would also be weary of what other parts of the code depends on that
matching between the FW bdfn and the Linux bdfn.

Cheers,
Ben.

> Signed-off-by: Sergey Miroshnichenko 
> ---
>  arch/powerpc/platforms/powernv/pci.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index 6d4280086a08..f6eaca3123cd 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -1104,6 +1104,7 @@ void __init pnv_pci_init(void)
>   struct device_node *np;
>  
>   pci_add_flags(PCI_CAN_SKIP_ISA_ALIGN);
> + pci_add_flags(PCI_REASSIGN_ALL_BUS);
>  
>   /* If we don't have OPAL, eg. in sim, just skip PCI probe */
>   if (!firmware_has_feature(FW_FEATURE_OPAL))



Re: [PATCH 2/5] powerpc/pci: Create pci_dn on demand

2018-09-12 Thread Benjamin Herrenschmidt
On Wed, 2018-09-05 at 18:40 +0300, Sergey Miroshnichenko wrote:
> The pci_dn structures can be created not only from DT, but also
> directly from newly discovered PCIe devices, so allocate them
> dynamically.

I'd rather we moved towards killing pci_dn completely to be honest :)

> Signed-off-by: Sergey Miroshnichenko 
> ---
>  arch/powerpc/kernel/pci_dn.c | 69 +++-
>  1 file changed, 52 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index ab147a1909c8..5ce752874827 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -33,6 +33,8 @@
>  #include 
>  #include 
>  
> +static struct pci_dn* create_pdn(struct pci_dev *pdev, struct pci_dn 
> *parent);
> +
>  /*
>   * The function is used to find the firmware data of one
>   * specific PCI device, which is attached to the indicated
> @@ -58,6 +60,10 @@ static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus)
>   pbus = pbus->parent;
>   }
>  
> + if (!pbus->self && !pci_is_root_bus(pbus)) {
> + return NULL;
> + }
> +
>   /*
>* Except virtual bus, all PCI buses should
>* have device nodes.
> @@ -65,13 +71,16 @@ static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus)
>   dn = pci_bus_to_OF_node(pbus);
>   pdn = dn ? PCI_DN(dn) : NULL;
>  
> + if (!pdn && pbus->self) {
> + pdn = pbus->self->dev.archdata.pci_data;
> + }
> +
>   return pdn;
>  }
>  
>  struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
>   int devfn)
>  {
> - struct device_node *dn = NULL;
>   struct pci_dn *parent, *pdn;
>   struct pci_dev *pdev = NULL;
>  
> @@ -80,17 +89,10 @@ struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
>   if (pdev->devfn == devfn) {
>   if (pdev->dev.archdata.pci_data)
>   return pdev->dev.archdata.pci_data;
> -
> - dn = pci_device_to_OF_node(pdev);
>   break;
>   }
>   }
>  
> - /* Fast path: fetch from device node */
> - pdn = dn ? PCI_DN(dn) : NULL;
> - if (pdn)
> - return pdn;
> -
>   /* Slow path: fetch from firmware data hierarchy */
>   parent = pci_bus_to_pdn(bus);
>   if (!parent)
> @@ -128,16 +130,9 @@ struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
>   if (!parent)
>   return NULL;
>  
> - list_for_each_entry(pdn, >child_list, list) {
> - if (pdn->busno == pdev->bus->number &&
> - pdn->devfn == pdev->devfn)
> - return pdn;
> - }
> -
> - return NULL;
> + return create_pdn(pdev, parent);
>  }
>  
> -#ifdef CONFIG_PCI_IOV
>  static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent,
>  int vf_index,
>  int busno, int devfn)
> @@ -164,7 +159,47 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
> *parent,
>  
>   return pdn;
>  }
> -#endif
> +
> +static struct pci_dn* create_pdn(struct pci_dev *pdev, struct pci_dn *parent)
> +{
> + struct pci_dn *pdn = NULL;
> +
> + pdn = add_one_dev_pci_data(parent, 0, pdev->bus->number, pdev->devfn);
> + dev_info(>dev, "Create a new pdn for devfn %2x\n", pdev->devfn / 
> 8);
> +
> + if (pdn)
> + {
> + u32 class_code;
> + u16 device_id;
> + u16 vendor_id;
> +
> + struct eeh_dev *edev = eeh_dev_init(pdn);
> + if (!edev) {
> + kfree(pdn);
> + dev_err(>dev, "%s:%d: Failed to allocate edev\n", 
> __func__, __LINE__);
> + return NULL;
> + }
> +
> + pdn->busno = pdev->bus->busn_res.start;
> +
> + pci_bus_read_config_word(pdev->bus, pdev->devfn, PCI_VENDOR_ID, 
> _id);
> + pdn->vendor_id = vendor_id;
> +
> + pci_bus_read_config_word(pdev->bus, pdev->devfn, PCI_DEVICE_ID, 
> _id);
> + pdn->device_id = device_id;
> +
> + pci_bus_read_config_dword(pdev->bus, pdev->devfn, 
> PCI_CLASS_REVISION, _code);
> + class_code >>= 8;
> + pdn->class_code = class_code;
> +
> + pdn->pci_ext_config_space = 0;
> + pdev->dev.archdata.pci_data = pdn;
> + } else {
> + dev_err(>dev, "%s:%d: Failed to allocate pdn\n", 
> __func__, __LINE__);
> + }
> +
> + return pdn;
> +}
>  
>  struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
>  {



Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-09-09 Thread Benjamin Herrenschmidt
On Fri, 2018-08-31 at 14:58 +1000, Benjamin Herrenschmidt wrote:
> 
> > A long shot, but something to consider, is that I failed to cover the
> > cases of dynamic devicetree updates (removing nodes that contain a
> > phandle) in ways other than overlays.  Michael Ellerman has reported
> > such a problem for powerpc/mobility with of_detach_node().  A patch to
> > fix that is one of the tasks I need to complete.
> 
> The only thing I can think of is booting via the BootX bootloader on
> those ancient macs results in a DT with no phandles. I didn't see an
> obvious reason why that would cause that patch to break though.

Guys, we still don't have a fix for this one on its way upstream...

My test patch just creates phandle properties for all nodes, that was
not intended as a fix, more a way to check if the problem was related
to the lack of phandles.

I don't actually know why the new code causes things to fail when
phandles are absent. This needs to be looked at.

I'm travelling at the moment and generally caught up with other things,
I haven't had a chance to dig, so just a heads up. I don't intend to
submit my patch since it's just a band aid. We need to figure out what
the actual problem is.

Cheers,
Ben.




Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-09-05 Thread Benjamin Herrenschmidt
On Fri, 2018-08-31 at 14:35 +1000, Benjamin Herrenschmidt wrote:
> 
> > If I force output with "-f", the resulting file has no occurrences 
> > of "phandle".
> 
> Are you booting with BootX or Open Firmware ?

Assuming you are using BootX (or miBoot), can you try this patch ?

--- a/arch/powerpc/platforms/powermac/bootx_init.c
+++ b/arch/powerpc/platforms/powermac/bootx_init.c
@@ -37,6 +37,7 @@ static unsigned long __initdata bootx_dt_strend;
 static unsigned long __initdata bootx_node_chosen;
 static boot_infos_t * __initdata bootx_info;
 static char __initdata bootx_disp_path[256];
+static int __initdata bootx_phandle;
 
 /* Is boot-info compatible ? */
 #define BOOT_INFO_IS_COMPATIBLE(bi) \
@@ -258,6 +259,8 @@ static void __init bootx_scan_dt_build_strings(unsigned 
long base,
namep = pp->name ? (char *)(base + pp->name) : NULL;
if (namep == NULL || strcmp(namep, "name") == 0)
goto next;
+   if (!strcmp(namep, "phandle") || !strcmp(namep, 
"linux,phandle"))
+   bootx_phandle = -1;
/* get/create string entry */
soff = bootx_dt_find_string(namep);
if (soff == 0)
@@ -330,6 +333,12 @@ static void __init bootx_scan_dt_build_struct(unsigned 
long base,
ppp = >next;
}
 
+   /* add a phandle */
+   if (bootx_phandle > 0) {
+   bootx_dt_add_prop("phandle", _phandle, 4, mem_end);
+   bootx_phandle++;
+   }
+
if (node == bootx_node_chosen) {
bootx_add_chosen_props(base, mem_end);
if (bootx_info->dispDeviceRegEntryOffset == 0)
@@ -385,6 +394,8 @@ static unsigned long __init bootx_flatten_dt(unsigned long 
start)
bootx_dt_add_string("linux,bootx-height", _end);
bootx_dt_add_string("linux,bootx-linebytes", _end);
bootx_dt_add_string("linux,bootx-addr", _end);
+   if (bootx_phandle > 0)
+   bootx_dt_add_string("phandle", _end);
/* Wrap up strings */
hdr->off_dt_strings = bootx_dt_strbase - mem_start;
hdr->dt_strings_size = bootx_dt_strend - bootx_dt_strbase;
@@ -482,6 +493,7 @@ void __init bootx_init(unsigned long r3, unsigned long r4)
bootx_dt_strbase = bootx_dt_strend = 0;
bootx_node_chosen = 0;
bootx_disp_path[0] = 0;
+   bootx_phandle = 1;
 
if (!BOOT_INFO_IS_V2_COMPATIBLE(bi))
bi->logicalDisplayBase = bi->dispDeviceBase;




Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-08-31 Thread Benjamin Herrenschmidt
On Sat, 2018-09-01 at 09:36 +1000, Finn Thain wrote:
> > The patched kernel (Finn's vmlinux-4.18.0-1-gd44cf7e41c19) boots 
> > normally on my Beige G3 Desktop using BootX.
> > 
> 
> Ben sent two patches, so I picked the most recent one and applied it by 
> hand due to corrupted whitespace.

Yup, evolution seems to have broken copy/pasting of patches in a
preformatted email lately ... ugh.

> The patch you tested was the one below.



Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-08-31 Thread Benjamin Herrenschmidt
On Fri, 2018-08-31 at 06:28 -0600, Mac User wrote:
> On 8/30/18 10:49 PM, Benjamin Herrenschmidt wrote:
> 
> > On Fri, 2018-08-31 at 14:35 +1000, Benjamin Herrenschmidt wrote:
> > 
> > ...
> > Assuming you are using BootX (or miBoot), can you try this patch ?
> 
> Yes, I'm using BootX.
> 
> Thanks to Finn for applying the patch (I wouldn't have been sure
> which source tree to apply it to).
> 
> Thepatched kernel (Finn's vmlinux-4.18.0-1-gd44cf7e41c19)
> boots normally on my Beige G3 Desktop using BootX.
> 
> Thanks for working on this!

Well it's still a sign of something wrong with the new code, my patch
just band-aids by creating fake phandles. I would be interesting to
figure out what causes the new code to barf.

Cheers,
Ben.




Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-08-30 Thread Benjamin Herrenschmidt
On Thu, 2018-08-30 at 21:36 -0700, Frank Rowand wrote:
> > No idea whether that's relevant; I haven't done any further investigation. 
> > Complete dmesg output is attached. Please let me know if there's any more 
> > information you need to help find the bug.
> > 
> > Thanks.
> 
> I don't have any useful answers yet, but I am following the thread and have
> also quickly scanned the two commits for any obvious cause.  I will look
> into this some more, but have a few other tasks that I need to complete
> first.
> 
> A long shot, but something to consider, is that I failed to cover the
> cases of dynamic devicetree updates (removing nodes that contain a
> phandle) in ways other than overlays.  Michael Ellerman has reported
> such a problem for powerpc/mobility with of_detach_node().  A patch to
> fix that is one of the tasks I need to complete.

The only thing I can think of is booting via the BootX bootloader on
those ancient macs results in a DT with no phandles. I didn't see an
obvious reason why that would cause that patch to break though.

Cheers,
Ben.



Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-08-30 Thread Benjamin Herrenschmidt
On Fri, 2018-08-31 at 14:35 +1000, Benjamin Herrenschmidt wrote:
> 
> > If I force output with "-f", the resulting file has no occurrences 
> > of "phandle".
> 
> Are you booting with BootX or Open Firmware ?

Assuming you are using BootX (or miBoot), can you try this patch ?

--- a/arch/powerpc/platforms/powermac/bootx_init.c
+++ b/arch/powerpc/platforms/powermac/bootx_init.c
@@ -37,6 +37,7 @@ static unsigned long __initdata bootx_dt_strend;
 static unsigned long __initdata bootx_node_chosen;
 static boot_infos_t * __initdata bootx_info;
 static char __initdata bootx_disp_path[256];
+static int __initdata bootx_phandle;
 
 /* Is boot-info compatible ? */
 #define BOOT_INFO_IS_COMPATIBLE(bi) \
@@ -258,6 +259,8 @@ static void __init bootx_scan_dt_build_strings(unsigned 
long base,
namep = pp->name ? (char *)(base + pp->name) : NULL;
if (namep == NULL || strcmp(namep, "name") == 0)
goto next;
+   if (!strcmp(namep, "phandle") || !strcmp(namep, 
"linux,phandle"))
+   bootx_phandle = -1;
/* get/create string entry */
soff = bootx_dt_find_string(namep);
if (soff == 0)
@@ -310,6 +313,7 @@ static void __init bootx_scan_dt_build_struct(unsigned long 
base,
*mem_end = _ALIGN_UP((unsigned long)lp + 1, 4);
 
/* get and store all properties */
+   has_phandle = false;
while (*ppp) {
struct bootx_dt_prop *pp =
(struct bootx_dt_prop *)(base + *ppp);
@@ -330,6 +334,12 @@ static void __init bootx_scan_dt_build_struct(unsigned 
long base,
ppp = >next;
}
 
+   /* add a phandle */
+   if (bootx_phandle > 0) {
+   bootx_dt_add_prop("phandle", _phandle, 4, mem_end);
+   bootx_phandle++;
+   }
+
if (node == bootx_node_chosen) {
bootx_add_chosen_props(base, mem_end);
if (bootx_info->dispDeviceRegEntryOffset == 0)
@@ -385,6 +395,8 @@ static unsigned long __init bootx_flatten_dt(unsigned long 
start)
bootx_dt_add_string("linux,bootx-height", _end);
bootx_dt_add_string("linux,bootx-linebytes", _end);
bootx_dt_add_string("linux,bootx-addr", _end);
+   if (bootx_phandle > 0)
+   bootx_dt_add_string("phandle", _end);
/* Wrap up strings */
hdr->off_dt_strings = bootx_dt_strbase - mem_start;
hdr->dt_strings_size = bootx_dt_strend - bootx_dt_strbase;
@@ -482,6 +494,7 @@ void __init bootx_init(unsigned long r3, unsigned long r4)
bootx_dt_strbase = bootx_dt_strend = 0;
bootx_node_chosen = 0;
bootx_disp_path[0] = 0;
+   bootx_phandle = 1;
 
if (!BOOT_INFO_IS_V2_COMPATIBLE(bi))
bi->logicalDisplayBase = bi->dispDeviceBase;



Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-08-30 Thread Benjamin Herrenschmidt
On Thu, 2018-08-30 at 20:39 -0600, Mac User wrote:
> On 8/29/18 7:05 PM, Rob Herring wrote:
> 
> > On Wed, Aug 29, 2018 at 7:44 PM Finn Thain  
> > wrote:
> > > Hi Frank,
> > > 
> > > Linux v4.17 and later will no longer boot on a G3 PowerMac. The boot hangs
> > > very early, before any video driver loads.
> > > 
> > > Stan and I were able to bisect the regression between v4.16 and v4.17 and
> > > arrived at commit 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of
> > > of_find_node_by_phandle()").
> > > 
> > > I don't see any obvious bug in 0b3ce78e90fc or b9952b5218ad. But if you
> > > revert these from v4.18 (which is also affected) that certainly resolves
> > > the issue.
> > 
> > Perhaps a bad assumption on phandle values causing a problem. Can you
> > provide a dump of all the phandle or linux,phandle values from
> > /proc/device-tree.
> > 
> > Rob
> 
> Rob,
> 
> As suggested by Finn, I installed device-tree-compiler and
> powerpc-ibm-utils.
>  
> Running "dtc -I fs -H both /sys/firmware/devicetree/base"
> resulted in the following errors:
> 
> DTC: fs->dts on file "/sys/firmware/devicetree/base"
> ERROR (name_properties): "name" property in
> /pci/multifunc-device/pci1799,1#1 is incorrect ("pci1799,1" instead of
> base node name)
> ERROR (name_properties): "name" property in /pci/mac-io/ide#1 is
> incorrect ("ide" instead of base node name)
> ERROR (name_properties): "name" property in
> /pci/mac-io/ide#1/atapi-disk#1 is incorrect ("atapi-disk" instead of
> base node name)
> ERROR (name_properties): "name" property in /cpus/PowerPC,750/l2-cache#1
> is incorrect ("l2-cache" instead of base node name)
> ERROR: Input tree has errors, aborting (use -f to force output)
> 
> If I force output with "-f", the resulting file has no occurrences 
> of "phandle".

Are you booting with BootX or Open Firmware ?

> Running "lsprop /proc/device-tree | grep -i phandle" results in no 
> output.
> 
> Please let me know if there's some other way to get information that 
> would be helpful.
> 
> thanks
> 
> -Stan



Re: [PATCH 2/2] powerpc/64s/radix: Explicitly flush ERAT with local LPID invalidation

2018-08-27 Thread Benjamin Herrenschmidt
On Mon, 2018-08-27 at 13:03 +1000, Nicholas Piggin wrote:
> Local radix TLB flush operations that operate on congruence classes
> have explicit ERAT flushes for POWER9. The process scoped LPID flush
> did not have a flush, so add it.

Paul, is that an actual bug ? I think the ERAT is flushed on LPID
changes...

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/mm/tlb-radix.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
> index fef3e1eb3a19..4e798f33c530 100644
> --- a/arch/powerpc/mm/tlb-radix.c
> +++ b/arch/powerpc/mm/tlb-radix.c
> @@ -366,6 +366,7 @@ static inline void _tlbiel_lpid_guest(unsigned long lpid, 
> unsigned long ric)
>   __tlbiel_lpid_guest(lpid, set, RIC_FLUSH_TLB);
>  
>   asm volatile("ptesync": : :"memory");
> + asm volatile(PPC_INVALIDATE_ERAT : : :"memory");
>  }
>  
>  



Re: DT case sensitivity

2018-08-23 Thread Benjamin Herrenschmidt
On Thu, 2018-08-23 at 07:19 -0500, Segher Boessenkool wrote:
> If one implementation does case insensitive, it will most likely just work,
> because people do not make insane names differing only in case on purpose.

Apple did :-)

ide, vs IDE, ata vs ATA, I've seen all sort of crap there, esp. on old
machines.

> Now people write other things that they only test against that implementation,
> and those things now only work with case-insensitive.  And you do not know
> without testing if anything breaks.  (But the laws of big numbers are against
> you here).  Since there isn't really a drawback to doing case-insensitive
> always, that is a much safer way forward, much less work for everyone, even
> if technically the wrong thing to do :-)
> 
> 
> Segher



Re: [PATCH 2/2] powerpc/mm/hash: Only need the Nest MMU workaround for R -> RW transition

2018-08-23 Thread Benjamin Herrenschmidt
On Thu, 2018-08-23 at 19:23 +1000, Nicholas Piggin wrote:
> On Wed, 22 Aug 2018 22:46:05 +0530
> "Aneesh Kumar K.V"  wrote:
> 
> > The Nest MMU workaround is only needed for RW upgrades. Avoid doing that
> > for other pte updates.
> > 
> > We also avoid clearing the pte while marking it invalid. This is because 
> > other
> > page table walk will find this pte none and can result in unexpected 
> > behaviour
> > due to that. Instead we clear _PAGE_PRESENT and set the software pte bit
> > _PAGE_INVALID. pte_present is already updated to check for bot the bits. 
> > This
> > make sure page table walkers will find the pte present and things like
> > pte_pfn(pte) returns the right value.
> > 
> > Based on the original patch from Benjamin Herrenschmidt 
> > 
> > 
> > Signed-off-by: Aneesh Kumar K.V 
> > ---
> >  arch/powerpc/mm/pgtable-radix.c | 8 +---
> 
> This is powerpc/mm/radix, isn't it? Subject says hash.
> 
> Could we make this fix POWER9 only and use a RSV bit for it
> rather than use up a SW bit? Other than that,

Well, the SW bit is necessary for THP as well isn't it ?
> 
> Reviewed-by: Nicholas Piggin 
> 
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/pgtable-radix.c 
> > b/arch/powerpc/mm/pgtable-radix.c
> > index 7be99fd9af15..c879979faa73 100644
> > --- a/arch/powerpc/mm/pgtable-radix.c
> > +++ b/arch/powerpc/mm/pgtable-radix.c
> > @@ -1045,20 +1045,22 @@ void radix__ptep_set_access_flags(struct 
> > vm_area_struct *vma, pte_t *ptep,
> > struct mm_struct *mm = vma->vm_mm;
> > unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED |
> >   _PAGE_RW | _PAGE_EXEC);
> > +
> > +   unsigned long change = pte_val(entry) ^ pte_val(*ptep);
> > /*
> >  * To avoid NMMU hang while relaxing access, we need mark
> >  * the pte invalid in between.
> >  */
> > -   if (atomic_read(>context.copros) > 0) {
> > +   if ((change & _PAGE_RW) && atomic_read(>context.copros) > 0) {
> > unsigned long old_pte, new_pte;
> >  
> > -   old_pte = __radix_pte_update(ptep, ~0, 0);
> > +   old_pte = __radix_pte_update(ptep, _PAGE_PRESENT, 
> > _PAGE_INVALID);
> > /*
> >  * new value of pte
> >  */
> > new_pte = old_pte | set;
> > radix__flush_tlb_page_psize(mm, address, psize);
> > -   __radix_pte_update(ptep, 0, new_pte);
> > +   __radix_pte_update(ptep, _PAGE_INVALID, new_pte);
> > } else {
> > __radix_pte_update(ptep, 0, set);
> > /*



Re: DT case sensitivity

2018-08-23 Thread Benjamin Herrenschmidt
On Thu, 2018-08-23 at 06:43 -0500, Rob Herring wrote:
> On Thu, Aug 23, 2018 at 4:02 AM Grant Likely  wrote:
> > 
> > 
> > What problem are you trying to solve?
> 
> I'm looking at removing device_node.name and using full_name instead
> (which now is only the local node name plus unit-address). This means
> replacing of_node_cmp() (and still some strcmp) calls in a lot of
> places. I need to use either strncmp or strncasecmp instead.
> 
> > I would think making everything
> > case insensitive would be the direction to go if you do anything. Least
> > possibility of breaking existing platforms in that scenario.
> 
> Really? Even if all the "new" arches are effectively case sensitive?
> Anything using dtc and libfdt are (and json-schema certainly will be).
> But I frequently say the kernel's job is not DT validation, so you
> pass crap in, you get undefined results.

I tend to agree with Grant. Let's put it this way:

What is the drawback of being case insensitive ?

Do we expect that there exist a case where we will want to distinguish
between nodes that have the same name with a different case ?

If not, I don't see the point of being strict about it.

Cheers,
Ben.




Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Benjamin Herrenschmidt
On Thu, 2018-08-23 at 07:24 +0200, Christoph Hellwig wrote:
> > Well, iommus can have bypass regions, which we also use for
> > performance, so we do at dma_set_mask() time "swap" the ops around, and
> > in that case, we do want to check the mask against the actual top of
> > memory...
> 
> That is a bit of a powerpc special case (we also had one other arch
> doing that, but it got removed in the great purge, can't rember which
> one right now).  Everyone else has one set of ops, and they just switch
> to the direct mapping inside the iommu ops.

We more or less do that too in some of ours these days bcs of the whole
coherent_mask vs mask where a given device might need either depending
on the type of mapping.

Ben.



Re: DT case sensitivity

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 20:26 -0500, Rob Herring wrote:
> On Wed, Aug 22, 2018 at 8:14 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > On Wed, 2018-08-22 at 19:47 -0500, Rob Herring wrote:
> > > The default DT string handling in the kernel is node names and
> > > compatibles are case insensitive and property names are case sensitive
> > > (Sparc is the the only variation and is opposite). It seems only PPC
> > > (and perhaps only Power Macs?) needs to support case insensitive
> > > comparisons. It was probably a mistake to follow PPC for new arches
> > > and we should have made everything case sensitive from the start. So I
> > > have a few questions for the DT historians. :)
> > 
> > Open Firmware itself is insensitive.
> 
> Doesn't it depend on the implementation? Otherwise, how is Sparc different?

Not sure ... Forth itself is insensitive for words but maybe not for
string comparisons.

> 
> > > What PPC systems are case insensitive? Can we limit that to certain 
> > > systems?
> > 
> > All PowerMacs at least, the problem is that I don't have DT images or
> > access to all the historical systems (and yes some people occasionally
> > still use them) to properly test a change in that area.
> 
> I'm temped to break them so I can find folks to provide me with DT dumps. :)

I have a collection of DT dumps but I'm not sure about the legality of
publishing them...

Cheers,
Ben.




Re: DT case sensitivity

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 19:47 -0500, Rob Herring wrote:
> The default DT string handling in the kernel is node names and
> compatibles are case insensitive and property names are case sensitive
> (Sparc is the the only variation and is opposite). It seems only PPC
> (and perhaps only Power Macs?) needs to support case insensitive
> comparisons. It was probably a mistake to follow PPC for new arches
> and we should have made everything case sensitive from the start. So I
> have a few questions for the DT historians. :)

Open Firmware itself is insensitive.

> What PPC systems are case insensitive? Can we limit that to certain systems?

All PowerMacs at least, the problem is that I don't have DT images or
access to all the historical systems (and yes some people occasionally
still use them) to properly test a change in that area.

> AFAICT, dtc at least (if not anything FDT based) has always been case
> sensitive at least for node and property names. I'm not sure about
> compatible strings?
> 
> Anyone see potential issues with switching all platforms except PPC
> and Sparc to case sensitive comparisons?

Cheers,
Ben.



Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:58 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 09:54:33AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > > We need to take the DMA offset and encryption bit into account when 
> > > selecting
> > > a zone.  Add a helper that takes those into account and use it.
> > 
> > That whole "encryption" stuff seems to be completely specific to the
> > way x86 does memory encryption, or am I mistaken ? It's not clear to me
> > what that does in practice and how it relates to DMA mappings.
> 
> Not even all of x86, but AMD in particular, Intel does it yet another
> way.  But it still is easier to take this into the core with a few
> overrides than duplicating all the code.
> 
> > I'm also not sure about that whole business with ZONE_DMA and
> > ARCH_ZONE_DMA_BITS...
> 
> ZONE_DMA usually (but not always) maps to 24-bits of address space,
> if it doesn't (I mostly through about s390 with it's odd 31-bits)
> the architecture can override it if it cares).
> 
> > On ppc64, unless you enable swiotlb (which we only do currently on
> > some embedded platforms), you have all of memory in ZONE_DMA.
> > 
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x-0x001f]
> > [0.00]   DMA32empty
> > [0.00]   Normal   empty
> > [0.00]   Device   empty
> 
> This is really weird.  Why would you wire up ZONE_DMA like this?

We always did :-) It predates my involvement and I think it predates
even Pauls. It's quite silly actually since the first powerpc machines
actually had ISA devices in them, but that's how it's been for ever. I
suppose we could change it but that would mean digging out some old
stuff to test.

> The general scheme that architectures should implement is:
> 
> ZONE_DMA: Any memory below a magic threshold that is lower than
>   32-bit.  Only enabled if actually required (usually
>   either 24-bit for ISA, or some other weird architecture
>   specific value like 32-bit for S/390)

It should have been ZONE_ISA_DMA :-)

> ZONE_DMA32:   Memory <= 32-bit if the architecture supports more than
>   32-bits worth of physical address space.  Should generally
>   be enabled on all 64-bit architectures unless you have
>   a very good reason not to.

Yeah so we sort-of enable the config option but only populate the zone
on platforms using swiotlb (freescale stuff). It's a bit messy at the
moment I must admit.

> ZONE_NORMAL:  Everything above 32-bit not falling into HIGHMEM or
>   MOVEABLE.

Cheers,
Ben.




Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:53 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 09:44:18AM +1000, Benjamin Herrenschmidt wrote:
> > We do have the occasional device with things like 31-bit DMA
> > limitation. We know they happens to work because those systems
> > can't have enough memory to be a problem. This is why our current
> > DMA direct ops in powerpc just unconditionally return true on ppc32.
> > 
> > The test against a full 32-bit mask here will break them I think.
> > 
> > Thing is, I'm not sure I still have access to one of these things
> > to test, I'll have to dig (from memory things like b43 wifi).
> 
> Yeah, the other platforms that support these devices support ZONE_DMA
> to reliably handle these devices. But there is two other ways the
> current code would actually handle these fine despite the dma_direct
> checks:
> 
>  1) if the device only has physical addresses up to 31-bit anyway
>  2) by trying again to find a lower address.  But this only works
> for coherent allocations and not streaming maps (unless we have
> swiotlb with a buffer below 31-bits).
> 
> It seems powerpc can have ZONE_DMA, though and we will cover these
> devices just fine.  If it didn't have that the current powerpc
> code would not work either.

Not exactly. powerpc has ZONE_DMA covering all of system memory.

What happens in ppc32 is that we somewhat "know" that none of the
systems with those stupid 31-bit limited pieces of HW is capable of
having more than 2GB of memory anyway.

So we get away with just returning "1".

> 
> >  - What is this trying to achieve ?
> > 
> > /*
> >  * Various PCI/PCIe bridges have broken support for > 32bit DMA even
> >  * if the device itself might support it.
> >  */
> > if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
> > return 0;
> > 
> > IE, if the device has a 32-bit limit, we fail an attempt at checking
> > if a >32-bit mask works ? That doesn't quite seem to be the right thing
> > to do... Shouldn't this be in dma_set_mask() and just clamp the mask down ?
> > 
> > IE, dma_set_mask() is what a driver uses to establish the device capability,
> > so it makes sense tot have dma_32bit_limit just reduce that capability, not
> > fail because the device can do more than what the bridge can 
> 
> If your PCI bridge / PCIe root port doesn't support dma to addresses
> larger than 32-bit the device capabilities above that don't matter, it
> just won't work.  We have this case at least for some old VIA x86 chipsets
> and some relatively modern Xilinx FPGAs with PCIe.

Hrm... that's the usual confusion dma_capable() vs. dma_set_mask().

It's always been perfectly fine for a driver to do a dma_set_mask(64-
bit) on a system where the bridge can only do 32-bits ...

We shouldn't fail there, we should instead "clamp" the mask to 32-bit,
see what I mean ? It doesn't matter that the device itself is capable
of issuing >32 addresses, I agree, but what we need to express is that
the combination device+bridge doesn't want addresses above 32-bit, so
it's equivalent to making the device do a set_mask(32-bit).

This will succeed if the system can limit the addresses (for example
because memory is never above 32-bit) and will fail if the system
can't.

So that's equivalent of writing

if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
mask = phys_to_dma(dev, DMA_BIT_MASK(32)); 

Effectively meaning "don't give me addresses aboe 32-bit".

Still, your code doesn't check the mask against the memory size. Which
means it will fail for 32-bit masks even on systems that do not have
memory above 4G.

> >  - How is that file supposed to work on 64-bit platforms ? From what I can
> > tell, dma_supported() will unconditionally return true if the mask is
> > 32-bit or larger (appart from the above issue). This doesn't look right,
> > the mask needs to be compared to the max memory address. There are a bunch
> > of devices out there with masks anywhere bettween 40 and 64 bits, and
> > some of these will not work "out of the box" if the offseted top
> > of memory is beyond the mask limit. Or am I missing something ?
> 
> Your are not missing anything except for the history of this code.
> 
> Your observation is right, but there always has been the implicit
> assumption that architectures with more than 4GB of physical address
> space must either support and iommu or swiotlb and use that.  It's
> never been document anywhere, but I'm working on integrating all
> this code to make more sense.

Well, iommus can have bypass regions, which we also use for
performance, so we do at dma_set_mask() time "swap" the ops around, and
in that case, we do want to check the mask against the actual top of
memory...

Cheers,
Ben.




Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:45 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 10:01:16AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> > > It turns out cxl actually uses it.  So for now skip this patch,
> > > although random code in drivers messing with dma ops will need to
> > > be sorted out sooner or later.
> > 
> > CXL devices are "special", they bypass the classic iommu in favor of
> > allowing the device to operate using the main processor page tables
> > using an MMU context (so basically the device can use userspace
> > addresses directly), akin to ATS.
> > 
> > I think the code currently uses the nommu ops as a way to do a simple
> > kernel mapping for kernel drivers using CXL (not userspace stuff)
> > though.
> 
> Its still a horrible idea to have this in drivers/, we need some
> core API to mediate this behavior.  Also if the device supports
> using virtual addresses dma_nommu_ops seems wrong as it won't do
> the right thing for e.g. vmalloc addresses not mapped into the
> kernel linear mapping (which I guess can't currently happen on
> powerpc, but still..)

You are right it won't do the right thing, but neither will standard
DMA ops, will they ? Drivers know not to try to dma_map vmalloc
addresses without first getting the underlying page, nothing unusal
there.

Yes I agree having this in drivers somewhat sucks though.

Cheers,
Ben.



Re: [PATCH 10/20] powerpc/dma-noncoherent: don't disable irqs over kmap_atomic

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 09:02 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 10:27:46AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > > The requirement to disable local irqs over kmap_atomic is long gone,
> > > so remove those calls.
> > 
> > Really ? I'm trying to verify that and getting lost in a mess of macros
> > from hell in the per-cpu stuff but if you look at our implementation
> > of kmap_atomic_prot(), all it does is a preempt_disable(), and then
> > it uses kmap_atomic_idx_push():
> > 
> > int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
> > 
> > Note the use of __this_cpu_inc_return(), not this_cpu_inc_return(),
> > ie this is the non-interrupt safe version...
> 
> Looks like the powerpc variant indeed isn't save.
> 
> I did look a bit more through the code and history, and it seems
> like we remove the need to disable irqs when called from process
> context a while ago, but we still require disabling irqs when called
> from irq context.  Given that this code can also be called from
> irq context we'll have to keep the local_irq_save.

This is the same with x86 no ?

32-bit x86 kmap_atomic_prot is the same as ours...

In fact I wonder why the preempt_disable() in there since it needs to
be protected against interrupt ?

Or is it that we never actually call kmap_atomic_* these days from
interrupt, and the atomic versions are just about dealing with
spinlocks ?

Cheers,
Ben.
 



Re: [PATCH] powerpc/64s/hash: convert SLB miss handlers to C

2018-08-21 Thread Benjamin Herrenschmidt
On Tue, 2018-08-21 at 15:13 +1000, Nicholas Piggin wrote:
> This patch moves SLB miss handlers completely to C, using the standard
> exception handler macros to set up the stack and branch to C.
> 
> This can be done because the segment containing the kernel stack is
> always bolted, so accessing it with relocation on will not cause an
> SLB exception.
> 
> Arbitrary kernel memory may not be accessed when handling kernel space
> SLB misses, so care should be taken there. However user SLB misses can
> access any kernel memory, which can be used to move some fields out of
> the paca (in later patches).
> 
> User SLB misses could quite easily reconcile IRQs and set up a first
> class kernel environment and exit via ret_from_except, however that
> doesn't seem to be necessary at the moment, so we only do that if a
> bad fault is encountered.
> 
> [ Credit to Aneesh for bug fixes, error checks, and improvements to bad
>   address handling, etc ]
> 
> Signed-off-by: Nicholas Piggin 
> 
> Since RFC:
> - Send patch 1 by itself to focus on the big change.
> - Added MSR[RI] handling
> - Fixed up a register loss bug exposed by irq tracing (Aneesh)
> - Reject misses outside the defined kernel regions (Aneesh)
> - Added several more sanity checks and error handlig (Aneesh), we may
>   look at consolidating these tests and tightenig up the code but for
>   a first pass we decided it's better to check carefully.
> ---
>  arch/powerpc/include/asm/asm-prototypes.h |   2 +
>  arch/powerpc/kernel/exceptions-64s.S  | 202 +++--
>  arch/powerpc/mm/Makefile  |   2 +-
>  arch/powerpc/mm/slb.c | 257 +
>  arch/powerpc/mm/slb_low.S | 335 --
>  5 files changed, 185 insertions(+), 613 deletions(-)
^^^^^^

Nice ! :-)

Cheers,
Ben.



Re: [PATCH] powerpc/powernv: Don't select the cpufreq governors

2018-08-20 Thread Benjamin Herrenschmidt
On Tue, 2018-08-21 at 11:44 +0930, Joel Stanley wrote:
> Deciding wich govenors should be built into the kernel can be left to
> users to configure.
> 
> Fixes: 81f359027a3a ("cpufreq: powernv: Select CPUFreq related Kconfig 
> options for powernv")
> Signed-off-by: Joel Stanley 

Can you add them to the defconfigs then ?

> ---
>  arch/powerpc/platforms/powernv/Kconfig | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/Kconfig 
> b/arch/powerpc/platforms/powernv/Kconfig
> index f8dc98d3dc01..028ac941c05c 100644
> --- a/arch/powerpc/platforms/powernv/Kconfig
> +++ b/arch/powerpc/platforms/powernv/Kconfig
> @@ -15,11 +15,6 @@ config PPC_POWERNV
>   select PPC_SCOM
>   select ARCH_RANDOM
>   select CPU_FREQ
> - select CPU_FREQ_GOV_PERFORMANCE
> - select CPU_FREQ_GOV_POWERSAVE
> - select CPU_FREQ_GOV_USERSPACE
> - select CPU_FREQ_GOV_ONDEMAND
> - select CPU_FREQ_GOV_CONSERVATIVE
>   select PPC_DOORBELL
>   select MMU_NOTIFIER
>   select FORCE_SMP



[PATCH v2] powerpc/powernv/pci: Work around races in PCI bridge enabling

2018-08-17 Thread Benjamin Herrenschmidt
The generic code is race when multiple children of a PCI bridge try to
enable it simultaneously.

This leads to drivers trying to access a device through a not-yet-enabled
bridge, and this EEH errors under various circumstances when using parallel
driver probing.

There is work going on to fix that properly in the PCI core but it will
take some time.

x86 gets away with it because (outside of hotplug), the BIOS enables all
the bridges at boot time.

This patch does the same thing on powernv by enabling all bridges that
have child devices at boot time, thus avoiding subsequent races. It's suitable
for backporting to stable and distros, while the proper PCI fix will probably
be significantly more invasive.

Signed-off-by: Benjamin Herrenschmidt 
CC: sta...@vger.kernel.org
---

v2. Fix unused variables warning.

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5bd0eb6681bc..ca720d530c6d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3367,12 +3367,49 @@ static void pnv_pci_ioda_create_dbgfs(void)
 #endif /* CONFIG_DEBUG_FS */
 }
 
+static void pnv_pci_enable_bridge(struct pci_bus *bus)
+{
+   struct pci_dev *dev = bus->self;
+   struct pci_bus *child;
+
+   /* Empty bus ? bail */
+   if (list_empty(>devices))
+   return;
+
+   /*
+* If there's a bridge associated with that bus enable it. This works
+* around races in the generic code if the enabling is done during
+* parallel probing. This can be removed once those races have been
+* fixed.
+*/
+   if (dev) {
+   int rc = pci_enable_device(dev);
+   if (rc)
+   pci_err(dev, "Error enabling bridge (%d)\n", rc);
+   pci_set_master(dev);
+   }
+
+   /* Perform the same to child busses */
+   list_for_each_entry(child, >children, node)
+   pnv_pci_enable_bridge(child);
+}
+
+static void pnv_pci_enable_bridges(void)
+{
+   struct pci_controller *hose;
+
+   list_for_each_entry(hose, _list, list_node)
+   pnv_pci_enable_bridge(hose->bus);
+}
+
 static void pnv_pci_ioda_fixup(void)
 {
pnv_pci_ioda_setup_PEs();
pnv_pci_ioda_setup_iommu_api();
pnv_pci_ioda_create_dbgfs();
 
+   pnv_pci_enable_bridges();
+
 #ifdef CONFIG_EEH
pnv_eeh_post_init();
 #endif



[PATCH] powerpc/powernv/pci: Work around races in PCI bridge enabling

2018-08-16 Thread Benjamin Herrenschmidt
The generic code is race when multiple children of a PCI bridge try to
enable it simultaneously.

This leads to drivers trying to access a device through a not-yet-enabled
bridge, and this EEH errors under various circumstances when using parallel
driver probing.

There is work going on to fix that properly in the PCI core but it will
take some time.

x86 gets away with it because (outside of hotplug), the BIOS enables all
the bridges at boot time.

This patch does the same thing on powernv by enabling all bridges that
have child devices at boot time, thus avoiding subsequent races. It's suitable
for backporting to stable and distros, while the proper PCI fix will probably
be significantly more invasive.

Signed-off-by: Benjamin Herrenschmidt 
CC: sta...@vger.kernel.org
---

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 70b2e1e0f23c..0975b0aaf210 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3368,12 +3368,52 @@ static void pnv_pci_ioda_create_dbgfs(void)
 #endif /* CONFIG_DEBUG_FS */
 }
 
+static void pnv_pci_enable_bridge(struct pci_bus *bus)
+{
+   struct pci_dev *dev = bus->self;
+   struct pci_bus *child;
+
+   /* Empty bus ? bail */
+   if (list_empty(>devices))
+   return;
+
+   /*
+* If there's a bridge associated with that bus enable it. This works
+* around races in the generic code if the enabling is done during
+* parallel probing. This can be removed once those races have been
+* fixed.
+*/
+   if (dev) {
+   int rc = pci_enable_device(dev);
+   if (rc)
+   pci_err(dev, "Error enabling bridge (%d)\n", rc);
+   pci_set_master(dev);
+   }
+
+   /* Perform the same to child busses */
+   list_for_each_entry(child, >children, node)
+   pnv_pci_enable_bridge(child);
+}
+
+static void pnv_pci_enable_bridges(void)
+{
+   struct pci_controller *hose;
+   struct pnv_phb *phb;
+   struct pci_bus *bus;
+   struct pci_dev *pdev;
+
+   list_for_each_entry(hose, _list, list_node)
+   pnv_pci_enable_bridge(hose->bus);
+}
+
 static void pnv_pci_ioda_fixup(void)
 {
pnv_pci_ioda_setup_PEs();
pnv_pci_ioda_setup_iommu_api();
pnv_pci_ioda_create_dbgfs();
 
+   pnv_pci_enable_bridges();
+
 #ifdef CONFIG_EEH
pnv_eeh_post_init();
 #endif



Re: [PATCH v6 1/2] powerpc: Detect the presence of big-cores via "ibm,thread-groups"

2018-08-13 Thread Benjamin Herrenschmidt
On Mon, 2018-08-13 at 17:06 +0530, Gautham R Shenoy wrote:
> Hi Srikar,
> 
> Thanks for reviewing the patch.
> 
> On Thu, Aug 09, 2018 at 06:27:43AM -0700, Srikar Dronamraju wrote:
> > * Gautham R. Shenoy  [2018-08-09 11:02:07]:
> > 
> > > 
> > >  int threads_per_core, threads_per_subcore, threads_shift;
> > > +bool has_big_cores;
> > >  cpumask_t threads_core_mask;
> > >  EXPORT_SYMBOL_GPL(threads_per_core);
> > >  EXPORT_SYMBOL_GPL(threads_per_subcore);
> > >  EXPORT_SYMBOL_GPL(threads_shift);
> > > +EXPORT_SYMBOL_GPL(has_big_cores);
> > 
> > Why do we need EXPORT_SYMBOL_GPL?
> 
> As Christoph pointed out, I was blindly following the suit.
> 
> You are right, we don't need to export it at the moment. The remaining
> EXPORT_SYMBOL_GPL are required by KVM. However, as of now, there is no
> need for "has_big_cores" in the KVM.

There is actually. KVM needs to refuse to start on big cores, at least
HV KVM. And when KVM grows support for big cores (may or may not
happen), it will need to know. So keep the GPL export.

> Will remove this in the next version.
> > 
> > >  EXPORT_SYMBOL_GPL(threads_core_mask);
> > > 
> > > + *
> > > + * Returns 0 on success, -EINVAL if the property does not exist,
> > > + * -ENODATA if property does not have a value, and -EOVERFLOW if the
> > > + * property data isn't large enough.
> > > + */
> > > +int parse_thread_groups(struct device_node *dn,
> > > + struct thread_groups *tg)
> > > +{
> > > + unsigned int nr_groups, threads_per_group, property;
> > > + int i;
> > > + u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
> > > + u32 *thread_list;
> > > + size_t total_threads;
> > > + int ret;
> > > +
> > > + ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> > > +  thread_group_array, 3);
> > > +
> > > + if (ret)
> > > + goto out_err;
> > > +
> > > + property = thread_group_array[0];
> > > + nr_groups = thread_group_array[1];
> > > + threads_per_group = thread_group_array[2];
> > > + total_threads = nr_groups * threads_per_group;
> 
> 
> > > +
> > 
> > Shouldnt we check for property and nr_groups
> > If the property is not 1 and nr_groups < 1, we should error out
> > No point in calling a of_property read if property is not right.
> 
> Yes, the nr_groups < 1 check can be moved into this function.
> 
> However, this function merely parses the the thread group structure
> exposed by the device tree. So it should error out only if there is a
> failure in parsing, or as you said the parsed values are incorrect
> (nr_groups < 1, threads_per_group < 1, etc). Whether the thread group
> is relevant or not (in this case we are interested in thread groups
> that share L1 cache, translation etc) is something for the caller to
> decide.
> 
> However, I see what you mean. We can avoid parsing the remainder of
> the array if the property in the device-tree isn't the property that
> the caller is interested in.
> 
> This can be solved by passing the interested property value as a
> parameter and so that the code errors out if this property doesn't
> match the property in the device-tree. Will add this in the next
> version.
> 
> > 
> > 
> > Nit: 
> > Cant we directly assign to tg->property, and hence avoid local
> > variables, property, nr_groups and threads_per_group?
> 
> Will clean this up. This was from an older version where I added the
> local variables so that the statements referencing them don't need to
> be split across multiple lines. However, the code has been optimized
> since then. So, the local variables are not needed.
> 
> > 
> > > + ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> > > +  thread_group_array,
> > > +  3 + total_threads);
> > > +
> > > +static inline bool dt_has_big_core(struct device_node *dn,
> > > +struct thread_groups *tg)
> > > +{
> > > + if (parse_thread_groups(dn, tg))
> > > + return false;
> > > +
> > > + if (tg->property != 1)
> > > + return false;
> > > +
> > > + if (tg->nr_groups < 1)
> > > + return false;
> > 
> > Can avoid these check if we can check in parse_thread_groups.
> > 
> > >  /**
> > >   * setup_cpu_maps - initialize the following cpu maps:
> > >   *  cpu_possible_mask
> > > @@ -457,6 +605,7 @@ void __init smp_setup_cpu_maps(void)
> > >   int cpu = 0;
> > >   int nthreads = 1;
> > > 
> > > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> > > index 755dc98..f5717de 100644
> > > --- a/arch/powerpc/kernel/sysfs.c
> > > +++ b/arch/powerpc/kernel/sysfs.c
> > > @@ -18,6 +18,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > > 
> > >  #include "cacheinfo.h"
> > >  #include "setup.h"
> > > @@ -1025,6 +1026,33 @@ static ssize_t show_physical_id(struct device *dev,
> > >  }
> > >  static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
> > > 
> > > +static ssize_t 

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Benjamin Herrenschmidt
On Thu, 2018-08-09 at 08:13 +1000, Benjamin Herrenschmidt wrote:
> > For completeness, virtio could also have its own bounce buffer
> > outside of DMA API one. I don't see lots of benefits to this
> > though.
> 
> Not fan of that either...

To elaborate a bit ...

For our secure VMs, we will need bounce buffering for everything
anyway. virtio, emulated PCI, or vfio.

By ensuring that we create an identity mapping in the IOMMU for
the bounce buffering pool, we enable virtio "legacy/direct" to
use the same mapping ops as things using the iommu.

That said, we still need somewhere in arch/powerpc a set of dma
ops which we'll attach to all PCI devices of a secure VM to force
bouncing always, rather than just based on address (which is what
the standard swiotlb ones do)... Unless we can tweak the swiotlb
"threshold" for example by using an empty mask.

We'll need the same set of DMA ops for VIO devices too, not just PCI.

Cheers,
Ben.




Re: [PATCH 17/20] powerpc/dma-swiotlb: use generic swiotlb_dma_ops

2018-08-08 Thread Benjamin Herrenschmidt
On Thu, 2018-08-09 at 10:54 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > These are identical to the arch specific ones, so remove them.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Acked-by: Benjamin Herrenschmidt 

Note: We will still need to implement some custom variant of this
for our secure VMs ... 

Basically we'll need to use the existing bounce bufferring as-is but
the condition will be different, it won't be whether the address is
below a certain limit, it will be *always*.

Cheers,
Ben.

> > ---
> >  arch/powerpc/include/asm/dma-direct.h |  4 
> >  arch/powerpc/include/asm/swiotlb.h|  2 --
> >  arch/powerpc/kernel/dma-swiotlb.c | 28 ++-
> >  arch/powerpc/sysdev/fsl_pci.c |  2 +-
> >  4 files changed, 7 insertions(+), 29 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/dma-direct.h 
> > b/arch/powerpc/include/asm/dma-direct.h
> > index 0fba19445ae8..657f84ddb20d 100644
> > --- a/arch/powerpc/include/asm/dma-direct.h
> > +++ b/arch/powerpc/include/asm/dma-direct.h
> > @@ -30,4 +30,8 @@ static inline phys_addr_t __dma_to_phys(struct device 
> > *dev, dma_addr_t daddr)
> > return daddr - PCI_DRAM_OFFSET;
> > return daddr - dev->archdata.dma_offset;
> >  }
> > +
> > +u64 swiotlb_powerpc_get_required(struct device *dev);
> > +#define swiotlb_get_required_mask swiotlb_powerpc_get_required
> > +
> >  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> > diff --git a/arch/powerpc/include/asm/swiotlb.h 
> > b/arch/powerpc/include/asm/swiotlb.h
> > index f65ecf57b66c..1d8c1da26ab3 100644
> > --- a/arch/powerpc/include/asm/swiotlb.h
> > +++ b/arch/powerpc/include/asm/swiotlb.h
> > @@ -13,8 +13,6 @@
> >  
> >  #include 
> >  
> > -extern const struct dma_map_ops powerpc_swiotlb_dma_ops;
> > -
> >  extern unsigned int ppc_swiotlb_enable;
> >  int __init swiotlb_setup_bus_notifier(void);
> >  
> > diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> > b/arch/powerpc/kernel/dma-swiotlb.c
> > index 25986fcd1e5e..0c269de61f39 100644
> > --- a/arch/powerpc/kernel/dma-swiotlb.c
> > +++ b/arch/powerpc/kernel/dma-swiotlb.c
> > @@ -24,7 +24,7 @@
> >  
> >  unsigned int ppc_swiotlb_enable;
> >  
> > -static u64 swiotlb_powerpc_get_required(struct device *dev)
> > +u64 swiotlb_powerpc_get_required(struct device *dev)
> >  {
> > u64 end, mask, max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
> >  
> > @@ -38,30 +38,6 @@ static u64 swiotlb_powerpc_get_required(struct device 
> > *dev)
> > return mask;
> >  }
> >  
> > -/*
> > - * At the moment, all platforms that use this code only require
> > - * swiotlb to be used if we're operating on HIGHMEM.  Since
> > - * we don't ever call anything other than map_sg, unmap_sg,
> > - * map_page, and unmap_page on highmem, use normal dma_ops
> > - * for everything else.
> > - */
> > -const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> > -   .alloc = dma_direct_alloc,
> > -   .free = dma_direct_free,
> > -   .mmap = dma_nommu_mmap_coherent,
> > -   .map_sg = swiotlb_map_sg_attrs,
> > -   .unmap_sg = swiotlb_unmap_sg_attrs,
> > -   .dma_supported = swiotlb_dma_supported,
> > -   .map_page = swiotlb_map_page,
> > -   .unmap_page = swiotlb_unmap_page,
> > -   .sync_single_for_cpu = swiotlb_sync_single_for_cpu,
> > -   .sync_single_for_device = swiotlb_sync_single_for_device,
> > -   .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
> > -   .sync_sg_for_device = swiotlb_sync_sg_for_device,
> > -   .mapping_error = swiotlb_dma_mapping_error,
> > -   .get_required_mask = swiotlb_powerpc_get_required,
> > -};
> > -
> >  void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
> >  {
> > struct pci_controller *hose;
> > @@ -88,7 +64,7 @@ static int ppc_swiotlb_bus_notify(struct notifier_block 
> > *nb,
> >  
> > /* May need to bounce if the device can't address all of DRAM */
> > if ((dma_get_mask(dev) + 1) < memblock_end_of_DRAM())
> > -   set_dma_ops(dev, _swiotlb_dma_ops);
> > +   set_dma_ops(dev, _dma_ops);
> >  
> > return NOTIFY_DONE;
> >  }
> > diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
> > index 918be816b097..daf44bc0108d 100644
> > --- a/arch/powerpc/sysdev/fsl_pci.c
> > +++ b/arch/powerpc/sysdev/fsl_pci.c
> > @@ -118,7 +118,7 @@ static void setup_swiotlb_ops(struct pci_controller 
> > *hose)
> >  {
> > if (ppc_swiotlb_enable) {
> > hose->controller_ops.dma_dev_setup = pci_dma_dev_setup_swiotlb;
> > -   set_pci_dma_ops(_swiotlb_dma_ops);
> > +   set_pci_dma_ops(_dma_ops);
> > }
> >  }
> >  #else



Re: [PATCH 20/20] powerpc/dma: remove dma_nommu_mmap_coherent

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The remaining implementation for coherent caches is functionally
> identical to the default provided in common code.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h |  7 ---
>  arch/powerpc/kernel/dma-iommu.c|  1 -
>  arch/powerpc/kernel/dma.c  | 13 -
>  arch/powerpc/platforms/pseries/vio.c   |  1 -
>  4 files changed, 22 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 879c4efba785..e62e23aa3714 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -18,13 +18,6 @@
>  #include 
>  #include 
>  
> -/* Some dma direct funcs must be visible for use in other dma_ops */
> -extern int dma_nommu_mmap_coherent(struct device *dev,
> - struct vm_area_struct *vma,
> - void *cpu_addr, dma_addr_t handle,
> - size_t size, unsigned long attrs);
> -
> -
>  static inline unsigned long device_to_mask(struct device *dev)
>  {
>   if (dev->dma_mask && *dev->dma_mask)
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index f9fe2080ceb9..bf5234e1f71b 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -114,7 +114,6 @@ int dma_iommu_mapping_error(struct device *dev, 
> dma_addr_t dma_addr)
>  struct dma_map_ops dma_iommu_ops = {
>   .alloc  = dma_iommu_alloc_coherent,
>   .free   = dma_iommu_free_coherent,
> - .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_iommu_map_sg,
>   .unmap_sg   = dma_iommu_unmap_sg,
>   .dma_supported  = dma_iommu_dma_supported,
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 08b12cbd7abf..5b71c9d1b8cc 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -70,18 +70,6 @@ static void dma_nommu_free_coherent(struct device *dev, 
> size_t size,
>   iommu_free_coherent(iommu, size, vaddr, dma_handle);
>  }
>  
> -int dma_nommu_mmap_coherent(struct device *dev, struct vm_area_struct *vma,
> -  void *cpu_addr, dma_addr_t handle, size_t size,
> -  unsigned long attrs)
> -{
> - unsigned long pfn = page_to_pfn(virt_to_page(cpu_addr));
> -
> - return remap_pfn_range(vma, vma->vm_start,
> -pfn + vma->vm_pgoff,
> -vma->vm_end - vma->vm_start,
> -vma->vm_page_prot);
> -}
> -
>  /* note: needs to be called arch_get_required_mask for dma-noncoherent.c */
>  u64 arch_get_required_mask(struct device *dev)
>  {
> @@ -98,7 +86,6 @@ u64 arch_get_required_mask(struct device *dev)
>  const struct dma_map_ops dma_nommu_ops = {
>   .alloc  = dma_nommu_alloc_coherent,
>   .free   = dma_nommu_free_coherent,
> - .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_direct_map_sg,
>   .map_page   = dma_direct_map_page,
>   .get_required_mask  = arch_get_required_mask,
> diff --git a/arch/powerpc/platforms/pseries/vio.c 
> b/arch/powerpc/platforms/pseries/vio.c
> index 49e04ec19238..51d564313bd0 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -618,7 +618,6 @@ static u64 vio_dma_get_required_mask(struct device *dev)
>  static const struct dma_map_ops vio_dma_mapping_ops = {
>   .alloc = vio_dma_iommu_alloc_coherent,
>   .free  = vio_dma_iommu_free_coherent,
> - .mmap  = dma_nommu_mmap_coherent,
>   .map_sg= vio_dma_iommu_map_sg,
>   .unmap_sg  = vio_dma_iommu_unmap_sg,
>   .map_page  = vio_dma_iommu_map_page,



Re: [PATCH 18/20] powerpc/dma-noncoherent: use generic dma_noncoherent_ops

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The generic dma-noncoherent code provides all that is needed by powerpc.
> 
> Note that the cache maintainance in the existing code is a bit odd
> as it implements both the sync_to_device and sync_to_cpu callouts,
> but never flushes caches when unmapping.  This patch keeps both
> directions arounds, which will lead to more flushing than the previous
> implementation.  Someone more familar with the required CPUs should
> eventually take a look and optimize the cache flush handling if needed.

The original code looks bogus indeed.

I think we got away with it because those older CPUs wouldn't speculate
or prefetch aggressively enough (or at all) so the flush on map was
sufficient, the stuff wouldn't come back into the cache.

But safe is better than sorry, so ... tentative Ack, I do need to try
to dig one of these things to test, which might take a while.

Acked-by: Benjamin Herrenschmidt 


> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/Kconfig   |  2 +-
>  arch/powerpc/include/asm/dma-mapping.h | 29 -
>  arch/powerpc/kernel/dma.c  | 59 +++---
>  arch/powerpc/kernel/pci-common.c   |  5 ++-
>  arch/powerpc/kernel/setup-common.c |  4 ++
>  arch/powerpc/mm/dma-noncoherent.c  | 52 +--
>  arch/powerpc/platforms/44x/warp.c  |  2 +-
>  arch/powerpc/platforms/Kconfig.cputype |  6 ++-
>  8 files changed, 60 insertions(+), 99 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index bbfa6a8df4da..33c6017ffce6 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -129,7 +129,7 @@ config PPC
>   # Please keep this list sorted alphabetically.
>   #
>   select ARCH_HAS_DEVMEM_IS_ALLOWED
> - select ARCH_HAS_DMA_SET_COHERENT_MASK
> + select ARCH_HAS_DMA_SET_COHERENT_MASK if !NOT_COHERENT_CACHE
>   select ARCH_HAS_ELF_RANDOMIZE
>   select ARCH_HAS_FORTIFY_SOURCE
>   select ARCH_HAS_GCOV_PROFILE_ALL
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index f0bf7ac2686c..879c4efba785 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -19,40 +19,11 @@
>  #include 
>  
>  /* Some dma direct funcs must be visible for use in other dma_ops */
> -extern void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> -  dma_addr_t *dma_handle, gfp_t flag,
> -  unsigned long attrs);
> -extern void __dma_nommu_free_coherent(struct device *dev, size_t size,
> -void *vaddr, dma_addr_t dma_handle,
> -unsigned long attrs);
>  extern int dma_nommu_mmap_coherent(struct device *dev,
>   struct vm_area_struct *vma,
>   void *cpu_addr, dma_addr_t handle,
>   size_t size, unsigned long attrs);
>  
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> -/*
> - * DMA-consistent mapping functions for PowerPCs that don't support
> - * cache snooping.  These allocate/free a region of uncached mapped
> - * memory space for use with DMA devices.  Alternatively, you could
> - * allocate the space "normally" and use the cache management functions
> - * to ensure it is consistent.
> - */
> -struct device;
> -extern void __dma_sync(void *vaddr, size_t size, int direction);
> -extern void __dma_sync_page(struct page *page, unsigned long offset,
> -  size_t size, int direction);
> -extern unsigned long __dma_get_coherent_pfn(unsigned long cpu_addr);
> -
> -#else /* ! CONFIG_NOT_COHERENT_CACHE */
> -/*
> - * Cache coherent cores.
> - */
> -
> -#define __dma_sync(addr, size, rw)   ((void)0)
> -#define __dma_sync_page(pg, off, sz, rw) ((void)0)
> -
> -#endif /* ! CONFIG_NOT_COHERENT_CACHE */
>  
>  static inline unsigned long device_to_mask(struct device *dev)
>  {
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 2b90a403cdac..b2e88075b2ea 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -36,12 +36,7 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>* we can really use the direct ops
>*/
>   if (dma_direct_supported(dev, dev->coherent_dma_mask))
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - return __dma_nommu_alloc_coherent(dev, size, dma_handle,
> -flag, attrs);
> -#else
>   return dma_

Re: [PATCH 17/20] powerpc/dma-swiotlb: use generic swiotlb_dma_ops

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These are identical to the arch specific ones, so remove them.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-direct.h |  4 
>  arch/powerpc/include/asm/swiotlb.h|  2 --
>  arch/powerpc/kernel/dma-swiotlb.c | 28 ++-
>  arch/powerpc/sysdev/fsl_pci.c |  2 +-
>  4 files changed, 7 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-direct.h 
> b/arch/powerpc/include/asm/dma-direct.h
> index 0fba19445ae8..657f84ddb20d 100644
> --- a/arch/powerpc/include/asm/dma-direct.h
> +++ b/arch/powerpc/include/asm/dma-direct.h
> @@ -30,4 +30,8 @@ static inline phys_addr_t __dma_to_phys(struct device *dev, 
> dma_addr_t daddr)
>   return daddr - PCI_DRAM_OFFSET;
>   return daddr - dev->archdata.dma_offset;
>  }
> +
> +u64 swiotlb_powerpc_get_required(struct device *dev);
> +#define swiotlb_get_required_mask swiotlb_powerpc_get_required
> +
>  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> diff --git a/arch/powerpc/include/asm/swiotlb.h 
> b/arch/powerpc/include/asm/swiotlb.h
> index f65ecf57b66c..1d8c1da26ab3 100644
> --- a/arch/powerpc/include/asm/swiotlb.h
> +++ b/arch/powerpc/include/asm/swiotlb.h
> @@ -13,8 +13,6 @@
>  
>  #include 
>  
> -extern const struct dma_map_ops powerpc_swiotlb_dma_ops;
> -
>  extern unsigned int ppc_swiotlb_enable;
>  int __init swiotlb_setup_bus_notifier(void);
>  
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index 25986fcd1e5e..0c269de61f39 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -24,7 +24,7 @@
>  
>  unsigned int ppc_swiotlb_enable;
>  
> -static u64 swiotlb_powerpc_get_required(struct device *dev)
> +u64 swiotlb_powerpc_get_required(struct device *dev)
>  {
>   u64 end, mask, max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
>  
> @@ -38,30 +38,6 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
>   return mask;
>  }
>  
> -/*
> - * At the moment, all platforms that use this code only require
> - * swiotlb to be used if we're operating on HIGHMEM.  Since
> - * we don't ever call anything other than map_sg, unmap_sg,
> - * map_page, and unmap_page on highmem, use normal dma_ops
> - * for everything else.
> - */
> -const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> - .alloc = dma_direct_alloc,
> - .free = dma_direct_free,
> - .mmap = dma_nommu_mmap_coherent,
> - .map_sg = swiotlb_map_sg_attrs,
> - .unmap_sg = swiotlb_unmap_sg_attrs,
> - .dma_supported = swiotlb_dma_supported,
> - .map_page = swiotlb_map_page,
> - .unmap_page = swiotlb_unmap_page,
> - .sync_single_for_cpu = swiotlb_sync_single_for_cpu,
> - .sync_single_for_device = swiotlb_sync_single_for_device,
> - .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
> - .sync_sg_for_device = swiotlb_sync_sg_for_device,
> - .mapping_error = swiotlb_dma_mapping_error,
> - .get_required_mask = swiotlb_powerpc_get_required,
> -};
> -
>  void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
>  {
>   struct pci_controller *hose;
> @@ -88,7 +64,7 @@ static int ppc_swiotlb_bus_notify(struct notifier_block *nb,
>  
>   /* May need to bounce if the device can't address all of DRAM */
>   if ((dma_get_mask(dev) + 1) < memblock_end_of_DRAM())
> - set_dma_ops(dev, _swiotlb_dma_ops);
> + set_dma_ops(dev, _dma_ops);
>  
>   return NOTIFY_DONE;
>  }
> diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
> index 918be816b097..daf44bc0108d 100644
> --- a/arch/powerpc/sysdev/fsl_pci.c
> +++ b/arch/powerpc/sysdev/fsl_pci.c
> @@ -118,7 +118,7 @@ static void setup_swiotlb_ops(struct pci_controller *hose)
>  {
>   if (ppc_swiotlb_enable) {
>   hose->controller_ops.dma_dev_setup = pci_dma_dev_setup_swiotlb;
> - set_pci_dma_ops(_swiotlb_dma_ops);
> + set_pci_dma_ops(_dma_ops);
>   }
>  }
>  #else



Re: [PATCH 16/20] powerpc/dma: use dma_direct_{alloc,free}

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These do the same functionality as the existing helpers, but do it
> simpler, and also allow the (optional) use of CMA.
> 
> Note that the swiotlb code now calls into the dma_direct code directly,
> given that it doesn't work with noncoherent caches at all, and isn't called
> when we have an iommu either, so the iommu special case in
> dma_nommu_alloc_coherent isn't required for swiotlb.

I am not convinced that this will produce the same results due to
the way the zone picking works.

As for the interaction with swiotlb, we'll need the FSL guys to have
a look. Scott, do you remember what this is about ?

> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/include/asm/pgtable.h |  1 -
>  arch/powerpc/kernel/dma-swiotlb.c  |  4 +-
>  arch/powerpc/kernel/dma.c  | 78 --
>  arch/powerpc/mm/mem.c  | 19 
>  4 files changed, 11 insertions(+), 91 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 14c79a7dc855..123de4958d2e 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -38,7 +38,6 @@ extern unsigned long empty_zero_page[];
>  extern pgd_t swapper_pg_dir[];
>  
>  void limit_zone_pfn(enum zone_type zone, unsigned long max_pfn);
> -int dma_pfn_limit_to_zone(u64 pfn_limit);
>  extern void paging_init(void);
>  
>  /*
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index f6e0701c5303..25986fcd1e5e 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -46,8 +46,8 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
>   * for everything else.
>   */
>  const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> - .alloc = __dma_nommu_alloc_coherent,
> - .free = __dma_nommu_free_coherent,
> + .alloc = dma_direct_alloc,
> + .free = dma_direct_free,
>   .mmap = dma_nommu_mmap_coherent,
>   .map_sg = swiotlb_map_sg_attrs,
>   .unmap_sg = swiotlb_unmap_sg_attrs,
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 2cfc45acbb52..2b90a403cdac 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -26,75 +26,6 @@
>   * can set archdata.dma_data to an unsigned long holding the offset. By
>   * default the offset is PCI_DRAM_OFFSET.
>   */
> -
> -static u64 __maybe_unused get_pfn_limit(struct device *dev)
> -{
> - u64 pfn = (dev->coherent_dma_mask >> PAGE_SHIFT) + 1;
> - struct dev_archdata __maybe_unused *sd = >archdata;
> -
> -#ifdef CONFIG_SWIOTLB
> - if (sd->max_direct_dma_addr && dev->dma_ops == _swiotlb_dma_ops)
> - pfn = min_t(u64, pfn, sd->max_direct_dma_addr >> PAGE_SHIFT);
> -#endif
> -
> - return pfn;
> -}
> -
> -#ifndef CONFIG_NOT_COHERENT_CACHE
> -void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> -   dma_addr_t *dma_handle, gfp_t flag,
> -   unsigned long attrs)
> -{
> - void *ret;
> - struct page *page;
> - int node = dev_to_node(dev);
> -#ifdef CONFIG_FSL_SOC
> - u64 pfn = get_pfn_limit(dev);
> - int zone;
> -
> - /*
> -  * This code should be OK on other platforms, but we have drivers that
> -  * don't set coherent_dma_mask. As a workaround we just ifdef it. This
> -  * whole routine needs some serious cleanup.
> -  */
> -
> - zone = dma_pfn_limit_to_zone(pfn);
> - if (zone < 0) {
> - dev_err(dev, "%s: No suitable zone for pfn %#llx\n",
> - __func__, pfn);
> - return NULL;
> - }
> -
> - switch (zone) {
> - case ZONE_DMA:
> - flag |= GFP_DMA;
> - break;
> -#ifdef CONFIG_ZONE_DMA32
> - case ZONE_DMA32:
> - flag |= GFP_DMA32;
> - break;
> -#endif
> - };
> -#endif /* CONFIG_FSL_SOC */
> -
> - page = alloc_pages_node(node, flag, get_order(size));
> - if (page == NULL)
> - return NULL;
> - ret = page_address(page);
> - memset(ret, 0, size);
> - *dma_handle = phys_to_dma(dev,__pa(ret));
> -
> - return ret;
> -}
> -
> -void __dma_nommu_free_coherent(struct device *dev, size_t size,
> - void *vaddr, dma_addr_t dma_handle,
> - unsigned long attrs)
> -{
> - free_pages((unsigned long)vaddr, get_order(size));
> -}
> -#endif /* !CONFIG_NOT_COHERENT_CACHE */
> -
>  static void *dma_nommu_alloc_coherent(struct device *dev, size_t size,
>  dma_addr_t *dma_handle, gfp_t flag,
>  unsigned long attrs)
> @@ -105,8 +36,12 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>* we can really use the direct ops
>*/
>   if (dma_direct_supported(dev, dev->coherent_dma_mask))
> 

Re: [PATCH 15/20] powerpc/dma: remove the unused unmap_page and unmap_sg methods

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These methods are optional to start with, no need to implement no-op
> versions.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/kernel/dma.c | 16 
>  1 file changed, 16 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 511a4972560d..2cfc45acbb52 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -178,12 +178,6 @@ static int dma_nommu_map_sg(struct device *dev, struct 
> scatterlist *sgl,
>   return nents;
>  }
>  
> -static void dma_nommu_unmap_sg(struct device *dev, struct scatterlist *sg,
> - int nents, enum dma_data_direction direction,
> - unsigned long attrs)
> -{
> -}
> -
>  static u64 dma_nommu_get_required_mask(struct device *dev)
>  {
>   u64 end, mask;
> @@ -209,14 +203,6 @@ static inline dma_addr_t dma_nommu_map_page(struct 
> device *dev,
>   return phys_to_dma(dev, page_to_phys(page)) + offset;
>  }
>  
> -static inline void dma_nommu_unmap_page(struct device *dev,
> -  dma_addr_t dma_address,
> -  size_t size,
> -  enum dma_data_direction direction,
> -  unsigned long attrs)
> -{
> -}
> -
>  #ifdef CONFIG_NOT_COHERENT_CACHE
>  static inline void dma_nommu_sync_sg(struct device *dev,
>   struct scatterlist *sgl, int nents,
> @@ -242,10 +228,8 @@ const struct dma_map_ops dma_nommu_ops = {
>   .free   = dma_nommu_free_coherent,
>   .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_nommu_map_sg,
> - .unmap_sg   = dma_nommu_unmap_sg,
>   .dma_supported  = dma_direct_supported,
>   .map_page   = dma_nommu_map_page,
> - .unmap_page = dma_nommu_unmap_page,
>   .get_required_mask  = dma_nommu_get_required_mask,
>  #ifdef CONFIG_NOT_COHERENT_CACHE
>   .sync_single_for_cpu= dma_nommu_sync_single,



Re: [PATCH 14/20] powerpc/dma: replace dma_nommu_dma_supported with dma_direct_supported

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The ppc32 case of dma_nommu_dma_supported already was a no-op, and the
> 64-bit case came to the same conclusion as dma_direct_supported, so
> replace it with the generic version.

It's not at all equivalent (see my review on your earlier patch) or
am I missing something ?

 - ppc32 always return 1, but dma_direct_supported() will not for
devices with a <32-bit mask (and yes ppc32 isn't quite right to do
so, it should check against memory size, but in practice it worked
as the only limited devices we deal with on systems we still support
have a 31-bit limitation)

 - ppc64 needs to check against the end of DRAM as some devices will
fail the check, dma_direct_supported() doesn't seem to be doing that.

Also as I mentioned, I'm not sure about the business with ZONE_DMA,
and that arbitrary 24-bit limit since our entire memory is in ZONE_DMA
but that's a different can of worms I suppose.

> Signed-off-by: Christoph Hellwig 


> ---
>  arch/powerpc/Kconfig  |  1 +
>  arch/powerpc/kernel/dma.c | 28 +++-
>  2 files changed, 4 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index f9cae7edd735..bbfa6a8df4da 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -158,6 +158,7 @@ config PPC
>   select CLONE_BACKWARDS
>   select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
>   select DYNAMIC_FTRACE   if FUNCTION_TRACER
> + select DMA_DIRECT_OPS
>   select EDAC_ATOMIC_SCRUB
>   select EDAC_SUPPORT
>   select GENERIC_ATOMIC64 if PPC32
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 3487de83bb37..511a4972560d 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -40,28 +40,6 @@ static u64 __maybe_unused get_pfn_limit(struct device *dev)
>   return pfn;
>  }
>  
> -static int dma_nommu_dma_supported(struct device *dev, u64 mask)
> -{
> -#ifdef CONFIG_PPC64
> - u64 limit = phys_to_dma(dev, (memblock_end_of_DRAM() - 1));
> -
> - /* Limit fits in the mask, we are good */
> - if (mask >= limit)
> - return 1;
> -
> -#ifdef CONFIG_FSL_SOC
> - /* Freescale gets another chance via ZONE_DMA/ZONE_DMA32, however
> -  * that will have to be refined if/when they support iommus
> -  */
> - return 1;
> -#endif
> - /* Sorry ... */
> - return 0;
> -#else
> - return 1;
> -#endif
> -}
> -
>  #ifndef CONFIG_NOT_COHERENT_CACHE
>  void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t flag,
> @@ -126,7 +104,7 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>   /* The coherent mask may be smaller than the real mask, check if
>* we can really use the direct ops
>*/
> - if (dma_nommu_dma_supported(dev, dev->coherent_dma_mask))
> + if (dma_direct_supported(dev, dev->coherent_dma_mask))
>   return __dma_nommu_alloc_coherent(dev, size, dma_handle,
>  flag, attrs);
>  
> @@ -148,7 +126,7 @@ static void dma_nommu_free_coherent(struct device *dev, 
> size_t size,
>   struct iommu_table *iommu;
>  
>   /* See comments in dma_nommu_alloc_coherent() */
> - if (dma_nommu_dma_supported(dev, dev->coherent_dma_mask))
> + if (dma_direct_supported(dev, dev->coherent_dma_mask))
>   return __dma_nommu_free_coherent(dev, size, vaddr, dma_handle,
> attrs);
>   /* Maybe we used an iommu ... */
> @@ -265,7 +243,7 @@ const struct dma_map_ops dma_nommu_ops = {
>   .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_nommu_map_sg,
>   .unmap_sg   = dma_nommu_unmap_sg,
> - .dma_supported  = dma_nommu_dma_supported,
> + .dma_supported  = dma_direct_supported,
>   .map_page   = dma_nommu_map_page,
>   .unmap_page = dma_nommu_unmap_page,
>   .get_required_mask  = dma_nommu_get_required_mask,



Re: [PATCH 13/20] powerpc/dma: remove get_dma_offset

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Just fold the calculation into __phys_to_dma/__dma_to_phys as those are
> the only places that should know about it.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-direct.h  |  8 ++--
>  arch/powerpc/include/asm/dma-mapping.h | 16 
>  2 files changed, 6 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-direct.h 
> b/arch/powerpc/include/asm/dma-direct.h
> index 7702875aabb7..0fba19445ae8 100644
> --- a/arch/powerpc/include/asm/dma-direct.h
> +++ b/arch/powerpc/include/asm/dma-direct.h
> @@ -19,11 +19,15 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>  
>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
>  {
> - return paddr + get_dma_offset(dev);
> + if (!dev)
> + return paddr + PCI_DRAM_OFFSET;
> + return paddr + dev->archdata.dma_offset;
>  }
>  
>  static inline phys_addr_t __dma_to_phys(struct device *dev, dma_addr_t daddr)
>  {
> - return daddr - get_dma_offset(dev);
> + if (!dev)
> + return daddr - PCI_DRAM_OFFSET;
> + return daddr - dev->archdata.dma_offset;
>  }
>  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index dacd0f93f2b2..f0bf7ac2686c 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -80,22 +80,6 @@ static inline const struct dma_map_ops 
> *get_arch_dma_ops(struct bus_type *bus)
>   return NULL;
>  }
>  
> -/*
> - * get_dma_offset()
> - *
> - * Get the dma offset on configurations where the dma address can be 
> determined
> - * from the physical address by looking at a simple offset.  Direct dma and
> - * swiotlb use this function, but it is typically not used by implementations
> - * with an iommu.
> - */
> -static inline dma_addr_t get_dma_offset(struct device *dev)
> -{
> - if (dev)
> - return dev->archdata.dma_offset;
> -
> - return PCI_DRAM_OFFSET;
> -}
> -
>  static inline void set_dma_offset(struct device *dev, dma_addr_t off)
>  {
>   if (dev)



Re: [PATCH 12/20] powerpc/dma: use phys_to_dma instead of get_dma_offset

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Use the standard portable helper instead of the powerpc specific one,
> which is about to go away.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/kernel/dma-swiotlb.c |  5 ++---
>  arch/powerpc/kernel/dma.c | 12 ++--
>  2 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index 88f3963ca30f..f6e0701c5303 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -11,7 +11,7 @@
>   *
>   */
>  
> -#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -31,9 +31,8 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
> end = memblock_end_of_DRAM();
> if (max_direct_dma_addr && end > max_direct_dma_addr)
> end = max_direct_dma_addr;
> -   end += get_dma_offset(dev);
>  
> -   mask = 1ULL << (fls64(end) - 1);
> +   mask = 1ULL << (fls64(phys_to_dma(dev, end)) - 1);
> mask += mask - 1;
>  
> return mask;
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index eceaa92e6986..3487de83bb37 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -6,7 +6,7 @@
>   */
>  
>  #include 
> -#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -43,7 +43,7 @@ static u64 __maybe_unused get_pfn_limit(struct device *dev)
>  static int dma_nommu_dma_supported(struct device *dev, u64 mask)
>  {
>  #ifdef CONFIG_PPC64
> -   u64 limit = get_dma_offset(dev) + (memblock_end_of_DRAM() - 1);
> +   u64 limit = phys_to_dma(dev, (memblock_end_of_DRAM() - 1));
>  
> /* Limit fits in the mask, we are good */
> if (mask >= limit)
> @@ -104,7 +104,7 @@ void *__dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
> return NULL;
> ret = page_address(page);
> memset(ret, 0, size);
> -   *dma_handle = __pa(ret) + get_dma_offset(dev);
> +   *dma_handle = phys_to_dma(dev,__pa(ret));
>  
> return ret;
>  }
> @@ -188,7 +188,7 @@ static int dma_nommu_map_sg(struct device *dev, struct 
> scatterlist *sgl,
> int i;
>  
> for_each_sg(sgl, sg, nents, i) {
> -   sg->dma_address = sg_phys(sg) + get_dma_offset(dev);
> +   sg->dma_address = phys_to_dma(dev, sg_phys(sg));
> sg->dma_length = sg->length;
>  
> if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
> @@ -210,7 +210,7 @@ static u64 dma_nommu_get_required_mask(struct device *dev)
>  {
> u64 end, mask;
>  
> -   end = memblock_end_of_DRAM() + get_dma_offset(dev);
> +   end = phys_to_dma(dev, memblock_end_of_DRAM());
>  
> mask = 1ULL << (fls64(end) - 1);
> mask += mask - 1;
> @@ -228,7 +228,7 @@ static inline dma_addr_t dma_nommu_map_page(struct device 
> *dev,
> if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> __dma_sync_page(page, offset, size, dir);
>  
> -   return page_to_phys(page) + offset + get_dma_offset(dev);
> +   return phys_to_dma(dev, page_to_phys(page)) + offset;
>  }
>  
>  static inline void dma_nommu_unmap_page(struct device *dev,



Re: [PATCH 11/20] powerpc/dma: split the two __dma_alloc_coherent implementations

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The implemementation for the CONFIG_NOT_COHERENT_CACHE case doesn't share
> any code with the one for systems with coherent caches.  Split it off
> and merge it with the helpers in dma-noncoherent.c that have no other
> callers.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h |  5 -
>  arch/powerpc/kernel/dma.c  | 14 ++
>  arch/powerpc/mm/dma-noncoherent.c  | 15 +++
>  arch/powerpc/platforms/44x/warp.c  |  2 +-
>  4 files changed, 10 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index f2a4a7142b1e..dacd0f93f2b2 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -39,9 +39,6 @@ extern int dma_nommu_mmap_coherent(struct device *dev,
>   * to ensure it is consistent.
>   */
>  struct device;
> -extern void *__dma_alloc_coherent(struct device *dev, size_t size,
> -   dma_addr_t *handle, gfp_t gfp);
> -extern void __dma_free_coherent(size_t size, void *vaddr);
>  extern void __dma_sync(void *vaddr, size_t size, int direction);
>  extern void __dma_sync_page(struct page *page, unsigned long offset,
>size_t size, int direction);
> @@ -52,8 +49,6 @@ extern unsigned long __dma_get_coherent_pfn(unsigned long 
> cpu_addr);
>   * Cache coherent cores.
>   */
>  
> -#define __dma_alloc_coherent(dev, gfp, size, handle) NULL
> -#define __dma_free_coherent(size, addr)  ((void)0)
>  #define __dma_sync(addr, size, rw)   ((void)0)
>  #define __dma_sync_page(pg, off, sz, rw) ((void)0)
>  
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 3939589aab04..eceaa92e6986 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -62,18 +62,12 @@ static int dma_nommu_dma_supported(struct device *dev, 
> u64 mask)
>  #endif
>  }
>  
> +#ifndef CONFIG_NOT_COHERENT_CACHE
>  void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t flag,
> unsigned long attrs)
>  {
>   void *ret;
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - ret = __dma_alloc_coherent(dev, size, dma_handle, flag);
> - if (ret == NULL)
> - return NULL;
> - *dma_handle += get_dma_offset(dev);
> - return ret;
> -#else
>   struct page *page;
>   int node = dev_to_node(dev);
>  #ifdef CONFIG_FSL_SOC
> @@ -113,19 +107,15 @@ void *__dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>   *dma_handle = __pa(ret) + get_dma_offset(dev);
>  
>   return ret;
> -#endif
>  }
>  
>  void __dma_nommu_free_coherent(struct device *dev, size_t size,
>   void *vaddr, dma_addr_t dma_handle,
>   unsigned long attrs)
>  {
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - __dma_free_coherent(size, vaddr);
> -#else
>   free_pages((unsigned long)vaddr, get_order(size));
> -#endif
>  }
> +#endif /* !CONFIG_NOT_COHERENT_CACHE */
>  
>  static void *dma_nommu_alloc_coherent(struct device *dev, size_t size,
>  dma_addr_t *dma_handle, gfp_t flag,
> diff --git a/arch/powerpc/mm/dma-noncoherent.c 
> b/arch/powerpc/mm/dma-noncoherent.c
> index d1c16456abac..cfc48a253707 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -29,7 +29,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  #include 
>  
>  #include 
> @@ -151,8 +151,8 @@ static struct ppc_vm_region *ppc_vm_region_find(struct 
> ppc_vm_region *head, unsi
>   * Allocate DMA-coherent memory space and return both the kernel remapped
>   * virtual and bus address for that space.
>   */
> -void *
> -__dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle, 
> gfp_t gfp)
> +void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> + dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>  {
>   struct page *page;
>   struct ppc_vm_region *c;
> @@ -223,7 +223,7 @@ __dma_alloc_coherent(struct device *dev, size_t size, 
> dma_addr_t *handle, gfp_t
>   /*
>* Set the "dma handle"
>*/
> - *handle = page_to_phys(page);
> + *dma_handle = phys_to_dma(dev, page_to_phys(page));
>  
>   do {
>  

Re: [PATCH 10/20] powerpc/dma-noncoherent: don't disable irqs over kmap_atomic

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The requirement to disable local irqs over kmap_atomic is long gone,
> so remove those calls.

Really ? I'm trying to verify that and getting lost in a mess of macros
from hell in the per-cpu stuff but if you look at our implementation
of kmap_atomic_prot(), all it does is a preempt_disable(), and then
it uses kmap_atomic_idx_push():

int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;

Note the use of __this_cpu_inc_return(), not this_cpu_inc_return(),
ie this is the non-interrupt safe version...

Ben.

> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/mm/dma-noncoherent.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/dma-noncoherent.c 
> b/arch/powerpc/mm/dma-noncoherent.c
> index 382528475433..d1c16456abac 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -357,12 +357,10 @@ static inline void __dma_sync_page_highmem(struct page 
> *page,
>  {
>   size_t seg_size = min((size_t)(PAGE_SIZE - offset), size);
>   size_t cur_size = seg_size;
> - unsigned long flags, start, seg_offset = offset;
> + unsigned long start, seg_offset = offset;
>   int nr_segs = 1 + ((size - seg_size) + PAGE_SIZE - 1)/PAGE_SIZE;
>   int seg_nr = 0;
>  
> - local_irq_save(flags);
> -
>   do {
>   start = (unsigned long)kmap_atomic(page + seg_nr) + seg_offset;
>  
> @@ -378,8 +376,6 @@ static inline void __dma_sync_page_highmem(struct page 
> *page,
>   cur_size += seg_size;
>   seg_offset = 0;
>   } while (seg_nr < nr_segs);
> -
> - local_irq_restore(flags);
>  }
>  #endif /* CONFIG_HIGHMEM */
>  



Re: [PATCH 09/20] powerpc/dma: remove the unused ISA_DMA_THRESHOLD export

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 


> ---
>  arch/powerpc/kernel/setup_32.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c
> index 74457485574b..3c2d093f74c7 100644
> --- a/arch/powerpc/kernel/setup_32.c
> +++ b/arch/powerpc/kernel/setup_32.c
> @@ -55,7 +55,6 @@ unsigned long ISA_DMA_THRESHOLD;
>  unsigned int DMA_MODE_READ;
>  unsigned int DMA_MODE_WRITE;
>  
> -EXPORT_SYMBOL(ISA_DMA_THRESHOLD);
>  EXPORT_SYMBOL(DMA_MODE_READ);
>  EXPORT_SYMBOL(DMA_MODE_WRITE);
>  fooBenjam



Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-08 Thread Benjamin Herrenschmidt
On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> It turns out cxl actually uses it.  So for now skip this patch,
> although random code in drivers messing with dma ops will need to
> be sorted out sooner or later.

CXL devices are "special", they bypass the classic iommu in favor of
allowing the device to operate using the main processor page tables
using an MMU context (so basically the device can use userspace
addresses directly), akin to ATS.

I think the code currently uses the nommu ops as a way to do a simple
kernel mapping for kernel drivers using CXL (not userspace stuff)
though.

Ben.




Re: [PATCH 07/20] powerpc/dma: remove the unused ARCH_HAS_DMA_MMAP_COHERENT define

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 8fa394520af6..f2a4a7142b1e 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -112,7 +112,5 @@ extern int dma_set_mask(struct device *dev, u64 dma_mask);
>  
>  extern u64 __dma_get_required_mask(struct device *dev);
>  
> -#define ARCH_HAS_DMA_MMAP_COHERENT
> -
>  #endif /* __KERNEL__ */
>  #endif   /* _ASM_DMA_MAPPING_H */



Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-08 Thread Benjamin Herrenschmidt
On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> We need to take the DMA offset and encryption bit into account when selecting
> a zone.  Add a helper that takes those into account and use it.

That whole "encryption" stuff seems to be completely specific to the
way x86 does memory encryption, or am I mistaken ? It's not clear to me
what that does in practice and how it relates to DMA mappings.

I'm also not sure about that whole business with ZONE_DMA and
ARCH_ZONE_DMA_BITS...

On ppc64, unless you enable swiotlb (which we only do currently on
some embedded platforms), you have all of memory in ZONE_DMA.

[0.00] Zone ranges:
[0.00]   DMA  [mem 0x-0x001f]
[0.00]   DMA32empty
[0.00]   Normal   empty
[0.00]   Device   empty

I'm not sure how this will work with that dma direct code.

I also see a number of tests against a 64-bit mask rather than the
top of memory...

Ben.

> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/dma/direct.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index d32d4f0d2c0c..c2c1df8827f2 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -58,6 +58,14 @@ static bool dma_coherent_ok(struct device *dev, 
> phys_addr_t phys, size_t size)
>   return addr + size - 1 <= dev->coherent_dma_mask;
>  }
>  
> +static bool dma_coherent_below(struct device *dev, u64 mask)
> +{
> + dma_addr_t addr = force_dma_unencrypted() ?
> + __phys_to_dma(dev, mask) : phys_to_dma(dev, mask);
> +
> + return dev->coherent_dma_mask <= addr;
> +}
> +
>  void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t 
> *dma_handle,
>   gfp_t gfp, unsigned long attrs)
>  {
> @@ -70,9 +78,9 @@ void *dma_direct_alloc(struct device *dev, size_t size, 
> dma_addr_t *dma_handle,
>   gfp &= ~__GFP_ZERO;
>  
>   /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: */
> - if (dev->coherent_dma_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
> + if (dma_coherent_below(dev, DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)))
>   gfp |= GFP_DMA;
> - if (dev->coherent_dma_mask <= DMA_BIT_MASK(32) && !(gfp & GFP_DMA))
> + if (dma_coherent_below(dev, DMA_BIT_MASK(32) && !(gfp & GFP_DMA)))
>   gfp |= GFP_DMA32;
>  
>  again:
> @@ -92,14 +100,14 @@ void *dma_direct_alloc(struct device *dev, size_t size, 
> dma_addr_t *dma_handle,
>   page = NULL;
>  
>   if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
> - dev->coherent_dma_mask < DMA_BIT_MASK(64) &&
> + dma_coherent_below(dev, DMA_BIT_MASK(64)) &&
>   !(gfp & (GFP_DMA32 | GFP_DMA))) {
>   gfp |= GFP_DMA32;
>   goto again;
>   }
>  
>   if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> - dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
> + dma_coherent_below(dev, DMA_BIT_MASK(32)) &&
>   !(gfp & GFP_DMA)) {
>   gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
>   goto again;



<    1   2   3   4   5   6   7   8   9   10   >