Re: a3b2cb30 broken panic reporting for qemu guests

2017-12-03 Thread David Gibson
On Mon, Dec 04, 2017 at 04:12:06PM +1000, Nicholas Piggin wrote:
> On Mon, 4 Dec 2017 16:49:14 +1100
> David Gibson  wrote:
> 
> > On Fri, Dec 01, 2017 at 09:40:38PM +1000, Nicholas Piggin wrote:
> > > On Fri, 01 Dec 2017 22:11:50 +1100
> > > Michael Ellerman  wrote:
> > >   
> > > > David Gibson  writes:
> > > >   
> > > > > On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote:
> > > > >> On Wed, 29 Nov 2017 15:06:52 +1100
> > > > >> David Gibson  wrote:
> > > > >> 
> > > > >> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic 
> > > > >> > notifier"
> > > > >> > purports to fix a problem when the kernel panics with fadump not
> > > > >> > registered, but it breaks something else instead.  I _think_ it was
> > > > >> > working on the incorrect assumption that ppc_md.panic was (or 
> > > > >> > should
> > > > >> > be) only used with fadump, but I'm not really sure.
> > > > >> > 
> > > > >> > Panic works with kdump enabled, and (I think) with fadump enabled).
> > > > >> > However, with neither of these enabled, we always go to the generic
> > > > >> > panic logic.
> > > > >> 
> > > > >> Yeah thanks, I can't remember what assumption I was working on tbh.
> > > > >>  
> > > > >> > That's incorrect for PAPR guests - they should call ibm,os-term via
> > > > >> > RTAS.  Under qemu this leads to a "GUEST_PANICKED" event 
> > > > >> > notification
> > > > >> > which higher-level management pays attention to.  Since a3b2cb30 we
> > > > >> > now reboot instead of reporting that.
> > > > >> > 
> > > > >> > I believe it will also break panic for PS3 machines, but since that
> > > > >> > platform basically no longer exists, we probably don't care.
> > > > >> 
> > > > >> I (hope) it should just go down to the normal panic path and not do
> > > > >> much worse than it already does -- although it won't print out that
> > > > >> message.
> > > > >> 
> > > > >> > I'm not entirely sure how to fix this.  I _think_ what we want is 
> > > > >> > to
> > > > >> > call ppc_md.panic from a late panic notifier, the way this patch 
> > > > >> > does
> > > > >> > for fadump_panic_event() if fadump is registered.
> > > > >> 
> > > > >> The problem I had there is that some of the printk and console stuff
> > > > >> wasn't getting flushed out, so I was getting a blank screen. This was
> > > > >> probably in conjunction with panicing from NMI context that we're now
> > > > >> starting to introduce.
> > > > >> 
> > > > >> So it's a bit annoying. There's other ugliness we have for being 
> > > > >> unable
> > > > >> to control panic code well enough from arch code
> > > > >> (arch/powerpc/platforms/powernv/opal.c)
> > > > >> 
> > > > >> I guess a really minimal fix is to put an #ifdef powerpc down the 
> > > > >> bottom
> > > > >> there (/me *cries*).
> > > > >
> > > > > Um.. right.  I'm not really sure from that how to go forward from
> > > > > here.  We want to fix this for RHEL7.5, which doesn't give us a lot of
> > > > > time.
> > > > >
> > > > > Adding the #ifdef at the bottom of the generic panic code is gross,
> > > > > but there's already a bunch of that, so maybe adequate until a better
> > > > > solution can be found?
> > > > 
> > > > I think you mean put an #ifdef at the bottom of panic(). If so that
> > > > won't work. Our default panic_timeout is 180 so we never get to the
> > > > bottom of panic(), we call emergency_restart().
> > > > 
> > > > You *could* put an #ifdef powerpc before that, but that's even more
> > > > gross because it's in a different place to the sparc/s390 #ifdefs.
> > > > 
> > > > I notice we don't implement machine_emergency_restart(), it just
> > > > becomes machine_restart(NULL).
> > > > 
> > > > So it seems like that's the place we should be hooking,
> > > > machine_emergency_restart(). That's what x86 does.
> > > > 
> > > > But panic() is not the only caller of emergency_restart(), so it's not
> > > > an entirely straight forward conversion.
> > > > 
> > > > So I think for 4.15 and 4.14 I'm inclined to revert. Then we can do a
> > > > bigger rework for 4.16.
> > > > 
> > > > Nick that shouldn't break your original aim too badly I think? ie. worst
> > > > case is we panic() but don't see the output if we came from NMI?  
> > > 
> > > If the fix is not pretty obvious, then I guess revert. We actually
> > > do have a bit of a regression though, since we've started marking
> > > system reset interrupts as NMI. Previously a system reset would have
> > > a better chance of printing something there.
> > > 
> > > So I wonder is an ifdef really all that much worse just because it's not
> > > in the same exact place as the others? We do get bug reports that were
> > > triggered by a system reset from hypervisor console. Hopefully they would
> > > be running with crash dumps usually.  
> > 
> > I don't think an ifdef there is really correct though.  

Re: [PATCH 0/4] Fix use after free in HPT resizing code and related minor improvements

2017-12-03 Thread David Gibson
On Wed, Nov 29, 2017 at 11:38:22AM -0500, Serhii Popovych wrote:
> It is possible to trigger use after free during HPT resize
> causing host kernel to crash. More details and analysis of
> the problem can be found in change with corresponding subject
> (KVM: PPC: Book3S HV: Fix use after free in case of multiple
> resize requests).
> 
> We need some changes to prepare for the fix, especially
> make ->error in HPT resize instance single point for
> tracking allocation state, improve kvmppc_allocate_hpt()
> and kvmppc_free_hpt() so they can be used more safely.
> 
> See individual commit description message to get more
> information on changes presented.

I spoke with Paul Mackerras about these patches on IRC today.  We want
this as a fix, ASAP, in 4.15.  However, he's uncomfortable with
pushing some of extra cleanups which aren't necessary for the bug fix
this late for 4.15, and was having trouble following what was the core
of the fix.  He was also nervous about the addition of more BUG_ON()s.

To avoid the round trip to Ukraine time and back, I've made revised
versions of patches 1 & 3 which should apply standalone, replaced the
BUG_ON()s with WARN_ON()s and revised the commit messages to better
explain the crucial part of the fix.

However, I've run out of time to test them.

Serhii,  I'll send you my revised patches shortly.  Can you please
test them and repost.  Then you can rebase patches 2 & 4 from this
series on top of the revised patches and post those separately (as a
cleanup with less urgency than the actual fix).

A couple of people have also suggested CCing k...@vger.kernel.org on
the next round in addition to the lists already included.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: a3b2cb30 broken panic reporting for qemu guests

2017-12-03 Thread Nicholas Piggin
On Mon, 4 Dec 2017 16:49:14 +1100
David Gibson  wrote:

> On Fri, Dec 01, 2017 at 09:40:38PM +1000, Nicholas Piggin wrote:
> > On Fri, 01 Dec 2017 22:11:50 +1100
> > Michael Ellerman  wrote:
> >   
> > > David Gibson  writes:
> > >   
> > > > On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote:
> > > >> On Wed, 29 Nov 2017 15:06:52 +1100
> > > >> David Gibson  wrote:
> > > >> 
> > > >> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic notifier"
> > > >> > purports to fix a problem when the kernel panics with fadump not
> > > >> > registered, but it breaks something else instead.  I _think_ it was
> > > >> > working on the incorrect assumption that ppc_md.panic was (or should
> > > >> > be) only used with fadump, but I'm not really sure.
> > > >> > 
> > > >> > Panic works with kdump enabled, and (I think) with fadump enabled).
> > > >> > However, with neither of these enabled, we always go to the generic
> > > >> > panic logic.
> > > >> 
> > > >> Yeah thanks, I can't remember what assumption I was working on tbh.
> > > >>  
> > > >> > That's incorrect for PAPR guests - they should call ibm,os-term via
> > > >> > RTAS.  Under qemu this leads to a "GUEST_PANICKED" event notification
> > > >> > which higher-level management pays attention to.  Since a3b2cb30 we
> > > >> > now reboot instead of reporting that.
> > > >> > 
> > > >> > I believe it will also break panic for PS3 machines, but since that
> > > >> > platform basically no longer exists, we probably don't care.
> > > >> 
> > > >> I (hope) it should just go down to the normal panic path and not do
> > > >> much worse than it already does -- although it won't print out that
> > > >> message.
> > > >> 
> > > >> > I'm not entirely sure how to fix this.  I _think_ what we want is to
> > > >> > call ppc_md.panic from a late panic notifier, the way this patch does
> > > >> > for fadump_panic_event() if fadump is registered.
> > > >> 
> > > >> The problem I had there is that some of the printk and console stuff
> > > >> wasn't getting flushed out, so I was getting a blank screen. This was
> > > >> probably in conjunction with panicing from NMI context that we're now
> > > >> starting to introduce.
> > > >> 
> > > >> So it's a bit annoying. There's other ugliness we have for being unable
> > > >> to control panic code well enough from arch code
> > > >> (arch/powerpc/platforms/powernv/opal.c)
> > > >> 
> > > >> I guess a really minimal fix is to put an #ifdef powerpc down the 
> > > >> bottom
> > > >> there (/me *cries*).
> > > >
> > > > Um.. right.  I'm not really sure from that how to go forward from
> > > > here.  We want to fix this for RHEL7.5, which doesn't give us a lot of
> > > > time.
> > > >
> > > > Adding the #ifdef at the bottom of the generic panic code is gross,
> > > > but there's already a bunch of that, so maybe adequate until a better
> > > > solution can be found?
> > > 
> > > I think you mean put an #ifdef at the bottom of panic(). If so that
> > > won't work. Our default panic_timeout is 180 so we never get to the
> > > bottom of panic(), we call emergency_restart().
> > > 
> > > You *could* put an #ifdef powerpc before that, but that's even more
> > > gross because it's in a different place to the sparc/s390 #ifdefs.
> > > 
> > > I notice we don't implement machine_emergency_restart(), it just
> > > becomes machine_restart(NULL).
> > > 
> > > So it seems like that's the place we should be hooking,
> > > machine_emergency_restart(). That's what x86 does.
> > > 
> > > But panic() is not the only caller of emergency_restart(), so it's not
> > > an entirely straight forward conversion.
> > > 
> > > So I think for 4.15 and 4.14 I'm inclined to revert. Then we can do a
> > > bigger rework for 4.16.
> > > 
> > > Nick that shouldn't break your original aim too badly I think? ie. worst
> > > case is we panic() but don't see the output if we came from NMI?  
> > 
> > If the fix is not pretty obvious, then I guess revert. We actually
> > do have a bit of a regression though, since we've started marking
> > system reset interrupts as NMI. Previously a system reset would have
> > a better chance of printing something there.
> > 
> > So I wonder is an ifdef really all that much worse just because it's not
> > in the same exact place as the others? We do get bug reports that were
> > triggered by a system reset from hypervisor console. Hopefully they would
> > be running with crash dumps usually.  
> 
> I don't think an ifdef there is really correct though.  Sounds like we
> might instead need to move some console flushes before calling of all
> the notifiers, so that things like platforms/pseries or pvpanic can
> insert non-returning notifiers.

Well, I don't necessarily think that's the right thing to do either.
If we have a crash dump handler, we should get the console memory
and 

Re: [PATCH 2/4] powerpc/64: do not trace irqs-off at interrupt return to soft-disabled context

2017-12-03 Thread Nicholas Piggin
On Mon, 04 Dec 2017 16:09:57 +1100
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > When an interrupt is returning to a soft-disabled context (which can
> > happen for non-maskable interrupts or synchronous interrupts), it goes
> > through the motions of soft-disabling again, including calling
> > TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
> >
> > This is not necessary, because we must already be soft-disabled in the
> > interrupt context, it also may be causing crashes in the irq tracing
> > code to re-enter as an nmi. Replace it with a warning to ensure that
> > soft-interrupts are still disabled.
> >
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/kernel/entry_64.S | 10 +++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)  
> 
> So this patch is the core of the bug fix I gather.
> 
> Git blames says:
> 
>   Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state 
> getting out of sync")
>   Cc: sta...@vger.kernel.org # v3.4+
> 
> But I'm wondering how this has been broken that long without us
> noticing? You hit it doing some sort of perf stress test I think - so is
> it just that we've never pushed hard enough? Or did something change to
> expose this? Or we're just not sure?

I'm not really sure. A customer hit it, during either a stress test or
long running workload with lockdep irq tracing and perf running at the
same time. I don't have a lot more details but we might be able to get
some offline if necessary.

Thanks,
Nick


[PATCH V2] powerpc/mm: Invalidate subpage_prot() system call on radix platforms

2017-12-03 Thread Anshuman Khandual
Radix enabled platforms don't support subpage_prot() system calls. But
at present the system call goes through without an error and fails
later on while validating expected subpage accesses. Lets not allow
the system call on powerpc radix platforms to begin with to prevent
this confusion in user space.

Signed-off-by: Anshuman Khandual 
---
Changes in V2:

Now it returns -ENOENT error instead as per Michael Ellerman.

 arch/powerpc/mm/subpage-prot.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index 781532d..f14a07c 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -195,6 +195,9 @@ long sys_subpage_prot(unsigned long addr, unsigned long 
len, u32 __user *map)
unsigned long next, limit;
int err;
 
+   if (radix_enabled())
+   return -ENOENT;
+
/* Check parameters */
if ((addr & ~PAGE_MASK) || (len & ~PAGE_MASK) ||
addr >= mm->task_size || len >= mm->task_size ||
-- 
1.8.3.1



Re: a3b2cb30 broken panic reporting for qemu guests

2017-12-03 Thread David Gibson
On Fri, Dec 01, 2017 at 10:11:50PM +1100, Michael Ellerman wrote:
> David Gibson  writes:
> 
> > On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote:
> >> On Wed, 29 Nov 2017 15:06:52 +1100
> >> David Gibson  wrote:
> >> 
> >> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic notifier"
> >> > purports to fix a problem when the kernel panics with fadump not
> >> > registered, but it breaks something else instead.  I _think_ it was
> >> > working on the incorrect assumption that ppc_md.panic was (or should
> >> > be) only used with fadump, but I'm not really sure.
> >> > 
> >> > Panic works with kdump enabled, and (I think) with fadump enabled).
> >> > However, with neither of these enabled, we always go to the generic
> >> > panic logic.
> >> 
> >> Yeah thanks, I can't remember what assumption I was working on tbh.
> >>  
> >> > That's incorrect for PAPR guests - they should call ibm,os-term via
> >> > RTAS.  Under qemu this leads to a "GUEST_PANICKED" event notification
> >> > which higher-level management pays attention to.  Since a3b2cb30 we
> >> > now reboot instead of reporting that.
> >> > 
> >> > I believe it will also break panic for PS3 machines, but since that
> >> > platform basically no longer exists, we probably don't care.
> >> 
> >> I (hope) it should just go down to the normal panic path and not do
> >> much worse than it already does -- although it won't print out that
> >> message.
> >> 
> >> > I'm not entirely sure how to fix this.  I _think_ what we want is to
> >> > call ppc_md.panic from a late panic notifier, the way this patch does
> >> > for fadump_panic_event() if fadump is registered.
> >> 
> >> The problem I had there is that some of the printk and console stuff
> >> wasn't getting flushed out, so I was getting a blank screen. This was
> >> probably in conjunction with panicing from NMI context that we're now
> >> starting to introduce.
> >> 
> >> So it's a bit annoying. There's other ugliness we have for being unable
> >> to control panic code well enough from arch code
> >> (arch/powerpc/platforms/powernv/opal.c)
> >> 
> >> I guess a really minimal fix is to put an #ifdef powerpc down the bottom
> >> there (/me *cries*).
> >
> > Um.. right.  I'm not really sure from that how to go forward from
> > here.  We want to fix this for RHEL7.5, which doesn't give us a lot of
> > time.
> >
> > Adding the #ifdef at the bottom of the generic panic code is gross,
> > but there's already a bunch of that, so maybe adequate until a better
> > solution can be found?
> 
> I think you mean put an #ifdef at the bottom of panic(). If so that
> won't work. Our default panic_timeout is 180 so we never get to the
> bottom of panic(), we call emergency_restart().
> 
> You *could* put an #ifdef powerpc before that, but that's even more
> gross because it's in a different place to the sparc/s390 #ifdefs.
> 
> I notice we don't implement machine_emergency_restart(), it just
> becomes machine_restart(NULL).
> 
> So it seems like that's the place we should be hooking,
> machine_emergency_restart(). That's what x86 does.

Only sort of.  machine_emergency_restart() is in basically the right
place, but AFAICT it is supposed to restart the machine, which os-term
(by design) does not.  x86 uses this for a restart, when using pvpanic
which provides similar functionality to os-term, that gets called
before reaching emergency_reset().

> But panic() is not the only caller of emergency_restart(), so it's not
> an entirely straight forward conversion.
> 
> So I think for 4.15 and 4.14 I'm inclined to revert. Then we can do a
> bigger rework for 4.16.

I've sent a revert for now.

> Nick that shouldn't break your original aim too badly I think? ie. worst
> case is we panic() but don't see the output if we came from NMI?
> 
> cheers
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: a3b2cb30 broken panic reporting for qemu guests

2017-12-03 Thread David Gibson
On Fri, Dec 01, 2017 at 09:40:38PM +1000, Nicholas Piggin wrote:
> On Fri, 01 Dec 2017 22:11:50 +1100
> Michael Ellerman  wrote:
> 
> > David Gibson  writes:
> > 
> > > On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote:  
> > >> On Wed, 29 Nov 2017 15:06:52 +1100
> > >> David Gibson  wrote:
> > >>   
> > >> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic notifier"
> > >> > purports to fix a problem when the kernel panics with fadump not
> > >> > registered, but it breaks something else instead.  I _think_ it was
> > >> > working on the incorrect assumption that ppc_md.panic was (or should
> > >> > be) only used with fadump, but I'm not really sure.
> > >> > 
> > >> > Panic works with kdump enabled, and (I think) with fadump enabled).
> > >> > However, with neither of these enabled, we always go to the generic
> > >> > panic logic.  
> > >> 
> > >> Yeah thanks, I can't remember what assumption I was working on tbh.
> > >>
> > >> > That's incorrect for PAPR guests - they should call ibm,os-term via
> > >> > RTAS.  Under qemu this leads to a "GUEST_PANICKED" event notification
> > >> > which higher-level management pays attention to.  Since a3b2cb30 we
> > >> > now reboot instead of reporting that.
> > >> > 
> > >> > I believe it will also break panic for PS3 machines, but since that
> > >> > platform basically no longer exists, we probably don't care.  
> > >> 
> > >> I (hope) it should just go down to the normal panic path and not do
> > >> much worse than it already does -- although it won't print out that
> > >> message.
> > >>   
> > >> > I'm not entirely sure how to fix this.  I _think_ what we want is to
> > >> > call ppc_md.panic from a late panic notifier, the way this patch does
> > >> > for fadump_panic_event() if fadump is registered.  
> > >> 
> > >> The problem I had there is that some of the printk and console stuff
> > >> wasn't getting flushed out, so I was getting a blank screen. This was
> > >> probably in conjunction with panicing from NMI context that we're now
> > >> starting to introduce.
> > >> 
> > >> So it's a bit annoying. There's other ugliness we have for being unable
> > >> to control panic code well enough from arch code
> > >> (arch/powerpc/platforms/powernv/opal.c)
> > >> 
> > >> I guess a really minimal fix is to put an #ifdef powerpc down the bottom
> > >> there (/me *cries*).  
> > >
> > > Um.. right.  I'm not really sure from that how to go forward from
> > > here.  We want to fix this for RHEL7.5, which doesn't give us a lot of
> > > time.
> > >
> > > Adding the #ifdef at the bottom of the generic panic code is gross,
> > > but there's already a bunch of that, so maybe adequate until a better
> > > solution can be found?  
> > 
> > I think you mean put an #ifdef at the bottom of panic(). If so that
> > won't work. Our default panic_timeout is 180 so we never get to the
> > bottom of panic(), we call emergency_restart().
> > 
> > You *could* put an #ifdef powerpc before that, but that's even more
> > gross because it's in a different place to the sparc/s390 #ifdefs.
> > 
> > I notice we don't implement machine_emergency_restart(), it just
> > becomes machine_restart(NULL).
> > 
> > So it seems like that's the place we should be hooking,
> > machine_emergency_restart(). That's what x86 does.
> > 
> > But panic() is not the only caller of emergency_restart(), so it's not
> > an entirely straight forward conversion.
> > 
> > So I think for 4.15 and 4.14 I'm inclined to revert. Then we can do a
> > bigger rework for 4.16.
> > 
> > Nick that shouldn't break your original aim too badly I think? ie. worst
> > case is we panic() but don't see the output if we came from NMI?
> 
> If the fix is not pretty obvious, then I guess revert. We actually
> do have a bit of a regression though, since we've started marking
> system reset interrupts as NMI. Previously a system reset would have
> a better chance of printing something there.
> 
> So I wonder is an ifdef really all that much worse just because it's not
> in the same exact place as the others? We do get bug reports that were
> triggered by a system reset from hypervisor console. Hopefully they would
> be running with crash dumps usually.

I don't think an ifdef there is really correct though.  Sounds like we
might instead need to move some console flushes before calling of all
the notifiers, so that things like platforms/pseries or pvpanic can
insert non-returning notifiers.

Or else we need a different mechanism from the existing panic
notifiers for hooking in a "how to die" function from platform or
device.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: a3b2cb30 broken panic reporting for qemu guests

2017-12-03 Thread David Gibson
On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote:
> On Wed, 29 Nov 2017 15:06:52 +1100
> David Gibson  wrote:
> 
> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic notifier"
> > purports to fix a problem when the kernel panics with fadump not
> > registered, but it breaks something else instead.  I _think_ it was
> > working on the incorrect assumption that ppc_md.panic was (or should
> > be) only used with fadump, but I'm not really sure.
> > 
> > Panic works with kdump enabled, and (I think) with fadump enabled).
> > However, with neither of these enabled, we always go to the generic
> > panic logic.
> 
> Yeah thanks, I can't remember what assumption I was working on tbh.
>  
> > That's incorrect for PAPR guests - they should call ibm,os-term via
> > RTAS.  Under qemu this leads to a "GUEST_PANICKED" event notification
> > which higher-level management pays attention to.  Since a3b2cb30 we
> > now reboot instead of reporting that.
> > 
> > I believe it will also break panic for PS3 machines, but since that
> > platform basically no longer exists, we probably don't care.
> 
> I (hope) it should just go down to the normal panic path and not do
> much worse than it already does -- although it won't print out that
> message.

Sounds plausible.

> > I'm not entirely sure how to fix this.  I _think_ what we want is to
> > call ppc_md.panic from a late panic notifier, the way this patch does
> > for fadump_panic_event() if fadump is registered.
> 
> The problem I had there is that some of the printk and console stuff
> wasn't getting flushed out, so I was getting a blank screen. This was
> probably in conjunction with panicing from NMI context that we're now
> starting to introduce.

Ok.  What was the exact bit of panic() that wasn't getting invoked
that needed to be?

AFAICT ppc_md.panic was already being called at the end of the panic
notifiers, by using INT_MIN priority.  Note that this is the same way
that the pvpanic device (used on x86 for similar panic notification
functionality) does it.  Well, pvpanic uses priority 1, which seems
less thorough than INT_MIN.

> So it's a bit annoying. There's other ugliness we have for being unable
> to control panic code well enough from arch code
> (arch/powerpc/platforms/powernv/opal.c)
> 
> I guess a really minimal fix is to put an #ifdef powerpc down the bottom
> there (/me *cries*).

That would work for the PAPR os-term thing, but wouldn't if we ever
had a specific device that worked like pvpanic.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PATCH] Revert "powerpc: Do not call ppc_md.panic in fadump panic notifier"

2017-12-03 Thread David Gibson
This reverts commit a3b2cb30f252b21a6f962e0dd107c8b897ca65e4.

The earlier patch tried to fix problems with panic on powerpc in
certain circumstances, where some output from the generic panic code
was being dropped.

Unfortunately, it breaks things worse in other circumstances.  In
particular when running a PAPR guest, it will now attempt to reboot
instead of informing the hypervisor (KVM or PowerVM) that the guest
has crashed.  The crash notification is important to some
virtualization management layers.

Since the circumstances in which the original patch helped are
somewhat obscure, revert it for now until we figure out how to do it
properly.

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/machdep.h |  1 +
 arch/powerpc/include/asm/setup.h   |  1 +
 arch/powerpc/kernel/fadump.c   | 22 --
 arch/powerpc/kernel/setup-common.c | 27 +++
 arch/powerpc/platforms/ps3/setup.c | 15 +++
 arch/powerpc/platforms/pseries/setup.c |  1 +
 6 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 73b92017b6d7..cd2fc1cc1cc7 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -76,6 +76,7 @@ struct machdep_calls {
 
void __noreturn (*restart)(char *cmd);
void __noreturn (*halt)(void);
+   void(*panic)(char *str);
void(*cpu_die)(void);
 
long(*time_init)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 257d23dbf55d..cf00ec26303a 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -24,6 +24,7 @@ extern void reloc_got2(unsigned long);
 
 void check_for_initrd(void);
 void initmem_init(void);
+void setup_panic(void);
 #define ARCH_PANIC_TIMEOUT 180
 
 #ifdef CONFIG_PPC_PSERIES
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 04ea5c04fd24..3c2c2688918f 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1462,25 +1462,6 @@ static void fadump_init_files(void)
return;
 }
 
-static int fadump_panic_event(struct notifier_block *this,
- unsigned long event, void *ptr)
-{
-   /*
-* If firmware-assisted dump has been registered then trigger
-* firmware-assisted dump and let firmware handle everything
-* else. If this returns, then fadump was not registered, so
-* go through the rest of the panic path.
-*/
-   crash_fadump(NULL, ptr);
-
-   return NOTIFY_DONE;
-}
-
-static struct notifier_block fadump_panic_block = {
-   .notifier_call = fadump_panic_event,
-   .priority = INT_MIN /* may not return; must be done last */
-};
-
 /*
  * Prepare for firmware-assisted dump.
  */
@@ -1513,9 +1494,6 @@ int __init setup_fadump(void)
init_fadump_mem_struct(, fw_dump.reserve_dump_area_start);
fadump_init_files();
 
-   atomic_notifier_chain_register(_notifier_list,
-   _panic_block);
-
return 1;
 }
 subsys_initcall(setup_fadump);
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 2075322cd225..9d213542a48b 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -704,6 +704,30 @@ int check_legacy_ioport(unsigned long base_port)
 }
 EXPORT_SYMBOL(check_legacy_ioport);
 
+static int ppc_panic_event(struct notifier_block *this,
+ unsigned long event, void *ptr)
+{
+   /*
+* If firmware-assisted dump has been registered then trigger
+* firmware-assisted dump and let firmware handle everything else.
+*/
+   crash_fadump(NULL, ptr);
+   ppc_md.panic(ptr);  /* May not return */
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block ppc_panic_block = {
+   .notifier_call = ppc_panic_event,
+   .priority = INT_MIN /* may not return; must be done last */
+};
+
+void __init setup_panic(void)
+{
+   if (!ppc_md.panic)
+   return;
+   atomic_notifier_chain_register(_notifier_list, _panic_block);
+}
+
 #ifdef CONFIG_CHECK_CACHE_COHERENCY
 /*
  * For platforms that have configurable cache-coherency.  This function
@@ -848,6 +872,9 @@ void __init setup_arch(char **cmdline_p)
/* Probe the machine type, establish ppc_md. */
probe_machine();
 
+   /* Setup panic notifier if requested by the platform. */
+   setup_panic();
+
/*
 * Configure ppc_md.power_save (ppc32 only, 64-bit machines do
 * it from their respective probe() function.
diff --git a/arch/powerpc/platforms/ps3/setup.c 
b/arch/powerpc/platforms/ps3/setup.c
index 9dabea6e1443..6244bc849469 100644
--- a/arch/powerpc/platforms/ps3/setup.c
+++ 

Re: [PATCH V3 0/9] powerpc: Support for ibm,dynamic-memory-v2

2017-12-03 Thread Michael Ellerman
Nathan Fontenot  writes:

> This patch set provides a series of updates to de-couple the LMB
> information provided in the device tree property from the device
> tree property format. This eases the ability to support a new
> format for the dynamic memory property, ibm,dynamic-memory-v2.

Something in here is still blowing up for me in a KVM guest:

OF stdout device is: /vdevice/vty@7100
Preparing to boot Linux version 4.14.0-rc2-gcc6x-g9e1fc7e 
(kerkins@alpine1-p1) (gcc version 6.4.1 20171202 (Custom 6328ca9eaa476138)) #1 
SMP Sun Dec 3 21:45:32 AEDT 2017
Detected machine type: 0101
command line: 
Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
Calling ibm,client-architecture-support... done
memory layout at init:
  memory_limit :  (16 MB aligned)
  alloc_bottom : 015c
  alloc_top: 3000
  alloc_top_hi : 0001
  rmo_top  : 3000
  ram_top  : 0001
instantiating rtas at 0x2fff... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x017d -> 0x017d09d8
Device tree struct  0x017e -> 0x017f
Quiescing Open Firmware ...
Booting Linux via __start() @ 0x0040 ...
[0.00] bootconsole [udbg0] enabled
[0.00] Allocated 2883584 bytes for 2048 pacas at cfd4
[0.00] hash-mmu: Page sizes from device-tree:
[0.00] hash-mmu: base_shift=12: shift=12, sllp=0x, 
avpnm=0x, tlbiel=1, penc=0
[0.00] hash-mmu: base_shift=16: shift=16, sllp=0x0110, 
avpnm=0x, tlbiel=1, penc=1
[0.00]  -> fw_vec5_feature_init()
[0.00]  <- fw_vec5_feature_init()
[0.00]  -> fw_hypertas_feature_init()
[0.00]  <- fw_hypertas_feature_init()
[0.00] Page orders: linear mapping = 16, virtual = 16, io = 16, 
vmemmap = 16
[0.00] Using 1TB segments
[0.00] hash-mmu: Initializing hash mmu with SLB
[0.00] Linux version 4.14.0-rc2-gcc6x-g9e1fc7e (kerkins@alpine1-p1) 
(gcc version 6.4.1 20171202 (Custom 6328ca9eaa476138)) #1 SMP Sun Dec 3 
21:45:32 AEDT 2017
[0.00] Found initrd at 0xc15c:0xc178d70b
[0.00] Machine is LPAR !
[0.00]  -> pseries_init()
[0.00]  -> fw_cmo_feature_init()
[0.00] CMO not available
[0.00]  <- fw_cmo_feature_init()
[0.00]  <- pseries_init()
[0.00] Using pSeries machine description
[0.00] Partition configured for 16 cpus.
[0.00] CPU maps initialized for 8 threads per core
[0.00]  (thread shift is 3)
[0.00] Freed 2818048 bytes for unused pacas
[0.00] -
[0.00] ppc64_pft_size= 0x19
[0.00] phys_mem_size = 0x1
[0.00] dcache_bsize  = 0x80
[0.00] icache_bsize  = 0x80
[0.00] cpu_features  = 0x17dc7aec18500249
[0.00]   possible= 0xdfdf18500649
[0.00]   always  = 0x18100040
[0.00] cpu_user_features = 0xdc0065c2 0xef00
[0.00] mmu_features  = 0x78006001
[0.00] firmware_features = 0x0001405a440b
[0.00] htab_hash_mask= 0x3
[0.00] -
[0.00] numa:   NODE_DATA [mem 0xfff6a300-0xfff73fff]
[0.00]  -> smp_init_pSeries()
[0.00]  <- smp_init_pSeries()
[0.00] PCI host bridge /pci@8002000  ranges:
[0.00]   IO 0x01008000..0x01008000 -> 
0x
[0.00]  MEM 0x0100a000..0x01101fff -> 
0x8000 
[0.00] PPC64 nvram contains 65536 bytes
[0.00] Top of RAM: 0x1, Total RAM: 0x1
[0.00] Memory hole size: 0MB
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x-0x]
[0.00]   DMA32empty
[0.00]   Normal   empty
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x-0x]
[0.00] Initmem setup node 0 [mem 
0x-0x]
[0.00] On node 0 totalpages: 65536
[0.00]   DMA zone: 64 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 65536 pages, LIFO batch:1
[0.00] percpu: Embedded 4 pages/cpu @c000ffb0 s167064 r0 
d95080 u262144
[0.00] pcpu-alloc: s167064 

Re: [PATCH 2/4] powerpc/64: do not trace irqs-off at interrupt return to soft-disabled context

2017-12-03 Thread Michael Ellerman
Nicholas Piggin  writes:

> When an interrupt is returning to a soft-disabled context (which can
> happen for non-maskable interrupts or synchronous interrupts), it goes
> through the motions of soft-disabling again, including calling
> TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
>
> This is not necessary, because we must already be soft-disabled in the
> interrupt context, it also may be causing crashes in the irq tracing
> code to re-enter as an nmi. Replace it with a warning to ensure that
> soft-interrupts are still disabled.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/entry_64.S | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)

So this patch is the core of the bug fix I gather.

Git blames says:

  Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting 
out of sync")
  Cc: sta...@vger.kernel.org # v3.4+

But I'm wondering how this has been broken that long without us
noticing? You hit it doing some sort of perf stress test I think - so is
it just that we've never pushed hard enough? Or did something change to
expose this? Or we're just not sure?

cheers

> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 3320bcac7192..36878b6ee8b8 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -911,9 +911,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>   beq 1f
>   rlwinm  r7,r7,0,~PACA_IRQ_HARD_DIS
>   stb r7,PACAIRQHAPPENED(r13)
> -1:   li  r0,0
> - stb r0,PACASOFTIRQEN(r13);
> - TRACE_DISABLE_INTS
> +1:
> +#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_BUG)
> + /* The interrupt should not have soft enabled. */
> + lbz r7,PACASOFTIRQEN(r13)
> +1:   tdnei   r7,0
> + EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,BUGFLAG_WARNING
> +#endif
>   b   .Ldo_restore
>  
>   /*
> -- 
> 2.15.0


Re: [PATCH v5 0/3] Prepartion for SR-IOV PowerVM Enablement

2017-12-03 Thread Russell Currey
On Thu, 2017-11-09 at 08:00 -0600, Bryant G. Ly wrote:
> v1 - Initial patch
> v2 - Addressed Bjorn's comment on creating a highly platform
>  dependent global exported symbol.
> v3 - Based patch off linux-ppc/master
> v4 - Using the sriov-drivers_autoprobe mechanism per Bjorn's request
> v5 - Fixed comments and commit message

Looks good to me, for the whole series:

Acked-by: Russell Currey 

> 
> Bryant G. Ly (3):
>   powerpc/kernel: Separate SR-IOV Calls
>   pseries: Add PSeries SR-IOV Machine dependent calls
>   PCI/IOV: Add pci_vf_drivers_autoprobe() interface
> 
>  arch/powerpc/include/asm/machdep.h   |  7 ++
>  arch/powerpc/include/asm/pci-bridge.h|  4 +---
>  arch/powerpc/kernel/eeh_driver.c |  4 ++--
>  arch/powerpc/kernel/pci-common.c | 23
> +++
>  arch/powerpc/kernel/pci_dn.c |  6 -
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 33 ++--
> 
>  arch/powerpc/platforms/powernv/pci-ioda.c|  6 +++--
>  arch/powerpc/platforms/pseries/eeh_pseries.c | 24
> 
>  arch/powerpc/platforms/pseries/pci.c | 31
> ++
>  drivers/pci/iov.c| 11 ++
>  include/linux/pci.h  |  2 ++
>  11 files changed, 118 insertions(+), 33 deletions(-)
> 


Re: [PATCH] powerpc/40x: acadia: Fix the 'interrupt-parent' property

2017-12-03 Thread Michael Ellerman
Fabio Estevam  writes:

> Hi Michael,
>
> On Tue, Oct 17, 2017 at 11:01 AM, Fabio Estevam  wrote:
>> 'interrupts-parent' property does not exist. Fix the typo.
>>
>> Fixes: 00f3ca740a9c26 ("powerpc/40x: AMCC PowerPC 405EZ Acadia DTS")
>> Signed-off-by: Fabio Estevam 
>> ---
>>  arch/powerpc/boot/dts/acadia.dts | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/boot/dts/acadia.dts 
>> b/arch/powerpc/boot/dts/acadia.dts
>> index 57291f6..8626615 100644
>> --- a/arch/powerpc/boot/dts/acadia.dts
>> +++ b/arch/powerpc/boot/dts/acadia.dts
>> @@ -183,7 +183,7 @@
>> usb@ef603000 {
>> compatible = "ohci-be";
>> reg = <0xef603000 0x80>;
>> -   interrupts-parent = <>;
>> +   interrupt-parent = <>;
>
> Any comments?

Nope :)

Will pick it up for 4.16.

cheers


Re: [PATCH v1] PCI: Make PCI_SCAN_ALL_PCIE_DEVS work for Root as well as Downstream Ports

2017-12-03 Thread Michael Ellerman
Bjorn Helgaas  writes:

> On Fri, Dec 01, 2017 at 06:27:10PM -0600, Bjorn Helgaas wrote:
>> From: Bjorn Helgaas 
>> 
>> PCIe Downstream Ports normally have only a Device 0 below them.  To
>> optimize enumeration, we don't scan for other devices *unless* the
>> PCI_SCAN_ALL_PCIE_DEVS flag is set by set by quirks or the
>> "pci=pcie_scan_all" kernel parameter.
>> 
>> Previously PCI_SCAN_ALL_PCIE_DEVS only affected scanning below Switch
>> Downstream Ports, not Root Ports.
>> 
>> But the "Nemo" system, also known as the AmigaOne X1000, has a PA Semi Root
>> Port whose link leads to an AMD/ATI SB600 South Bridge.  The Root Port is a
>> PCIe device, of course, but the SB600 contains only conventional PCI
>> devices with no visible PCIe port.
>> 
>> Simplify and restructure only_one_child() so that we scan for all possible
>> devices below Root Ports as well as Switch Downstream Ports when
>> PCI_SCAN_ALL_PCIE_DEVS is set.
>> 
>> This is enough to make Nemo work with "pci=pcie_scan_all".  We would also
>> like to add a quirk to set PCI_SCAN_ALL_PCIE_DEVS automatically on Nemo so
>> users wouldn't have to use the "pci=pcie_scan_all" parameter, but we don't
>> have that yet.
>> 
>> Link: 
>> https://lkml.kernel.org/r/CAErSpo55Q8Q=5p6_+uu7ahnw+53ibvdnrxxrzrv9qnur_9e...@mail.gmail.com
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=198057
>> Reported-and-Tested-by: Christian Zigotzky 
>> Signed-off-by: Bjorn Helgaas 
>
> Applied to pci/enumeration for v4.16.

Thanks Bjorn.

cheers


Re: [PATCH] powerpc/powernv : Add support to enable sensor groups

2017-12-03 Thread Stewart Smith
Shilpasri G Bhat  writes:
> On 11/28/2017 05:07 PM, Michael Ellerman wrote:
>> Shilpasri G Bhat  writes:
>> 
>>> Adds support to enable/disable a sensor group. This can be used to
>>> select the sensor groups that needs to be copied to main memory by
>>> OCC. Sensor groups like power, temperature, current, voltage,
>>> frequency, utilization can be enabled/disabled at runtime.
>>>
>>> Signed-off-by: Shilpasri G Bhat 
>>> ---
>>> The skiboot patch for the opal call is posted below:
>>> https://lists.ozlabs.org/pipermail/skiboot/2017-November/009713.html
>> 
>> Can you remind me why we're doing this with a completely bespoke sysfs
>> API, rather than using some generic sensors API?
>> 
>
> Disabling/Enabling sensor groups is not supported in the current generic 
> sensors
> API. And also we dont export all type of sensors in HWMON as not all of them 
> are
> environment sensors (like performance).

Are there barriers to adding such concepts to the generic sensors API?

-- 
Stewart Smith
OPAL Architect, IBM.



Re: [rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's

2017-12-03 Thread Nicholas Piggin
On Mon, 4 Dec 2017 11:37:01 +1100
Balbir Singh  wrote:

> On Sun, Dec 3, 2017 at 1:36 PM, Nicholas Piggin  wrote:
> > Seems like a reasonable approach. Why do we only do this for
> > powernv? It seems like a good idea in general to pull all
> > offlined CPUs out and into the same state for all platforms
> > and for all shutdown/restart/crash paths.
> >  
> 
> The reason is largely wake-up related, do we expect offline CPUs to wake
> up in the kdump kernel. Largely the infrastructure allows us to selectively
> decide what platforms need this support. I did not want to break the world
> by enabling it across platforms (pseries for example) without good reason.

What happens if a pseries offlined CPU gets an exception for some reason
though? It seems like it would return into pseries_mach_cpu_die of the
old kernel which will go wrong.

Maybe the platform has stronger guarantees that it won't wake up there,
like requiring a specific hcall or something?

I was just thinking trying to move all platforms in general to the same
scheme would be preferable, unless there is a good reason not to. Just
for sharing code and behaviour.

> 
> > Also I wonder if there is anything we should do on the other
> > side of the equation for the kdump kernel to pull CPUs into a
> > known state rather than rely on the crash kernel to do it for
> > us. We might have a better ability to do that with system
> > reset IPIs now.
> >  
> 
> Yes, but do we need to do that or quickly dump the vmcore to a file
> and exit?

Well if the previous kernel did not shut them down properly, we need
to do that. Don't we? My point is the previous kernel crashed somehow,
we should be trying to fix everything up rather than hoping it crashed
"nicely" for us.

Yes we shouldn't disturb things as much as possible, but we've booted
an entire new kernel in its own reserved memory, so I'm not sure if
it's such a concern to try fixing up wayward CPUs.

Thanks,
Nick


[PATCH v2] powerpc/64s: ISAv3 initialize MMU registers before setting partition table

2017-12-03 Thread Nicholas Piggin
kexec can leave MMU registers set when booting into a new kernel, PIDR
in particular. The boot sequence does not zero PIDR, so it only gets
set when CPUs first switch to a userspace processes (until then it's
running a kernel thread with effective PID = 0).

This leaves a window where a process table entry and page tables are
set up due to user processes running on other CPUs, that happen to
match with a stale PID. The CPU with that PID may cause speculative
accesses that address quadrant 0, which will result in cached
translations and PWC for that process, on a CPU which is not in the
mm_cpumask and so they will not get invalidated properly.

The most common result is the kernel hanging in infinite page fault
loops soon after kexec (usually in schedule_tail, which is usually the
first non-speculative quardant 0 access to a new PID) due to a stale
PWC. However being a stale translation erorr, it could result in
anything up to security and data corruption errors.

Fix this by zeroing out PIDR before setting PTCR.

LPIDR is also not initialized, and may cause a similar issue with
speculative access to quadrant 1/2. This has not been observed, but
LPIDR is cleared to prevent that possibility.

Signed-off-by: Nicholas Piggin 
---
Since v1:
- Aneesh noticed hash was not clearing LPID
- Improved changelog
- Improved comments

Please review.

Thanks,
Nick

 arch/powerpc/mm/hash_utils_64.c | 19 +--
 arch/powerpc/mm/pgtable-radix.c | 10 ++
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 655a5a9a183d..d9f3a72f3981 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1029,13 +1029,24 @@ void __init hash__early_init_mmu(void)
pci_io_base = ISA_IO_BASE;
 #endif
 
+   /*
+* kexec kernels can boot into the new kernel with PID and LPID
+* registers non-zero. Zero them to prevent speculative accesses
+* setting up cached translations when we turn the MMU on and
+* process/partition table entries start being added.
+*/
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   mtspr(SPRN_PID, 0);
+
/* Select appropriate backend */
if (firmware_has_feature(FW_FEATURE_PS3_LV1))
ps3_early_mm_init();
else if (firmware_has_feature(FW_FEATURE_LPAR))
hpte_init_pseries();
-   else if (IS_ENABLED(CONFIG_PPC_NATIVE))
+   else if (IS_ENABLED(CONFIG_PPC_NATIVE)) {
+   mtspr(SPRN_LPID, 0);
hpte_init_native();
+   }
 
if (!mmu_hash_ops.hpte_insert)
panic("hash__early_init_mmu: No MMU hash ops defined!\n");
@@ -1055,11 +1066,15 @@ void __init hash__early_init_mmu(void)
 void hash__early_init_mmu_secondary(void)
 {
/* Initialize hash table for that CPU */
-   if (!firmware_has_feature(FW_FEATURE_LPAR)) {
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   mtspr(SPRN_PID, 0);
 
+   if (!firmware_has_feature(FW_FEATURE_LPAR)) {
if (cpu_has_feature(CPU_FTR_POWER9_DD1))
update_hid_for_hash();
 
+   mtspr(SPRN_LPID, 0);
+
if (!cpu_has_feature(CPU_FTR_ARCH_300))
mtspr(SPRN_SDR1, _SDR1);
else
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index cfbbee941a76..0c13355ddc2a 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -563,10 +563,18 @@ void __init radix__early_init_mmu(void)
__pte_frag_nr = H_PTE_FRAG_NR;
__pte_frag_size_shift = H_PTE_FRAG_SIZE_SHIFT;
 
+   /*
+* kexec kernels can boot into the new kernel with PID and LPID
+* registers non-zero. Zero them to prevent speculative accesses
+* setting up cached translations when we turn the MMU on and
+* process/partition table entries start being added.
+*/
+   mtspr(SPRN_PID, 0);
if (!firmware_has_feature(FW_FEATURE_LPAR)) {
radix_init_native();
if (cpu_has_feature(CPU_FTR_POWER9_DD1))
update_hid_for_radix();
+   mtspr(SPRN_LPID, 0);
lpcr = mfspr(SPRN_LPCR);
mtspr(SPRN_LPCR, lpcr | LPCR_UPRT | LPCR_HR);
radix_init_partition_table();
@@ -587,10 +595,12 @@ void radix__early_init_mmu_secondary(void)
/*
 * update partition table control register and UPRT
 */
+   mtspr(SPRN_PID, 0);
if (!firmware_has_feature(FW_FEATURE_LPAR)) {
 
if (cpu_has_feature(CPU_FTR_POWER9_DD1))
update_hid_for_radix();
+   mtspr(SPRN_LPID, 0);
 
lpcr = mfspr(SPRN_LPCR);
mtspr(SPRN_LPCR, lpcr | LPCR_UPRT | LPCR_HR);
-- 
2.15.0



[PATCH 3/3] powerpc/mm/ Add proper pte access check helper

2017-12-03 Thread Aneesh Kumar K.V
pte_access_premitted get called in get_user_pages_fast path. If we have marked
the pte PROT_NONE, we should not allow a read access on the address. With
the current implementation we are not checking the READ and only check for
WRITE. This is needed on archs like ppc64 that implement PROT_NONE using
_PAGE_USER access instead of _PAGE_PRESENT. Also add pte_user check just to 
make sure
we are not accessing kernel mapping.

Even though there is code duplication, keeping the low level pte accessors
different for different platforms helps in code readability.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/32/pgtable.h | 23 +++
 arch/powerpc/include/asm/nohash/pgtable.h| 23 +++
 2 files changed, 46 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 016579ef16d3..30a155c0a6b0 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -311,6 +311,29 @@ static inline int pte_present(pte_t pte)
return pte_val(pte) & _PAGE_PRESENT;
 }
 
+/*
+ * We only find page table entry in the last level
+ * Hence no need for other accessors
+ */
+#define pte_access_permitted pte_access_permitted
+static inline bool pte_access_permitted(pte_t pte, bool write)
+{
+   unsigned long pteval = pte_val(pte);
+   /*
+* A read-only access is controlled by _PAGE_USER bit.
+* We have _PAGE_READ set for WRITE and EXECUTE
+*/
+   unsigned long need_pte_bits = _PAGE_PRESENT | _PAGE_USER;
+
+   if (write)
+   need_pte_bits |= _PAGE_WRITE;
+
+   if ((pteval & need_pte_bits) != need_pte_bits)
+   return false;
+
+   return true;
+}
+
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
  *
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 5c68f4a59f75..fc4376c8d444 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -45,6 +45,29 @@ static inline int pte_present(pte_t pte)
return pte_val(pte) & _PAGE_PRESENT;
 }
 
+/*
+ * We only find page table entry in the last level
+ * Hence no need for other accessors
+ */
+#define pte_access_permitted pte_access_permitted
+static inline bool pte_access_permitted(pte_t pte, bool write)
+{
+   unsigned long pteval = pte_val(pte);
+   /*
+* A read-only access is controlled by _PAGE_USER bit.
+* We have _PAGE_READ set for WRITE and EXECUTE
+*/
+   unsigned long need_pte_bits = _PAGE_PRESENT | _PAGE_USER;
+
+   if (write)
+   need_pte_bits |= _PAGE_WRITE;
+
+   if ((pteval & need_pte_bits) != need_pte_bits)
+   return false;
+
+   return true;
+}
+
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
  *
-- 
2.14.3



[PATCH 2/3] powerpc/mm/book3s/64: Add proper pte access check helper

2017-12-03 Thread Aneesh Kumar K.V
pte_access_premitted get called in get_user_pages_fast path. If we have marked
the pte PROT_NONE, we should not allow a read access on the address. With
the current implementation we are not checking the READ and only check for
WRITE. This is needed on archs like ppc64 that implement PROT_NONE using
RWX access instead of _PAGE_PRESENT. Also add pte_user check just to make sure
we are not accessing kernel mapping.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 41 
 1 file changed, 41 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 2006fec40a00..efa5bd6bc1d3 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -520,6 +520,30 @@ static inline int pte_present(pte_t pte)
 {
return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT));
 }
+
+#define pte_access_permitted pte_access_permitted
+static inline bool pte_access_permitted(pte_t pte, bool write)
+{
+   unsigned long pteval = pte_val(pte);
+   /* Also check for pte_user */
+   unsigned long clear_pte_bits = _PAGE_PRIVILEGED;
+   /*
+* _PAGE_READ is needed for any access and will be
+* cleared for PROT_NONE
+*/
+   unsigned long need_pte_bits = _PAGE_PRESENT | _PAGE_READ;
+
+   if (write)
+   need_pte_bits |= _PAGE_WRITE;
+
+   if ((pteval & need_pte_bits) != need_pte_bits)
+   return false;
+
+   if ((pteval & clear_pte_bits) == clear_pte_bits)
+   return false;
+   return true;
+}
+
 /*
  * Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
@@ -824,6 +848,11 @@ static inline int pud_bad(pud_t pud)
return hash__pud_bad(pud);
 }
 
+#define pud_access_permitted pud_access_permitted
+static inline bool pud_access_permitted(pud_t pud, bool write)
+{
+   return pte_access_permitted(pud_pte(pud), write);
+}
 
 #define pgd_write(pgd) pte_write(pgd_pte(pgd))
 static inline void pgd_set(pgd_t *pgdp, unsigned long val)
@@ -863,6 +892,12 @@ static inline int pgd_bad(pgd_t pgd)
return hash__pgd_bad(pgd);
 }
 
+#define pgd_access_permitted pgd_access_permitted
+static inline bool pgd_access_permitted(pgd_t pgd, bool write)
+{
+   return pte_access_permitted(pgd_pte(pgd), write);
+}
+
 extern struct page *pgd_page(pgd_t pgd);
 
 /* Pointers in the page table tree are physical addresses */
@@ -984,6 +1019,12 @@ static inline int pmd_protnone(pmd_t pmd)
 #define __pmd_write(pmd)   __pte_write(pmd_pte(pmd))
 #define pmd_savedwrite(pmd)pte_savedwrite(pmd_pte(pmd))
 
+#define pmd_access_permitted pmd_access_permitted
+static inline bool pmd_access_permitted(pmd_t pmd, bool write)
+{
+   return pte_access_permitted(pmd_pte(pmd), write);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
-- 
2.14.3



[PATCH 1/3] arch/powerpc/hugetlb: Use pte_access_permitted for hugetlb access check

2017-12-03 Thread Aneesh Kumar K.V
No functional change in this patch. This update gup_hugepte to use the
helper. This will help later when we add memory keys.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a9b9083c5e49..c7e5afe5e118 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -855,9 +855,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned 
long addr,
 
pte = READ_ONCE(*ptep);
 
-   if (!pte_present(pte) || !pte_read(pte))
-   return 0;
-   if (write && !pte_write(pte))
+   if (!pte_access_permitted(pte, write))
return 0;
 
/* hugepages are never "special" */
-- 
2.14.3



Re: [PATCH] KVM: PPC: Book3S HV: check for XIVE device before executing the XICS hcalls

2017-12-03 Thread Paul Mackerras
On Mon, Nov 27, 2017 at 08:30:17AM +0100, Cédric Le Goater wrote:
> When QEMU is started with the option kernel_irqchip=òff, the kvm XICS
> hcalls are being used even though a kvm XICS device has not been
> created on the host, resulting quickly in a failure and a broken
> guest.
> 
> The test checking if there is a XIVE device in the VM before executing
> the XICS hcalls is missing from the recent XICS-over-XIVE glue.
> 
> Signed-off-by: Cédric Le Goater 

I think this is fixing the same bug that commit 00bb6ae50062 ("KVM:
PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not
enabled", 2017-10-26) addresses.

Do you think this patch is needed in addition to 00bb6ae50062?

Paul.


Re: [rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's

2017-12-03 Thread Balbir Singh
On Sun, Dec 3, 2017 at 1:36 PM, Nicholas Piggin  wrote:
> Seems like a reasonable approach. Why do we only do this for
> powernv? It seems like a good idea in general to pull all
> offlined CPUs out and into the same state for all platforms
> and for all shutdown/restart/crash paths.
>

The reason is largely wake-up related, do we expect offline CPUs to wake
up in the kdump kernel. Largely the infrastructure allows us to selectively
decide what platforms need this support. I did not want to break the world
by enabling it across platforms (pseries for example) without good reason.

> Also I wonder if there is anything we should do on the other
> side of the equation for the kdump kernel to pull CPUs into a
> known state rather than rely on the crash kernel to do it for
> us. We might have a better ability to do that with system
> reset IPIs now.
>

Yes, but do we need to do that or quickly dump the vmcore to a file
and exit? Ideally we want the original kernel to wake offline cpus
as appropriate (as we do with kexec) and send them to opal_reinit_cpus
and make them spin on kexec_spin_wait(). The kdump kernel boots
with maxcpus=1 and leaves them spinning

> We still need to support platforms without NMI IPIs, so we
> still need this patch as well.
>

True

Thanks for the review!
Balbir Singh


Re: Memory corruption in powerpc guests with virtio_balloon (was Re: [PATCH v3] virtio_balloon: fix deadlock on OOM)

2017-12-03 Thread Michael Ellerman
"Michael S. Tsirkin"  writes:
> On Fri, Dec 01, 2017 at 11:31:08PM +1100, Michael Ellerman wrote:
>> "Michael S. Tsirkin"  writes:
>> 
>> > fill_balloon doing memory allocations under balloon_lock
>> > can cause a deadlock when leak_balloon is called from
>> > virtballoon_oom_notify and tries to take same lock.
...
>> 
>> 
>> Somehow this commit seems to be killing powerpc guests.
...
>
> Thanks for the report!
> A fix was just posted:
> virtio_balloon: fix increment of vb->num_pfns in fill_balloon()
>
> Would appreciate testing.

Yep that fixes it. Thanks.

Tested-by: Michael Ellerman 

cheers


[PATCH] SB600 for the Nemo board has non-zero devices on non-root bus

2017-12-03 Thread Christian Zigotzky

On 3. Dec 2017, at 00:02, Olof Johansson  wrote:
> 
>> On Sat, Dec 2, 2017 at 3:00 PM, Olof Johansson  wrote:
>> 
>> The below patch, together with Bjorn's, should do it. Christian, can you test
>> and report back?
>> 
>> I'm guessing it won't do any harm to set this on non-X1000 platforms. My
>> test system is currently powered down so I can't check.
>> 
>> 
>> From a3b390277627b0342c8ccfc16e58679e0d8abdde Mon Sep 17 00:00:00 2001
>> From: Olof Johansson 
>> Date: Sat, 2 Dec 2017 14:56:36 -0800
>> Subject: [PATCH] powerpc/pasemi: set PCI_SCAN_ALL_PCI_DEVS
>> 
>> Needed on Amiga X1000 with SB600.
>> 
>> Reported-by: Christian Zigotzky 
>> Cc: Bjorn Helgaas 
>> Signed-off-by: Olof Johansson 
>> ---
>> arch/powerpc/platforms/pasemi/pci.c | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git a/arch/powerpc/platforms/pasemi/pci.c 
>> b/arch/powerpc/platforms/pasemi/pci.c
>> index 5ff6108..ea54ed2 100644
>> --- a/arch/powerpc/platforms/pasemi/pci.c
>> +++ b/arch/powerpc/platforms/pasemi/pci.c
>> @@ -224,6 +224,8 @@ void __init pas_pci_init(void)
>>return;
>>}
>> 
>> +   pci_set_flag(PCI_SCAN_ALL_PCIE_DEVS):
> 
> Typo, should be ';', not ':'. I obviously didn't even try compiling this. :)
> 
> 
> -Olof

Hi Olof,

Thanks a lot for your patch! I will test it on Wednesday.

Cheers,
Christian