Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-07 Thread Michael Ellerman
Nicholas Piggin  writes:
> On Mon, 07 Aug 2017 19:56:28 +1000
> Michael Ellerman  wrote:
>> Nicholas Piggin  writes:
>> > On Fri, 04 Aug 2017 21:54:57 +0200
>> > Andreas Schwab  wrote:
>> >  
>> >> No, this is really a 4.13-rc1 regression.
>> >
>> > SLB miss with MSR[RI]=0 on
>> >
>> > lbz r0,THREAD+THREAD_LOAD_FP(r7)
>> >
>> > Caused by bc4f65e4cf9d6cc43e0e9ba0b8648cf9201cd55f  
>> 
>> > Hmm, I'll see if something can be done, but that MSR_RI stuff in syscall
>> > exit makes things fairly difficult (and will reduce performance improvement
>> > of this patch anyway).
>> >
>> > I'm trying to work to a point where we have a soft-RI bit for these kinds 
>> > of
>> > uses that would avoid all this complexity. Until then it may be best to
>> > just revert this patch.  
>> 
>> OK. Let me know in the next day or two what you want to do.
>> 
>> One option would be to load THREAD_LOAD_FP/THREAD_LOAD_VEC before we
>> turn off RI.
>
> Yeah, although that's a couple of unnecessary loads when we haven't
> used the fp regs.
>
> This path hits often on return from context switch, but for general
> syscalls it's less clear. And considering it's fairly tricky code at
> this point I'm thinking maybe just revert it for now?

Yeah OK.

Related thought, why the hell do we use 0x4100 for unrecoverable SLB.
That is really confusing now that we have AIL.

... lots of git blaming ...

Looks like it first appeard in the commit below, a classic :)

We should really change it some other value.

cheers


  37b9416e7d6efb2168119ef12ce0b093da28ea19
  Author: Andrew Morton 
  AuthorDate: Thu Mar 18 14:58:53 2004 -0800
  Commit: Linus Torvalds 
  CommitDate: Thu Mar 18 14:58:53 2004 -0800

  [PATCH] ppc64: Fix SLB reload bug
  
  From: Paul Mackerras 
  
  Recently we found a particularly nasty bug in the segment handling in the
  ppc64 kernel.  It would only happen rarely under heavy load, but when it
  did the machine would lock up with the whole of memory filled with
  exception stack frames.
  
  The primary cause was that we were losing the translation for the kernel
  stack from the SLB, but we still had it in the ERAT for a while longer.
  Now, there is a critical region in various exception exit paths where we
  have loaded the SRR0 and SRR1 registers from GPRs and we are loading those
  GPRs and the stack pointer from the exception frame on the kernel stack.
  If we lose the ERAT entry for the kernel stack in that region, we take an
  SLB miss on the next access to the kernel stack.  Taking the exception
  overwrites the values we have put into SRR0 and SRR1, which means we lose
  state.  In fact we ended up repeating that last section of the exception
  exit path, but using the user stack pointer this time.  That caused another
  exception (or if it didn't, we loaded a new value from the user stack and
  then went around and tried to use that).  And it spiralled downwards from
  there.
  
  The patch below fixes the primary problem by making sure that we really
  never cast out the SLB entry for the kernel stack.  It also improves
  debuggability in case anything like this happens again by:
  
  - In our exception exit paths, we now check whether the RI bit in the
SRR1 value is 0.  We already set the RI bit to 0 before starting the
critical region, but we never checked it.  Now, if we do ever get an
exception in one of the critical regions, we will detect it before
returning to the critical region, and instead we will print a nasty
message and oops.
  
  - In the exception entry code, we now check that the kernel stack pointer
value we're about to use isn't a userspace address.  If it is, we print a
nasty message and oops.
  
  This has been tested on G5 and pSeries (both with and without hypervisor)
  and compile-tested on iSeries.


...

+unrecov_stab:
+   EXCEPTION_PROLOG_COMMON
+   li  r6,0x4100   <- ends up in regs->trap
+   li  r20,0
+   bl  .save_remaining_regs
+1: addir3,r1,STACK_FRAME_OVERHEAD
+   bl  .unrecoverable_exception
+   b   1b


Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-07 Thread Michael Ellerman
Andreas Schwab  writes:

> On Aug 07 2017, Michael Ellerman  wrote:
>
>> Ah of course. Not sure why I haven't seen it in any of my testing :/
>
> It took me a whole gcc bootstrap to trigger it.

OK. I tried a few kernel builds, but I guess a GCC bootstrap takes a
while longer ;)

The other thing is most of our systems now have 1T segments, which means
you're less likely to take an SLB miss in the first place. Though we do
have disable_1t_segments for turning them off, precisely to test this
kind of code.

cheers


Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-07 Thread Andreas Schwab
On Aug 07 2017, Michael Ellerman  wrote:

> Ah of course. Not sure why I haven't seen it in any of my testing :/

It took me a whole gcc bootstrap to trigger it.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-07 Thread Nicholas Piggin
On Mon, 07 Aug 2017 19:56:28 +1000
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > On Fri, 04 Aug 2017 21:54:57 +0200
> > Andreas Schwab  wrote:
> >  
> >> No, this is really a 4.13-rc1 regression.
> >> 
> >> Andreas.
> >>   
> >
> > SLB miss with MSR[RI]=0 on
> >
> > lbz r0,THREAD+THREAD_LOAD_FP(r7)
> >
> > Caused by bc4f65e4cf9d6cc43e0e9ba0b8648cf9201cd55f  
> 
> Ah of course. Not sure why I haven't seen it in any of my testing :/
> 
> > Hmm, I'll see if something can be done, but that MSR_RI stuff in syscall
> > exit makes things fairly difficult (and will reduce performance improvement
> > of this patch anyway).
> >
> > I'm trying to work to a point where we have a soft-RI bit for these kinds of
> > uses that would avoid all this complexity. Until then it may be best to
> > just revert this patch.  
> 
> OK. Let me know in the next day or two what you want to do.
> 
> One option would be to load THREAD_LOAD_FP/THREAD_LOAD_VEC before we
> turn off RI.

Yeah, although that's a couple of unnecessary loads when we haven't
used the fp regs.

This path hits often on return from context switch, but for general
syscalls it's less clear. And considering it's fairly tricky code at
this point I'm thinking maybe just revert it for now?

Thanks,
Nick


Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-07 Thread Michael Ellerman
Nicholas Piggin  writes:

> On Fri, 04 Aug 2017 21:54:57 +0200
> Andreas Schwab  wrote:
>
>> No, this is really a 4.13-rc1 regression.
>> 
>> Andreas.
>> 
>
> SLB miss with MSR[RI]=0 on
>
> lbz r0,THREAD+THREAD_LOAD_FP(r7)
>
> Caused by bc4f65e4cf9d6cc43e0e9ba0b8648cf9201cd55f

Ah of course. Not sure why I haven't seen it in any of my testing :/

> Hmm, I'll see if something can be done, but that MSR_RI stuff in syscall
> exit makes things fairly difficult (and will reduce performance improvement
> of this patch anyway).
>
> I'm trying to work to a point where we have a soft-RI bit for these kinds of
> uses that would avoid all this complexity. Until then it may be best to
> just revert this patch.

OK. Let me know in the next day or two what you want to do.

One option would be to load THREAD_LOAD_FP/THREAD_LOAD_VEC before we
turn off RI.

It would be ugly, but if we think we can clean it up in the medium term
with soft-RI then maybe that would be OK.

cheers


Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-05 Thread Nicholas Piggin
On Fri, 04 Aug 2017 21:54:57 +0200
Andreas Schwab  wrote:

> No, this is really a 4.13-rc1 regression.
> 
> Andreas.
> 

SLB miss with MSR[RI]=0 on

lbz r0,THREAD+THREAD_LOAD_FP(r7)

Caused by bc4f65e4cf9d6cc43e0e9ba0b8648cf9201cd55f

Hmm, I'll see if something can be done, but that MSR_RI stuff in syscall
exit makes things fairly difficult (and will reduce performance improvement
of this patch anyway).

I'm trying to work to a point where we have a soft-RI bit for these kinds of
uses that would avoid all this complexity. Until then it may be best to
just revert this patch.

Thanks for the report


Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-04 Thread Andreas Schwab
No, this is really a 4.13-rc1 regression.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-04 Thread Andreas Schwab
This is actually a 4.13-rc2 regression.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


Re: 4.13-rc3: Unrecoverable exception 4100

2017-08-04 Thread Benjamin Herrenschmidt
On Fri, 2017-08-04 at 12:59 +0200, Andreas Schwab wrote:
> I'm getting a lot of Unrecoverable exception 4100 with 4.13-rc3:

Hi Andeas !

Any chance you can bisect this ?

Thanks !

Cheers,
Ben.



4.13-rc3: Unrecoverable exception 4100

2017-08-04 Thread Andreas Schwab
I'm getting a lot of Unrecoverable exception 4100 with 4.13-rc3:

[13483.295173] Unrecoverable exception 4100 at c000a1ec
[13483.295186] Oops: Unrecoverable exception, sig: 6 [#1]
[13483.295190] SMP NR_CPUS=2
[13483.295191] PowerMac
[13483.295197] Modules linked in: nfsd auth_rpcgss oid_registry lockd grace 
nfs_acl sunrpc tun af_packet ip6table_mangle nf_conntrack_ipv6 nf_defrag_ipv6 
ip6t_REJECT nf_log_ipv6 ip6table_filter ip6_tables xt_TCPMSS iptable_mangle 
snd_aoa_fabric_layout snd_aoa_i2sbus snd_aoa_soundbus snd_pcm_oss snd_pcm 
sr_mod snd_aoa_codec_tas cdrom snd_aoa snd_seq snd_timer snd_seq_device 
xt_recent xt_nat snd_mixer_oss firewire_ohci snd sungem sungem_phy pata_macio 
firewire_core crc_itu_t soundcore xt_conntrack ipt_REJECT nf_log_ipv4 
nf_log_common xt_LOG xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_filter ip_tables x_tables sg 
linear md_mod hid_generic usbhid ohci_pci ohci_hcd ehci_pci ehci_hcd usbcore 
usb_common dm_snapshot dm_bufio dm_mirror dm_region_hash
[13483.295297]  dm_log dm_mod sata_svw
[13483.295306] CPU: 0 PID: 18626 Comm: rm Not tainted 4.13.0-rc3 #1
[13483.295311] task: c0018335e080 task.stack: c00139e5
[13483.295314] NIP: c000a1ec LR: c000a118 CTR: 
[13483.295318] REGS: c00139e53bb0 TRAP: 4100   Not tainted  (4.13.0-rc3)
[13483.295321] MSR: 90001030 
[13483.295329]   CR: 2444  XER: 2000
[13483.295333] SOFTE: 1
GPR00:  c00139e53e30 c0abb500 fffe
GPR04: c001eb866298   c0018335e080
GPR08: 9000d032  0002 f001
GPR12: c00139e5 c000 3fffa8c0dca0 3fffa8c0dc88
GPR16: 1000 0001 3fffa8c0eaa0 
GPR20: 3fffa8c27528 3fffa8c27b00  
GPR24: 3fffa8c0d918 31b3efa0 3fffa8c26d68 
GPR28: 3fffa8c249e8 3fffa8c263d0 3fffa8c27550 31b3ef10
[13483.295393] NIP [c000a1ec] system_call_exit+0xc0/0x21c
[13483.295398] LR [c000a118] system_call+0x58/0x6c
[13483.295400] Call Trace:
[13483.295405] [c00139e53e30] [c000a118] system_call+0x58/0x6c 
(unreliable)
[13483.295410] Instruction dump:
[13483.295415] 64a51000 7c6300d0 f8a101a0 4b9c 3c00 6006 780007c6 
6400
[13483.295425] 6000 7c004039 4082001c e8ed0170 <88070b78> 88c70b79 7c003214 
2c20
[13483.295437] ---[ end trace 79af5598e0243808 ]---

[13697.100080] Unrecoverable exception 4100 at c000a1ec
[13697.100093] Oops: Unrecoverable exception, sig: 6 [#2]
[13697.100096] SMP NR_CPUS=2
[13697.100098] PowerMac
[13697.100104] Modules linked in: nfsd auth_rpcgss oid_registry lockd grace 
nfs_acl sunrpc tun af_packet ip6table_mangle nf_conntrack_ipv6 nf_defrag_ipv6 
ip6t_REJECT nf_log_ipv6 ip6table_filter ip6_tables xt_TCPMSS iptable_mangle 
snd_aoa_fabric_layout snd_aoa_i2sbus snd_aoa_soundbus snd_pcm_oss snd_pcm 
sr_mod snd_aoa_codec_tas cdrom snd_aoa snd_seq snd_timer snd_seq_device 
xt_recent xt_nat snd_mixer_oss firewire_ohci snd sungem sungem_phy pata_macio 
firewire_core crc_itu_t soundcore xt_conntrack ipt_REJECT nf_log_ipv4 
nf_log_common xt_LOG xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_filter ip_tables x_tables sg 
linear md_mod hid_generic usbhid ohci_pci ohci_hcd ehci_pci ehci_hcd usbcore 
usb_common dm_snapshot dm_bufio dm_mirror dm_region_hash
[13697.100205]  dm_log dm_mod sata_svw
[13697.100214] CPU: 0 PID: 21001 Comm: sh Tainted: G  D 4.13.0-rc3 
#1
[13697.100219] task: c00173f9f080 task.stack: c001eb7b8000
[13697.100222] NIP: c000a1ec LR: c000a118 CTR: 
[13697.100226] REGS: c001eb70 TRAP: 4100   Tainted: G  D  
(4.13.0-rc3)
[13697.100229] MSR: 92003030 
[13697.100239]   CR: 24024482  XER: 2000
[13697.100243] SOFTE: 1
GPR00:  c001eb7bbe30 c0abb500 
GPR04: c0007b51 c019fa04 c00173f9fe10 c00173f9f080
GPR08: b000d032  0002 f001
GPR12: c001eb7b8000 c000 1002e380 1002e350
GPR16: 1002e318 1002e2e0  0001
GPR20: 1002bca8 1002bd30 10044c58 1002bc80
GPR24: 10044c60 1ac15fd0 10047790 1ac23490
GPR28: 1ac20d20 1ac23020 1ac20d20 3d882280
[13697.100301] NIP [c000a1ec] system_call_exit+0xc0/0x21c
[13697.100306] LR [c000a118] system_call+0x58/0x6c
[13697.100309] Call Trace:
[13697.100314] [c001eb7bbe30] [c000a118] system_call+0x58/0x6c 
(unreliable)
[13697.100319] Instruction dump: