Re: TARGET_ARCH=powerpc head -r317820 production-style kernel: periodic panics always in pid=11 (the Idle threads)

2017-05-20 Thread Mark Millard
On 2017-May-19, at 9:42 PM, Mark Millard  wrote:

> On 2017-May-9, at 2:00 PM, Mark Millard  wrote:
> 
> . . .
>> fatal kernel trap:
>>  exception   = 0x903a64e (unknown)
>>  srr0= 0x7ff760
>>  srr1= 0xc1007c
>>  lr  = 0x907f
>>  curthread   = 0x147d6c0
>> pid = 11, comm = idle: cpu0
>> [ thread pid 11 tid 13 ]
>> Stopped at  ffs_truncate+0x1080:stw r11, 0xf8(r31)
>> 
>> 1 contains (cpu1 instead of cpu0, so different tid):
>> 
>> fatal kernel trap:
>>  exception   = 0x903a64e (unknown)
>>  srr0= 0x7ff760
>>  srr1= 0xc1007c
>>  lr  = 0x907f
>>  curthread   = 0x147d360
>> pid = 11, comm = idle: cpu1
>> [ thread pid 11 tid 14 ]
>> Stopped at  ffs_truncate+0x1080:stw r11, 0xf8(r31)
>> 
>> 1 contains:
> 
> I've discovered where to find the trapframe
> in the vmcore.* files for these specific
> examples with 0x903a64e as the exception
> and such.
> 
> In the vmcore the memory image starts at
> byte offset 0x1000.
> 
> To see the values reported the only
> place in the image file to start that
> produces those values at the offsets
> for in side the powerpc trapframe is:
> 
> offset 0x1001 in the vmcore.* file.
> 
> So memory address 0x1 is being used
> as the trapframe address when that
> odd exception information is being
> displayed. Yep: misaligned.
> 
> The decoding is not of the actual
> trapframe: it is garbage that is
> not to be believed.
> 
> 
> Note: I lucked out after the above and
> got a somewhat different odd trap information
> that lead to actually getting a backtrace
> that included the actual pid 11 cpu 1 kernel
> thread stack bt associated with that odd
> information display.

Typo: That should have been "cpu 2".

> I'll send a separate reply for that information
> as it will take some transcribing from camera
> pictures and such.


As indicated, I got a different odd trap report
that gave a backtrace. . .

fatal user trap

exception = 0x421 (unknown)
srr0  = 0xc1007c09
srr1  = 0x3a64e80
lr= 0xc0807fc9
curthread = 0x147d000
pid = 11, comm = idle: cpu 2

Now at this point it attempted to
db_print_loc_and_inst and got
another exception (at offset +0x60
in the routine).

So the backtrace has both the
consequences of that and what lead
up to that: an EXI trap was
attempting to report trap frame
information but was using a bad
address for the supposed frame.

The details of the backtrace:

panic: data storage interrupt trap
cpuid = 2
time = 145187154
KDB: stack backtrace
0xdf5ef2c0: at kdb_backtrace+0x5c
0xdf5ef3a0: at panic+0x54
0xdf5ef3f0: at trap_fatal+0x1cc
0xdf5ef420: at powerpc_interrupt+0x180
0xdf5ef5c0: kernel DSI read trap @ 0xc1007c09
by db_disasm+0x30:
srr1=0x1032
r1  =0xdf5ef6b0
cr  =0x24009022
xer =0
ctr =0x1852cc
sr  =0x4000
0xdf5ef6b0: at 0x1007480
0xdf5ef6d0: at db_print_loc_and_inst+0x60
0xdf5ef700: at db_trap+0x104
0xdf5ef790: at kdb_trap+0x1bc
0xdf5ef810: at trap_fatal+0x1b0
0xdf5ef840: at trap+0x1184
0xdf5ef870: kernel EXI trap
by cpu_idle_60x+0x88:
srr1=0x1032
r1  =0xdf5ef930
cr  =0x4042
xer =0x2000
ctr =0x8e3bd8
saved LR(0x2) is invalid.

So an EXI trap was attempting to
report a trap frame.

(Note: the LR's for pid 11 cpu threads
normally report an invalid LR in ddb.)



The actual EXI trapframe starts at 013f0878 in
vmcore.5:

013f0870  df 5e f9 30 00 10 08 f8  00 04 90 32 df 5e f9 30  |.^.0...2.^.0|
013f0880  01 47 d0 00 00 00 00 00  25 94 48 3f 00 00 00 00  |.G..%.H?|
013f0890  25 94 48 3f 00 4a a9 c8  00 00 00 00 00 00 00 44  |%.H?.J.D|
013f08a0  01 fc a0 55 00 00 90 32  df 5d 1d 00 00 00 00 00  |...U...2.]..|
013f08b0  00 d4 bd ec 00 cb 98 98  00 c9 66 bc 00 c4 5d 08  |..f...].|
013f08c0  00 c9 66 bc 00 d4 c5 3c  df 5e f9 e0 00 eb a7 80  |..f<.^..|
013f08d0  00 c9 66 bc 01 47 d0 00  df 5e f9 8c 00 00 00 06  |..f..G...^..|
013f08e0  00 00 00 06 00 eb b5 80  00 00 00 00 00 8e 3b d8  |..;.|
013f08f0  00 d2 6b f0 df 5e f9 30  00 8e 3b f4 40 00 00 42  |..k..^.0..;.@..B|
013f0900  20 00 00 00 00 8e 3b d8  00 8e 3c 60 00 00 90 32  | .;...<`...2|
013f0910  00 00 05 00 41 a1 d5 d4  42 00 00 00 00 00 00 00  |A...B...|

So:

r0= 0x00049032
r1= 0xdf5ef930
r2= 0x0147d000
r3= 0x
r4= 0x2594483f
r5= 0x
r6= 0x2594483f
r7= 0x004aa9c8
r8= 0x
r9= 0x0044
r10   = 0x01fca055
r11   = 0x9032
r12   = 0xdf5d1d00
r13   = 0x
r14   = 0x00d4bdec
r15   = 0x00cb9898
r16   = 0x00c966bc
r17   = 0x00c45d08
r18   = 0x00c966bc
r19   = 0x00d4c53c
r20   = 0xdf5ef9e0
r21   = 0x00eba780
r22   = 0x00c966bc
r23   = 0x1047d000
r24   = 0xdf5ef98c
r25   = 0x0006 (this value shows up later in a bad spot)
r26   

Re: TARGET_ARCH=powerpc head -r317820 production-style kernel: periodic panics always in pid=11 (the Idle threads)

2017-05-19 Thread Mark Millard

On 2017-May-9, at 2:00 PM, Mark Millard  wrote:

. . .
> fatal kernel trap:
>   exception   = 0x903a64e (unknown)
>   srr0= 0x7ff760
>   srr1= 0xc1007c
>   lr  = 0x907f
>   curthread   = 0x147d6c0
>  pid = 11, comm = idle: cpu0
> [ thread pid 11 tid 13 ]
> Stopped at  ffs_truncate+0x1080:stw r11, 0xf8(r31)
> 
> 1 contains (cpu1 instead of cpu0, so different tid):
> 
> fatal kernel trap:
>   exception   = 0x903a64e (unknown)
>   srr0= 0x7ff760
>   srr1= 0xc1007c
>   lr  = 0x907f
>   curthread   = 0x147d360
>  pid = 11, comm = idle: cpu1
> [ thread pid 11 tid 14 ]
> Stopped at  ffs_truncate+0x1080:stw r11, 0xf8(r31)
> 
> 1 contains:

I've discovered where to find the trapframe
in the vmcore.* files for these specific
examples with 0x903a64e as the exception
and such.

In the vmcore the memory image starts at
byte offset 0x1000.

To see the values reported the only
place in the image file to start that
produces those values at the offsets
for in side the powerpc trapframe is:

offset 0x1001 in the vmcore.* file.

So memory address 0x1 is being used
as the trapframe address when that
odd exception information is being
displayed. Yep: misaligned.

The decoding is not of the actual
trapframe: it is garbage that is
not to be believed.


Note: I lucked out after the above and
got a somewhat different odd trap information
that lead to actually getting a backtrace
that included the actual pid 11 cpu 1 kernel
thread stack bt associated with that odd
information display.

I'll send a separate reply for that information
as it will take some transcribing from camera
pictures and such.

===
Mark Millard
markmi at dsl-only.net

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


TARGET_ARCH=powerpc head -r317820 production-style kernel: periodic panics always in pid=11 (the Idle threads)

2017-05-09 Thread Mark Millard
kgdb is not working for powerpc, neither system
nor ports. I've used "strings" to extract the 
later information below about the failures.

The time frames to failure are widely variable,
minutes to hours.

I've never seen the below with a debug kernel, only with
production-style. I have not seen any such problems for
powerpc64, aarch64 (with -mcpu=cortex-a53 ), armv6
(with -mcpu=cortex-a7 ), or amd64. Just powerpc.
The powerpc and powerpc64 hardware is (e.g.) the same
old PowerMac G5 so-called "Quad Core" used with
two different boot SSDs.

Note: This reproduces for me for pure gcc 4.2.1
  based builds. My usual clang-targetting-
  powerpc experiments are not involved here.

I'd not updated for a long time before this due to
the status of the clang compiler not changing and
its powerpc stack code-generation problems being
difficult to work around.

My kernels are unusual by having both sc and
vt in the build and ps3 disabled. I happen to be
using sc because it works with the 2560x1440
display that is currently connected but with vt
it fails to boot for such a size.

Of 7 example vmcore.* files. . .
(Note that all are pid 11 Idle-process thread
failures)

3 contain:

fatal kernel trap:
   exception   = 0x903a64e (unknown)
   srr0= 0x7ff760
   srr1= 0xc1007c
   lr  = 0x907f
   curthread   = 0x147d6c0
  pid = 11, comm = idle: cpu0
[ thread pid 11 tid 13 ]
Stopped at  ffs_truncate+0x1080:stw r11, 0xf8(r31)

1 contains (cpu1 instead of cpu0, so different tid):

fatal kernel trap:
   exception   = 0x903a64e (unknown)
   srr0= 0x7ff760
   srr1= 0xc1007c
   lr  = 0x907f
   curthread   = 0x147d360
  pid = 11, comm = idle: cpu1
[ thread pid 11 tid 14 ]
Stopped at  ffs_truncate+0x1080:stw r11, 0xf8(r31)

1 contains:

fatal kernel trap:
   exception   = 0x2100 (unknown)
   srr0= 0x7c0903
   srr1= 0xa64e8004
   lr  = 0x807fc9e7
   curthread   = 0x147d000
  pid = 11, comm = idle: cpu2
[ thread pid 11 tid 15 ]
Stopped at  audit_commit+0x24f: illegal instruction 4915f00

1 contains:

fatal kernel trap:
   exception   = 0x300 (data storage interrupt)
   virtual address = 0x7ff76000
   dsisr   = 0x4000
   srr0= 0x8e3cf8
   srr1= 0x1032
   lr  = 0x8e3ce8
   curthread   = 0x147d6c0
  pid = 11, comm = idle: cpu0
panic: data storage interrupt trap
cpuid = 0
time = 1494057319
KDB: stack backtrace:
 0xdf5e52c0: at kdb_backtrace+0x5c
0xdf5e5330: at vpanic+0x1ec
0xdf5e53a0: at panic+0x54
0xdf5e53f0: at trap_fatal+0x1cc
0xdf5e5420: at trap+0x122c
0xdf5e55c0: at powerpc_interrupt+0x180
0xdf5e55f0: kernel DSI read trap @ 0x7ff76000 by db_disasm+0x30: srr1=0x1032
r1=0xdf5e56b0 cr=0x24009022 xer=0 ctr=0x1852cc sr=0x4000
0xdf5e56b0: at 0x1007460
0xdf5e56d0: at db_print_loc_and_inst+0x60
0xdf5e5700: at db_trap+0x104
0xdf5e5790: at kdb_trap+0x1bc
0xdf5e5810: at trap_fatal+0x1b0
0xdf5e5840: at trap+0x1184
0xdf5e5870: kernel DECR trap by cpu_idle_60x+0x88: srr1=0x9032
r1=0xdf5e5930 cr=0x4042 xer=0x2000 ctr=0x8e3bd8
saved LR(0xfffe) is invalid

And 1 contains:

fatal kernel trap:
   exception   = 0x0 (unknown)
   srr0= 0x903a64e
   srr1= 0x80042100
   lr  = 0xc9e7c800
   curthread   = 0x147d360
  pid = 11, comm = idle: cpu1
[ thread pid 11 tid 14 ]
Stopped at  0x903a64e:
fatal kernel trap:
   exception   = 0x300 (data storage interrupt)
   virtual address = 0x903a64e
   dsisr   = 0x4000
   srr0= 0x8e3cf8
   srr1= 0x1032
   lr  = 0x8e3ce8
   curthread   = 0x147d360
  pid = 11, comm = idle: cpu1
panic: data storage interrupt trap
cpuid = 1
time = 1494132014
KDB: stack backtrace:
  0xdf5ea2c0: at kdb_backtrace+0x5c
0xdf5ea330: at vpanic+0x1ec
0xdf5ea3a0: at panic+0x54
0xdf5ea3f0: at trap_fatal+0x1cc
0xdf5ea420: at trap+0x122c
0xdf5ea5c0: at powerpc_interrupt+0x180
0xdf5ea5f0: kernel DSI read trap @ 0x903a64e by db_disasm+0x30: srr1=0x1032
r1=0xdf5ea6b0 cr=0x24009022 xer=0 ctr=0x1852cc sr=0x4000
0xdf5ea6b0: at 0x1007460
0xdf5ea6d0: at db_print_loc_and_inst+0x60
0xdf5ea700: at db_trap+0x104
0xdf5ea790: at kdb_trap+0x1bc
0xdf5ea810: at trap_fatal+0x1b0
0xdf5ea840: at trap+0x122c
0xdf5ea870: kernel EXI trap by cpu_idle_60x+0x88: srr1=0x9032
r1=0xdf5ea930 cr=0x4042 xer=0x2000 ctr=0x8e3bd8
saved LR(0x5) is invalid


Most (but not all) of the above were while the
old PowerMac was sitting unused.

The pid 11 Idle thread commonality suggests to me
some sort of interrupt oddity messing up when the
idle threads were put to use for the interrupt.

The /usr/src/sys/powerpc/conf/* files in use
are (-NODBG for production style and -DBG for
debug style):


# more