Re: 2.6.31-rc5-git2 crash on a idle system.

2009-08-06 Thread Sachin Sant

Benjamin Herrenschmidt wrote:

Thanks. Since it's a memory corruption (or seems to be) however, it's
possible that the bisection will mislead you. IE. The culprit could be
somewhere else, and the commit you'll find via bisection just happens to
move things around in the kernel in such a way that the corruption hits
that code path instead of another rarely used one.

I would suggest using printk to print out the content of memory where
the code appears to have been smashed at different stages during boot
(maybe even in the initcalls loop in init/main.c) to try to point out
what appears to be causing the corruption.
  

By the time machine is up and running the particular memory location
in question is already overwritten. So seems like the corruption occurs
during the boot.

I added few printks in the initcall debug code patch. The o/p suggests
that by the time first initicall debug message is printed the code is
already corrupted. Further debug suggests, when start_kernel() is
called the code at address(0xc060) is already corrupted.
About 28 bytes of code starting from the above address is overwritten. 


I will try to add few more debug statements to find the place where
this corruption might me happening.

Thanks
-Sachin


--

-
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
-

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: 2.6.31-rc5-git2 crash on a idle system.

2009-08-06 Thread Michael Ellerman
On Thu, 2009-08-06 at 19:03 +0530, Sachin Sant wrote:
 Benjamin Herrenschmidt wrote:
  Thanks. Since it's a memory corruption (or seems to be) however, it's
  possible that the bisection will mislead you. IE. The culprit could be
  somewhere else, and the commit you'll find via bisection just happens to
  move things around in the kernel in such a way that the corruption hits
  that code path instead of another rarely used one.
 
  I would suggest using printk to print out the content of memory where
  the code appears to have been smashed at different stages during boot
  (maybe even in the initcalls loop in init/main.c) to try to point out
  what appears to be causing the corruption.

 By the time machine is up and running the particular memory location
 in question is already overwritten. So seems like the corruption occurs
 during the boot.
 
 I added few printks in the initcall debug code patch. The o/p suggests
 that by the time first initicall debug message is printed the code is
 already corrupted. Further debug suggests, when start_kernel() is
 called the code at address(0xc060) is already corrupted.
 About 28 bytes of code starting from the above address is overwritten. 
 
 I will try to add few more debug statements to find the place where
 this corruption might me happening.

Is it always the exact same pattern at the exact same address? Or does
it change and if so how?

cheers


signature.asc
Description: This is a digitally signed message part
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: 2.6.31-rc5-git2 crash on a idle system.

2009-08-06 Thread Benjamin Herrenschmidt
On Thu, 2009-08-06 at 19:03 +0530, Sachin Sant wrote:
 I added few printks in the initcall debug code patch. The o/p suggests
 that by the time first initicall debug message is printed the code is
 already corrupted. Further debug suggests, when start_kernel() is
 called the code at address(0xc060) is already corrupted.
 About 28 bytes of code starting from the above address is
 overwritten. 
 
 I will try to add few more debug statements to find the place where
 this corruption might me happening.

Hrm... start_kernel is very very early... strange. Can you double check
that the actual kernel image contains the right stuff ? Also, what
distro are you using to test that and how are you booting that kernel ?

You can always add something to prom_init.c to test (though beware you
aren't relocated yet so you need to offset the addresses).

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: 2.6.31-rc5-git2 crash on a idle system.

2009-08-05 Thread Sachin Sant

Benjamin Herrenschmidt wrote:

On Tue, 2009-08-04 at 17:57 +0530, Sachin Sant wrote:
  

I have a power6 blade [IBM,7998-61X] running 2.6.31-rc5-git2
kernel (a33a052f19a21d727847391c8c1aff3fb221c472). After some
period of inactivity the machine drops into xmon with following
traces.



Looks like code has been overwritten with data. Would be useful
to try to track down when that happens.

Ben.

2.6.31-rc5-git1 (4905f92ed752d49ebe9cce4fe78a4bc39e710523) works fine
on this box without any problem. So the problem was introduced between
2.6.31-rc5-git1 (4905f92ed752d49ebe9cce4fe78a4bc39e710523) and
2.6.31-rc5-git2 (a33a052f19a21d727847391c8c1aff3fb221c472).

Looking at the changelog all the changes are confined to
drivers/md/ directory ( probably following five commits ). I do have
all the CONFIG_MD* options enabled in my kernel.

449aad3e25358812c43afc60918c5ad3819488e7
70471dafe3390243c598a3165dfb86b8b8b3f4fe
3673f305faf1bc66ead751344f8262ace851ff44
3a981b03f38dc3b8a69b77cbc679e66c1318a44a
ac5e7113e74872928844d00085bd47c988f12728

I will try to find out the exact commit.

Thanks
-Sachin


--

-
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
-

[r...@mjs22lp1 home]# cpu 0x0: Vector: 700 (Program Check) at [ca90]
pc: c060: .flow_cache_new_hashrnd+0x3c/0xcc
lr: c00c6038: .run_timer_softirq+0x20c/0x2f4
sp: cd10
   msr: 80089032
  current = 0xc0f58b70
  paca= 0xc10b2400
pid   = 0, comm = swapper
enter ? for help
[cda0] c00c6038 .run_timer_softirq+0x20c/0x2f4
[cea0] c00be8a0 .__do_softirq+0x174/0x2c8
[cf90] c00307b0 .call_do_softirq+0x14/0x24
[c1013870] c000ecf8 .do_softirq+0xa0/0x104
[c1013910] c00be1d0 .irq_exit+0x74/0xd4
[c1013990] c002ce20 .timer_interrupt+0x1cc/0x200
[c1013a30] c0003728 decrementer_common+0x128/0x180
--- Exception: 901 (Decrementer) at c000ec3c 
.raw_local_irq_restore+0xc0/0xdc
[c1013dc0] c0015824 .cpu_idle+0x13c/0x1e0
[c1013e60] c0009fe8 .rest_init+0x94/0xcc
[c1013ee0] c0990cf4 .start_kernel+0x484/0x4a8
[c1013f90] c0008408 .start_here_common+0x2c/0xa4
0:mon e
cpu 0x0: Vector: 700 (Program Check) at [ca90]
pc: c060: .flow_cache_new_hashrnd+0x3c/0xcc
lr: c00c6038: .run_timer_softirq+0x20c/0x2f4
sp: cd10
   msr: 80089032
  current = 0xc0f58b70
  paca= 0xc10b2400
pid   = 0, comm = swapper
0:mon di c060 
c060  1010  .long 0x1010
c064  0008  .long 0x8
c068  1013  .long 0x1013
c06c  000f  .long 0xf
c0600010  7961626f  rldimi. r1,r11,44,41
c0600014  6f74  xoris   r20,r27,0
c0600018  00101600  .long 0x101600
c060001c  0c00  .long 0xc00
c0600020  0400  .long 0x400
c0600024  00101100  .long 0x101100
c0600028  08e9  .long 0x8e9
c060002c  6000  nop
c0600030  e93e8050  ld  r9,-32688(r30)
c0600034  7c6507b4  extsw   r5,r3
c0600038  78ab4da4  rldicr  r11,r5,9,54
c060003c  396b0040  addir11,r11,64
0:mon r
R00 = c00c6038   R16 = 
R01 = cd10   R17 = 0100
R02 = c100   R18 = c1059300
R03 =    R19 = c1102f98
R04 = cd60   R20 = c1103398
R05 =    R21 = c1103798
R06 = 0700   R22 = c1103b98
R07 = 0001   R23 = 
R08 = 0083   R24 = cfffc000
R09 = c072f890   R25 = 
R10 = c120fbf8   R26 = 00200200
R11 = c0600090   R27 = ce10
R12 = 2822   R28 = 0001
R13 = c10b2400   R29 = c10805e8
R14 = 00382800   R30 = c0fb4f78
R15 =    R31 = cd10
pc  = c060 .flow_cache_new_hashrnd+0x3c/0xcc
lr  = c00c6038 .run_timer_softirq+0x20c/0x2f4
msr = 80089032   cr  = 2824
ctr = c05fffc4   xer =    trap =  700
0:mon dl
6Phyp-dump disabled at boot time
6Using pSeries machine description
7Page orders: linear mapping = 24, virtual = 16, io = 12, vmemmap = 24
6Using 1TB segments
4Found initrd at 0xc200:0xc2382800
6console [udbg0] enabled
6Partition configured for 8 cpus.
6CPU maps initialized for 2 threads per core
7 (thread shift is 1)
4Starting Linux PPC64 #1 SMP Tue Aug 4 12:15:55 IST 2009
4-
4ppc64_pft_size= 0x1a

Re: 2.6.31-rc5-git2 crash on a idle system.

2009-08-05 Thread Benjamin Herrenschmidt

 2.6.31-rc5-git1 (4905f92ed752d49ebe9cce4fe78a4bc39e710523) works fine
 on this box without any problem. So the problem was introduced between
 2.6.31-rc5-git1 (4905f92ed752d49ebe9cce4fe78a4bc39e710523) and
 2.6.31-rc5-git2 (a33a052f19a21d727847391c8c1aff3fb221c472).
 
 Looking at the changelog all the changes are confined to
 drivers/md/ directory ( probably following five commits ). I do have
 all the CONFIG_MD* options enabled in my kernel.
 
 449aad3e25358812c43afc60918c5ad3819488e7
 70471dafe3390243c598a3165dfb86b8b8b3f4fe
 3673f305faf1bc66ead751344f8262ace851ff44
 3a981b03f38dc3b8a69b77cbc679e66c1318a44a
 ac5e7113e74872928844d00085bd47c988f12728
 
 I will try to find out the exact commit.

Thanks. Since it's a memory corruption (or seems to be) however, it's
possible that the bisection will mislead you. IE. The culprit could be
somewhere else, and the commit you'll find via bisection just happens to
move things around in the kernel in such a way that the corruption hits
that code path instead of another rarely used one.

I would suggest using printk to print out the content of memory where
the code appears to have been smashed at different stages during boot
(maybe even in the initcalls loop in init/main.c) to try to point out
what appears to be causing the corruption.

Cheers,
Ben.

 Thanks
 -Sachin
 
 
 plain text document attachment (rc5-git2-log)
 [r...@mjs22lp1 home]# cpu 0x0: Vector: 700 (Program Check) at 
 [ca90]
 pc: c060: .flow_cache_new_hashrnd+0x3c/0xcc
 lr: c00c6038: .run_timer_softirq+0x20c/0x2f4
 sp: cd10
msr: 80089032
   current = 0xc0f58b70
   paca= 0xc10b2400
 pid   = 0, comm = swapper
 enter ? for help
 [cda0] c00c6038 .run_timer_softirq+0x20c/0x2f4
 [cea0] c00be8a0 .__do_softirq+0x174/0x2c8
 [cf90] c00307b0 .call_do_softirq+0x14/0x24
 [c1013870] c000ecf8 .do_softirq+0xa0/0x104
 [c1013910] c00be1d0 .irq_exit+0x74/0xd4
 [c1013990] c002ce20 .timer_interrupt+0x1cc/0x200
 [c1013a30] c0003728 decrementer_common+0x128/0x180
 --- Exception: 901 (Decrementer) at c000ec3c 
 .raw_local_irq_restore+0xc0/0xdc
 [c1013dc0] c0015824 .cpu_idle+0x13c/0x1e0
 [c1013e60] c0009fe8 .rest_init+0x94/0xcc
 [c1013ee0] c0990cf4 .start_kernel+0x484/0x4a8
 [c1013f90] c0008408 .start_here_common+0x2c/0xa4
 0:mon e
 cpu 0x0: Vector: 700 (Program Check) at [ca90]
 pc: c060: .flow_cache_new_hashrnd+0x3c/0xcc
 lr: c00c6038: .run_timer_softirq+0x20c/0x2f4
 sp: cd10
msr: 80089032
   current = 0xc0f58b70
   paca= 0xc10b2400
 pid   = 0, comm = swapper
 0:mon di c060 
 c060  1010  .long 0x1010
 c064  0008  .long 0x8
 c068  1013  .long 0x1013
 c06c  000f  .long 0xf
 c0600010  7961626f  rldimi. r1,r11,44,41
 c0600014  6f74  xoris   r20,r27,0
 c0600018  00101600  .long 0x101600
 c060001c  0c00  .long 0xc00
 c0600020  0400  .long 0x400
 c0600024  00101100  .long 0x101100
 c0600028  08e9  .long 0x8e9
 c060002c  6000  nop
 c0600030  e93e8050  ld  r9,-32688(r30)
 c0600034  7c6507b4  extsw   r5,r3
 c0600038  78ab4da4  rldicr  r11,r5,9,54
 c060003c  396b0040  addir11,r11,64
 0:mon r
 R00 = c00c6038   R16 = 
 R01 = cd10   R17 = 0100
 R02 = c100   R18 = c1059300
 R03 =    R19 = c1102f98
 R04 = cd60   R20 = c1103398
 R05 =    R21 = c1103798
 R06 = 0700   R22 = c1103b98
 R07 = 0001   R23 = 
 R08 = 0083   R24 = cfffc000
 R09 = c072f890   R25 = 
 R10 = c120fbf8   R26 = 00200200
 R11 = c0600090   R27 = ce10
 R12 = 2822   R28 = 0001
 R13 = c10b2400   R29 = c10805e8
 R14 = 00382800   R30 = c0fb4f78
 R15 =    R31 = cd10
 pc  = c060 .flow_cache_new_hashrnd+0x3c/0xcc
 lr  = c00c6038 .run_timer_softirq+0x20c/0x2f4
 msr = 80089032   cr  = 2824
 ctr = c05fffc4   xer =    trap =  700
 0:mon dl
 6Phyp-dump disabled at boot time
 6Using pSeries machine description
 7Page orders: linear mapping = 24, virtual = 16, io = 12, vmemmap = 24
 6Using 1TB segments
 4Found initrd at 0xc200:0xc2382800
 6console [udbg0] enabled
 6Partition configured for 8 cpus.
 6CPU maps initialized for 2 threads per core