Re: 2.6.31-rc5-git2 crash on a idle system.
Benjamin Herrenschmidt wrote: Thanks. Since it's a memory corruption (or seems to be) however, it's possible that the bisection will mislead you. IE. The culprit could be somewhere else, and the commit you'll find via bisection just happens to move things around in the kernel in such a way that the corruption hits that code path instead of another rarely used one. I would suggest using printk to print out the content of memory where the code appears to have been smashed at different stages during boot (maybe even in the initcalls loop in init/main.c) to try to point out what appears to be causing the corruption. By the time machine is up and running the particular memory location in question is already overwritten. So seems like the corruption occurs during the boot. I added few printks in the initcall debug code patch. The o/p suggests that by the time first initicall debug message is printed the code is already corrupted. Further debug suggests, when start_kernel() is called the code at address(0xc060) is already corrupted. About 28 bytes of code starting from the above address is overwritten. I will try to add few more debug statements to find the place where this corruption might me happening. Thanks -Sachin -- - Sachin Sant IBM Linux Technology Center India Systems and Technology Labs Bangalore, India - ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 2.6.31-rc5-git2 crash on a idle system.
On Thu, 2009-08-06 at 19:03 +0530, Sachin Sant wrote: Benjamin Herrenschmidt wrote: Thanks. Since it's a memory corruption (or seems to be) however, it's possible that the bisection will mislead you. IE. The culprit could be somewhere else, and the commit you'll find via bisection just happens to move things around in the kernel in such a way that the corruption hits that code path instead of another rarely used one. I would suggest using printk to print out the content of memory where the code appears to have been smashed at different stages during boot (maybe even in the initcalls loop in init/main.c) to try to point out what appears to be causing the corruption. By the time machine is up and running the particular memory location in question is already overwritten. So seems like the corruption occurs during the boot. I added few printks in the initcall debug code patch. The o/p suggests that by the time first initicall debug message is printed the code is already corrupted. Further debug suggests, when start_kernel() is called the code at address(0xc060) is already corrupted. About 28 bytes of code starting from the above address is overwritten. I will try to add few more debug statements to find the place where this corruption might me happening. Is it always the exact same pattern at the exact same address? Or does it change and if so how? cheers signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 2.6.31-rc5-git2 crash on a idle system.
On Thu, 2009-08-06 at 19:03 +0530, Sachin Sant wrote: I added few printks in the initcall debug code patch. The o/p suggests that by the time first initicall debug message is printed the code is already corrupted. Further debug suggests, when start_kernel() is called the code at address(0xc060) is already corrupted. About 28 bytes of code starting from the above address is overwritten. I will try to add few more debug statements to find the place where this corruption might me happening. Hrm... start_kernel is very very early... strange. Can you double check that the actual kernel image contains the right stuff ? Also, what distro are you using to test that and how are you booting that kernel ? You can always add something to prom_init.c to test (though beware you aren't relocated yet so you need to offset the addresses). Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 2.6.31-rc5-git2 crash on a idle system.
Benjamin Herrenschmidt wrote: On Tue, 2009-08-04 at 17:57 +0530, Sachin Sant wrote: I have a power6 blade [IBM,7998-61X] running 2.6.31-rc5-git2 kernel (a33a052f19a21d727847391c8c1aff3fb221c472). After some period of inactivity the machine drops into xmon with following traces. Looks like code has been overwritten with data. Would be useful to try to track down when that happens. Ben. 2.6.31-rc5-git1 (4905f92ed752d49ebe9cce4fe78a4bc39e710523) works fine on this box without any problem. So the problem was introduced between 2.6.31-rc5-git1 (4905f92ed752d49ebe9cce4fe78a4bc39e710523) and 2.6.31-rc5-git2 (a33a052f19a21d727847391c8c1aff3fb221c472). Looking at the changelog all the changes are confined to drivers/md/ directory ( probably following five commits ). I do have all the CONFIG_MD* options enabled in my kernel. 449aad3e25358812c43afc60918c5ad3819488e7 70471dafe3390243c598a3165dfb86b8b8b3f4fe 3673f305faf1bc66ead751344f8262ace851ff44 3a981b03f38dc3b8a69b77cbc679e66c1318a44a ac5e7113e74872928844d00085bd47c988f12728 I will try to find out the exact commit. Thanks -Sachin -- - Sachin Sant IBM Linux Technology Center India Systems and Technology Labs Bangalore, India - [r...@mjs22lp1 home]# cpu 0x0: Vector: 700 (Program Check) at [ca90] pc: c060: .flow_cache_new_hashrnd+0x3c/0xcc lr: c00c6038: .run_timer_softirq+0x20c/0x2f4 sp: cd10 msr: 80089032 current = 0xc0f58b70 paca= 0xc10b2400 pid = 0, comm = swapper enter ? for help [cda0] c00c6038 .run_timer_softirq+0x20c/0x2f4 [cea0] c00be8a0 .__do_softirq+0x174/0x2c8 [cf90] c00307b0 .call_do_softirq+0x14/0x24 [c1013870] c000ecf8 .do_softirq+0xa0/0x104 [c1013910] c00be1d0 .irq_exit+0x74/0xd4 [c1013990] c002ce20 .timer_interrupt+0x1cc/0x200 [c1013a30] c0003728 decrementer_common+0x128/0x180 --- Exception: 901 (Decrementer) at c000ec3c .raw_local_irq_restore+0xc0/0xdc [c1013dc0] c0015824 .cpu_idle+0x13c/0x1e0 [c1013e60] c0009fe8 .rest_init+0x94/0xcc [c1013ee0] c0990cf4 .start_kernel+0x484/0x4a8 [c1013f90] c0008408 .start_here_common+0x2c/0xa4 0:mon e cpu 0x0: Vector: 700 (Program Check) at [ca90] pc: c060: .flow_cache_new_hashrnd+0x3c/0xcc lr: c00c6038: .run_timer_softirq+0x20c/0x2f4 sp: cd10 msr: 80089032 current = 0xc0f58b70 paca= 0xc10b2400 pid = 0, comm = swapper 0:mon di c060 c060 1010 .long 0x1010 c064 0008 .long 0x8 c068 1013 .long 0x1013 c06c 000f .long 0xf c0600010 7961626f rldimi. r1,r11,44,41 c0600014 6f74 xoris r20,r27,0 c0600018 00101600 .long 0x101600 c060001c 0c00 .long 0xc00 c0600020 0400 .long 0x400 c0600024 00101100 .long 0x101100 c0600028 08e9 .long 0x8e9 c060002c 6000 nop c0600030 e93e8050 ld r9,-32688(r30) c0600034 7c6507b4 extsw r5,r3 c0600038 78ab4da4 rldicr r11,r5,9,54 c060003c 396b0040 addir11,r11,64 0:mon r R00 = c00c6038 R16 = R01 = cd10 R17 = 0100 R02 = c100 R18 = c1059300 R03 = R19 = c1102f98 R04 = cd60 R20 = c1103398 R05 = R21 = c1103798 R06 = 0700 R22 = c1103b98 R07 = 0001 R23 = R08 = 0083 R24 = cfffc000 R09 = c072f890 R25 = R10 = c120fbf8 R26 = 00200200 R11 = c0600090 R27 = ce10 R12 = 2822 R28 = 0001 R13 = c10b2400 R29 = c10805e8 R14 = 00382800 R30 = c0fb4f78 R15 = R31 = cd10 pc = c060 .flow_cache_new_hashrnd+0x3c/0xcc lr = c00c6038 .run_timer_softirq+0x20c/0x2f4 msr = 80089032 cr = 2824 ctr = c05fffc4 xer = trap = 700 0:mon dl 6Phyp-dump disabled at boot time 6Using pSeries machine description 7Page orders: linear mapping = 24, virtual = 16, io = 12, vmemmap = 24 6Using 1TB segments 4Found initrd at 0xc200:0xc2382800 6console [udbg0] enabled 6Partition configured for 8 cpus. 6CPU maps initialized for 2 threads per core 7 (thread shift is 1) 4Starting Linux PPC64 #1 SMP Tue Aug 4 12:15:55 IST 2009 4- 4ppc64_pft_size= 0x1a
Re: 2.6.31-rc5-git2 crash on a idle system.
2.6.31-rc5-git1 (4905f92ed752d49ebe9cce4fe78a4bc39e710523) works fine on this box without any problem. So the problem was introduced between 2.6.31-rc5-git1 (4905f92ed752d49ebe9cce4fe78a4bc39e710523) and 2.6.31-rc5-git2 (a33a052f19a21d727847391c8c1aff3fb221c472). Looking at the changelog all the changes are confined to drivers/md/ directory ( probably following five commits ). I do have all the CONFIG_MD* options enabled in my kernel. 449aad3e25358812c43afc60918c5ad3819488e7 70471dafe3390243c598a3165dfb86b8b8b3f4fe 3673f305faf1bc66ead751344f8262ace851ff44 3a981b03f38dc3b8a69b77cbc679e66c1318a44a ac5e7113e74872928844d00085bd47c988f12728 I will try to find out the exact commit. Thanks. Since it's a memory corruption (or seems to be) however, it's possible that the bisection will mislead you. IE. The culprit could be somewhere else, and the commit you'll find via bisection just happens to move things around in the kernel in such a way that the corruption hits that code path instead of another rarely used one. I would suggest using printk to print out the content of memory where the code appears to have been smashed at different stages during boot (maybe even in the initcalls loop in init/main.c) to try to point out what appears to be causing the corruption. Cheers, Ben. Thanks -Sachin plain text document attachment (rc5-git2-log) [r...@mjs22lp1 home]# cpu 0x0: Vector: 700 (Program Check) at [ca90] pc: c060: .flow_cache_new_hashrnd+0x3c/0xcc lr: c00c6038: .run_timer_softirq+0x20c/0x2f4 sp: cd10 msr: 80089032 current = 0xc0f58b70 paca= 0xc10b2400 pid = 0, comm = swapper enter ? for help [cda0] c00c6038 .run_timer_softirq+0x20c/0x2f4 [cea0] c00be8a0 .__do_softirq+0x174/0x2c8 [cf90] c00307b0 .call_do_softirq+0x14/0x24 [c1013870] c000ecf8 .do_softirq+0xa0/0x104 [c1013910] c00be1d0 .irq_exit+0x74/0xd4 [c1013990] c002ce20 .timer_interrupt+0x1cc/0x200 [c1013a30] c0003728 decrementer_common+0x128/0x180 --- Exception: 901 (Decrementer) at c000ec3c .raw_local_irq_restore+0xc0/0xdc [c1013dc0] c0015824 .cpu_idle+0x13c/0x1e0 [c1013e60] c0009fe8 .rest_init+0x94/0xcc [c1013ee0] c0990cf4 .start_kernel+0x484/0x4a8 [c1013f90] c0008408 .start_here_common+0x2c/0xa4 0:mon e cpu 0x0: Vector: 700 (Program Check) at [ca90] pc: c060: .flow_cache_new_hashrnd+0x3c/0xcc lr: c00c6038: .run_timer_softirq+0x20c/0x2f4 sp: cd10 msr: 80089032 current = 0xc0f58b70 paca= 0xc10b2400 pid = 0, comm = swapper 0:mon di c060 c060 1010 .long 0x1010 c064 0008 .long 0x8 c068 1013 .long 0x1013 c06c 000f .long 0xf c0600010 7961626f rldimi. r1,r11,44,41 c0600014 6f74 xoris r20,r27,0 c0600018 00101600 .long 0x101600 c060001c 0c00 .long 0xc00 c0600020 0400 .long 0x400 c0600024 00101100 .long 0x101100 c0600028 08e9 .long 0x8e9 c060002c 6000 nop c0600030 e93e8050 ld r9,-32688(r30) c0600034 7c6507b4 extsw r5,r3 c0600038 78ab4da4 rldicr r11,r5,9,54 c060003c 396b0040 addir11,r11,64 0:mon r R00 = c00c6038 R16 = R01 = cd10 R17 = 0100 R02 = c100 R18 = c1059300 R03 = R19 = c1102f98 R04 = cd60 R20 = c1103398 R05 = R21 = c1103798 R06 = 0700 R22 = c1103b98 R07 = 0001 R23 = R08 = 0083 R24 = cfffc000 R09 = c072f890 R25 = R10 = c120fbf8 R26 = 00200200 R11 = c0600090 R27 = ce10 R12 = 2822 R28 = 0001 R13 = c10b2400 R29 = c10805e8 R14 = 00382800 R30 = c0fb4f78 R15 = R31 = cd10 pc = c060 .flow_cache_new_hashrnd+0x3c/0xcc lr = c00c6038 .run_timer_softirq+0x20c/0x2f4 msr = 80089032 cr = 2824 ctr = c05fffc4 xer = trap = 700 0:mon dl 6Phyp-dump disabled at boot time 6Using pSeries machine description 7Page orders: linear mapping = 24, virtual = 16, io = 12, vmemmap = 24 6Using 1TB segments 4Found initrd at 0xc200:0xc2382800 6console [udbg0] enabled 6Partition configured for 8 cpus. 6CPU maps initialized for 2 threads per core