Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
Hello Atilla , On Thu, 2 Aug 2007, Attila Nagy wrote: On 2007.08.01. 0:08, Roger Heflin wrote: Attila Nagy wrote: HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 0 TSC 1167e915e93ce MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b2401400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 5 TSC 1167e915e9ea8 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Attila, We had some issues with very similar boards all of the problems seem to be around the PCIX bus area of the machine, setting the PCIX buses to 66 mhz in the bios made things stable (but slow). Not using the PCIX bus also seemed to make things work. We got MCE's and other odd crashes under heavy IO loads. I believe turning things down to 100mhz made things more stable, but things still crashed. Supermicro reported being able to fix the issue with: setting the PCI Configuration -> PCI-e I/O performance setting to Colasce 128B. I am not exactly sure where to set it as we did not try it as we had already changed to a different motherboard that did not have the issue. If this works please tell me. Roger, you are my hero. :) With that PCI-e setting (again, for the record, this is on a Supermicro X7DBE motherboard, and the BIOS setting is PCIe I/O performance, which has two states: Coalesce and Payload 256B) all of the four machines have survived a half day of continous bashing. Previously one, or two machines typically fell off after such amount of IO load, so it looks promising so far. I hope this won't change over the time. BTW, this is still with 2.6.21.5, because the SCSI target stuff I use (SCST) has some -I hope temporary- problems with changed (deleted) interfaces in newer kernels. Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)? Thanks, I too have a SuperMicro MB , But it is a X7DB8 . Same symptoms . Reported MCE problems here a couple of times . I set the BIOS setting 'PCIe I/O performance', to 'Coalesce' . For everyones information , stability went way up , scsi IO is ~ half , But if there's no stability ... I'm going to try their 1.3b bios update & see if that helps any . iirc , Some said they'd already acquired the lastest for their MB & that did not help them at all . What th eheck I'll give it a try anyway . Hth , JimL -- +-+ | James W. Laferriere | System Techniques | Give me VMS | | NetworkEngineer | 663 Beaumont Blvd | Give me Linux | | [EMAIL PROTECTED] | Pacifica, CA. 94044 | only on AXP | +-+ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
Hello Atilla , On Thu, 2 Aug 2007, Attila Nagy wrote: On 2007.08.01. 0:08, Roger Heflin wrote: Attila Nagy wrote: HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 0 TSC 1167e915e93ce MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b2401400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 5 TSC 1167e915e9ea8 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Attila, We had some issues with very similar boards all of the problems seem to be around the PCIX bus area of the machine, setting the PCIX buses to 66 mhz in the bios made things stable (but slow). Not using the PCIX bus also seemed to make things work. We got MCE's and other odd crashes under heavy IO loads. I believe turning things down to 100mhz made things more stable, but things still crashed. Supermicro reported being able to fix the issue with: setting the PCI Configuration - PCI-e I/O performance setting to Colasce 128B. I am not exactly sure where to set it as we did not try it as we had already changed to a different motherboard that did not have the issue. If this works please tell me. Roger, you are my hero. :) With that PCI-e setting (again, for the record, this is on a Supermicro X7DBE motherboard, and the BIOS setting is PCIe I/O performance, which has two states: Coalesce and Payload 256B) all of the four machines have survived a half day of continous bashing. Previously one, or two machines typically fell off after such amount of IO load, so it looks promising so far. I hope this won't change over the time. BTW, this is still with 2.6.21.5, because the SCSI target stuff I use (SCST) has some -I hope temporary- problems with changed (deleted) interfaces in newer kernels. Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)? Thanks, I too have a SuperMicro MB , But it is a X7DB8 . Same symptoms . Reported MCE problems here a couple of times . I set the BIOS setting 'PCIe I/O performance', to 'Coalesce' . For everyones information , stability went way up , scsi IO is ~ half , But if there's no stability ... I'm going to try their 1.3b bios update see if that helps any . iirc , Some said they'd already acquired the lastest for their MB that did not help them at all . What th eheck I'll give it a try anyway . Hth , JimL -- +-+ | James W. Laferriere | System Techniques | Give me VMS | | NetworkEngineer | 663 Beaumont Blvd | Give me Linux | | [EMAIL PROTECTED] | Pacifica, CA. 94044 | only on AXP | +-+ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
On 2007.08.01. 0:08, Roger Heflin wrote: Attila Nagy wrote: HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 0 TSC 1167e915e93ce MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b2401400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 5 TSC 1167e915e9ea8 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Attila, We had some issues with very similar boards all of the problems seem to be around the PCIX bus area of the machine, setting the PCIX buses to 66 mhz in the bios made things stable (but slow). Not using the PCIX bus also seemed to make things work. We got MCE's and other odd crashes under heavy IO loads. I believe turning things down to 100mhz made things more stable, but things still crashed. Supermicro reported being able to fix the issue with: setting the PCI Configuration -> PCI-e I/O performance setting to Colasce 128B. I am not exactly sure where to set it as we did not try it as we had already changed to a different motherboard that did not have the issue. If this works please tell me. Roger, you are my hero. :) With that PCI-e setting (again, for the record, this is on a Supermicro X7DBE motherboard, and the BIOS setting is PCIe I/O performance, which has two states: Coalesce and Payload 256B) all of the four machines have survived a half day of continous bashing. Previously one, or two machines typically fell off after such amount of IO load, so it looks promising so far. I hope this won't change over the time. BTW, this is still with 2.6.21.5, because the SCSI target stuff I use (SCST) has some -I hope temporary- problems with changed (deleted) interfaces in newer kernels. Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)? Thanks, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
On 2007.08.01. 0:08, Roger Heflin wrote: Attila Nagy wrote: HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 0 TSC 1167e915e93ce MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b2401400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 5 TSC 1167e915e9ea8 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Attila, We had some issues with very similar boards all of the problems seem to be around the PCIX bus area of the machine, setting the PCIX buses to 66 mhz in the bios made things stable (but slow). Not using the PCIX bus also seemed to make things work. We got MCE's and other odd crashes under heavy IO loads. I believe turning things down to 100mhz made things more stable, but things still crashed. Supermicro reported being able to fix the issue with: setting the PCI Configuration - PCI-e I/O performance setting to Colasce 128B. I am not exactly sure where to set it as we did not try it as we had already changed to a different motherboard that did not have the issue. If this works please tell me. Roger, you are my hero. :) With that PCI-e setting (again, for the record, this is on a Supermicro X7DBE motherboard, and the BIOS setting is PCIe I/O performance, which has two states: Coalesce and Payload 256B) all of the four machines have survived a half day of continous bashing. Previously one, or two machines typically fell off after such amount of IO load, so it looks promising so far. I hope this won't change over the time. BTW, this is still with 2.6.21.5, because the SCSI target stuff I use (SCST) has some -I hope temporary- problems with changed (deleted) interfaces in newer kernels. Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)? Thanks, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
Attila Nagy wrote: On 2007.07.30. 18:19, Alan Cox wrote: O> MCE: [153103.918654] HARDWARE ERROR [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: b2401400 [153104.066037] RIP !INEXACT! 10: {mwait_idle+0x46/0x60} [153104.145699] TSC 1167e915e93ce [153104.183554] This is not a software problem! [153104.234724] Run through mcelog --ascii to decode and contact your hardware vendor If you it through mcelog as it suggests it wil decode the meaning of the MCE data and that should give you some idea. Generally speaking MCE errors are real hardware errors but can certainly be caused by external factors (power supply glitches, heat etc) Sorry, of course I ran that through mcelog, but inadvertently attached the original version. I've tried the machines with two types of power sources (different UPSes, line filtering, etc, and the chassis have redundant PSes), monitoring the temperatures (seems to be OK, the CPUs don't go over 30 °C even under load). I have the latest BIOS for the motherboard. But I will recheck everything. BTW, here's the output from mcelog, I see this occasionally on all four machines: HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 0 TSC 1167e915e93ce MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b2401400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 5 TSC 1167e915e9ea8 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Thanks, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Attila, We had some issues with very similar boards all of the problems seem to be around the PCIX bus area of the machine, setting the PCIX buses to 66 mhz in the bios made things stable (but slow). Not using the PCIX bus also seemed to make things work. We got MCE's and other odd crashes under heavy IO loads. I believe turning things down to 100mhz made things more stable, but things still crashed. Supermicro reported being able to fix the issue with: setting the PCI Configuration -> PCI-e I/O performance setting to Colasce 128B. I am not exactly sure where to set it as we did not try it as we had already changed to a different motherboard that did not have the issue. If this works please tell me. Roger - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
On 2007.07.30. 18:19, Alan Cox wrote: O> MCE: [153103.918654] HARDWARE ERROR [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: b2401400 [153104.066037] RIP !INEXACT! 10: {mwait_idle+0x46/0x60} [153104.145699] TSC 1167e915e93ce [153104.183554] This is not a software problem! [153104.234724] Run through mcelog --ascii to decode and contact your hardware vendor If you it through mcelog as it suggests it wil decode the meaning of the MCE data and that should give you some idea. Generally speaking MCE errors are real hardware errors but can certainly be caused by external factors (power supply glitches, heat etc) Sorry, of course I ran that through mcelog, but inadvertently attached the original version. I've tried the machines with two types of power sources (different UPSes, line filtering, etc, and the chassis have redundant PSes), monitoring the temperatures (seems to be OK, the CPUs don't go over 30 °C even under load). I have the latest BIOS for the motherboard. But I will recheck everything. BTW, here's the output from mcelog, I see this occasionally on all four machines: HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 0 TSC 1167e915e93ce MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b2401400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 5 TSC 1167e915e9ea8 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Thanks, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
On 2007.07.30. 18:19, Alan Cox wrote: O MCE: [153103.918654] HARDWARE ERROR [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: b2401400 [153104.066037] RIP !INEXACT! 10:802569e6 {mwait_idle+0x46/0x60} [153104.145699] TSC 1167e915e93ce [153104.183554] This is not a software problem! [153104.234724] Run through mcelog --ascii to decode and contact your hardware vendor If you it through mcelog as it suggests it wil decode the meaning of the MCE data and that should give you some idea. Generally speaking MCE errors are real hardware errors but can certainly be caused by external factors (power supply glitches, heat etc) Sorry, of course I ran that through mcelog, but inadvertently attached the original version. I've tried the machines with two types of power sources (different UPSes, line filtering, etc, and the chassis have redundant PSes), monitoring the temperatures (seems to be OK, the CPUs don't go over 30 °C even under load). I have the latest BIOS for the motherboard. But I will recheck everything. BTW, here's the output from mcelog, I see this occasionally on all four machines: HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 0 TSC 1167e915e93ce MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b2401400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 5 TSC 1167e915e9ea8 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Thanks, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
Attila Nagy wrote: On 2007.07.30. 18:19, Alan Cox wrote: O MCE: [153103.918654] HARDWARE ERROR [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: b2401400 [153104.066037] RIP !INEXACT! 10:802569e6 {mwait_idle+0x46/0x60} [153104.145699] TSC 1167e915e93ce [153104.183554] This is not a software problem! [153104.234724] Run through mcelog --ascii to decode and contact your hardware vendor If you it through mcelog as it suggests it wil decode the meaning of the MCE data and that should give you some idea. Generally speaking MCE errors are real hardware errors but can certainly be caused by external factors (power supply glitches, heat etc) Sorry, of course I ran that through mcelog, but inadvertently attached the original version. I've tried the machines with two types of power sources (different UPSes, line filtering, etc, and the chassis have redundant PSes), monitoring the temperatures (seems to be OK, the CPUs don't go over 30 °C even under load). I have the latest BIOS for the motherboard. But I will recheck everything. BTW, here's the output from mcelog, I see this occasionally on all four machines: HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 0 TSC 1167e915e93ce MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b2401400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 5 TSC 1167e915e9ea8 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Thanks, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Attila, We had some issues with very similar boards all of the problems seem to be around the PCIX bus area of the machine, setting the PCIX buses to 66 mhz in the bios made things stable (but slow). Not using the PCIX bus also seemed to make things work. We got MCE's and other odd crashes under heavy IO loads. I believe turning things down to 100mhz made things more stable, but things still crashed. Supermicro reported being able to fix the issue with: setting the PCI Configuration - PCI-e I/O performance setting to Colasce 128B. I am not exactly sure where to set it as we did not try it as we had already changed to a different motherboard that did not have the issue. If this works please tell me. Roger - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
Attila Nagy wrote: Hello, I have four identical machines, based on Supermicro X7DBE motherboards. All of the machines have two Xeon 5130 CPUs, 16GB RAM and two add-on cards: an Areca 1261 SATA RAID HBA and a Qlogic QLA2342 fibre channel HBA. I would like to use these as file servers (via FC), but during the performance and reliabilty tests it turned out that the machines are very unreliable, despite that they seemed to be OK hardware-wise (memtest and the usual stuff). During the debugging of this (seemingly) high IO load related problem, I have observed the following: - when MSI is enabled (the first iteration), the machines sometimes "hang", but not the whole system, just the SCSI target subsystem (SCST), which makes heavy IO on the Areca (arcmsr) and the Qlogic (patched qla2xxx) HBAs - when MSI is disabled, I couldn't reproduce that hung up state, instead the machines sometimes throw an MCE (see below), but I couldn't find its cause - when MSI is disabled, and CONFIG_DEBUG_SHIRQ is enabled, the machines can't even boot normally, I get an oops instantly during the kernel initialization - with MSI disabled sometimes the machines fail to respond, the ssh sessions terminate and on the console I can't type for very long seconds. I have nearly all debugging turned on, but can't see anything in the logs or on the console. The machine recovers from this hang automatically. The whole thing seems like when a high (eg. network) interrupt activity happens on a highly loaded machine, but I could observe this even after a fresh boot, without anything (of course minus the standard stuff, sshd, and the others) running on the machine. The kernel is 2.6.21.5 (I've tried 2.6.18, the effects are the same), running in 64 bit mode. something is definately not happy on this system. There was a e1000 fix related to DEBUG_SHIRQ in 2.6.22, so I definately advise you to test 2.6.22.1 immediately - however: The oops I get with MSI disabled and CONFIG_DEBUG_SHIRQ enabled: [ 92.681320] NET: Registered protocol family 17 [ 93.491658] Unable to handle kernel NULL pointer dereference at RIP: [ 93.557402] [<>] [ 93.626770] PGD 0 [ 93.651106] Oops: 0010 [1] SMP [ 93.689170] CPU 1 [ 93.713506] Modules linked in: [ 93.750322] Pid: 1, comm: swapper Not tainted 2.6.21.5 #1 [ 93.815011] RIP: 0010:[<>] [<>] [ 93.887187] RSP: 0018:81042fc5dc68 EFLAGS: 00010002 [ 93.950836] RAX: 81042fbe6b70 RBX: 0202 RCX: 81042fbe6b70 [ 94.036323] RDX: c204 RSI: 81042f51cdf8 RDI: 81042fbe6800 [ 94.121812] RBP: 81042fc5dd10 R08: R09: 81042f4c0ea8 [ 94.207298] R10: R11: 81042fbe6800 R12: fff4 [ 94.292788] R13: 81042fbe6000 R14: 0001 R15: 80399450 [ 94.378275] FS: () GS:81042fc694c8() knlGS: [ 94.475307] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [ 94.544153] CR2: CR3: 00201000 CR4: 06e0 [ 94.629643] Process swapper (pid: 1, threadinfo 81042fc5c000, task 81042fc58040) [ 94.726673] Stack: 80399559 81042fc5dca0 802121b2 81042f4c0ea0 [ 94.823603] 81042fc5dca0 80221cd7 802addb5 81042fbe6800 [ 94.913042] 8020c9bc 81042fbe6b70 0246 81042fbe6b70 [ 95.000194] Call Trace: [ 95.031814] [] e1000_intr+0x109/0x590 [ 95.095461] [] poison_obj+0x42/0x60 [ 95.157027] [] dbg_redzone1+0x17/0x30 [ 95.220676] [] request_irq+0x95/0x150 [ 95.284324] [] cache_alloc_debugcheck_after+0x17c/0x1c0 [ 95.366690] [] kmem_cache_alloc+0xcd/0xf0 [ 95.434500] [] e1000_intr+0x0/0x590 [ 95.496067] [] request_irq+0xe0/0x150 [ 95.559716] [] e1000_request_irq+0x3c/0x80 [ 95.628564] [] e1000_open+0x5c/0x100 [ 95.691172] [] dev_open+0x37/0x80 [ 95.750661] [] dev_change_flags+0x6d/0x150 [ 95.819508] [] ip_auto_config+0x175/0xea0 [ 95.887317] [] tcp_set_default_congestion_control+0x18/0x70 [ 95.973947] [] tcp_set_default_congestion_control+0x5f/0x70 [ 96.060582] [] _spin_unlock+0x26/0x30 [ 96.124227] [] init+0x1a4/0x2b0 [ 96.181635] [] trace_hardirqs_on+0x14b/0x180 [ 96.252563] [] child_rip+0xa/0x12 [ 96.312051] [] _spin_unlock_irq+0x2b/0x40 [ 96.379859] [] restore_args+0x0/0x30 [ 96.442467] [] init+0x0/0x2b0 [ 96.497795] [] child_rip+0x0/0x12 [ 96.557282] [ 96.575170] [ 96.575171] Code: Bad RIP value. [ 96.633203] RIP [<>] [ 96.677297] RSP [ 96.719105] CR2: [ 96.758835] Kernel panic - not syncing: Attempted to kill init! MCE: [153103.918654] HARDWARE ERROR [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: b2401400 [153104.066037] RIP !INEXACT! 10: {mwait_idle+0x46/0x60} [153104.145699] TSC 1167e915e93ce
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
O> MCE: > [153103.918654] HARDWARE ERROR > [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: > b2401400 > [153104.066037] RIP !INEXACT! 10: {mwait_idle+0x46/0x60} > [153104.145699] TSC 1167e915e93ce > [153104.183554] This is not a software problem! > [153104.234724] Run through mcelog --ascii to decode and contact your > hardware vendor If you it through mcelog as it suggests it wil decode the meaning of the MCE data and that should give you some idea. Generally speaking MCE errors are real hardware errors but can certainly be caused by external factors (power supply glitches, heat etc) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
O MCE: [153103.918654] HARDWARE ERROR [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: b2401400 [153104.066037] RIP !INEXACT! 10:802569e6 {mwait_idle+0x46/0x60} [153104.145699] TSC 1167e915e93ce [153104.183554] This is not a software problem! [153104.234724] Run through mcelog --ascii to decode and contact your hardware vendor If you it through mcelog as it suggests it wil decode the meaning of the MCE data and that should give you some idea. Generally speaking MCE errors are real hardware errors but can certainly be caused by external factors (power supply glitches, heat etc) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
Attila Nagy wrote: Hello, I have four identical machines, based on Supermicro X7DBE motherboards. All of the machines have two Xeon 5130 CPUs, 16GB RAM and two add-on cards: an Areca 1261 SATA RAID HBA and a Qlogic QLA2342 fibre channel HBA. I would like to use these as file servers (via FC), but during the performance and reliabilty tests it turned out that the machines are very unreliable, despite that they seemed to be OK hardware-wise (memtest and the usual stuff). During the debugging of this (seemingly) high IO load related problem, I have observed the following: - when MSI is enabled (the first iteration), the machines sometimes hang, but not the whole system, just the SCSI target subsystem (SCST), which makes heavy IO on the Areca (arcmsr) and the Qlogic (patched qla2xxx) HBAs - when MSI is disabled, I couldn't reproduce that hung up state, instead the machines sometimes throw an MCE (see below), but I couldn't find its cause - when MSI is disabled, and CONFIG_DEBUG_SHIRQ is enabled, the machines can't even boot normally, I get an oops instantly during the kernel initialization - with MSI disabled sometimes the machines fail to respond, the ssh sessions terminate and on the console I can't type for very long seconds. I have nearly all debugging turned on, but can't see anything in the logs or on the console. The machine recovers from this hang automatically. The whole thing seems like when a high (eg. network) interrupt activity happens on a highly loaded machine, but I could observe this even after a fresh boot, without anything (of course minus the standard stuff, sshd, and the others) running on the machine. The kernel is 2.6.21.5 (I've tried 2.6.18, the effects are the same), running in 64 bit mode. something is definately not happy on this system. There was a e1000 fix related to DEBUG_SHIRQ in 2.6.22, so I definately advise you to test 2.6.22.1 immediately - however: The oops I get with MSI disabled and CONFIG_DEBUG_SHIRQ enabled: [ 92.681320] NET: Registered protocol family 17 [ 93.491658] Unable to handle kernel NULL pointer dereference at RIP: [ 93.557402] [] [ 93.626770] PGD 0 [ 93.651106] Oops: 0010 [1] SMP [ 93.689170] CPU 1 [ 93.713506] Modules linked in: [ 93.750322] Pid: 1, comm: swapper Not tainted 2.6.21.5 #1 [ 93.815011] RIP: 0010:[] [] [ 93.887187] RSP: 0018:81042fc5dc68 EFLAGS: 00010002 [ 93.950836] RAX: 81042fbe6b70 RBX: 0202 RCX: 81042fbe6b70 [ 94.036323] RDX: c204 RSI: 81042f51cdf8 RDI: 81042fbe6800 [ 94.121812] RBP: 81042fc5dd10 R08: R09: 81042f4c0ea8 [ 94.207298] R10: R11: 81042fbe6800 R12: fff4 [ 94.292788] R13: 81042fbe6000 R14: 0001 R15: 80399450 [ 94.378275] FS: () GS:81042fc694c8() knlGS: [ 94.475307] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [ 94.544153] CR2: CR3: 00201000 CR4: 06e0 [ 94.629643] Process swapper (pid: 1, threadinfo 81042fc5c000, task 81042fc58040) [ 94.726673] Stack: 80399559 81042fc5dca0 802121b2 81042f4c0ea0 [ 94.823603] 81042fc5dca0 80221cd7 802addb5 81042fbe6800 [ 94.913042] 8020c9bc 81042fbe6b70 0246 81042fbe6b70 [ 95.000194] Call Trace: [ 95.031814] [80399559] e1000_intr+0x109/0x590 [ 95.095461] [802121b2] poison_obj+0x42/0x60 [ 95.157027] [80221cd7] dbg_redzone1+0x17/0x30 [ 95.220676] [802addb5] request_irq+0x95/0x150 [ 95.284324] [8020c9bc] cache_alloc_debugcheck_after+0x17c/0x1c0 [ 95.366690] [8020a43d] kmem_cache_alloc+0xcd/0xf0 [ 95.434500] [80399450] e1000_intr+0x0/0x590 [ 95.496067] [802ade00] request_irq+0xe0/0x150 [ 95.559716] [8039558c] e1000_request_irq+0x3c/0x80 [ 95.628564] [803985bc] e1000_open+0x5c/0x100 [ 95.691172] [8041d937] dev_open+0x37/0x80 [ 95.750661] [8041becd] dev_change_flags+0x6d/0x150 [ 95.819508] [80616565] ip_auto_config+0x175/0xea0 [ 95.887317] [80442f88] tcp_set_default_congestion_control+0x18/0x70 [ 95.973947] [80442fcf] tcp_set_default_congestion_control+0x5f/0x70 [ 96.060582] [80265236] _spin_unlock+0x26/0x30 [ 96.124227] [805f1754] init+0x1a4/0x2b0 [ 96.181635] [802a0e7b] trace_hardirqs_on+0x14b/0x180 [ 96.252563] [8025ff28] child_rip+0xa/0x12 [ 96.312051] [8026563b] _spin_unlock_irq+0x2b/0x40 [ 96.379859] [8025f63c] restore_args+0x0/0x30 [ 96.442467] [805f15b0] init+0x0/0x2b0 [ 96.497795] [8025ff1e] child_rip+0x0/0x12 [ 96.557282] [ 96.575170] [ 96.575171] Code: Bad RIP value. [ 96.633203] RIP