Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-08-04 Thread Mr. James W. Laferriere

Hello Atilla ,

On Thu, 2 Aug 2007, Attila Nagy wrote:

On 2007.08.01. 0:08, Roger Heflin wrote:

Attila Nagy wrote:

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b2401400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor


Attila,

We had some issues with very similar boards all of the problems
seem to be around the PCIX bus area of the machine, setting the
PCIX buses to 66 mhz in the bios made things stable (but slow).   Not using
the PCIX bus also seemed to make things work.   We got MCE's and
other odd crashes under heavy IO loads.   I believe turning things
down to 100mhz made things more stable, but things still crashed.

Supermicro reported being able to fix the issue with:
setting the PCI Configuration -> PCI-e I/O performance
setting to Colasce 128B.

I am not exactly sure where to set it as we did not try it
as we had already changed to a different motherboard that did not
have the issue.

If this works please tell me.

Roger, you are my hero. :)
With that PCI-e setting (again, for the record, this is on a Supermicro X7DBE 
motherboard,
and the BIOS setting is PCIe I/O performance, which has two states: Coalesce 
and Payload 256B)
all of the four machines have survived a half day of continous bashing. 
Previously one, or two
machines typically fell off after such amount of IO load, so it looks 
promising so far.

I hope this won't change over the time.

BTW, this is still with 2.6.21.5, because the SCSI target stuff I use (SCST) 
has some
-I hope temporary- problems with changed (deleted) interfaces in newer 
kernels.


Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)?

Thanks,


	I too have a SuperMicro MB ,  But it is a X7DB8 .  Same symptoms . 
Reported MCE problems here a couple of times .

I set the BIOS setting 'PCIe I/O performance', to 'Coalesce' .
 	For everyones information ,  stability went way up ,  scsi IO is ~ half 
,  But if there's no stability ...


I'm going to try their 1.3b bios update & see if that helps any .
	iirc ,  Some said they'd already acquired the lastest for their MB & 
that did not help them at all .  What th eheck I'll give it a try anyway .

Hth ,  JimL
--
+-+
| James   W.   Laferriere | System   Techniques | Give me VMS |
| NetworkEngineer | 663  Beaumont  Blvd |  Give me Linux  |
| [EMAIL PROTECTED] | Pacifica, CA. 94044 |   only  on  AXP |
+-+
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-08-04 Thread Mr. James W. Laferriere

Hello Atilla ,

On Thu, 2 Aug 2007, Attila Nagy wrote:

On 2007.08.01. 0:08, Roger Heflin wrote:

Attila Nagy wrote:

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b2401400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor


Attila,

We had some issues with very similar boards all of the problems
seem to be around the PCIX bus area of the machine, setting the
PCIX buses to 66 mhz in the bios made things stable (but slow).   Not using
the PCIX bus also seemed to make things work.   We got MCE's and
other odd crashes under heavy IO loads.   I believe turning things
down to 100mhz made things more stable, but things still crashed.

Supermicro reported being able to fix the issue with:
setting the PCI Configuration - PCI-e I/O performance
setting to Colasce 128B.

I am not exactly sure where to set it as we did not try it
as we had already changed to a different motherboard that did not
have the issue.

If this works please tell me.

Roger, you are my hero. :)
With that PCI-e setting (again, for the record, this is on a Supermicro X7DBE 
motherboard,
and the BIOS setting is PCIe I/O performance, which has two states: Coalesce 
and Payload 256B)
all of the four machines have survived a half day of continous bashing. 
Previously one, or two
machines typically fell off after such amount of IO load, so it looks 
promising so far.

I hope this won't change over the time.

BTW, this is still with 2.6.21.5, because the SCSI target stuff I use (SCST) 
has some
-I hope temporary- problems with changed (deleted) interfaces in newer 
kernels.


Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)?

Thanks,


	I too have a SuperMicro MB ,  But it is a X7DB8 .  Same symptoms . 
Reported MCE problems here a couple of times .

I set the BIOS setting 'PCIe I/O performance', to 'Coalesce' .
 	For everyones information ,  stability went way up ,  scsi IO is ~ half 
,  But if there's no stability ...


I'm going to try their 1.3b bios update  see if that helps any .
	iirc ,  Some said they'd already acquired the lastest for their MB  
that did not help them at all .  What th eheck I'll give it a try anyway .

Hth ,  JimL
--
+-+
| James   W.   Laferriere | System   Techniques | Give me VMS |
| NetworkEngineer | 663  Beaumont  Blvd |  Give me Linux  |
| [EMAIL PROTECTED] | Pacifica, CA. 94044 |   only  on  AXP |
+-+
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-08-02 Thread Attila Nagy

On 2007.08.01. 0:08, Roger Heflin wrote:

Attila Nagy wrote:

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b2401400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor


Attila,

We had some issues with very similar boards all of the problems
seem to be around the PCIX bus area of the machine, setting the
PCIX buses to 66 mhz in the bios made things stable (but slow).   Not 
using

the PCIX bus also seemed to make things work.   We got MCE's and
other odd crashes under heavy IO loads.   I believe turning things
down to 100mhz made things more stable, but things still crashed.

Supermicro reported being able to fix the issue with:
setting the PCI Configuration -> PCI-e I/O performance
setting to Colasce 128B.

I am not exactly sure where to set it as we did not try it
as we had already changed to a different motherboard that did not
have the issue.

If this works please tell me.

Roger, you are my hero. :)
With that PCI-e setting (again, for the record, this is on a Supermicro 
X7DBE motherboard,
and the BIOS setting is PCIe I/O performance, which has two states: 
Coalesce and Payload 256B)
all of the four machines have survived a half day of continous bashing. 
Previously one, or two
machines typically fell off after such amount of IO load, so it looks 
promising so far.

I hope this won't change over the time.

BTW, this is still with 2.6.21.5, because the SCSI target stuff I use 
(SCST) has some
-I hope temporary- problems with changed (deleted) interfaces in newer 
kernels.


Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)?

Thanks,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-08-02 Thread Attila Nagy

On 2007.08.01. 0:08, Roger Heflin wrote:

Attila Nagy wrote:

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b2401400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor


Attila,

We had some issues with very similar boards all of the problems
seem to be around the PCIX bus area of the machine, setting the
PCIX buses to 66 mhz in the bios made things stable (but slow).   Not 
using

the PCIX bus also seemed to make things work.   We got MCE's and
other odd crashes under heavy IO loads.   I believe turning things
down to 100mhz made things more stable, but things still crashed.

Supermicro reported being able to fix the issue with:
setting the PCI Configuration - PCI-e I/O performance
setting to Colasce 128B.

I am not exactly sure where to set it as we did not try it
as we had already changed to a different motherboard that did not
have the issue.

If this works please tell me.

Roger, you are my hero. :)
With that PCI-e setting (again, for the record, this is on a Supermicro 
X7DBE motherboard,
and the BIOS setting is PCIe I/O performance, which has two states: 
Coalesce and Payload 256B)
all of the four machines have survived a half day of continous bashing. 
Previously one, or two
machines typically fell off after such amount of IO load, so it looks 
promising so far.

I hope this won't change over the time.

BTW, this is still with 2.6.21.5, because the SCSI target stuff I use 
(SCST) has some
-I hope temporary- problems with changed (deleted) interfaces in newer 
kernels.


Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)?

Thanks,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-07-31 Thread Roger Heflin

Attila Nagy wrote:

On 2007.07.30. 18:19, Alan Cox wrote:

O> MCE:
 

[153103.918654] HARDWARE ERROR
[153103.918655] CPU 1: Machine Check Exception:5 Bank 
0: b2401400
[153104.066037] RIP !INEXACT! 10: 
{mwait_idle+0x46/0x60}

[153104.145699] TSC 1167e915e93ce
[153104.183554] This is not a software problem!
[153104.234724] Run through mcelog --ascii to decode and contact your 
hardware vendor



If you it through mcelog as it suggests it wil decode the meaning of the
MCE data and that should give you some idea. Generally speaking MCE
errors are real hardware errors but can certainly be caused by external
factors (power supply glitches, heat etc)
  
Sorry, of course I ran that through mcelog, but inadvertently attached 
the original version.


I've tried the machines with two types of power sources (different 
UPSes, line filtering, etc,
and the chassis have redundant PSes), monitoring the temperatures (seems 
to be OK,
the CPUs don't go over 30 °C even under load). I have the latest BIOS 
for the

motherboard.
But I will recheck everything.

BTW, here's the output from mcelog, I see this occasionally on all four 
machines:


HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b2401400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

Thanks,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Attila,

We had some issues with very similar boards all of the problems
seem to be around the PCIX bus area of the machine, setting the
PCIX buses to 66 mhz in the bios made things stable (but slow).   Not using
the PCIX bus also seemed to make things work.   We got MCE's and
other odd crashes under heavy IO loads.   I believe turning things
down to 100mhz made things more stable, but things still crashed.

Supermicro reported being able to fix the issue with:
setting the PCI Configuration -> PCI-e I/O performance
setting to Colasce 128B.

I am not exactly sure where to set it as we did not try it
as we had already changed to a different motherboard that did not
have the issue.

If this works please tell me.

 Roger





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-07-31 Thread Attila Nagy

On 2007.07.30. 18:19, Alan Cox wrote:

O> MCE:
  

[153103.918654] HARDWARE ERROR
[153103.918655] CPU 1: Machine Check Exception:5 Bank 0: 
b2401400

[153104.066037] RIP !INEXACT! 10: {mwait_idle+0x46/0x60}
[153104.145699] TSC 1167e915e93ce
[153104.183554] This is not a software problem!
[153104.234724] Run through mcelog --ascii to decode and contact your 
hardware vendor



If you it through mcelog as it suggests it wil decode the meaning of the
MCE data and that should give you some idea. Generally speaking MCE
errors are real hardware errors but can certainly be caused by external
factors (power supply glitches, heat etc)
  
Sorry, of course I ran that through mcelog, but inadvertently attached 
the original version.


I've tried the machines with two types of power sources (different 
UPSes, line filtering, etc,
and the chassis have redundant PSes), monitoring the temperatures (seems 
to be OK,
the CPUs don't go over 30 °C even under load). I have the latest BIOS 
for the

motherboard.
But I will recheck everything.

BTW, here's the output from mcelog, I see this occasionally on all four 
machines:


HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b2401400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

Thanks,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-07-31 Thread Attila Nagy

On 2007.07.30. 18:19, Alan Cox wrote:

O MCE:
  

[153103.918654] HARDWARE ERROR
[153103.918655] CPU 1: Machine Check Exception:5 Bank 0: 
b2401400

[153104.066037] RIP !INEXACT! 10:802569e6 {mwait_idle+0x46/0x60}
[153104.145699] TSC 1167e915e93ce
[153104.183554] This is not a software problem!
[153104.234724] Run through mcelog --ascii to decode and contact your 
hardware vendor



If you it through mcelog as it suggests it wil decode the meaning of the
MCE data and that should give you some idea. Generally speaking MCE
errors are real hardware errors but can certainly be caused by external
factors (power supply glitches, heat etc)
  
Sorry, of course I ran that through mcelog, but inadvertently attached 
the original version.


I've tried the machines with two types of power sources (different 
UPSes, line filtering, etc,
and the chassis have redundant PSes), monitoring the temperatures (seems 
to be OK,
the CPUs don't go over 30 °C even under load). I have the latest BIOS 
for the

motherboard.
But I will recheck everything.

BTW, here's the output from mcelog, I see this occasionally on all four 
machines:


HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b2401400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

Thanks,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-07-31 Thread Roger Heflin

Attila Nagy wrote:

On 2007.07.30. 18:19, Alan Cox wrote:

O MCE:
 

[153103.918654] HARDWARE ERROR
[153103.918655] CPU 1: Machine Check Exception:5 Bank 
0: b2401400
[153104.066037] RIP !INEXACT! 10:802569e6 
{mwait_idle+0x46/0x60}

[153104.145699] TSC 1167e915e93ce
[153104.183554] This is not a software problem!
[153104.234724] Run through mcelog --ascii to decode and contact your 
hardware vendor



If you it through mcelog as it suggests it wil decode the meaning of the
MCE data and that should give you some idea. Generally speaking MCE
errors are real hardware errors but can certainly be caused by external
factors (power supply glitches, heat etc)
  
Sorry, of course I ran that through mcelog, but inadvertently attached 
the original version.


I've tried the machines with two types of power sources (different 
UPSes, line filtering, etc,
and the chassis have redundant PSes), monitoring the temperatures (seems 
to be OK,
the CPUs don't go over 30 °C even under load). I have the latest BIOS 
for the

motherboard.
But I will recheck everything.

BTW, here's the output from mcelog, I see this occasionally on all four 
machines:


HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b2401400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

Thanks,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Attila,

We had some issues with very similar boards all of the problems
seem to be around the PCIX bus area of the machine, setting the
PCIX buses to 66 mhz in the bios made things stable (but slow).   Not using
the PCIX bus also seemed to make things work.   We got MCE's and
other odd crashes under heavy IO loads.   I believe turning things
down to 100mhz made things more stable, but things still crashed.

Supermicro reported being able to fix the issue with:
setting the PCI Configuration - PCI-e I/O performance
setting to Colasce 128B.

I am not exactly sure where to set it as we did not try it
as we had already changed to a different motherboard that did not
have the issue.

If this works please tell me.

 Roger





-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-07-30 Thread Kok, Auke

Attila Nagy wrote:

Hello,

I have four identical machines, based on Supermicro X7DBE motherboards.
All of the machines have two Xeon 5130 CPUs, 16GB RAM and two add-on
cards: an Areca 1261 SATA RAID HBA and a Qlogic QLA2342 fibre channel HBA.

I would like to use these as file servers (via FC), but during the 
performance and
reliabilty tests it turned out that the machines are very unreliable, 
despite that they

seemed to be OK hardware-wise (memtest and the usual stuff).

During the debugging of this (seemingly) high IO load related problem, I 
have

observed the following:
- when MSI is enabled (the first iteration), the machines sometimes 
"hang", but

not the whole system, just the SCSI target subsystem (SCST), which makes
heavy IO on the Areca (arcmsr) and the Qlogic (patched qla2xxx) HBAs
- when MSI is disabled, I couldn't reproduce that hung up state, instead the
machines sometimes throw an MCE (see below), but I couldn't find its cause
- when MSI is disabled, and CONFIG_DEBUG_SHIRQ is enabled, the machines
can't even boot normally, I get an oops instantly during the kernel 
initialization
- with MSI disabled sometimes the machines fail to respond, the ssh 
sessions terminate
and on the console I can't type for very long seconds. I have nearly all 
debugging turned on,
but can't see anything in the logs or on the console. The machine 
recovers from this hang
automatically. The whole thing seems like when a high (eg. network) 
interrupt activity happens
on a highly loaded machine, but I could observe this even after a fresh 
boot, without anything
(of course minus the standard stuff, sshd, and the others) running on 
the machine.


The kernel is 2.6.21.5 (I've tried 2.6.18, the effects are the same), 
running in 64 bit mode.


something is definately not happy on this system. There was a e1000 fix related 
to DEBUG_SHIRQ in 2.6.22, so I definately advise you to test 2.6.22.1 
immediately - however:



The oops I get with MSI disabled and CONFIG_DEBUG_SHIRQ enabled:

[   92.681320] NET: Registered protocol family 17
[   93.491658] Unable to handle kernel NULL pointer dereference at 
 RIP:
[   93.557402]  [<>]
[   93.626770] PGD 0
[   93.651106] Oops: 0010 [1] SMP
[   93.689170] CPU 1
[   93.713506] Modules linked in:
[   93.750322] Pid: 1, comm: swapper Not tainted 2.6.21.5 #1
[   93.815011] RIP: 0010:[<>]  [<>]
[   93.887187] RSP: 0018:81042fc5dc68  EFLAGS: 00010002
[   93.950836] RAX: 81042fbe6b70 RBX: 0202 RCX: 81042fbe6b70
[   94.036323] RDX: c204 RSI: 81042f51cdf8 RDI: 81042fbe6800
[   94.121812] RBP: 81042fc5dd10 R08:  R09: 81042f4c0ea8
[   94.207298] R10:  R11: 81042fbe6800 R12: fff4
[   94.292788] R13: 81042fbe6000 R14: 0001 R15: 80399450
[   94.378275] FS:  () GS:81042fc694c8() 
knlGS:
[   94.475307] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[   94.544153] CR2:  CR3: 00201000 CR4: 06e0
[   94.629643] Process swapper (pid: 1, threadinfo 81042fc5c000, task 
81042fc58040)
[   94.726673] Stack:  80399559 81042fc5dca0 802121b2 
81042f4c0ea0
[   94.823603]  81042fc5dca0 80221cd7 802addb5 
81042fbe6800
[   94.913042]  8020c9bc 81042fbe6b70 0246 
81042fbe6b70
[   95.000194] Call Trace:
[   95.031814]  [] e1000_intr+0x109/0x590
[   95.095461]  [] poison_obj+0x42/0x60
[   95.157027]  [] dbg_redzone1+0x17/0x30
[   95.220676]  [] request_irq+0x95/0x150
[   95.284324]  [] cache_alloc_debugcheck_after+0x17c/0x1c0
[   95.366690]  [] kmem_cache_alloc+0xcd/0xf0
[   95.434500]  [] e1000_intr+0x0/0x590
[   95.496067]  [] request_irq+0xe0/0x150
[   95.559716]  [] e1000_request_irq+0x3c/0x80
[   95.628564]  [] e1000_open+0x5c/0x100
[   95.691172]  [] dev_open+0x37/0x80
[   95.750661]  [] dev_change_flags+0x6d/0x150
[   95.819508]  [] ip_auto_config+0x175/0xea0
[   95.887317]  [] 
tcp_set_default_congestion_control+0x18/0x70
[   95.973947]  [] 
tcp_set_default_congestion_control+0x5f/0x70
[   96.060582]  [] _spin_unlock+0x26/0x30
[   96.124227]  [] init+0x1a4/0x2b0
[   96.181635]  [] trace_hardirqs_on+0x14b/0x180
[   96.252563]  [] child_rip+0xa/0x12
[   96.312051]  [] _spin_unlock_irq+0x2b/0x40
[   96.379859]  [] restore_args+0x0/0x30
[   96.442467]  [] init+0x0/0x2b0
[   96.497795]  [] child_rip+0x0/0x12
[   96.557282]
[   96.575170]
[   96.575171] Code:  Bad RIP value.
[   96.633203] RIP  [<>]
[   96.677297]  RSP 
[   96.719105] CR2: 
[   96.758835] Kernel panic - not syncing: Attempted to kill init!


MCE:
[153103.918654] HARDWARE ERROR
[153103.918655] CPU 1: Machine Check Exception:5 Bank 0: 
b2401400

[153104.066037] RIP !INEXACT! 10: {mwait_idle+0x46/0x60}
[153104.145699] TSC 1167e915e93ce

Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-07-30 Thread Alan Cox
O> MCE:
> [153103.918654] HARDWARE ERROR
> [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: 
> b2401400
> [153104.066037] RIP !INEXACT! 10: {mwait_idle+0x46/0x60}
> [153104.145699] TSC 1167e915e93ce
> [153104.183554] This is not a software problem!
> [153104.234724] Run through mcelog --ascii to decode and contact your 
> hardware vendor

If you it through mcelog as it suggests it wil decode the meaning of the
MCE data and that should give you some idea. Generally speaking MCE
errors are real hardware errors but can certainly be caused by external
factors (power supply glitches, heat etc)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-07-30 Thread Alan Cox
O MCE:
 [153103.918654] HARDWARE ERROR
 [153103.918655] CPU 1: Machine Check Exception:5 Bank 0: 
 b2401400
 [153104.066037] RIP !INEXACT! 10:802569e6 {mwait_idle+0x46/0x60}
 [153104.145699] TSC 1167e915e93ce
 [153104.183554] This is not a software problem!
 [153104.234724] Run through mcelog --ascii to decode and contact your 
 hardware vendor

If you it through mcelog as it suggests it wil decode the meaning of the
MCE data and that should give you some idea. Generally speaking MCE
errors are real hardware errors but can certainly be caused by external
factors (power supply glitches, heat etc)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

2007-07-30 Thread Kok, Auke

Attila Nagy wrote:

Hello,

I have four identical machines, based on Supermicro X7DBE motherboards.
All of the machines have two Xeon 5130 CPUs, 16GB RAM and two add-on
cards: an Areca 1261 SATA RAID HBA and a Qlogic QLA2342 fibre channel HBA.

I would like to use these as file servers (via FC), but during the 
performance and
reliabilty tests it turned out that the machines are very unreliable, 
despite that they

seemed to be OK hardware-wise (memtest and the usual stuff).

During the debugging of this (seemingly) high IO load related problem, I 
have

observed the following:
- when MSI is enabled (the first iteration), the machines sometimes 
hang, but

not the whole system, just the SCSI target subsystem (SCST), which makes
heavy IO on the Areca (arcmsr) and the Qlogic (patched qla2xxx) HBAs
- when MSI is disabled, I couldn't reproduce that hung up state, instead the
machines sometimes throw an MCE (see below), but I couldn't find its cause
- when MSI is disabled, and CONFIG_DEBUG_SHIRQ is enabled, the machines
can't even boot normally, I get an oops instantly during the kernel 
initialization
- with MSI disabled sometimes the machines fail to respond, the ssh 
sessions terminate
and on the console I can't type for very long seconds. I have nearly all 
debugging turned on,
but can't see anything in the logs or on the console. The machine 
recovers from this hang
automatically. The whole thing seems like when a high (eg. network) 
interrupt activity happens
on a highly loaded machine, but I could observe this even after a fresh 
boot, without anything
(of course minus the standard stuff, sshd, and the others) running on 
the machine.


The kernel is 2.6.21.5 (I've tried 2.6.18, the effects are the same), 
running in 64 bit mode.


something is definately not happy on this system. There was a e1000 fix related 
to DEBUG_SHIRQ in 2.6.22, so I definately advise you to test 2.6.22.1 
immediately - however:



The oops I get with MSI disabled and CONFIG_DEBUG_SHIRQ enabled:

[   92.681320] NET: Registered protocol family 17
[   93.491658] Unable to handle kernel NULL pointer dereference at 
 RIP:
[   93.557402]  []
[   93.626770] PGD 0
[   93.651106] Oops: 0010 [1] SMP
[   93.689170] CPU 1
[   93.713506] Modules linked in:
[   93.750322] Pid: 1, comm: swapper Not tainted 2.6.21.5 #1
[   93.815011] RIP: 0010:[]  []
[   93.887187] RSP: 0018:81042fc5dc68  EFLAGS: 00010002
[   93.950836] RAX: 81042fbe6b70 RBX: 0202 RCX: 81042fbe6b70
[   94.036323] RDX: c204 RSI: 81042f51cdf8 RDI: 81042fbe6800
[   94.121812] RBP: 81042fc5dd10 R08:  R09: 81042f4c0ea8
[   94.207298] R10:  R11: 81042fbe6800 R12: fff4
[   94.292788] R13: 81042fbe6000 R14: 0001 R15: 80399450
[   94.378275] FS:  () GS:81042fc694c8() 
knlGS:
[   94.475307] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[   94.544153] CR2:  CR3: 00201000 CR4: 06e0
[   94.629643] Process swapper (pid: 1, threadinfo 81042fc5c000, task 
81042fc58040)
[   94.726673] Stack:  80399559 81042fc5dca0 802121b2 
81042f4c0ea0
[   94.823603]  81042fc5dca0 80221cd7 802addb5 
81042fbe6800
[   94.913042]  8020c9bc 81042fbe6b70 0246 
81042fbe6b70
[   95.000194] Call Trace:
[   95.031814]  [80399559] e1000_intr+0x109/0x590
[   95.095461]  [802121b2] poison_obj+0x42/0x60
[   95.157027]  [80221cd7] dbg_redzone1+0x17/0x30
[   95.220676]  [802addb5] request_irq+0x95/0x150
[   95.284324]  [8020c9bc] cache_alloc_debugcheck_after+0x17c/0x1c0
[   95.366690]  [8020a43d] kmem_cache_alloc+0xcd/0xf0
[   95.434500]  [80399450] e1000_intr+0x0/0x590
[   95.496067]  [802ade00] request_irq+0xe0/0x150
[   95.559716]  [8039558c] e1000_request_irq+0x3c/0x80
[   95.628564]  [803985bc] e1000_open+0x5c/0x100
[   95.691172]  [8041d937] dev_open+0x37/0x80
[   95.750661]  [8041becd] dev_change_flags+0x6d/0x150
[   95.819508]  [80616565] ip_auto_config+0x175/0xea0
[   95.887317]  [80442f88] 
tcp_set_default_congestion_control+0x18/0x70
[   95.973947]  [80442fcf] 
tcp_set_default_congestion_control+0x5f/0x70
[   96.060582]  [80265236] _spin_unlock+0x26/0x30
[   96.124227]  [805f1754] init+0x1a4/0x2b0
[   96.181635]  [802a0e7b] trace_hardirqs_on+0x14b/0x180
[   96.252563]  [8025ff28] child_rip+0xa/0x12
[   96.312051]  [8026563b] _spin_unlock_irq+0x2b/0x40
[   96.379859]  [8025f63c] restore_args+0x0/0x30
[   96.442467]  [805f15b0] init+0x0/0x2b0
[   96.497795]  [8025ff1e] child_rip+0x0/0x12
[   96.557282]
[   96.575170]
[   96.575171] Code:  Bad RIP value.
[   96.633203] RIP