Re: what causes Machine Check exception? revisited (2.2.18)

2001-05-08 Thread Mike Fedyk

On Mon, May 07, 2001 at 11:57:17AM +0100, Alan Cox wrote:
> Generally it indicates a CPU problem but I've see it caused by overclocking
> and poorly fitted heatsinks
I've been able to trigger a Machine check error on PPC when trying to boot
directly from OF with a COFF kernel.  The system has worked perfectly with
BootX.

I wonder why this is the first non-x86 report...

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Simon Richter

On Mon, 7 May 2001, Dan Hollis wrote:

> > Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> > don't see those because you don't have ECC and the error is simply not
> > detected.

> So a 440bx motherboard with ECC ram is a non-standard PC?

I bet the board doesn't force you to use ECC RAM, so manufacturers will
not use it because it's too expensive and the average customer doesn't
understand what memory is and what it's used for. So yes, it's
non-standard.

   Simon

-- 
GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc
 Fingerprint: DC26 EB8D 1F35 4F44 2934  7583 DBB6 F98D 9198 3292
Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread nick

Yep, totally.  I've worked on hundreds of systems and less than 20 of the
workstations or PCs have been useing ECC.  Most servers do, but not even
all of them.
Nick

On Mon, 7 May 2001, Dan Hollis wrote:

> On Mon, 7 May 2001, Simon Richter wrote:
> > On Mon, 7 May 2001, Bene, Martin wrote:
> > > Definitely not caused by:
> > >   Bad Rams, mb-chipset.
> > Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> > don't see those because you don't have ECC and the error is simply not
> > detected.
> 
> So a 440bx motherboard with ECC ram is a non-standard PC?
> 
> -Dan
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Dan Hollis

On Mon, 7 May 2001, Simon Richter wrote:
> On Mon, 7 May 2001, Bene, Martin wrote:
> > Definitely not caused by:
> > Bad Rams, mb-chipset.
> Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> don't see those because you don't have ECC and the error is simply not
> detected.

So a 440bx motherboard with ECC ram is a non-standard PC?

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Simon Richter

On Mon, 7 May 2001, Bene, Martin wrote:

[MCE caused by bad RAM]

> I don't think there is a way a machine check exception can be triggered by
> software - which it would have to be in order to be caused by bad RAMs.

A MCE is triggered by an ECC error - no software involved. A good trap
handler will then see if the error is recoverable (one-bit errors are),
notify userspace (so the admin gets mailed) and move the data out of this
page.

   Simon

-- 
GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc
 Fingerprint: DC26 EB8D 1F35 4F44 2934  7583 DBB6 F98D 9198 3292
Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Ricardo Galli

>> Definitely not caused by:
>> Bad Rams, mb-chipset.
>
> Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> don't see those because you don't have ECC and the error is simply not
> detected.

I did have the same problem with an SMP Intel 440LX which run without any
problem since 1998. When I installed 2.2.18 it could run for more than 5
minutes (Alan suggested me it was .

I am not sure it's a RAM poblem, because it never gave/gives a SEGFAULT
compiling the kernel. I brought it back to 2.2.16 and it's running happy.

Could be some SMP/BIOS related problem? If it's the RAM or chipset, I am
scared how we could use it for three years and suddenly it hangs with a new
version of the kernel... Blame to Intel?


--ricardo
http://m3d.uib.es/~gallir/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Alan Cox

> You get SIG11 errors when running programs(kernel compile seems to be agood
> example), you get crashing processes, you get all sorts of weird funnies but
> you really shouldn't get machine check exceptions.
> 
> I don't think there is a way a machine check exception can be triggered by
> software - which it would have to be in order to be caused by bad RAMs.

Bad ECC memory and unrecoverable ECC faults can certainly be reported back to
the processor electrically. Also an L2 cache load failing when the RAM fails
to ack the signals is quite visible to a processor.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Bene, Martin

Hi Simon,

> On Mon, 7 May 2001, Bene, Martin wrote:
> 
> > Definitely not caused by:
> > Bad Rams, mb-chipset.
> 
> Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> don't see those because you don't have ECC and the error is simply not
> detected.

Strange - definitely, strange. Of course you're correct about memory errors
going undetected on standard PC hardware, and usually these undetected
errors lead to other failures later on:

You get SIG11 errors when running programs(kernel compile seems to be agood
example), you get crashing processes, you get all sorts of weird funnies but
you really shouldn't get machine check exceptions.

I don't think there is a way a machine check exception can be triggered by
software - which it would have to be in order to be caused by bad RAMs.

Bye, Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Simon Richter

On Mon, 7 May 2001, Bene, Martin wrote:

> Definitely not caused by:
>   Bad Rams, mb-chipset.

Erm, it was bad RAM everytime it happened to me. On standard PCs, you
don't see those because you don't have ECC and the error is simply not
detected.

   Simon

-- 
GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc
 Fingerprint: DC26 EB8D 1F35 4F44 2934  7583 DBB6 F98D 9198 3292
Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Alan Cox

> After searching the archives of the list I found some similar reports
> from September and December 2000 but as far as I understood the cause of
> the error was blamed on the CPU. Is this the most probable case? 

A machine check (trap 18) is signalled by the processor when it thinks it is
in an invalid state. Many x86 cpus have checking circuitry and the default
behaviour is to either reboot or continue-and-pray. 

Linux enables notification of these events. So yes your processor was unhappy.
But it can be unhappy because of wrong voltages, electrical noise, overheating
and many other things.

Generally it indicates a CPU problem but I've see it caused by overclocking
and poorly fitted heatsinks

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: what causes Machine Check exception? revisited (2.2.18)

2001-05-07 Thread Bene, Martin

Hi Juhan,

> After searching the archives of the list I found some similar reports
> from September and December 2000 but as far as I understood 
> the cause of
> the error was blamed on the CPU. Is this the most probable case? 
> 
> Best regards,
> 
> Juhan Ernits
> 
>   -- /var/log/kern.log
> 
> May  6 06:47:25 market kernel: CPU 0: Machine Check 
> Exception: 0004
> May  6 06:47:25 market kernel: Bank 4: b2040151<0>Kernel
> panic: CPU context corrupt

Yes. consensus of the messages I received is that it's the cpu flagging an
internal hardware problem. 

Suggested causes include:
overclocking
thermal problems
CPU actually bad

Definitely not caused by:
Bad Rams, mb-chipset.

In my case the error only occured once and never again - marked it up to bad
karma on that day.

Bye, Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/