Re: what causes Machine Check exception? revisited (2.2.18)
On Mon, May 07, 2001 at 11:57:17AM +0100, Alan Cox wrote: > Generally it indicates a CPU problem but I've see it caused by overclocking > and poorly fitted heatsinks I've been able to trigger a Machine check error on PPC when trying to boot directly from OF with a COFF kernel. The system has worked perfectly with BootX. I wonder why this is the first non-x86 report... Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: what causes Machine Check exception? revisited (2.2.18)
On Mon, 7 May 2001, Dan Hollis wrote: > > Erm, it was bad RAM everytime it happened to me. On standard PCs, you > > don't see those because you don't have ECC and the error is simply not > > detected. > So a 440bx motherboard with ECC ram is a non-standard PC? I bet the board doesn't force you to use ECC RAM, so manufacturers will not use it because it's too expensive and the average customer doesn't understand what memory is and what it's used for. So yes, it's non-standard. Simon -- GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc Fingerprint: DC26 EB8D 1F35 4F44 2934 7583 DBB6 F98D 9198 3292 Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: what causes Machine Check exception? revisited (2.2.18)
Yep, totally. I've worked on hundreds of systems and less than 20 of the workstations or PCs have been useing ECC. Most servers do, but not even all of them. Nick On Mon, 7 May 2001, Dan Hollis wrote: > On Mon, 7 May 2001, Simon Richter wrote: > > On Mon, 7 May 2001, Bene, Martin wrote: > > > Definitely not caused by: > > > Bad Rams, mb-chipset. > > Erm, it was bad RAM everytime it happened to me. On standard PCs, you > > don't see those because you don't have ECC and the error is simply not > > detected. > > So a 440bx motherboard with ECC ram is a non-standard PC? > > -Dan > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: what causes Machine Check exception? revisited (2.2.18)
On Mon, 7 May 2001, Simon Richter wrote: > On Mon, 7 May 2001, Bene, Martin wrote: > > Definitely not caused by: > > Bad Rams, mb-chipset. > Erm, it was bad RAM everytime it happened to me. On standard PCs, you > don't see those because you don't have ECC and the error is simply not > detected. So a 440bx motherboard with ECC ram is a non-standard PC? -Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: what causes Machine Check exception? revisited (2.2.18)
On Mon, 7 May 2001, Bene, Martin wrote: [MCE caused by bad RAM] > I don't think there is a way a machine check exception can be triggered by > software - which it would have to be in order to be caused by bad RAMs. A MCE is triggered by an ECC error - no software involved. A good trap handler will then see if the error is recoverable (one-bit errors are), notify userspace (so the admin gets mailed) and move the data out of this page. Simon -- GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc Fingerprint: DC26 EB8D 1F35 4F44 2934 7583 DBB6 F98D 9198 3292 Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: what causes Machine Check exception? revisited (2.2.18)
>> Definitely not caused by: >> Bad Rams, mb-chipset. > > Erm, it was bad RAM everytime it happened to me. On standard PCs, you > don't see those because you don't have ECC and the error is simply not > detected. I did have the same problem with an SMP Intel 440LX which run without any problem since 1998. When I installed 2.2.18 it could run for more than 5 minutes (Alan suggested me it was . I am not sure it's a RAM poblem, because it never gave/gives a SEGFAULT compiling the kernel. I brought it back to 2.2.16 and it's running happy. Could be some SMP/BIOS related problem? If it's the RAM or chipset, I am scared how we could use it for three years and suddenly it hangs with a new version of the kernel... Blame to Intel? --ricardo http://m3d.uib.es/~gallir/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: what causes Machine Check exception? revisited (2.2.18)
> You get SIG11 errors when running programs(kernel compile seems to be agood > example), you get crashing processes, you get all sorts of weird funnies but > you really shouldn't get machine check exceptions. > > I don't think there is a way a machine check exception can be triggered by > software - which it would have to be in order to be caused by bad RAMs. Bad ECC memory and unrecoverable ECC faults can certainly be reported back to the processor electrically. Also an L2 cache load failing when the RAM fails to ack the signals is quite visible to a processor. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: what causes Machine Check exception? revisited (2.2.18)
Hi Simon, > On Mon, 7 May 2001, Bene, Martin wrote: > > > Definitely not caused by: > > Bad Rams, mb-chipset. > > Erm, it was bad RAM everytime it happened to me. On standard PCs, you > don't see those because you don't have ECC and the error is simply not > detected. Strange - definitely, strange. Of course you're correct about memory errors going undetected on standard PC hardware, and usually these undetected errors lead to other failures later on: You get SIG11 errors when running programs(kernel compile seems to be agood example), you get crashing processes, you get all sorts of weird funnies but you really shouldn't get machine check exceptions. I don't think there is a way a machine check exception can be triggered by software - which it would have to be in order to be caused by bad RAMs. Bye, Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: what causes Machine Check exception? revisited (2.2.18)
On Mon, 7 May 2001, Bene, Martin wrote: > Definitely not caused by: > Bad Rams, mb-chipset. Erm, it was bad RAM everytime it happened to me. On standard PCs, you don't see those because you don't have ECC and the error is simply not detected. Simon -- GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc Fingerprint: DC26 EB8D 1F35 4F44 2934 7583 DBB6 F98D 9198 3292 Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: what causes Machine Check exception? revisited (2.2.18)
> After searching the archives of the list I found some similar reports > from September and December 2000 but as far as I understood the cause of > the error was blamed on the CPU. Is this the most probable case? A machine check (trap 18) is signalled by the processor when it thinks it is in an invalid state. Many x86 cpus have checking circuitry and the default behaviour is to either reboot or continue-and-pray. Linux enables notification of these events. So yes your processor was unhappy. But it can be unhappy because of wrong voltages, electrical noise, overheating and many other things. Generally it indicates a CPU problem but I've see it caused by overclocking and poorly fitted heatsinks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: what causes Machine Check exception? revisited (2.2.18)
Hi Juhan, > After searching the archives of the list I found some similar reports > from September and December 2000 but as far as I understood > the cause of > the error was blamed on the CPU. Is this the most probable case? > > Best regards, > > Juhan Ernits > > -- /var/log/kern.log > > May 6 06:47:25 market kernel: CPU 0: Machine Check > Exception: 0004 > May 6 06:47:25 market kernel: Bank 4: b2040151<0>Kernel > panic: CPU context corrupt Yes. consensus of the messages I received is that it's the cpu flagging an internal hardware problem. Suggested causes include: overclocking thermal problems CPU actually bad Definitely not caused by: Bad Rams, mb-chipset. In my case the error only occured once and never again - marked it up to bad karma on that day. Bye, Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/