Re: unexpected machine check on 5.0 alpha
Andrew Gallatin wrote: > No, that's a 660. (system machine check). > A 670 is much more likely to be bad ram, bad cache, bad CPU, etc. > Its not always overheating. It's looking like at least my troubles are not from FreeBSD, but from the hardware, probably the SCSI card. I tried "dd if=/dev/zero of=/dev/da3" and got a pair of 670 machine checks, shown below. After I pressed the reset button, the SRM said "I/O-detected PCI bus data parity error on IOD0" just after looking at the Symbios SCSI card to which the hard drives are attached (I had gotten this before, when I had tried replacing the Ethernet card). Then there was a 660 machine check, then the SRM crashed--http://people.freebsd.org/~trevor/alpha/4100-20030116-cu2.log>. -- begin log -- (noperiph:sym1:0:-1:-1): SCSI BUS reset detected. sym1: unable to abort current chip operation. unexpected machine check: mces= 0x1 vector = 0x670 param = 0xfc004e10 pc = 0xfc642970 ra = 0xfc406f70 curproc = 0xfc001f169200 pid = 23, comm = intr: sym1 panic: machine check cpuid = 1; boot() called on cpu#1 syncing disks, buffers remaining... panic: bwrite: buffer is not busy??? cpuid = 1; boot() called on cpu#1 Uptime: 1h42m51s (noperiph:sym1:0:-1:-1): SCSI BUS reset detected. sym1: unable to abort current chip operation. unexpected machine check: mces= 0x1 vector = 0x670 param = 0xfc004e10 pc = 0xfc642970 ra = 0xfc406f70 curproc = 0xfc001f169200 pid = 23, comm = intr: sym1 panic: machine check cpuid = 1; boot() called on cpu#1 Uptime: 1h42m53s panic: bremfree: removing a buffer not on a queue cpuid = 1; boot() called on cpu#1 Uptime: 1h43m16s sym1: suspicious SCSI data while resetting the BUS. sym1: dp1,d15-8,dp0,d7-0,rst,req,ack,bsy,sel,atn,msg,c/d,i/o = 0x7ff, expecting 0x100 -- end log -- -- Trevor Johnson To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
On Thu, Jan 16, 2003 at 09:39:36AM -0800, Nate Lawson wrote the words in effect of: > On Thu, 16 Jan 2003, Trevor Johnson wrote: > > Before adding the ccd line to my kernel configuration file, I had > > attempted to run ccdconfig while using just the GENERIC kernel (also > > 5.0-RC3). I suppos e I shouldn't have been surprised that it didn't work: > > > > -- begin log -- > > # ccdconfig ccd0 128 CCDF_UNIFORM /dev/da2 /dev/da3 /dev/da4 /dev/da5 > > fatal kernel trap: > > trap entry = 0x4 (unaligned access fault) > > cpuid = 1 > > faulting va= 0xe4a00ed > > opcode = 0x29 > > register = 0x1b > > pc = 0xfe0002bd1f1c > > ra = 0xfe0002bd1eec > > sp = 0xfe00140898a0 > > usp= 0x11fff9f8 > > curthread = 0xfc0017efe1f0 > > pid = 3658, comm = ccdconfig > > panic: trap > > cpuid = 1; > > Something in the automatic kldload then? Unaligned access is usually a > programming error. I don't know how much this info can help, but I recently ported NetBSD's BUS_SPACE_DEBUG functionality, which helped them a lot in fixing unaligned access faults. The patch needs a lil' cleaning up for other architectures. Most of the drivers in FreeBSD are heavily used on i386, so it is beneficial to use BUS space debug, so that we can easily find out errors, and fix 'em. It reports someting like this: (taken from NetBSD sample) "buffer not aligned to 2 bytes ../../../../dev/ic/aic6360.c:1426" Let me know if anyone is interested in those patches. Cheers. -- Hiten Pandya ([EMAIL PROTECTED], [EMAIL PROTECTED]) http://www.unixdaemons.com/~hiten/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
Mike Tibor writes: > On Thu, 16 Jan 2003, Andrew Gallatin wrote: > > (I wrote: ) > > > I believe a 670 machine check can also result from a read of a > > > non-existent I/O space. I'm not a programmer, but could that be the > > > problem here? > > > > No, that's a 660. (system machine check). > > A 670 is much more likely to be bad ram, bad cache, bad CPU, etc. > > Its not always overheating. > > Hmm... well, I got that from Jay Estabrook (works at DEC/Compaq/HP) via > the [EMAIL PROTECTED] The archived message is here: > > http://www.lib.uaa.alaska.edu/axp-list/archive/1998-12/0491.html That's nice. If you'd care to see the console logs from a "new" type of machine I've been bringing up (multi-hose AS2100A) which crashes with 660s when I read from a bad IO address, I'd be more than happy to share... Drew To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
On Thu, Jan 16, 2003 at 11:48:12AM -0900, Mike Tibor wrote: > On Thu, 16 Jan 2003, Andrew Gallatin wrote: > > (I wrote: ) > > > I believe a 670 machine check can also result from a read of a > > > non-existent I/O space. I'm not a programmer, but could that be the > > > problem here? > > > > No, that's a 660. (system machine check). > > A 670 is much more likely to be bad ram, bad cache, bad CPU, etc. > > Its not always overheating. > > Hmm... well, I got that from Jay Estabrook (works at DEC/Compaq/HP) via > the [EMAIL PROTECTED] The archived message is here: Single bit errors on ram are non fatal and reported by FreeBSD as processor correctable error. In the most cases you don't get fatal memory errors without some non-fatal errors. You may want to remove and reinsert the simms. -- B.Walter COSMO-Project http://www.cosmo-project.de [EMAIL PROTECTED] Usergroup [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
On Thu, 16 Jan 2003, Andrew Gallatin wrote: (I wrote: ) > > I believe a 670 machine check can also result from a read of a > > non-existent I/O space. I'm not a programmer, but could that be the > > problem here? > > No, that's a 660. (system machine check). > A 670 is much more likely to be bad ram, bad cache, bad CPU, etc. > Its not always overheating. Hmm... well, I got that from Jay Estabrook (works at DEC/Compaq/HP) via the [EMAIL PROTECTED] The archived message is here: http://www.lib.uaa.alaska.edu/axp-list/archive/1998-12/0491.html FWIW... Mike To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
On Thu, 16 Jan 2003, Trevor Johnson wrote: > Before adding the ccd line to my kernel configuration file, I had > attempted to run ccdconfig while using just the GENERIC kernel (also > 5.0-RC3). I suppos e I shouldn't have been surprised that it didn't work: > > -- begin log -- > # ccdconfig ccd0 128 CCDF_UNIFORM /dev/da2 /dev/da3 /dev/da4 /dev/da5 > fatal kernel trap: > trap entry = 0x4 (unaligned access fault) > cpuid = 1 > faulting va= 0xe4a00ed > opcode = 0x29 > register = 0x1b > pc = 0xfe0002bd1f1c > ra = 0xfe0002bd1eec > sp = 0xfe00140898a0 > usp= 0x11fff9f8 > curthread = 0xfc0017efe1f0 > pid = 3658, comm = ccdconfig > panic: trap > cpuid = 1; Something in the automatic kldload then? Unaligned access is usually a programming error. -Nate To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
On Thu, Jan 16, 2003 at 04:27:37AM -0500, Trevor Johnson wrote: > I was doing the "dd" in an attempt to follow the recipe posted by Andre > Albsmeier on > >http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&th=440939b0f4db6bdb&seekm=arg34b%2410hi%241%40FreeBSD.csie.NCTU.edu.tw&frame=off>. > I had experienced the same problem that John De Boskey had been having: See the ccdconfig manpage, or the FreeBSD FAQ entries about ccd(4). That "recipe" has problems. (but it would not cause you these crashes) To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
Mike Tibor writes: > On Thu, 16 Jan 2003, Trevor Johnson wrote: > > > I just got a similar crash: > > > >unexpected machine check: > > > >mces= 0x1 > >vector = 0x670 > I believe a 670 machine check can also result from a read of a > non-existent I/O space. I'm not a programmer, but could that be the > problem here? No, that's a 660. (system machine check). A 670 is much more likely to be bad ram, bad cache, bad CPU, etc. Its not always overheating. Drew To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
On Thu, 16 Jan 2003, Trevor Johnson wrote: > I just got a similar crash: > > unexpected machine check: > > mces= 0x1 > vector = 0x670 > param = 0xfc004e10 > pc = 0xfc4069bc > ra = 0xfc4069b4 > curproc = 0xfc001f169200 > pid = 23, comm = intr: sym1 > I was sitting next to it when it crashed, and I would have noticed if the > fans had stopped. From earlier logs, I see that the temperature has > varied between 21 and 27 Celsius. I don't know what it is now, but it > feels warmish when I'm wearing just a tee shirt. There's a fan that > constantly blows air into the room, which points toward the back of the > computer. Should I run the air conditioner more often? I believe a 670 machine check can also result from a read of a non-existent I/O space. I'm not a programmer, but could that be the problem here? Mike To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
I wrote: > "dd if=zero of=da2" I forgot to mention that after the crash, the LED on disk da2 remains lit, as is the one on da0 (which contains /, /tmp, /usr, and /var but not /home). When I had the same disk drives attached to a PC, I could write to them at about 45 kilobytes per second. Multiplying that by 19 hours gives about 3 gigabytes, and it was a 4 gigabyte disk. I was doing the "dd" in an attempt to follow the recipe posted by Andre Albsmeier on http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&th=440939b0f4db6bdb&seekm=arg34b%2410hi%241%40FreeBSD.csie.NCTU.edu.tw&frame=off>. I had experienced the same problem that John De Boskey had been having: -- begin log -- # ccdconfig ccd0 128 CCDF_UNIFORM /dev/da2 /dev/da3 /dev/da4 /dev/da5 # ccdconfig -g >/etc/ccd.conf # cat /etc/ccd.conf ccd0128 2 /dev/da2 /dev/da3 /dev/da4 /dev/da5 # ls /dev/ccd* /dev/ccd0c # newfs /dev/ccd0c /dev/ccd0c: 16380.2MB (33546752 sectors) block size 16384, fragment size 2048 using 90 cylinder groups of 183.62MB, 11752 blks, 23552 inodes. super-block backups (for fsck -b #) at: 32, 376096, 752160, 1128224, 1504288, 1880352, 2256416, 2632480, 3008544, 3384608, 3760672, 4136736, 4512800, 464, 5264928, 5640992, 6017056, 6393120, 6769184, 7145248, 7521312, 7897376, 8273440, 8649504, 9025568, 9401632, 9777696, 10153760, 10529824, 10905888, 11281952, 11658016, 12034080, 12410144, 12786208, 13162272, 13538336, 13914400, 14290464, 14666528, 15042592, 15418656, 15794720, 16170784, 16546848, 16922912, 17298976, 17675040, 18051104, 18427168, 18803232, 19179296, 19555360, 19931424, 20307488, 20683552, 21059616, 21435680, 21811744, 22187808, 22563872, 22939936, 23316000, 23692064, 24068128, 2192, 24820256, 25196320, 25572384, 25948448, 26324512, 26700576, 27076640, 27452704, 27828768, 28204832, 28580896, 28956960, 29333024, 29709088, 30085152, 30461216, 30837280, 31213344, 31589408, 31965472, 32341536, 32717600, 33093664, 33469728 newfs: ioctl (WDINFO): /dev/ccd0c: can't rewrite disk label: No such process -- end log -- Before adding the ccd line to my kernel configuration file, I had attempted to run ccdconfig while using just the GENERIC kernel (also 5.0-RC3). I suppos e I shouldn't have been surprised that it didn't work: -- begin log -- # ccdconfig ccd0 128 CCDF_UNIFORM /dev/da2 /dev/da3 /dev/da4 /dev/da5 fatal kernel trap: trap entry = 0x4 (unaligned access fault) cpuid = 1 faulting va= 0xe4a00ed opcode = 0x29 register = 0x1b pc = 0xfe0002bd1f1c ra = 0xfe0002bd1eec sp = 0xfe00140898a0 usp= 0x11fff9f8 curthread = 0xfc0017efe1f0 pid = 3658, comm = ccdconfig panic: trap cpuid = 1; boot() called on cpu#1 syncing disks, buffers remaining... panic: bwrite: buffer is not busy??? cpuid = 1; boot() called on cpu#1 Uptime: 18h56m35s Automatic reboot in 15 seconds - press a key on the console to abort -- end log -- -- Trevor Johnson To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
Wilko Bulte wrote: > Time to check the fan for the CPU, and the air'tunnel' feeding the > air to the heatsink. I had one come loose after servicing the machine. I just got a similar crash: unexpected machine check: mces= 0x1 vector = 0x670 param = 0xfc004e10 pc = 0xfc4069bc ra = 0xfc4069b4 curproc = 0xfc001f169200 pid = 23, comm = intr: sym1 panic: machine check cpuid = 1; boot() called on cpu#1 syncing disks, buffers remaining... panic: bdwrite: buffer is not busy cpuid = 1; boot() called on cpu#1 Uptime: 19h27m51s It's an Alphaserver 4100, dual 5/300 with 512 MB RAM and one power supply, running 5.0-RC3 (kernel is GENERIC plus the line from ccd(4)). I had been trying to compile a few large things from the ports collection, Beonex for example, and all the compilations had finished so the main task was "dd if=zero of=da2" which had been running from /dev since a few minutes after boot (only a 4 GB disk--I wonder why it's so slow). $ strings /tmp/4100-20030113-cu4.log | grep degrees System temperature is 23 degrees C System temperature is 23 degrees C System temperature is 23 degrees C System temperature is 23 degrees C System temperature is 24 degrees C I was sitting next to it when it crashed, and I would have noticed if the fans had stopped. From earlier logs, I see that the temperature has varied between 21 and 27 Celsius. I don't know what it is now, but it feels warmish when I'm wearing just a tee shirt. There's a fan that constantly blows air into the room, which points toward the back of the computer. Should I run the air conditioner more often? -- Trevor Johnson To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
On Wed, Jan 15, 2003 at 02:19:19PM -0800, Kris Kennaway wrote: > On Wed, Jan 15, 2003 at 05:15:56PM -0500, Andrew Gallatin wrote: > > > > Kris Kennaway writes: > > > I just got this on one of the axp machines [*]: > > > > > > unexpected machine check: > > > > > > mces= 0x1 > > > vector = 0x670 > > > > 670 is a "cpu machine check" -- thats most likely an uncorretable > > memory parity error or some other (intermittent) hardware failure > > caused by overheating, etc. > > > > This is what? A miata? > > Yes. Time to check the fan for the CPU, and the air'tunnel' feeding the air to the heatsink. I had one come loose after servicing the machine. Impractical.. -- | / o / /_ _ [EMAIL PROTECTED] |/|/ / / /( (_) Bulte To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: unexpected machine check on 5.0 alpha
On Wed, Jan 15, 2003 at 05:15:56PM -0500, Andrew Gallatin wrote: > > Kris Kennaway writes: > > I just got this on one of the axp machines [*]: > > > > unexpected machine check: > > > > mces= 0x1 > > vector = 0x670 > > 670 is a "cpu machine check" -- thats most likely an uncorretable > memory parity error or some other (intermittent) hardware failure > caused by overheating, etc. > > This is what? A miata? Yes. Kris msg50330/pgp0.pgp Description: PGP signature
Re: unexpected machine check on 5.0 alpha
Kris Kennaway writes: > I just got this on one of the axp machines [*]: > > unexpected machine check: > > mces= 0x1 > vector = 0x670 670 is a "cpu machine check" -- thats most likely an uncorretable memory parity error or some other (intermittent) hardware failure caused by overheating, etc. This is what? A miata? Drew To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
unexpected machine check on 5.0 alpha
I just got this on one of the axp machines [*]: unexpected machine check: mces= 0x1 vector = 0x670 param = 0xfc006068 pc = 0xfc466840 ra = 0xfc451048 curproc = 0xfc0005472948 pid = 59810, comm = ssh Stopped at soreceive+0x700:ldl t0,0x18(s0) <0xfe042018> db> trace soreceive() at soreceive+0x700 soo_read() at soo_read+0x68 dofileread() at dofileread+0xf4 read() at read+0x64 syscall() at syscall+0x338 XentSys() at XentSys+0x64 --- syscall (3, FreeBSD ELF64, read) --- --- user mode --- db> Any ideas? Kris [*] thereby ruining my stability streak of no panics for almost a month on the bento cluster. msg50327/pgp0.pgp Description: PGP signature