Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Trevor Johnson
Andrew Gallatin wrote:

> No, that's a 660. (system machine check).
> A 670 is much more likely to be bad ram, bad cache, bad CPU, etc.
> Its not always overheating.

It's looking like at least my troubles are not from FreeBSD, but from the
hardware, probably the SCSI card.

I tried "dd if=/dev/zero of=/dev/da3" and got a pair of 670 machine
checks, shown below.  After I pressed the reset button, the SRM said
"I/O-detected PCI bus data parity error on IOD0" just after looking at the
Symbios SCSI card to which the hard drives are attached (I had gotten this
before, when I had tried replacing the Ethernet card).  Then there was a
660 machine check, then the SRM
crashed--http://people.freebsd.org/~trevor/alpha/4100-20030116-cu2.log>.

-- begin log --
(noperiph:sym1:0:-1:-1): SCSI BUS reset detected.
sym1: unable to abort current chip operation.

unexpected machine check:

mces= 0x1
vector  = 0x670
param   = 0xfc004e10
pc  = 0xfc642970
ra  = 0xfc406f70
curproc = 0xfc001f169200
pid = 23, comm = intr: sym1

panic: machine check
cpuid = 1;
boot() called on cpu#1

syncing disks, buffers remaining... panic: bwrite: buffer is not busy???
cpuid = 1;
boot() called on cpu#1
Uptime: 1h42m51s
(noperiph:sym1:0:-1:-1): SCSI BUS reset detected.
sym1: unable to abort current chip operation.

unexpected machine check:

mces= 0x1
vector  = 0x670
param   = 0xfc004e10
pc  = 0xfc642970
ra  = 0xfc406f70
curproc = 0xfc001f169200
pid = 23, comm = intr: sym1

panic: machine check
cpuid = 1;
boot() called on cpu#1
Uptime: 1h42m53s
panic: bremfree: removing a buffer not on a queue
cpuid = 1;
boot() called on cpu#1
Uptime: 1h43m16s
sym1: suspicious SCSI data while resetting the BUS.
sym1: dp1,d15-8,dp0,d7-0,rst,req,ack,bsy,sel,atn,msg,c/d,i/o = 0x7ff,
expecting 0x100
-- end log --

-- 
Trevor Johnson



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Hiten Pandya
On Thu, Jan 16, 2003 at 09:39:36AM -0800, Nate Lawson wrote the words in effect of:
> On Thu, 16 Jan 2003, Trevor Johnson wrote:
> > Before adding the ccd line to my kernel configuration file, I had
> > attempted to run ccdconfig while using just the GENERIC kernel (also
> > 5.0-RC3).  I suppos e I shouldn't have been surprised that it didn't work:
> > 
> > -- begin log --
> > # ccdconfig ccd0 128 CCDF_UNIFORM /dev/da2 /dev/da3 /dev/da4 /dev/da5
> > fatal kernel trap:
> > trap entry = 0x4 (unaligned access fault)
> > cpuid  = 1
> > faulting va= 0xe4a00ed
> > opcode = 0x29
> > register   = 0x1b
> > pc = 0xfe0002bd1f1c
> > ra = 0xfe0002bd1eec
> > sp = 0xfe00140898a0
> > usp= 0x11fff9f8
> > curthread  = 0xfc0017efe1f0
> > pid = 3658, comm = ccdconfig
> > panic: trap
> > cpuid = 1;
> 
> Something in the automatic kldload then?  Unaligned access is usually a
> programming error.

I don't know how much this info can help, but I recently ported NetBSD's
BUS_SPACE_DEBUG functionality, which helped them a lot in fixing
unaligned access faults.  The patch needs a lil' cleaning up for other
architectures.  Most of the drivers in FreeBSD are heavily used on i386,
so it is beneficial to use BUS space debug, so that we can easily find
out errors, and fix 'em.

It reports someting like this: (taken from NetBSD sample)

"buffer  not aligned to 2 bytes
../../../../dev/ic/aic6360.c:1426"

Let me know if anyone is interested in those patches.
Cheers.

-- 
Hiten Pandya ([EMAIL PROTECTED], [EMAIL PROTECTED])
http://www.unixdaemons.com/~hiten/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Andrew Gallatin

Mike Tibor writes:
 > On Thu, 16 Jan 2003, Andrew Gallatin wrote:
 > 
 > (I wrote: )
 > >  > I believe a 670 machine check can also result from a read of a
 > >  > non-existent I/O space.  I'm not a programmer, but could that be the
 > >  > problem here?
 > >
 > > No, that's a 660. (system machine check).
 > > A 670 is much more likely to be bad ram, bad cache, bad CPU, etc.
 > > Its not always overheating.
 > 
 > Hmm... well, I got that from Jay Estabrook (works at DEC/Compaq/HP) via
 > the [EMAIL PROTECTED]  The archived message is here:
 > 
 > http://www.lib.uaa.alaska.edu/axp-list/archive/1998-12/0491.html

That's nice.

If you'd care to see the console logs from a "new" type of machine
I've been bringing up (multi-hose AS2100A) which crashes with 660s
when I read from a bad IO address, I'd be more than happy to share...

Drew

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Bernd Walter
On Thu, Jan 16, 2003 at 11:48:12AM -0900, Mike Tibor wrote:
> On Thu, 16 Jan 2003, Andrew Gallatin wrote:
> 
> (I wrote: )
> >  > I believe a 670 machine check can also result from a read of a
> >  > non-existent I/O space.  I'm not a programmer, but could that be the
> >  > problem here?
> >
> > No, that's a 660. (system machine check).
> > A 670 is much more likely to be bad ram, bad cache, bad CPU, etc.
> > Its not always overheating.
> 
> Hmm... well, I got that from Jay Estabrook (works at DEC/Compaq/HP) via
> the [EMAIL PROTECTED]  The archived message is here:

Single bit errors on ram are non fatal and reported by FreeBSD as
processor correctable error.
In the most cases you don't get fatal memory errors without some
non-fatal errors.
You may want to remove and reinsert the simms.

-- 
B.Walter  COSMO-Project http://www.cosmo-project.de
[EMAIL PROTECTED] Usergroup   [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Mike Tibor
On Thu, 16 Jan 2003, Andrew Gallatin wrote:

(I wrote: )
>  > I believe a 670 machine check can also result from a read of a
>  > non-existent I/O space.  I'm not a programmer, but could that be the
>  > problem here?
>
> No, that's a 660. (system machine check).
> A 670 is much more likely to be bad ram, bad cache, bad CPU, etc.
> Its not always overheating.

Hmm... well, I got that from Jay Estabrook (works at DEC/Compaq/HP) via
the [EMAIL PROTECTED]  The archived message is here:

http://www.lib.uaa.alaska.edu/axp-list/archive/1998-12/0491.html

FWIW...

Mike


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Nate Lawson
On Thu, 16 Jan 2003, Trevor Johnson wrote:
> Before adding the ccd line to my kernel configuration file, I had
> attempted to run ccdconfig while using just the GENERIC kernel (also
> 5.0-RC3).  I suppos e I shouldn't have been surprised that it didn't work:
> 
> -- begin log --
> # ccdconfig ccd0 128 CCDF_UNIFORM /dev/da2 /dev/da3 /dev/da4 /dev/da5
> fatal kernel trap:
> trap entry = 0x4 (unaligned access fault)
> cpuid  = 1
> faulting va= 0xe4a00ed
> opcode = 0x29
> register   = 0x1b
> pc = 0xfe0002bd1f1c
> ra = 0xfe0002bd1eec
> sp = 0xfe00140898a0
> usp= 0x11fff9f8
> curthread  = 0xfc0017efe1f0
> pid = 3658, comm = ccdconfig
> panic: trap
> cpuid = 1;

Something in the automatic kldload then?  Unaligned access is usually a
programming error.

-Nate


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread David O'Brien
On Thu, Jan 16, 2003 at 04:27:37AM -0500, Trevor Johnson wrote:
> I was doing the "dd" in an attempt to follow the recipe posted by Andre
> Albsmeier on
> 
>http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&th=440939b0f4db6bdb&seekm=arg34b%2410hi%241%40FreeBSD.csie.NCTU.edu.tw&frame=off>.
> I had experienced the same problem that John De Boskey had been having:

See the ccdconfig manpage, or the FreeBSD FAQ entries about ccd(4).
That "recipe" has problems.  (but it would not cause you these crashes)

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Andrew Gallatin

Mike Tibor writes:
 > On Thu, 16 Jan 2003, Trevor Johnson wrote:
 > 
 > > I just got a similar crash:
 > >
 > >unexpected machine check:
 > >
 > >mces= 0x1
 > >vector  = 0x670
 > I believe a 670 machine check can also result from a read of a
 > non-existent I/O space.  I'm not a programmer, but could that be the
 > problem here?

No, that's a 660. (system machine check).
A 670 is much more likely to be bad ram, bad cache, bad CPU, etc.
Its not always overheating.

Drew

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Mike Tibor
On Thu, 16 Jan 2003, Trevor Johnson wrote:

> I just got a similar crash:
>
>   unexpected machine check:
>
>   mces= 0x1
>   vector  = 0x670
>   param   = 0xfc004e10
>   pc  = 0xfc4069bc
>   ra  = 0xfc4069b4
>   curproc = 0xfc001f169200
>   pid = 23, comm = intr: sym1


> I was sitting next to it when it crashed, and I would have noticed if the
> fans had stopped.  From earlier logs, I see that the temperature has
> varied between 21 and 27 Celsius.  I don't know what it is now, but it
> feels warmish when I'm wearing just a tee shirt.  There's a fan that
> constantly blows air into the room, which points toward the back of the
> computer.  Should I run the air conditioner more often?

I believe a 670 machine check can also result from a read of a
non-existent I/O space.  I'm not a programmer, but could that be the
problem here?

Mike


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Trevor Johnson
I wrote:

> "dd if=zero of=da2"

I forgot to mention that after the crash, the LED on disk da2 remains lit,
as is the one on da0 (which contains /, /tmp, /usr, and /var but not
/home).  When I had the same disk drives attached to a PC, I could write
to them at about 45 kilobytes per second.  Multiplying that by 19 hours
gives about 3 gigabytes, and it was a 4 gigabyte disk.

I was doing the "dd" in an attempt to follow the recipe posted by Andre
Albsmeier on
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&th=440939b0f4db6bdb&seekm=arg34b%2410hi%241%40FreeBSD.csie.NCTU.edu.tw&frame=off>.
I had experienced the same problem that John De Boskey had been having:

-- begin log --
# ccdconfig ccd0 128 CCDF_UNIFORM /dev/da2 /dev/da3 /dev/da4 /dev/da5
# ccdconfig -g >/etc/ccd.conf
# cat /etc/ccd.conf
ccd0128 2   /dev/da2 /dev/da3 /dev/da4 /dev/da5
# ls /dev/ccd*
/dev/ccd0c
# newfs /dev/ccd0c
/dev/ccd0c: 16380.2MB (33546752 sectors) block size 16384, fragment size
2048
using 90 cylinder groups of 183.62MB, 11752 blks, 23552 inodes.
super-block backups (for fsck -b #) at:
 32, 376096, 752160, 1128224, 1504288, 1880352, 2256416, 2632480, 3008544,
 3384608, 3760672, 4136736, 4512800, 464, 5264928, 5640992, 6017056,
 6393120, 6769184, 7145248, 7521312, 7897376, 8273440, 8649504, 9025568,
 9401632, 9777696, 10153760, 10529824, 10905888, 11281952, 11658016,
12034080,
 12410144, 12786208, 13162272, 13538336, 13914400, 14290464, 14666528,
 15042592, 15418656, 15794720, 16170784, 16546848, 16922912, 17298976,
 17675040, 18051104, 18427168, 18803232, 19179296, 19555360, 19931424,
 20307488, 20683552, 21059616, 21435680, 21811744, 22187808, 22563872,
 22939936, 23316000, 23692064, 24068128, 2192, 24820256, 25196320,
 25572384, 25948448, 26324512, 26700576, 27076640, 27452704, 27828768,
 28204832, 28580896, 28956960, 29333024, 29709088, 30085152, 30461216,
 30837280, 31213344, 31589408, 31965472, 32341536, 32717600, 33093664,
33469728
newfs: ioctl (WDINFO): /dev/ccd0c: can't rewrite disk label: No such
process
-- end log --

Before adding the ccd line to my kernel configuration file, I had
attempted to run ccdconfig while using just the GENERIC kernel (also
5.0-RC3).  I suppos e I shouldn't have been surprised that it didn't work:

-- begin log --
# ccdconfig ccd0 128 CCDF_UNIFORM /dev/da2 /dev/da3 /dev/da4 /dev/da5
fatal kernel trap:
trap entry = 0x4 (unaligned access fault)
cpuid  = 1
faulting va= 0xe4a00ed
opcode = 0x29
register   = 0x1b
pc = 0xfe0002bd1f1c
ra = 0xfe0002bd1eec
sp = 0xfe00140898a0
usp= 0x11fff9f8
curthread  = 0xfc0017efe1f0
pid = 3658, comm = ccdconfig
panic: trap
cpuid = 1;
boot() called on cpu#1
syncing disks, buffers remaining... panic: bwrite: buffer is not busy???
cpuid = 1;
boot() called on cpu#1
Uptime: 18h56m35s
Automatic reboot in 15 seconds - press a key on the console to abort
-- end log --
-- 
Trevor Johnson


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-16 Thread Trevor Johnson
Wilko Bulte wrote:

> Time to check the fan for the CPU, and the air'tunnel' feeding the
> air to the heatsink. I had one come loose after servicing the machine.

I just got a similar crash:

unexpected machine check:

mces= 0x1
vector  = 0x670
param   = 0xfc004e10
pc  = 0xfc4069bc
ra  = 0xfc4069b4
curproc = 0xfc001f169200
pid = 23, comm = intr: sym1

panic: machine check
cpuid = 1;
boot() called on cpu#1

syncing disks, buffers remaining... panic: bdwrite: buffer is not busy
cpuid = 1;
boot() called on cpu#1
Uptime: 19h27m51s

It's an Alphaserver 4100, dual 5/300 with 512 MB RAM and one power supply,
running 5.0-RC3 (kernel is GENERIC plus the line from ccd(4)).  I had been
trying to compile a few large things from the ports collection, Beonex for
example, and all the compilations had finished so the main task was "dd
if=zero of=da2" which had been running from /dev since a few minutes after
boot (only a 4 GB disk--I wonder why it's so slow).

$ strings /tmp/4100-20030113-cu4.log | grep degrees
System temperature is 23 degrees C
System temperature is 23 degrees C
System temperature is 23 degrees C
System temperature is 23 degrees C
System temperature is 24 degrees C

I was sitting next to it when it crashed, and I would have noticed if the
fans had stopped.  From earlier logs, I see that the temperature has
varied between 21 and 27 Celsius.  I don't know what it is now, but it
feels warmish when I'm wearing just a tee shirt.  There's a fan that
constantly blows air into the room, which points toward the back of the
computer.  Should I run the air conditioner more often?
-- 
Trevor Johnson


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-15 Thread Wilko Bulte
On Wed, Jan 15, 2003 at 02:19:19PM -0800, Kris Kennaway wrote:
> On Wed, Jan 15, 2003 at 05:15:56PM -0500, Andrew Gallatin wrote:
> > 
> > Kris Kennaway writes:
> >  > I just got this on one of the axp machines [*]:
> >  > 
> >  > unexpected machine check:
> >  > 
> >  > mces= 0x1
> >  > vector  = 0x670
> > 
> > 670 is a "cpu machine check" -- thats most likely an uncorretable
> > memory parity error or some other (intermittent) hardware failure
> > caused by overheating, etc.
> > 
> > This is what?  A miata?
> 
> Yes.

Time to check the fan for the CPU, and the air'tunnel' feeding the
air to the heatsink. I had one come loose after servicing the machine.
Impractical..

-- 
|   / o / /_  _ [EMAIL PROTECTED]
|/|/ / / /(  (_)  Bulte 

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: unexpected machine check on 5.0 alpha

2003-01-15 Thread Kris Kennaway
On Wed, Jan 15, 2003 at 05:15:56PM -0500, Andrew Gallatin wrote:
> 
> Kris Kennaway writes:
>  > I just got this on one of the axp machines [*]:
>  > 
>  > unexpected machine check:
>  > 
>  > mces= 0x1
>  > vector  = 0x670
> 
> 670 is a "cpu machine check" -- thats most likely an uncorretable
> memory parity error or some other (intermittent) hardware failure
> caused by overheating, etc.
> 
> This is what?  A miata?

Yes.

Kris



msg50330/pgp0.pgp
Description: PGP signature


Re: unexpected machine check on 5.0 alpha

2003-01-15 Thread Andrew Gallatin

Kris Kennaway writes:
 > I just got this on one of the axp machines [*]:
 > 
 > unexpected machine check:
 > 
 > mces= 0x1
 > vector  = 0x670

670 is a "cpu machine check" -- thats most likely an uncorretable
memory parity error or some other (intermittent) hardware failure
caused by overheating, etc.

This is what?  A miata?

Drew


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



unexpected machine check on 5.0 alpha

2003-01-15 Thread Kris Kennaway
I just got this on one of the axp machines [*]:

unexpected machine check:

mces= 0x1
vector  = 0x670
param   = 0xfc006068
pc  = 0xfc466840
ra  = 0xfc451048
curproc = 0xfc0005472948
pid = 59810, comm = ssh

Stopped at  soreceive+0x700:ldl t0,0x18(s0) <0xfe042018>   
 
db> trace
soreceive() at soreceive+0x700
soo_read() at soo_read+0x68
dofileread() at dofileread+0xf4
read() at read+0x64
syscall() at syscall+0x338
XentSys() at XentSys+0x64
--- syscall (3, FreeBSD ELF64, read) ---
--- user mode ---
db>

Any ideas?

Kris

[*] thereby ruining my stability streak of no panics for almost a
month on the bento cluster.





msg50327/pgp0.pgp
Description: PGP signature