Re: 4.0 frozen

2006-12-17 Thread Marc Balmer
* Stephen Schaff wrote:

 So, I thought I would post my dmesg here and see if it grabs the  
 attention of anyone who knows better than I do. Any insight would be  
 much appreciated. It turns my stomach to think I'd have to reinstall  
 with a different OS.

If this system is critical for you, you might consider installing a
hardware watchdog timer which will then reboot the machine if it hangs.



Re: 4.0 frozen

2006-12-17 Thread Jacob Yocom-Piatt
 Original message 
Date: Sun, 17 Dec 2006 02:57:56 +0100
From: Dimitry Andric [EMAIL PROTECTED]  
Subject: Re: 4.0 frozen  
To: Stephen Schaff [EMAIL PROTECTED]
Cc: misc@openbsd.org

Stephen Schaff wrote:
 Yesterday it inexplicably went dark.
...
 wd0(pciide1:0:0): timeout
 type: ata
 c_bcount: 65536
 c_skip: 0
 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
 wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0
 bn 235334857; cn 14648 tn 233 sn 58), retrying
 wd0: soft error (corrected)
 wd0(pciide1:0:0): timeout
 type: ata
 c_bcount: 65536
 c_skip: 0
 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
... more of those IDE errors ...

Maybe dying disks?


i must second this suggestion. almost every time i've seen these IDE timeout
messages, it means that the disk(s) are damaged, close to dead or totally dead.

i find that doing disk intensive operations, e.g. extracting src.tar.gz, with
the machine in question will likely reproduce the timeouts if this is the case.

cheers,
jake



Re: 4.0 frozen

2006-12-17 Thread Dimitry Andric
Jacob Yocom-Piatt wrote:
 wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0
 bn 235334857; cn 14648 tn 233 sn 58), retrying
 wd0: soft error (corrected)
 Maybe dying disks?
 i must second this suggestion. almost every time i've seen these IDE timeout
 messages, it means that the disk(s) are damaged, close to dead or totally 
 dead.

Note that these errors can also be caused by any other part of the IDE
subsystem, e.g. the controller, the cables, etc.  Or even by bad RAM...
For sanity's sake, do a full hardware diagnostic of the machine.



Re: 4.0 frozen

2006-12-17 Thread Travers Buda
  Original message 
 Date: Sun, 17 Dec 2006 02:57:56 +0100
 From: Dimitry Andric [EMAIL PROTECTED]  
 Subject: Re: 4.0 frozen  
 To: Stephen Schaff [EMAIL PROTECTED]
 Cc: misc@openbsd.org
 
 Stephen Schaff wrote:
  Yesterday it inexplicably went dark.
 ...
  wd0(pciide1:0:0): timeout
  type: ata
  c_bcount: 65536
  c_skip: 0
  pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
  wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0
  bn 235334857; cn 14648 tn 233 sn 58), retrying
  wd0: soft error (corrected)
  wd0(pciide1:0:0): timeout
  type: ata
  c_bcount: 65536
  c_skip: 0
  pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
 ... more of those IDE errors ...
 
 Maybe dying disks?
 

Running # atactl wd0 smartstatus is also a quick way to check. I've got
something in rc.local for that...

Travers Buda



Re: 4.0 frozen

2006-12-17 Thread Stephen Schaff
Yeah. I did some testing last night - to know avail. When it bailed  
today, I restarted it, expecting the raid to rebuild as it always  
does. This time it didn't! It booted right up using wd1 and failed  
wd0 in raid0.


Kinda makes me happy I built it that way (special thanks to this  
page: http://www.argon18.com/raid_openbsd.html ).


So, I think that wd0 may be the cause of the whole problem, and I'll  
replace it right away and keep an eye on it to make sure that there  
aren't other problems.



Thanks everyone for your great suggestions. I've been exploring them  
all.



Best Regards,
Stephen

On 17-Dec-06, at 12:48 PM, Artur Grabowski wrote:


Stephen Schaff [EMAIL PROTECTED] writes:


wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 234162112 of 234162112-234162239
(wd0 bn 235334857; cn 14648 tn 233 sn 58), retrying
wd0: soft error (corrected)
wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 234997440 of 234997440-234997567
(wd0 bn 236170185; cn 14700 tn 233 sn 6), retrying
wd0: soft error (corrected)
wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 235719872 of 235719872-23571
(wd0 bn 236892617; cn 14745 tn 225 sn 17), retrying
wd0: soft error (corrected)


This is a pretty good indication of what's going wrong. Your disk  
is sad.


//art




4.0 frozen

2006-12-16 Thread Stephen Schaff
I've got 4.0 running nicely on a server sitting in a data centre,  
thanks to the help of the members of this list.

It's been up since Nov. 22nd and in production.

Yesterday it inexplicably went dark. I went down to check it out, and  
hooked up the monitor and keyboard. I could see the welcoming login  
prompt, but it wouldn't accept any input. It wasn't accepting any  
pings from a remote system on the network either. The only word I  
have for that is frozen - if there's better terminology out there -  
please let me know.


Anyway, after hard booting the machine, and rebuilding the raid - I  
checked all the log files I could think of and can't find a thing.  
Nada. Then - it went down again today! I'm not sure what to do now.


So, I thought I would post my dmesg here and see if it grabs the  
attention of anyone who knows better than I do. Any insight would be  
much appreciated. It turns my stomach to think I'd have to reinstall  
with a different OS.




Best Regards,
Stephen


, addr 1
uhub1: 8 ports with 8 removable, self powered
pciide0 at pci0 dev 13 function 0 NVIDIA MCP51 IDE rev 0xa1: DMA,  
channel 0 configured to compatibility, channel 1 configured to  
compatibility

atapiscsi0 at pciide0 channel 0 drive 0
scsibus0 at atapiscsi0: 2 targets
cd0 at scsibus0 targ 0 lun 0: HL-DT-ST, CD-ROM GCR-8525B, 1.02  
SCSI0 5/cdrom removable

cd0(pciide0:0:0): using PIO mode 4, DMA mode 2
pciide0: channel 1 disabled (no drives)
pciide1 at pci0 dev 14 function 0 NVIDIA MCP51 SATA rev 0xa1: DMA
pciide1: using irq 11 for native-PCI interrupt
wd0 at pciide1 channel 0 drive 0: ST3250823AS
wd0: 16-sector PIO, LBA48, 238475MB, 488397168 sectors
wd0(pciide1:0:0): using PIO mode 4, Ultra-DMA mode 5
pciide2 at pci0 dev 15 function 0 NVIDIA MCP51 SATA rev 0xa1: DMA
pciide2: using irq 10 for native-PCI interrupt
wd1 at pciide2 channel 0 drive 0: ST3250823AS
wd1: 16-sector PIO, LBA48, 238475MB, 488397168 sectors
wd1(pciide2:0:0): using PIO mode 4, Ultra-DMA mode 5
wd2 at pciide2 channel 1 drive 0: ST3250823AS
wd2: 16-sector PIO, LBA48, 238475MB, 488397168 sectors
wd2(pciide2:1:0): using PIO mode 4, Ultra-DMA mode 5
ppb3 at pci0 dev 16 function 0 NVIDIA MCP51 PCI-PCI rev 0xa2
pci4 at ppb3 bus 4
VIA VT6306 FireWire rev 0x80 at pci4 dev 5 function 0 not configured
em0 at pci4 dev 9 function 0 Intel PRO/1000GT (82541GI) rev 0x05:  
irq 5, address 00:0e:0c:b1:4e:e6
azalia0 at pci0 dev 16 function 1 NVIDIA MCP51 HD Audio rev 0xa2:  
irq 5

azalia0: host: High Definition Audio rev. 1.0
azalia0: codec: 0x04x/0x11d4 (rev. 5.0), HDA version 1.0
audio0 at azalia0
nfe0 at pci0 dev 20 function 0 NVIDIA MCP51 LAN rev 0xa1: irq 5,  
address 00:13:d4:ff:0f:4b

eephy0 at nfe0 phy 1: Marvell 88E Gigabit PHY, rev. 2
pchb0 at pci0 dev 24 function 0 AMD AMD64 HyperTransport rev 0x00
pchb1 at pci0 dev 24 function 1 AMD AMD64 Address Map rev 0x00
pchb2 at pci0 dev 24 function 2 AMD AMD64 DRAM Cfg rev 0x00
pchb3 at pci0 dev 24 function 3 AMD AMD64 Misc Cfg rev 0x00
isa0 at pcib0
isadma0 at isa0
pckbc0 at isa0 port 0x60/5
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
spkr0 at pcppi0
lpt0 at isa0 port 0x378/4 irq 7
lm0 at isa0 port 0x290/8: unknown Winbond chip (ID 0xa1)
npx0 at isa0 port 0xf0/16: using exception 16
pccom0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
biomask ff6d netmask ff6d ttymask ffef
pctr: user-level cycle counter enabled
Kernelized RAIDframe activated
cd0(atapiscsi0:0:0): Check Condition (error 0x70) on opcode 0x0
SENSE KEY: Not Ready
 ASC/ASCQ: Medium Not Present
raid0 (root): (RAID Level 1) total number of sectors is 487219200  
(237900 MB) as root

dkcsum: wd0 matches BIOS drive 0x80
dkcsum: wd1 matches BIOS drive 0x81
dkcsum: wd2 matches BIOS drive 0x82
WARNING: / was not properly unmounted
swapmount: no device
raid0: Device already configured!
wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 234162112 of 234162112-234162239  
(wd0 bn 235334857; cn 14648 tn 233 sn 58), retrying

wd0: soft error (corrected)
wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 234997440 of 234997440-234997567  
(wd0 bn 236170185; cn 14700 tn 233 sn 6), retrying

wd0: soft error (corrected)
wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 235719872 of 235719872-23571  
(wd0 bn 236892617; cn 14745 tn 225 sn 17), retrying

wd0: soft error (corrected)
Warning: truncating spare disk /dev/wd2d to 487219200 blocks.
OpenBSD 4.0 (GENERIC) #0: Thu Nov 23 01:28:38 

Re: 4.0 frozen

2006-12-16 Thread K K

On 12/16/06, Stephen Schaff [EMAIL PROTECTED] wrote:

Yesterday it inexplicably went dark. I went down to check it out, and
hooked up the monitor and keyboard. I could see the welcoming login
prompt, but it wouldn't accept any input. It wasn't accepting any
pings from a remote system on the network either. The only word I
have for that is frozen - if there's better terminology out there -
please let me know.

Anyway, after hard booting the machine, and rebuilding the raid - I
checked all the log files I could think of and can't find a thing.
Nada. Then - it went down again today! I'm not sure what to do now.


Sounds like a physical problem.  I've seen this type of hard freeze
with bad power, RAM, motherboard, or CPU,.  The problem is often
related to heat.

If you can take it out of production for half a day or so, I would try
UBCD, starting with the memory tests.

http://www.ultimatebootcd.com/

Kevin



Re: 4.0 frozen

2006-12-16 Thread Dimitry Andric
Stephen Schaff wrote:
 Yesterday it inexplicably went dark.
...
 wd0(pciide1:0:0): timeout
 type: ata
 c_bcount: 65536
 c_skip: 0
 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
 wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0
 bn 235334857; cn 14648 tn 233 sn 58), retrying
 wd0: soft error (corrected)
 wd0(pciide1:0:0): timeout
 type: ata
 c_bcount: 65536
 c_skip: 0
 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
... more of those IDE errors ...

Maybe dying disks?



Re: 4.0 frozen

2006-12-16 Thread Federico Giannici

Stephen Schaff wrote:
I've got 4.0 running nicely on a server sitting in a data centre, thanks 
to the help of the members of this list.

It's been up since Nov. 22nd and in production.

Yesterday it inexplicably went dark. I went down to check it out, and 
hooked up the monitor and keyboard. I could see the welcoming login 
prompt, but it wouldn't accept any input. It wasn't accepting any pings 
from a remote system on the network either. The only word I have for 
that is frozen - if there's better terminology out there - please let me 
know.


Welcome to the club!  :-(

A couple of minutes ago I restarted a frozen PC of mine.
This happens to different PCs, and I replaced ALL the hardware, but 
nothing changed.
It seems to happen usually during high disk/network activity, but I'm 
not sure.

For sure they became much more frequent after the upgrade from 3.9 to 4.0.
I sent several emails here, but nobody seemed to have any real clue...


Bye.

--
___
__
   |-  [EMAIL PROTECTED]
   |ederico Giannici  http://www.neomedia.it
___



Re: 4.0 frozen

2006-12-16 Thread Andreas Maus

Hi Stephen.

On 12/17/06, Stephen Schaff [EMAIL PROTECTED] wrote:

wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 234162112 of 234162112-234162239
(wd0 bn 235334857; cn 14648 tn 233 sn 58), retrying
wd0: soft error (corrected)
wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 234997440 of 234997440-234997567
(wd0 bn 236170185; cn 14700 tn 233 sn 6), retrying
wd0: soft error (corrected)
wd0(pciide1:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
wd0d: device timeout reading fsbn 235719872 of 235719872-23571
(wd0 bn 236892617; cn 14745 tn 225 sn 17), retrying
wd0: soft error (corrected)

I guess wd0 holds your root file system, right?

I had the same problem with my OpenBSD access point over one
year ago. After replacing the disk my system works like a charm :)

I suggest that you replace the dying harddisk with a new one and give
it a try.

HTH,

Andreas.


--
Hobbes : Shouldn't we read the instructions?
Calvin : Do I look like a sissy?



Re: 4.0 frozen

2006-12-16 Thread STeve Andre'
On Saturday 16 December 2006 20:24, Stephen Schaff wrote:
 I've got 4.0 running nicely on a server sitting in a data centre,
 thanks to the help of the members of this list.
 It's been up since Nov. 22nd and in production.

 Yesterday it inexplicably went dark. I went down to check it out, and
 hooked up the monitor and keyboard. I could see the welcoming login
 prompt, but it wouldn't accept any input. It wasn't accepting any
 pings from a remote system on the network either. The only word I
 have for that is frozen - if there's better terminology out there -
 please let me know.

 Anyway, after hard booting the machine, and rebuilding the raid - I
 checked all the log files I could think of and can't find a thing.
 Nada. Then - it went down again today! I'm not sure what to do now.

 So, I thought I would post my dmesg here and see if it grabs the
 attention of anyone who knows better than I do. Any insight would be
 much appreciated. It turns my stomach to think I'd have to reinstall
 with a different OS.

 Best Regards,
 Stephen

If things have been running for nearly a month and now you've
crashed twice in two days, that says that the system was just
fine, and now things have gone to hell.  You have new hardware
problems.

I'd first suspect ram.  Get memtest86 and run it for 24 hours or 
so.  I'd also take the raid array and stuff it into another identical
computer.  You do have a spare system for this production
service, don't you?

--STeve Andre'



Re: 4.0 frozen

2006-12-16 Thread Travers Buda
On Sat, 16 Dec 2006 21:31:28 -0500
STeve Andre' [EMAIL PROTECTED] wrote:

 
 If things have been running for nearly a month and now you've
 crashed twice in two days, that says that the system was just
 fine, and now things have gone to hell.  You have new hardware
 problems.
 
 I'd first suspect ram.  Get memtest86 and run it for 24 hours or 
 so.  I'd also take the raid array and stuff it into another identical
 computer.  You do have a spare system for this production
 service, don't you?
 

That's some good advice--if the problems are just now showing with
great frequency, it's the hardware. I'd check the disk, ram, and PSU in
that order. 

Travers Buda