Re: Complete disk disaster

2005-08-30 Thread Ramiro Aceves
 I hope you are not storing any valuable data on a 10 year old hdd...
 

Yes, of course.

I have a ddefinitive answer now. After some days of use, the disk failed
again. I changed the drive to another computer, and after compiling some
ports, some disk read failures came again, causing segfaults. I was
paranoid, and just to confirm , I tried to install debian linux on it. I
  could not even fisnish the install cause some disk read failures lead
to segmentation  faults.
The disk is now disassembled on my desk. The encloruse is removed. I am
looking at the spinning disk, the heads, the control system. If is
indeed an incredible beautiful machine that the man created. Just to
destroy it, I plug the cables with the enclosure opened. I created a ffs
file system on it, I mounted it, I copyied some files on it, some were
copyied, some not, the errors were frecuent. I has been an amazing
experience seeing how heads move to find the data on the disk. The disk
is on the trash now.

2 weeks of free time wasted, but many things learned!

Thank you very much.

Tomorrow I will buy a new HD only for OpenBSD.

Ramiro.



Re: Complete disk disaster

2005-08-24 Thread Ramiro Aceves
First, thank you very much for your interesting responses.

Yesterday in the evening I installed OpenBSD again on the same disk,
just to be sure if I could reproduce the errors. Yes!, I did not have to
wait for a long time. The errors appeared after some hours of use. I
installed the ports tree and run the locate.updateb command, just for
moving disk heads. Also added some audio files just to fill the disk space.

Yesterday night, there were only two corrupted files, inmediately after
the install:
/usr/libdata/perl5/AnyDBM_File.pm and
/usr/libdata/perl5/Attribute
That files disapeared:


# pwd
/usr/libdata/perl5
# ls A*
ls: AnyDBM_File.pm: Bad file descriptor
ls: Attribute: Bad file descriptor
AutoLoader.pm   AutoSplit.pm


Today morning, the errors count rised exponentialy. I even could record
this errors:



wd1(pciide0:0:1): timeout
type: ata
c_bcount: 2048
c_skip: 0
pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn
1489263; cn 1477 tn 7 sn 6), retrying
wd1: soft error (corrected)
wd1(pciide0:0:1): timeout
type: ata
c_bcount: 2048
c_skip: 0
pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1486176 of 1486176-1486179 (wd1 bn
1486239; cn 1474 tn 7 sn 6), retrying
wd1: soft error (corrected)
wd1(pciide0:0:1): timeout
type: ata
c_bcount: 2048
c_skip: 0
pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn
1489263; cn 1477 tn 7 sn 6), retrying
wd1: soft error (corrected)
wd1(pciide0:0:1): timeout
type: ata
c_bcount: 2048
c_skip: 0
pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1486376 of 1486376-1486379 (wd1 bn
1486439; cn 1474 tn 10 sn 17), retrying
wd1: soft error (corrected)






Here comes the fsck output full of errors. It seems that the filesystem
 gets corrupted quicker as the hard disk reaches its maxim capacity.

Even the system is unable to do a clean halt. It starts the ddb.


#fsck /dev/wd1a

** /dev/rwd1a (NO WRITE)
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
UNALLOCATED  I=62208  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/libdata/perl5/AnyDBM_File.pm

REMOVE? no

UNALLOCATED  I=62209  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/libdata/perl5/Attribute

REMOVE? no

UNALLOCATED  I=61952  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/bin/lam

REMOVE? no

UNALLOCATED  I=61953  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/bin/last

REMOVE? no

UNALLOCATED  I=61954  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/bin/lastcomm

REMOVE? no

UNALLOCATED  I=61955  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/bin/ldd

REMOVE? no

UNALLOCATED  I=85076  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/include/dev/ic/mpt_ioctl.h

REMOVE? no

UNALLOCATED  I=85077  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/include/dev/ic/mpt_mpilib.h

REMOVE? no

UNALLOCATED  I=85078  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/include/dev/ic/mpt_openbsd.h

REMOVE? no

UNALLOCATED  I=85079  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/include/dev/ic/mpuvar.h

REMOVE? no

UNALLOCATED  I=87776  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat1/mkdep.0

REMOVE? no

UNALLOCATED  I=8  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat1/mkdir.0

REMOVE? no

UNALLOCATED  I=87778  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat1/mkfifo.0

REMOVE? no

UNALLOCATED  I=87779  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat1/mktemp.0

REMOVE? no

UNALLOCATED  I=89396  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat8/named.0

REMOVE? no

UNALLOCATED  I=89397  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat8/ncheck.0

REMOVE? no

UNALLOCATED  I=89397  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat8/ncheck.0

REMOVE? no

UNALLOCATED  I=89398  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat8/ndp.0

REMOVE? no

UNALLOCATED  I=89399  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/share/man/cat8/netgroup_mkdb.0

REMOVE? no

UNALLOCATED  I=92099  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/ports/benchmarks/ubench

REMOVE? no

UNALLOCATED  I=92097  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/ports/benchmarks/randread/distinfo

REMOVE? no

UNALLOCATED  I=92098  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970
NAME=/usr/ports/benchmarks/randread/Makefile

REMOVE? no

UNALLOCATED  I=92096  OWNER=root MODE=0
SIZE=0 MTIME=Jan  1 01:00 1970

Re: Complete disk disaster

2005-08-24 Thread Ramiro Aceves
Edd Barrett wrote:
Oh, thanks, but I tried to do it a month ago from my Linux box and this
is an old disk that does not have the SMART thing. :-(
 
 
 At the price of storage media these days, you may aswell just buy another 
 disk.
 
 Regards
 
 Edd
 

Yes, disks are indeed very cheap. I had this spare disk just to try
OpenBSD and get comfortable with it without the risk of breaking my
Linux install. Now that I like OpenBSD, I am going to buy a disk for
OpenBSD only. Also considering to order the CD. I do not know if waiting
for the new release to come.

Ramiro.
EA1ABZ



Re: Complete disk disaster

2005-08-24 Thread Stuart Henderson

--On 24 August 2005 10:37 +0200, Ramiro Aceves wrote:


pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn
1489263; cn 1477 tn 7 sn 6), retrying
wd1: soft error (corrected)
wd1(pciide0:0:1): timeout
type: ata
c_bcount: 2048
c_skip: 0
pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1486176 of 1486176-1486179 (wd1 bn
1486239; cn 1474 tn 7 sn 6), retrying
wd1: soft error (corrected)

[etc]

All hard drives have bad blocks, most hard drives now have some spare 
capacity. As the drive detects bad or failing blocks, the spare blocks 
are automatically remapped over the bad blocks. This is internal to the 
drive - by the time you start noticing drive errors, the drive is 
usually unable to remap any more blocks.


Sometimes the manufacturer's drive-test tools can be useful 
(Hitachi/IBM's DFT can do some basic tests on drives from other 
manufacturers too). There's also a commercial program Spinrite which 
claims to have good stress-tests.




Re: Complete disk disaster

2005-08-24 Thread Alexandre Ratchov
On Wed, Aug 24, 2005 at 10:37:46AM +0200, Ramiro Aceves wrote:
 First, thank you very much for your interesting responses.
 
 Yesterday in the evening I installed OpenBSD again on the same disk,
 just to be sure if I could reproduce the errors. Yes!, I did not have to
 wait for a long time. The errors appeared after some hours of use. I
 installed the ports tree and run the locate.updateb command, just for
 moving disk heads. Also added some audio files just to fill the disk space.
 
 Yesterday night, there were only two corrupted files, inmediately after
 the install:
 /usr/libdata/perl5/AnyDBM_File.pm and
 /usr/libdata/perl5/Attribute
 That files disapeared:
 
 wd1(pciide0:0:1): timeout
   type: ata
   c_bcount: 2048
   c_skip: 0
 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
 wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn
 1489263; cn 1477 tn 7 sn 6), retrying
 wd1: soft error (corrected)
 wd1(pciide0:0:1): timeout
   type: ata
   c_bcount: 2048
   c_skip: 0
 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61

hello, 

are you using a slow disk and a fast disk on the same cable? i remembrer
that i experienced similar problems when i tried to put a slow 1.6G togother
with a fast 40Go disk on the same cable.

are you using a 80-conductor cable ?

-- 
Alexandre



Re: Complete disk disaster

2005-08-24 Thread Ramiro Aceves
Alexandre Ratchov wrote:
 On Wed, Aug 24, 2005 at 10:37:46AM +0200, Ramiro Aceves wrote:
 
First, thank you very much for your interesting responses.

Yesterday in the evening I installed OpenBSD again on the same disk,
just to be sure if I could reproduce the errors. Yes!, I did not have to
wait for a long time. The errors appeared after some hours of use. I
installed the ports tree and run the locate.updateb command, just for
moving disk heads. Also added some audio files just to fill the disk space.

Yesterday night, there were only two corrupted files, inmediately after
the install:
/usr/libdata/perl5/AnyDBM_File.pm and
/usr/libdata/perl5/Attribute
That files disapeared:

wd1(pciide0:0:1): timeout
  type: ata
  c_bcount: 2048
  c_skip: 0
pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn
1489263; cn 1477 tn 7 sn 6), retrying
wd1: soft error (corrected)
wd1(pciide0:0:1): timeout
  type: ata
  c_bcount: 2048
  c_skip: 0
pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
 
 
 hello, 
 
 are you using a slow disk and a fast disk on the same cable? i remembrer
 that i experienced similar problems when i tried to put a slow 1.6G togother
 with a fast 40Go disk on the same cable.
 
 are you using a 80-conductor cable ?
 

Yes!, I am using a 40 GB (aprox 4 years old) as master, and 1GB (around
10) as slave. Cable is 40-conductor, I think. Both at the same cable.

Thanks

Ramiro.



Re: Complete disk disaster

2005-08-24 Thread Alexandre Ratchov
On Wed, Aug 24, 2005 at 12:53:45PM +0200, Ramiro Aceves wrote:
 
 Yes!, I am using a 40 GB (aprox 4 years old) as master, and 1GB (around
 10) as slave. Cable is 40-conductor, I think. Both at the same cable.
 

hmmm... can you try to put slow devices and fast devices on separate cables.
by slow devices i mean cdroms and old hard disks.

-- 
Alexandre



Re: Complete disk disaster

2005-08-24 Thread Matty

On Wed, 24 Aug 2005, Stuart Henderson wrote:


--On 24 August 2005 10:37 +0200, Ramiro Aceves wrote:


pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn
1489263; cn 1477 tn 7 sn 6), retrying
wd1: soft error (corrected)
wd1(pciide0:0:1): timeout
type: ata
c_bcount: 2048
c_skip: 0
pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61
wd1a: device timeout reading fsbn 1486176 of 1486176-1486179 (wd1 bn
1486239; cn 1474 tn 7 sn 6), retrying
wd1: soft error (corrected)

[etc]

All hard drives have bad blocks, most hard drives now have some spare 
capacity. As the drive detects bad or failing blocks, the spare blocks are 
automatically remapped over the bad blocks. This is internal to the drive - 
by the time you start noticing drive errors, the drive is usually unable to 
remap any more blocks.


smartmontools does a great job of notifying you prior to this occurring. 
When you startup smartd to alert when S.M.A.R.T attributes change, you can 
watch the drive slowly die over time. smartmontools is part of the OpenBSD

ports tree in case you interested in giving it a spin.



Sometimes the manufacturer's drive-test tools can be useful (Hitachi/IBM's 
DFT can do some basic tests on drives from other manufacturers too). There's 
also a commercial program Spinrite which claims to have good stress-tests.




Re: Complete disk disaster

2005-08-23 Thread Alexandre Ratchov
On Mon, Aug 22, 2005 at 03:34:34PM +0200, Ramiro Aceves wrote:
 Hello Friends.
 
 I am new to OpenBSD (but not to Unixes), my experience with this OS is
 only a month. I was getting more an more confortable with the OS, and
 getting in love with it, but today I have experienced a very weird and
 strange thing.
 
 My OpenBSD testing system is installed on the second IDE disk (1GB).
 
 I was enjoying on a happy X-window fluxbox session. I installed links
 WEB browser package with pkg_add -v ftp://. , as usual. I was
 surfing the net sometime (ppp connection). I stopped the WEB browser and
 opened an xterm window, in order to search for certain man page. I was
 surprised because I could not see any man page! The error was something
 like: /etc/man.conf/ Not a directory. I stopped the X-window session
 and attempted to enter at the console. I was not able to do it. I seemed
 that /etc/ directory suffered some kind of damage.
 
 Login: root
 Aug 22 14:44:42 openbsd-remigio login: cannot stat /etc/login.conf: Not
 a directory
 
 Aug 22 14:44:42 openbsd-remigio passwd: /etc/pwd.db: Not a directory.
 
 Login incorrect
 Login:
 
 
 and so on.
 
 I started thinking that something serious could have happened, but I
 trusted on a reboot. I rebooted the system and it prompted for single
 user mode (I do not know if this is the right word, I called it like
 that on Linux). I ran and #fsck /dev/wd1a and it discovered plenty of
 errors in the /etc/ directory and some other directories. It created a
 lost+found with the found garbage..
 
 After the cleaning, I rebooted again, but the /etc/ directory was wiped out.
 
 Also /var/ directory dissapeared. I have searched for /var/log/*
 information on the lost+found directory but no luck.
 
 Luckyly, this system is only a system for fun. ;-).
 
 What could cause this disaster?
 
 Please, feel free to ask me for any information that you need before I
 wipe the entire disk and install a fresh OpenBSD again.
 

hello, 

The last year a had similar problems because of a bad IDE cable. In few
hours there were randomly corrupted files, but no disk error messages in the
log.

Finally a changed the cable and installed a fresh OpenBSD.

regards,

-- 
Alexandre



Re: Complete disk disaster

2005-08-23 Thread Ramiro Aceves
Alexandre Ratchov wrote:
 On Mon, Aug 22, 2005 at 03:34:34PM +0200, Ramiro Aceves wrote:
 
Hello Friends.

I am new to OpenBSD (but not to Unixes), my experience with this OS is
only a month. I was getting more an more confortable with the OS, and
getting in love with it, but today I have experienced a very weird and
strange thing.

My OpenBSD testing system is installed on the second IDE disk (1GB).

I was enjoying on a happy X-window fluxbox session. I installed links
WEB browser package with pkg_add -v ftp://. , as usual. I was
surfing the net sometime (ppp connection). I stopped the WEB browser and
opened an xterm window, in order to search for certain man page. I was
surprised because I could not see any man page! The error was something
like: /etc/man.conf/ Not a directory. I stopped the X-window session
and attempted to enter at the console. I was not able to do it. I seemed
that /etc/ directory suffered some kind of damage.

Login: root
Aug 22 14:44:42 openbsd-remigio login: cannot stat /etc/login.conf: Not
a directory

Aug 22 14:44:42 openbsd-remigio passwd: /etc/pwd.db: Not a directory.

Login incorrect
Login:


and so on.

I started thinking that something serious could have happened, but I
trusted on a reboot. I rebooted the system and it prompted for single
user mode (I do not know if this is the right word, I called it like
that on Linux). I ran and #fsck /dev/wd1a and it discovered plenty of
errors in the /etc/ directory and some other directories. It created a
lost+found with the found garbage..

After the cleaning, I rebooted again, but the /etc/ directory was wiped out.

Also /var/ directory dissapeared. I have searched for /var/log/*
information on the lost+found directory but no luck.

Luckyly, this system is only a system for fun. ;-).

What could cause this disaster?

Please, feel free to ask me for any information that you need before I
wipe the entire disk and install a fresh OpenBSD again.

 
 
 hello, 
 
 The last year a had similar problems because of a bad IDE cable. In few
 hours there were randomly corrupted files, but no disk error messages in the
 log.
 
 Finally a changed the cable and installed a fresh OpenBSD.
 
 regards,
 

Hello Alexandre and OpenBSD fans:

Many thanks for the information. As you and other OpenBSD friend said, I
must search for disk failure or cable failure. I am going to install a
fresh OpenBSD 3.7 and see whether I can reproduce the file corruption.
This IDE disk is the slave of my main master disk (first IDE cable), so
they are sharing the same cable.  Of course that the slave disk
connector can be broken (loosy  connection). I am going to do some disk
tranfers and move the cable back and forth and see whether it tiggers
the corruption problem.

Do you know of any disk test or utility program that can stress the disk
to work hard until it fails?

Any suggestions will be apreciated.
Regards

Ramiro.



Re: Complete disk disaster

2005-08-23 Thread Nick Holland
Ramiro Aceves wrote:
 Alexandre Ratchov wrote:
...
What could cause this disaster?

Please, feel free to ask me for any information that you need before I
wipe the entire disk and install a fresh OpenBSD again.

 
 
 hello, 
 
 The last year a had similar problems because of a bad IDE cable. In few
 hours there were randomly corrupted files, but no disk error messages in the
 log.
 
 Finally a changed the cable and installed a fresh OpenBSD.
 
 regards,
 
 
 Hello Alexandre and OpenBSD fans:
 
 Many thanks for the information. As you and other OpenBSD friend said, I
 must search for disk failure or cable failure. I am going to install a
 fresh OpenBSD 3.7 and see whether I can reproduce the file corruption.
 This IDE disk is the slave of my main master disk (first IDE cable), so
 they are sharing the same cable.  Of course that the slave disk
 connector can be broken (loosy  connection). I am going to do some disk
 tranfers and move the cable back and forth and see whether it tiggers
 the corruption problem.
 
 Do you know of any disk test or utility program that can stress the disk
 to work hard until it fails?

I'd agree, looks like a hardware problem.
Note the age of your 1G drive...it has got to be close to ten years old.

OpenBSD's file systems are very solid.  What you saw is a very
extraordinary event.  I've seen something close to that only a very few
times in many years and many machines of working with OpenBSD, and it
always involved a power-off in the middle of some disk activity, and
that's not what happened in your case.  I routinely freak people out by
tapping the power button on an OpenBSD machine if it is not convient to
login to do a proper shutdown (yeah, I make sure it isn't busy at the
time, but otherwise, just hit the button).

Good way to work a hard disk: Unpack ports or source tar.gz files,
'specially with softdeps off.

Nick.



Re: Complete disk disaster

2005-08-23 Thread Josh Grosse
On Tue, Aug 23, 2005 at 11:29:18AM +0200, Ramiro Aceves wrote:
...
 Do you know of any disk test or utility program that can stress the disk
 to work hard until it fails?

Smartmontools is available as an OBSD package.  From the port readme:

--

smartmontools-5.33 -- control and monitor storage systems using SMART

The smartmontools package contains two utility programs (smartctl and
smartd) to control and monitor storage systems using the
Self-Monitoring, Analysis and Reporting Technology System (SMART) built
into most modern ATA and SCSI hard disks.  In many cases, these
utilities will provide advanced warning of disk degradation and failure.

See http://smartmontools.sourceforge.net/ for details.



Re: Complete disk disaster

2005-08-23 Thread Bryan Irvine
 Good way to work a hard disk: Unpack ports or source tar.gz files,
 'specially with softdeps off.

And once you are done unpacking run /usr/libexec/locate.updatedb a few times :-)

This would cause the system to 'touch' every file on your drive and
you will almost surely see errors if there is a disk problem.


--Bryan



Re: Complete disk disaster

2005-08-23 Thread Will H. Backman
Most drives keep track of errors and are able to warn you of trouble
before they fail completely.  SMART is not always reliable, but should
warn you of coming problems.
See the atactl man page



Re: Complete disk disaster

2005-08-23 Thread Ramiro Aceves
Josh Grosse wrote:
 On Tue, Aug 23, 2005 at 11:29:18AM +0200, Ramiro Aceves wrote:
 
...
Do you know of any disk test or utility program that can stress the disk
to work hard until it fails?
 


Oh, thanks, but I tried to do it a month ago from my Linux box and this
is an old disk that does not have the SMART thing. :-(

Thank you very much.

Ramiro.


 
 Smartmontools is available as an OBSD package.  From the port readme:
 
 --
 
 smartmontools-5.33 -- control and monitor storage systems using SMART
 
 The smartmontools package contains two utility programs (smartctl and
 smartd) to control and monitor storage systems using the
 Self-Monitoring, Analysis and Reporting Technology System (SMART) built
 into most modern ATA and SCSI hard disks.  In many cases, these
 utilities will provide advanced warning of disk degradation and failure.
 
 See http://smartmontools.sourceforge.net/ for details.



Re: Complete disk disaster

2005-08-23 Thread Edd Barrett
 Oh, thanks, but I tried to do it a month ago from my Linux box and this
 is an old disk that does not have the SMART thing. :-(

At the price of storage media these days, you may aswell just buy another disk.

Regards

Edd