Re: Complete disk disaster
I hope you are not storing any valuable data on a 10 year old hdd... Yes, of course. I have a ddefinitive answer now. After some days of use, the disk failed again. I changed the drive to another computer, and after compiling some ports, some disk read failures came again, causing segfaults. I was paranoid, and just to confirm , I tried to install debian linux on it. I could not even fisnish the install cause some disk read failures lead to segmentation faults. The disk is now disassembled on my desk. The encloruse is removed. I am looking at the spinning disk, the heads, the control system. If is indeed an incredible beautiful machine that the man created. Just to destroy it, I plug the cables with the enclosure opened. I created a ffs file system on it, I mounted it, I copyied some files on it, some were copyied, some not, the errors were frecuent. I has been an amazing experience seeing how heads move to find the data on the disk. The disk is on the trash now. 2 weeks of free time wasted, but many things learned! Thank you very much. Tomorrow I will buy a new HD only for OpenBSD. Ramiro.
Re: Complete disk disaster
First, thank you very much for your interesting responses. Yesterday in the evening I installed OpenBSD again on the same disk, just to be sure if I could reproduce the errors. Yes!, I did not have to wait for a long time. The errors appeared after some hours of use. I installed the ports tree and run the locate.updateb command, just for moving disk heads. Also added some audio files just to fill the disk space. Yesterday night, there were only two corrupted files, inmediately after the install: /usr/libdata/perl5/AnyDBM_File.pm and /usr/libdata/perl5/Attribute That files disapeared: # pwd /usr/libdata/perl5 # ls A* ls: AnyDBM_File.pm: Bad file descriptor ls: Attribute: Bad file descriptor AutoLoader.pm AutoSplit.pm Today morning, the errors count rised exponentialy. I even could record this errors: wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn 1489263; cn 1477 tn 7 sn 6), retrying wd1: soft error (corrected) wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1486176 of 1486176-1486179 (wd1 bn 1486239; cn 1474 tn 7 sn 6), retrying wd1: soft error (corrected) wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn 1489263; cn 1477 tn 7 sn 6), retrying wd1: soft error (corrected) wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1486376 of 1486376-1486379 (wd1 bn 1486439; cn 1474 tn 10 sn 17), retrying wd1: soft error (corrected) Here comes the fsck output full of errors. It seems that the filesystem gets corrupted quicker as the hard disk reaches its maxim capacity. Even the system is unable to do a clean halt. It starts the ddb. #fsck /dev/wd1a ** /dev/rwd1a (NO WRITE) ** Last Mounted on / ** Root file system ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames UNALLOCATED I=62208 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/libdata/perl5/AnyDBM_File.pm REMOVE? no UNALLOCATED I=62209 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/libdata/perl5/Attribute REMOVE? no UNALLOCATED I=61952 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/bin/lam REMOVE? no UNALLOCATED I=61953 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/bin/last REMOVE? no UNALLOCATED I=61954 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/bin/lastcomm REMOVE? no UNALLOCATED I=61955 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/bin/ldd REMOVE? no UNALLOCATED I=85076 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/include/dev/ic/mpt_ioctl.h REMOVE? no UNALLOCATED I=85077 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/include/dev/ic/mpt_mpilib.h REMOVE? no UNALLOCATED I=85078 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/include/dev/ic/mpt_openbsd.h REMOVE? no UNALLOCATED I=85079 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/include/dev/ic/mpuvar.h REMOVE? no UNALLOCATED I=87776 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat1/mkdep.0 REMOVE? no UNALLOCATED I=8 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat1/mkdir.0 REMOVE? no UNALLOCATED I=87778 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat1/mkfifo.0 REMOVE? no UNALLOCATED I=87779 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat1/mktemp.0 REMOVE? no UNALLOCATED I=89396 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat8/named.0 REMOVE? no UNALLOCATED I=89397 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat8/ncheck.0 REMOVE? no UNALLOCATED I=89397 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat8/ncheck.0 REMOVE? no UNALLOCATED I=89398 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat8/ndp.0 REMOVE? no UNALLOCATED I=89399 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/share/man/cat8/netgroup_mkdb.0 REMOVE? no UNALLOCATED I=92099 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/ports/benchmarks/ubench REMOVE? no UNALLOCATED I=92097 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/ports/benchmarks/randread/distinfo REMOVE? no UNALLOCATED I=92098 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970 NAME=/usr/ports/benchmarks/randread/Makefile REMOVE? no UNALLOCATED I=92096 OWNER=root MODE=0 SIZE=0 MTIME=Jan 1 01:00 1970
Re: Complete disk disaster
Edd Barrett wrote: Oh, thanks, but I tried to do it a month ago from my Linux box and this is an old disk that does not have the SMART thing. :-( At the price of storage media these days, you may aswell just buy another disk. Regards Edd Yes, disks are indeed very cheap. I had this spare disk just to try OpenBSD and get comfortable with it without the risk of breaking my Linux install. Now that I like OpenBSD, I am going to buy a disk for OpenBSD only. Also considering to order the CD. I do not know if waiting for the new release to come. Ramiro. EA1ABZ
Re: Complete disk disaster
--On 24 August 2005 10:37 +0200, Ramiro Aceves wrote: pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn 1489263; cn 1477 tn 7 sn 6), retrying wd1: soft error (corrected) wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1486176 of 1486176-1486179 (wd1 bn 1486239; cn 1474 tn 7 sn 6), retrying wd1: soft error (corrected) [etc] All hard drives have bad blocks, most hard drives now have some spare capacity. As the drive detects bad or failing blocks, the spare blocks are automatically remapped over the bad blocks. This is internal to the drive - by the time you start noticing drive errors, the drive is usually unable to remap any more blocks. Sometimes the manufacturer's drive-test tools can be useful (Hitachi/IBM's DFT can do some basic tests on drives from other manufacturers too). There's also a commercial program Spinrite which claims to have good stress-tests.
Re: Complete disk disaster
On Wed, Aug 24, 2005 at 10:37:46AM +0200, Ramiro Aceves wrote: First, thank you very much for your interesting responses. Yesterday in the evening I installed OpenBSD again on the same disk, just to be sure if I could reproduce the errors. Yes!, I did not have to wait for a long time. The errors appeared after some hours of use. I installed the ports tree and run the locate.updateb command, just for moving disk heads. Also added some audio files just to fill the disk space. Yesterday night, there were only two corrupted files, inmediately after the install: /usr/libdata/perl5/AnyDBM_File.pm and /usr/libdata/perl5/Attribute That files disapeared: wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn 1489263; cn 1477 tn 7 sn 6), retrying wd1: soft error (corrected) wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 hello, are you using a slow disk and a fast disk on the same cable? i remembrer that i experienced similar problems when i tried to put a slow 1.6G togother with a fast 40Go disk on the same cable. are you using a 80-conductor cable ? -- Alexandre
Re: Complete disk disaster
Alexandre Ratchov wrote: On Wed, Aug 24, 2005 at 10:37:46AM +0200, Ramiro Aceves wrote: First, thank you very much for your interesting responses. Yesterday in the evening I installed OpenBSD again on the same disk, just to be sure if I could reproduce the errors. Yes!, I did not have to wait for a long time. The errors appeared after some hours of use. I installed the ports tree and run the locate.updateb command, just for moving disk heads. Also added some audio files just to fill the disk space. Yesterday night, there were only two corrupted files, inmediately after the install: /usr/libdata/perl5/AnyDBM_File.pm and /usr/libdata/perl5/Attribute That files disapeared: wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn 1489263; cn 1477 tn 7 sn 6), retrying wd1: soft error (corrected) wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 hello, are you using a slow disk and a fast disk on the same cable? i remembrer that i experienced similar problems when i tried to put a slow 1.6G togother with a fast 40Go disk on the same cable. are you using a 80-conductor cable ? Yes!, I am using a 40 GB (aprox 4 years old) as master, and 1GB (around 10) as slave. Cable is 40-conductor, I think. Both at the same cable. Thanks Ramiro.
Re: Complete disk disaster
On Wed, Aug 24, 2005 at 12:53:45PM +0200, Ramiro Aceves wrote: Yes!, I am using a 40 GB (aprox 4 years old) as master, and 1GB (around 10) as slave. Cable is 40-conductor, I think. Both at the same cable. hmmm... can you try to put slow devices and fast devices on separate cables. by slow devices i mean cdroms and old hard disks. -- Alexandre
Re: Complete disk disaster
On Wed, 24 Aug 2005, Stuart Henderson wrote: --On 24 August 2005 10:37 +0200, Ramiro Aceves wrote: pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1489200 of 1489200-1489203 (wd1 bn 1489263; cn 1477 tn 7 sn 6), retrying wd1: soft error (corrected) wd1(pciide0:0:1): timeout type: ata c_bcount: 2048 c_skip: 0 pciide0:0:1: bus-master DMA error: missing interrupt, status=0x61 wd1a: device timeout reading fsbn 1486176 of 1486176-1486179 (wd1 bn 1486239; cn 1474 tn 7 sn 6), retrying wd1: soft error (corrected) [etc] All hard drives have bad blocks, most hard drives now have some spare capacity. As the drive detects bad or failing blocks, the spare blocks are automatically remapped over the bad blocks. This is internal to the drive - by the time you start noticing drive errors, the drive is usually unable to remap any more blocks. smartmontools does a great job of notifying you prior to this occurring. When you startup smartd to alert when S.M.A.R.T attributes change, you can watch the drive slowly die over time. smartmontools is part of the OpenBSD ports tree in case you interested in giving it a spin. Sometimes the manufacturer's drive-test tools can be useful (Hitachi/IBM's DFT can do some basic tests on drives from other manufacturers too). There's also a commercial program Spinrite which claims to have good stress-tests.
Re: Complete disk disaster
On Mon, Aug 22, 2005 at 03:34:34PM +0200, Ramiro Aceves wrote: Hello Friends. I am new to OpenBSD (but not to Unixes), my experience with this OS is only a month. I was getting more an more confortable with the OS, and getting in love with it, but today I have experienced a very weird and strange thing. My OpenBSD testing system is installed on the second IDE disk (1GB). I was enjoying on a happy X-window fluxbox session. I installed links WEB browser package with pkg_add -v ftp://. , as usual. I was surfing the net sometime (ppp connection). I stopped the WEB browser and opened an xterm window, in order to search for certain man page. I was surprised because I could not see any man page! The error was something like: /etc/man.conf/ Not a directory. I stopped the X-window session and attempted to enter at the console. I was not able to do it. I seemed that /etc/ directory suffered some kind of damage. Login: root Aug 22 14:44:42 openbsd-remigio login: cannot stat /etc/login.conf: Not a directory Aug 22 14:44:42 openbsd-remigio passwd: /etc/pwd.db: Not a directory. Login incorrect Login: and so on. I started thinking that something serious could have happened, but I trusted on a reboot. I rebooted the system and it prompted for single user mode (I do not know if this is the right word, I called it like that on Linux). I ran and #fsck /dev/wd1a and it discovered plenty of errors in the /etc/ directory and some other directories. It created a lost+found with the found garbage.. After the cleaning, I rebooted again, but the /etc/ directory was wiped out. Also /var/ directory dissapeared. I have searched for /var/log/* information on the lost+found directory but no luck. Luckyly, this system is only a system for fun. ;-). What could cause this disaster? Please, feel free to ask me for any information that you need before I wipe the entire disk and install a fresh OpenBSD again. hello, The last year a had similar problems because of a bad IDE cable. In few hours there were randomly corrupted files, but no disk error messages in the log. Finally a changed the cable and installed a fresh OpenBSD. regards, -- Alexandre
Re: Complete disk disaster
Alexandre Ratchov wrote: On Mon, Aug 22, 2005 at 03:34:34PM +0200, Ramiro Aceves wrote: Hello Friends. I am new to OpenBSD (but not to Unixes), my experience with this OS is only a month. I was getting more an more confortable with the OS, and getting in love with it, but today I have experienced a very weird and strange thing. My OpenBSD testing system is installed on the second IDE disk (1GB). I was enjoying on a happy X-window fluxbox session. I installed links WEB browser package with pkg_add -v ftp://. , as usual. I was surfing the net sometime (ppp connection). I stopped the WEB browser and opened an xterm window, in order to search for certain man page. I was surprised because I could not see any man page! The error was something like: /etc/man.conf/ Not a directory. I stopped the X-window session and attempted to enter at the console. I was not able to do it. I seemed that /etc/ directory suffered some kind of damage. Login: root Aug 22 14:44:42 openbsd-remigio login: cannot stat /etc/login.conf: Not a directory Aug 22 14:44:42 openbsd-remigio passwd: /etc/pwd.db: Not a directory. Login incorrect Login: and so on. I started thinking that something serious could have happened, but I trusted on a reboot. I rebooted the system and it prompted for single user mode (I do not know if this is the right word, I called it like that on Linux). I ran and #fsck /dev/wd1a and it discovered plenty of errors in the /etc/ directory and some other directories. It created a lost+found with the found garbage.. After the cleaning, I rebooted again, but the /etc/ directory was wiped out. Also /var/ directory dissapeared. I have searched for /var/log/* information on the lost+found directory but no luck. Luckyly, this system is only a system for fun. ;-). What could cause this disaster? Please, feel free to ask me for any information that you need before I wipe the entire disk and install a fresh OpenBSD again. hello, The last year a had similar problems because of a bad IDE cable. In few hours there were randomly corrupted files, but no disk error messages in the log. Finally a changed the cable and installed a fresh OpenBSD. regards, Hello Alexandre and OpenBSD fans: Many thanks for the information. As you and other OpenBSD friend said, I must search for disk failure or cable failure. I am going to install a fresh OpenBSD 3.7 and see whether I can reproduce the file corruption. This IDE disk is the slave of my main master disk (first IDE cable), so they are sharing the same cable. Of course that the slave disk connector can be broken (loosy connection). I am going to do some disk tranfers and move the cable back and forth and see whether it tiggers the corruption problem. Do you know of any disk test or utility program that can stress the disk to work hard until it fails? Any suggestions will be apreciated. Regards Ramiro.
Re: Complete disk disaster
Ramiro Aceves wrote: Alexandre Ratchov wrote: ... What could cause this disaster? Please, feel free to ask me for any information that you need before I wipe the entire disk and install a fresh OpenBSD again. hello, The last year a had similar problems because of a bad IDE cable. In few hours there were randomly corrupted files, but no disk error messages in the log. Finally a changed the cable and installed a fresh OpenBSD. regards, Hello Alexandre and OpenBSD fans: Many thanks for the information. As you and other OpenBSD friend said, I must search for disk failure or cable failure. I am going to install a fresh OpenBSD 3.7 and see whether I can reproduce the file corruption. This IDE disk is the slave of my main master disk (first IDE cable), so they are sharing the same cable. Of course that the slave disk connector can be broken (loosy connection). I am going to do some disk tranfers and move the cable back and forth and see whether it tiggers the corruption problem. Do you know of any disk test or utility program that can stress the disk to work hard until it fails? I'd agree, looks like a hardware problem. Note the age of your 1G drive...it has got to be close to ten years old. OpenBSD's file systems are very solid. What you saw is a very extraordinary event. I've seen something close to that only a very few times in many years and many machines of working with OpenBSD, and it always involved a power-off in the middle of some disk activity, and that's not what happened in your case. I routinely freak people out by tapping the power button on an OpenBSD machine if it is not convient to login to do a proper shutdown (yeah, I make sure it isn't busy at the time, but otherwise, just hit the button). Good way to work a hard disk: Unpack ports or source tar.gz files, 'specially with softdeps off. Nick.
Re: Complete disk disaster
On Tue, Aug 23, 2005 at 11:29:18AM +0200, Ramiro Aceves wrote: ... Do you know of any disk test or utility program that can stress the disk to work hard until it fails? Smartmontools is available as an OBSD package. From the port readme: -- smartmontools-5.33 -- control and monitor storage systems using SMART The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI hard disks. In many cases, these utilities will provide advanced warning of disk degradation and failure. See http://smartmontools.sourceforge.net/ for details.
Re: Complete disk disaster
Good way to work a hard disk: Unpack ports or source tar.gz files, 'specially with softdeps off. And once you are done unpacking run /usr/libexec/locate.updatedb a few times :-) This would cause the system to 'touch' every file on your drive and you will almost surely see errors if there is a disk problem. --Bryan
Re: Complete disk disaster
Most drives keep track of errors and are able to warn you of trouble before they fail completely. SMART is not always reliable, but should warn you of coming problems. See the atactl man page
Re: Complete disk disaster
Josh Grosse wrote: On Tue, Aug 23, 2005 at 11:29:18AM +0200, Ramiro Aceves wrote: ... Do you know of any disk test or utility program that can stress the disk to work hard until it fails? Oh, thanks, but I tried to do it a month ago from my Linux box and this is an old disk that does not have the SMART thing. :-( Thank you very much. Ramiro. Smartmontools is available as an OBSD package. From the port readme: -- smartmontools-5.33 -- control and monitor storage systems using SMART The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI hard disks. In many cases, these utilities will provide advanced warning of disk degradation and failure. See http://smartmontools.sourceforge.net/ for details.
Re: Complete disk disaster
Oh, thanks, but I tried to do it a month ago from my Linux box and this is an old disk that does not have the SMART thing. :-( At the price of storage media these days, you may aswell just buy another disk. Regards Edd