detecting bad disks
Trying to copy a file from one filesystem to another, I kept getting input/output errors. I noticed these messages in the logs: wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417; cn 762 tn 5 sn 6), retrying wd1a: uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417; cn 762 tn 5 sn 6) Okay, so clearly wd1 has some issues (wd1a is the only filesystem on that disk). I've already started moving the data to a different disk. Now, I thought I was going to be alerted to this sort of thing automatically because of an entry like this one in the crontab: 0 * * * * /sbin/atactl /dev/wd0c smartstatus /dev/null However, when I run this by hand, I get [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus No SMART threshold exceeded So clearly, the SMART stuff wasn't going to tell me about this. However: [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 readattr Attributes table revision: 16 ID Attribute name Threshold Value Raw 1 Raw Read Error Rate 51199 0x0081 3 Spin Up Time 21123 0x1127 4 Start/Stop Count 40 99 0x056f 5 Reallocated Sector Count 140200 0x 7 Seek Error Rate 51200 0x 9 Power-on Hours Count 0 73 0x4da4 10 Spin Retry Count 51100 0x 11 Unknown 51100 0x 12 Device Power Cycle Count 0 99 0x056e 194 Temperature0101 0x0031 196 Reallocation Event Count 0200 0x 197 Current Pending Sector Count 0197 0x0068 198 Off-line Scan Uncorrectable Sect 0199 0x0032 199 Ultra DMA CRC Error Count 0200 0x I see a number of values that exceed the preset threshholds. But I see the same kinds of values on the other three drives: [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd0 readattr Attributes table revision: 16 ID Attribute name Threshold Value Raw 1 Raw Read Error Rate 51200 0x 3 Spin Up Time 21 96 0x175f 4 Start/Stop Count 40 96 0x110f 5 Reallocated Sector Count 140196 0x003a 7 Seek Error Rate 51200 0x 9 Power-on Hours Count 0 80 0x3a71 10 Spin Retry Count 51100 0x 11 Unknown 51100 0x 12 Device Power Cycle Count 0 99 0x0585 196 Reallocation Event Count 0181 0x0013 197 Current Pending Sector Count 0200 0x 198 Off-line Scan Uncorrectable Sect 0200 0x 199 Ultra DMA CRC Error Count 0200 0x0001 200 Unknown 51200 0x [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd2 readattr Attributes table revision: 16 ID Attribute name Threshold Value Raw 3 Spin Up Time 63200 0x1b3c 4 Start/Stop Count 0253 0x0020 5 Reallocated Sector Count 63253 0x 6 Unknown 100253 0x 7 Seek Error Rate0253 0x 8 Seek Time Performance187253 0xaa64 9 Power-on Hours Count 0217 0xb2b8 10 Spin Retry Count 157253 0x 11 Unknown 223253 0x 12 Device Power Cycle Count 0253 0x003b 192 Power-off Retract Count0253 0x 193 Load Cycle
Re: detecting bad disks
knitti wrote: - SMART didn't catch the errors. no monitoring is perfect, but it seems unlikely that it won't notice read errors Also, SMART thresholds are defined by the vendor. Setting them too high reduces the number of warranty claims. You'll notice wd1 has raw read errors and pending/offline uncorrectable sectors, they're an indicator of drive problems. wd0 also has 0x13 reallocated sectors, depending on the size of the drive that may be okay. It might be a good idea to to an offline self-test on all your drives and see if the numbers change.
Re: detecting bad disks
All drives develop read errors over time. When you write to these blocks, it may automatically remap them and the errors disappear. Just because you get some read errors doesn't meant the drive is necessarily about to die. But if you develop new bad blocks with any frequency, you might want to replace the drive. Derick Siddoway [EMAIL PROTECTED] wrote: Trying to copy a file from one filesystem to another, I kept getting input/output errors. I noticed these messages in the logs: wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417; cn 762 tn 5 sn 6), retrying wd1a: uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417; cn 762 tn 5 sn 6) Okay, so clearly wd1 has some issues (wd1a is the only filesystem on that disk). I've already started moving the data to a different disk. Now, I thought I was going to be alerted to this sort of thing automatically because of an entry like this one in the crontab: 0 * * * * /sbin/atactl /dev/wd0c smartstatus /dev/null However, when I run this by hand, I get [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus No SMART threshold exceeded So clearly, the SMART stuff wasn't going to tell me about this. However: [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 readattr Attributes table revision: 16 ID Attribute name Threshold Value Raw 1 Raw Read Error Rate 51199 0x0081 3 Spin Up Time 21123 0x1127 4 Start/Stop Count 40 99 0x056f 5 Reallocated Sector Count 140200 0x 7 Seek Error Rate 51200 0x 9 Power-on Hours Count 0 73 0x4da4 10 Spin Retry Count 51100 0x 11 Unknown 51100 0x 12 Device Power Cycle Count 0 99 0x056e 194 Temperature0101 0x0031 196 Reallocation Event Count 0200 0x 197 Current Pending Sector Count 0197 0x0068 198 Off-line Scan Uncorrectable Sect 0199 0x0032 199 Ultra DMA CRC Error Count 0200 0x I see a number of values that exceed the preset threshholds. But I see the same kinds of values on the other three drives: [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd0 readattr Attributes table revision: 16 ID Attribute name Threshold Value Raw 1 Raw Read Error Rate 51200 0x 3 Spin Up Time 21 96 0x175f 4 Start/Stop Count 40 96 0x110f 5 Reallocated Sector Count 140196 0x003a 7 Seek Error Rate 51200 0x 9 Power-on Hours Count 0 80 0x3a71 10 Spin Retry Count 51100 0x 11 Unknown 51100 0x 12 Device Power Cycle Count 0 99 0x0585 196 Reallocation Event Count 0181 0x0013 197 Current Pending Sector Count 0200 0x 198 Off-line Scan Uncorrectable Sect 0200 0x 199 Ultra DMA CRC Error Count 0200 0x0001 200 Unknown 51200 0x [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd2 readattr Attributes table revision: 16 ID Attribute name Threshold Value Raw 3 Spin Up Time 63200 0x1b3c 4 Start/Stop Count 0253 0x0020 5 Reallocated Sector Count 63253 0x 6 Unknown 100253 0x 7 Seek Error Rate0253 0x 8 Seek Time Performance187
Re: detecting bad disks
On 11/8/07, Derick Siddoway [EMAIL PROTECTED] wrote: Trying to copy a file from one filesystem to another, I kept getting input/output errors. I noticed these messages in the logs: wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; cn 762 tn 5 sn 5), retrying ... However, when I run this by hand, I get [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus No SMART threshold exceeded So clearly, the SMART stuff wasn't going to tell me about this. ... I see a number of values that exceed the preset threshholds. But I see the same kinds of values on the other three drives: not all SMART thresholds define an upper value, some values are a sort of quality measurement and go downwards. Indeed indicate your SMART values no error. Two possibilites: - SMART didn't catch the errors. no monitoring is perfect, but it seems unlikely that it won't notice read errors - there is everything OK with the disk, but something else is not. Try a different cable, look for faulty RAM or a dying PSU. Put the disk into another machine and look whether you can read everything fine. --knitti
Re: detecting bad disks
try a new IDE cable or if you can take the system offline, and assuming you can boot off cd/floppy I would suggest trying MHDD from http://hddguru.com/ it does some pretty nice low-level diagnostics. I've fixed some disks with this {crosses fingers} docs are very basic but there's more info on the forum http://forum.hddguru.com/?sid=9996c0fa3656ff12a72b5227b745a49b. floppy and cd .iso here http://hddguru.com/content/en/software/2006.02.10-Magic-Boot-Disk/ of course, be aware of what you're doing with these tools. mike
Re: detecting bad disks
2007/11/8, Derick Siddoway [EMAIL PROTECTED]: the filesystem. What's the best way to do this short of monitoring? sysutils/smartmontools Best Martin