detecting bad disks

2007-11-08 Thread Derick Siddoway
Trying to copy a file from one filesystem to another, I kept getting
input/output errors.  I noticed these messages in the logs:

wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; 
cn 762 tn 5 sn 5), retrying
wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; 
cn 762 tn 5 sn 5), retrying
wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; 
cn 762 tn 5 sn 5), retrying
wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; 
cn 762 tn 5 sn 5), retrying
wd1a:  uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417; 
cn 762 tn 5 sn 6), retrying
wd1a:  uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417; 
cn 762 tn 5 sn 6)

Okay, so clearly wd1 has some issues (wd1a is the only filesystem on that
disk).  I've already started moving the data to a different disk.

Now, I thought I was going to be alerted to this sort of thing automatically
because of an entry like this one in the crontab:

0 * * * *   /sbin/atactl /dev/wd0c smartstatus /dev/null

However, when I run this by hand, I get

[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus
No SMART threshold exceeded

So clearly, the SMART stuff wasn't going to tell me about this.

However:
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 readattr
Attributes table revision: 16
ID  Attribute name  Threshold   Value   Raw
  1 Raw Read Error Rate   51199 0x0081
  3 Spin Up Time  21123 0x1127
  4 Start/Stop Count  40 99 0x056f
  5 Reallocated Sector Count 140200 0x
  7 Seek Error Rate   51200 0x
  9 Power-on Hours Count   0 73 0x4da4
 10 Spin Retry Count  51100 0x
 11 Unknown   51100 0x
 12 Device Power Cycle Count   0 99 0x056e
194 Temperature0101 0x0031
196 Reallocation Event Count   0200 0x
197 Current Pending Sector Count   0197 0x0068
198 Off-line Scan Uncorrectable Sect   0199 0x0032
199 Ultra DMA CRC Error Count  0200 0x

I see a number of values that exceed the preset threshholds.
But I see the same kinds of values on the other three drives:

[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd0 readattr
Attributes table revision: 16
ID  Attribute name  Threshold   Value   Raw
  1 Raw Read Error Rate   51200 0x
  3 Spin Up Time  21 96 0x175f
  4 Start/Stop Count  40 96 0x110f
  5 Reallocated Sector Count 140196 0x003a
  7 Seek Error Rate   51200 0x
  9 Power-on Hours Count   0 80 0x3a71
 10 Spin Retry Count  51100 0x
 11 Unknown   51100 0x
 12 Device Power Cycle Count   0 99 0x0585
196 Reallocation Event Count   0181 0x0013
197 Current Pending Sector Count   0200 0x
198 Off-line Scan Uncorrectable Sect   0200 0x
199 Ultra DMA CRC Error Count  0200 0x0001
200 Unknown   51200 0x
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd2 readattr
Attributes table revision: 16
ID  Attribute name  Threshold   Value   Raw
  3 Spin Up Time  63200 0x1b3c
  4 Start/Stop Count   0253 0x0020
  5 Reallocated Sector Count  63253 0x
  6 Unknown  100253 0x
  7 Seek Error Rate0253 0x
  8 Seek Time Performance187253 0xaa64
  9 Power-on Hours Count   0217 0xb2b8
 10 Spin Retry Count 157253 0x
 11 Unknown  223253 0x
 12 Device Power Cycle Count   0253 0x003b
192 Power-off Retract Count0253 0x
193 Load Cycle 

Re: detecting bad disks

2007-11-08 Thread Steve Shockley

knitti wrote:

- SMART didn't catch the errors. no monitoring is perfect, but it
seems unlikely that it won't notice read errors


Also, SMART thresholds are defined by the vendor.  Setting them too high 
reduces the number of warranty claims.


You'll notice wd1 has raw read errors and pending/offline uncorrectable 
sectors, they're an indicator of drive problems.  wd0 also has 0x13 
reallocated sectors, depending on the size of the drive that may be 
okay.  It might be a good idea to to an offline self-test on all your 
drives and see if the numbers change.




Re: detecting bad disks

2007-11-08 Thread Chris Cappuccio
All drives develop read errors over time.  When you write to these blocks,
it may automatically remap them and the errors disappear.  Just because you
get some read errors doesn't meant the drive is necessarily about to die.  But
if you develop new bad blocks with any frequency, you might want to replace
the drive.

Derick Siddoway [EMAIL PROTECTED] wrote:
 Trying to copy a file from one filesystem to another, I kept getting
 input/output errors.  I noticed these messages in the logs:
 
 wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
 768416; cn 762 tn 5 sn 5), retrying
 wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
 768416; cn 762 tn 5 sn 5), retrying
 wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
 768416; cn 762 tn 5 sn 5), retrying
 wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
 768416; cn 762 tn 5 sn 5), retrying
 wd1a:  uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 
 768417; cn 762 tn 5 sn 6), retrying
 wd1a:  uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 
 768417; cn 762 tn 5 sn 6)
 
 Okay, so clearly wd1 has some issues (wd1a is the only filesystem on that
 disk).  I've already started moving the data to a different disk.
 
 Now, I thought I was going to be alerted to this sort of thing automatically
 because of an entry like this one in the crontab:
 
 0 * * * *   /sbin/atactl /dev/wd0c smartstatus /dev/null
 
 However, when I run this by hand, I get
 
 [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus
 No SMART threshold exceeded
 
 So clearly, the SMART stuff wasn't going to tell me about this.
 
 However:
 [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 readattr
 Attributes table revision: 16
 ID  Attribute name  Threshold   Value   Raw
   1 Raw Read Error Rate   51199 0x0081
   3 Spin Up Time  21123 0x1127
   4 Start/Stop Count  40 99 0x056f
   5 Reallocated Sector Count 140200 0x
   7 Seek Error Rate   51200 0x
   9 Power-on Hours Count   0 73 0x4da4
  10 Spin Retry Count  51100 0x
  11 Unknown   51100 0x
  12 Device Power Cycle Count   0 99 0x056e
 194 Temperature0101 0x0031
 196 Reallocation Event Count   0200 0x
 197 Current Pending Sector Count   0197 0x0068
 198 Off-line Scan Uncorrectable Sect   0199 0x0032
 199 Ultra DMA CRC Error Count  0200 0x
 
 I see a number of values that exceed the preset threshholds.
 But I see the same kinds of values on the other three drives:
 
 [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd0 readattr
 Attributes table revision: 16
 ID  Attribute name  Threshold   Value   Raw
   1 Raw Read Error Rate   51200 0x
   3 Spin Up Time  21 96 0x175f
   4 Start/Stop Count  40 96 0x110f
   5 Reallocated Sector Count 140196 0x003a
   7 Seek Error Rate   51200 0x
   9 Power-on Hours Count   0 80 0x3a71
  10 Spin Retry Count  51100 0x
  11 Unknown   51100 0x
  12 Device Power Cycle Count   0 99 0x0585
 196 Reallocation Event Count   0181 0x0013
 197 Current Pending Sector Count   0200 0x
 198 Off-line Scan Uncorrectable Sect   0200 0x
 199 Ultra DMA CRC Error Count  0200 0x0001
 200 Unknown   51200 0x
 [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd2 readattr
 Attributes table revision: 16
 ID  Attribute name  Threshold   Value   Raw
   3 Spin Up Time  63200 0x1b3c
   4 Start/Stop Count   0253 0x0020
   5 Reallocated Sector Count  63253 0x
   6 Unknown  100253 0x
   7 Seek Error Rate0253 0x
   8 Seek Time Performance187

Re: detecting bad disks

2007-11-08 Thread knitti
On 11/8/07, Derick Siddoway [EMAIL PROTECTED] wrote:
 Trying to copy a file from one filesystem to another, I kept getting
 input/output errors.  I noticed these messages in the logs:

 wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
 768416; cn 762 tn 5 sn 5), retrying
 wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
 768416; cn 762 tn 5 sn 5), retrying
...

 However, when I run this by hand, I get

 [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus
 No SMART threshold exceeded

 So clearly, the SMART stuff wasn't going to tell me about this.

 ...

 I see a number of values that exceed the preset threshholds.
 But I see the same kinds of values on the other three drives:

not all SMART thresholds define an upper value, some values
are a sort of quality measurement and go downwards. Indeed
indicate your SMART values no error. Two possibilites:

- SMART didn't catch the errors. no monitoring is perfect,
but it seems unlikely that it won't notice read errors

- there is everything OK with the disk, but something else
is not. Try a different cable, look for faulty RAM or a
dying PSU. Put the disk into another machine and look
whether you can read everything fine.

--knitti



Re: detecting bad disks

2007-11-08 Thread michael hamerski
try a new IDE cable or if you can take the system offline, and
assuming you can boot off cd/floppy I would suggest trying MHDD from
http://hddguru.com/ it does some pretty nice low-level diagnostics.
I've fixed some disks with this {crosses fingers}

docs are very basic but there's more info on the forum
http://forum.hddguru.com/?sid=9996c0fa3656ff12a72b5227b745a49b.

floppy and cd .iso here
http://hddguru.com/content/en/software/2006.02.10-Magic-Boot-Disk/

of course, be aware of what you're doing with these tools.

mike



Re: detecting bad disks

2007-11-08 Thread Martin Schröder
2007/11/8, Derick Siddoway [EMAIL PROTECTED]:
 the filesystem.  What's the best way to do this short of monitoring?

sysutils/smartmontools

Best
   Martin