[zfs-discuss] Checksum errors on and after resilver

2010-04-14 Thread bonso
Hi all,
 I recently experienced a disk failure on my home server and observed checksum 
errors while resilvering the pool and on the first scrub after the resilver had 
completed. Now everything seems fine but I'm posting this to get help with 
calming my nerves and detect any possible future faults.

 Lets start with some specs.
OSOL 2009.06
Intel SASUC8i (w LSI 1.30IT FW)
Gigabyte MA770-UD3 mobo w 8GB ECC RAM
Hitachi P7K500 harddrives

 When checking the condition of my pool some days ago (yes I should make it 
mail me if something like this happens again) one disk in my pool was labeled 
as Removed with a small number of read errors, nineish I think, all other 
disks where fine. I removed tested (DFT crashed so the disk seemed very broken) 
replaced the drive and started a resilver.

 Checking the status of the resilver everything looked good from the start but 
when it was finished the status report looked like this:
  pool: sasuc8i
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 4h9m with 0 errors on Mon Apr 12 18:12:26 2010
config:

NAME STATE READ WRITE CKSUM
sasuc8i  ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c12t4d0  ONLINE   0 0 5  108K resilvered
c12t8d0  ONLINE   0 0 0  254G resilvered
c12t6d0  ONLINE   0 0 0
c12t7d0  ONLINE   0 0 0
c12t0d0  ONLINE   0 0 1  21.5K resilvered
c12t1d0  ONLINE   0 0 2  43K resilvered
c12t2d0  ONLINE   0 0 4  86K resilvered
c12t3d0  ONLINE   0 0 1  21.5K resilvered

errors: No known data errors

 All I really cared about at this point was the Applications are unaffected 
and No known data errors and I thought that the checksum errors might be down 
to the failing drive (c12t5d0 failed, the controlled labeled the new drive as 
c12t8d0) going out during a write. Then again ZFS is atomic, better clear the 
errors and run a scrub, it came out like this: 
  pool: sasuc8i
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 1h16m with 0 errors on Tue Apr 13 01:29:32 2010
config:

NAME STATE READ WRITE CKSUM
sasuc8i  ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c12t4d0  ONLINE   0 0 5
c12t8d0  ONLINE   0 0 0
c12t6d0  ONLINE   0 0 0
c12t7d0  ONLINE   0 0 4  86K repaired
c12t0d0  ONLINE   0 0 1
c12t1d0  ONLINE   0 0 6  86K repaired
c12t2d0  ONLINE   0 0 4
c12t3d0  ONLINE   0 0 6  108K repaired

errors: No known data errors

 Now I'm getting nervous. Checksum errors, some repaired others not. Am I going 
to end up with multiple drive failures or what the * is going on here?

 Ran one more scrub and everything came up roses.
 Checked smart status on the drives with checksum errors and they are fine, 
allthough I expect only read/write errors would show up there.

 I'm not sure of how to get this into a propper question but what I'm after is 
is this normal to be expected after a resilver and can I start breathing 
again?. Checksum errors are as far as I can gather dodgy data on disk and 
read/write somewhere in the physical link (more or less).

Thank you!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors on and after resilver

2010-04-14 Thread Richard Elling
[this seems to be the question of the day, today...]

On Apr 14, 2010, at 2:57 AM, bonso wrote:

 Hi all,
 I recently experienced a disk failure on my home server and observed checksum 
 errors while resilvering the pool and on the first scrub after the resilver 
 had completed. Now everything seems fine but I'm posting this to get help 
 with calming my nerves and detect any possible future faults.
 
 Lets start with some specs.
 OSOL 2009.06
 Intel SASUC8i (w LSI 1.30IT FW)
 Gigabyte MA770-UD3 mobo w 8GB ECC RAM
 Hitachi P7K500 harddrives
 
 When checking the condition of my pool some days ago (yes I should make it 
 mail me if something like this happens again) one disk in my pool was labeled 
 as Removed with a small number of read errors, nineish I think, all other 
 disks where fine. I removed tested (DFT crashed so the disk seemed very 
 broken) replaced the drive and started a resilver.
 
 Checking the status of the resilver everything looked good from the start but 
 when it was finished the status report looked like this:
  pool: sasuc8i
 state: ONLINE
 status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 4h9m with 0 errors on Mon Apr 12 18:12:26 2010
 config:
 
   NAME STATE READ WRITE CKSUM
   sasuc8i  ONLINE   0 0 0
 raidz2 ONLINE   0 0 0
   c12t4d0  ONLINE   0 0 5  108K resilvered
   c12t8d0  ONLINE   0 0 0  254G resilvered
   c12t6d0  ONLINE   0 0 0
   c12t7d0  ONLINE   0 0 0
   c12t0d0  ONLINE   0 0 1  21.5K resilvered
   c12t1d0  ONLINE   0 0 2  43K resilvered
   c12t2d0  ONLINE   0 0 4  86K resilvered
   c12t3d0  ONLINE   0 0 1  21.5K resilvered
 
 errors: No known data errors
 
 All I really cared about at this point was the Applications are unaffected 
 and No known data errors and I thought that the checksum errors might be 
 down to the failing drive (c12t5d0 failed, the controlled labeled the new 
 drive as c12t8d0) going out during a write. Then again ZFS is atomic, better 
 clear the errors and run a scrub, it came out like this: 
  pool: sasuc8i
 state: ONLINE
 status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 1h16m with 0 errors on Tue Apr 13 01:29:32 2010
 config:
 
   NAME STATE READ WRITE CKSUM
   sasuc8i  ONLINE   0 0 0
 raidz2 ONLINE   0 0 0
   c12t4d0  ONLINE   0 0 5
   c12t8d0  ONLINE   0 0 0
   c12t6d0  ONLINE   0 0 0
   c12t7d0  ONLINE   0 0 4  86K repaired
   c12t0d0  ONLINE   0 0 1
   c12t1d0  ONLINE   0 0 6  86K repaired
   c12t2d0  ONLINE   0 0 4
   c12t3d0  ONLINE   0 0 6  108K repaired
 
 errors: No known data errors
 
 Now I'm getting nervous. Checksum errors, some repaired others not. Am I 
 going to end up with multiple drive failures or what the * is going on here?

When I see many disks suddenly reporting errors, I suspect a common
element: HBA, cables, backplane, mobo, CPU, power supply, etc.

If you search the zfs-discuss archives you can find instances where
HBA firmware, driver issues, or firmware+driver interactions caused
such reports. Cabling and power supplies are less commonly reported.

 Ran one more scrub and everything came up roses.
 Checked smart status on the drives with checksum errors and they are fine, 
 allthough I expect only read/write errors would show up there.
 
 I'm not sure of how to get this into a propper question but what I'm after is 
 is this normal to be expected after a resilver and can I start breathing 
 again?. Checksum errors are as far as I can gather dodgy data on disk and 
 read/write somewhere in the physical link (more or less).

Breathing is good.  Then check your firmware releases.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss