Re: [zfs-discuss] Recurring checksum errors on RAIDZ2 vdev

2011-01-26 Thread Ashley Nicholls
fmdump -eV is very verbose and far too long to post here :) Here is a snippet 
of a scsi error and a zfs checksum error from the log though.

Jan 20 2011 18:50:16.276742278 ereport.io.scsi.cmd.disk.dev.rqs.merr
nvlist version: 0
class = ereport.io.scsi.cmd.disk.dev.rqs.merr
ena = 0xf83e2f0e78101c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = 
/pci@0,0/pci8086,340e@7/pci1000,3080@0/iport@f0/disk@w5000c50010384d1d,0
devid = id1,sd@n5000c50010384d1f
(end detector)

driver-assessment = fatal
op-code = 0x28
cdb = 0x28 0x0 0x9 0xcb 0x6f 0x0 0x0 0x0 0x80 0x0
pkt-reason = 0x0
pkt-state = 0x3f
pkt-stats = 0x0
stat-code = 0x2
key = 0x3
asc = 0x11
ascq = 0x0
sense-data = 0xf0 0x0 0x3 0x9 0xcb 0x6f 0x77 0xa 0x0 0x0 0x0 0x0 0x11 
0x0 0x81 0x80 0x0 0x9d 0xdd 0xba
lba = 0x9cb6f00
__ttl = 0x1
__tod = 0x4d3883e8 0x107ec086


Jan 21 2011 11:20:35.197434250 ereport.fs.zfs.checksum
nvlist version: 0
class = ereport.fs.zfs.checksum
ena = 0x58ee5cfa5e102801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0xbfe55017c0fcfabe
vdev = 0x7d0e37f8d297b23b
(end detector)

pool = zpool002
pool_guid = 0xbfe55017c0fcfabe
pool_context = 0
pool_failmode = wait
vdev_guid = 0x7d0e37f8d297b23b
vdev_type = disk
vdev_path = /dev/dsk/c8t5000C50021177297d0s0
vdev_devid = id1,sd@n5000c50021177297/a
parent_guid = 0x35d57beca57a5f2a
parent_type = raidz
zio_err = 0
zio_offset = 0x2ca23e00
zio_size = 0x200
zio_objset = 0x31
zio_object = 0x1
zio_level = 0
zio_blkid = 0x31ca3f
bad_ranges = 0x0 0x200
bad_ranges_min_gap = 0x8
bad_range_sets = 0x615
bad_range_clears = 0x241
bad_set_histogram = 0x13 0x19 0x18 0x10 0x1c 0x1f 0x21 0x1c 0x17 0x15 
0x1a 0x16 0x1d 0x1f 0x1d 0x1d 0x16 0x16 0x1c 0x16 0x1d 0x15 0x1a 0x11 0x19 0x20 
0x19 0x1c 0x1b 0x18 0x17 0x18 0x15 0x13 0x13 0x19 0x15 0x1a 0x14 0x12 0x13 0x19 
0x16 0x19 0x19 0x16 0x12 0x13 0x16 0x22 0x17 0x15 0x1f 0x18 0x19 0x17 0x1e 0x19 
0x1e 0x14 0x14 0x19 0x19 0x1a
bad_cleared_histogram = 0xb 0xc 0x6 0xa 0x8 0x9 0x7 0xa 0xa 0xb 0xa 0xa 
0x8 0x6 0x6 0x4 0xd 0xb 0x9 0xe 0x8 0x6 0xa 0x9 0xa 0x7 0x9 0xb 0xa 0xc 0xa 0xa 
0xf 0xc 0x9 0xb 0x6 0xa 0x7 0xa 0xd 0xb 0x6 0x6 0x7 0x6 0xa 0x9 0xa 0x7 0x6 0x9 
0x4 0xa 0x2 0xa 0x7 0x9 0x7 0x9 0xa 0xc 0xa 0xa
__ttl = 0x1
__tod = 0x4d396c03 0xbc49b8a


The scsi errors do not appear to coincide with the zfs checksum errors. Also 
iostat -xenC showed up errors on a single disk (which, so far, has not been 
kicked from the array). It also showed errors on the controller itself (which I 
believe may be virtual as I'm using a dual-pathed system?)

extended device statistics    errors --- 
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot 
device
 2704.3 2474.1 31196.8 54735.8  0.0 103.70.0   20.0   0 1742   0  70  55 
125 c8
   56.8   59.6  656.0 1385.1  0.0  2.30.0   19.6   0  39   0   0   0   0 
c8t5000C500104E3D83d0
   14.28.2  147.1  171.1  0.0  0.40.0   16.8   0   7   0   0   0   0 
c8t5000C50021177453d0
   62.2   62.2  711.4 1429.1  0.0  2.50.0   20.2   0  42   0   0   0   0 
c8t5000C500104D7BEFd0
   64.7   62.8  721.0 1439.1  0.0  2.90.0   22.9   0  46   0  70  55 125 
c8t5000C50010384D1Fd0
   61.7   64.2  729.9 1438.9  0.0  2.60.0   20.5   0  43   0   0   0   0 
c8t5000C500211792AFd0
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recurring checksum errors on RAIDZ2 vdev

2011-01-25 Thread Richard Elling
On Jan 24, 2011, at 9:52 AM, Ashley Nicholls wrote:

 Hello all,
 
 I'm having a problem that I find difficult to diagnose.
 
 I have an IBM x3550 M3 running nexenta core platform 3.0.1 (134f) with 7x6 
 disk RAIDZ2 vdevs (see listing at bottom).
 Every day a disk fails with Too many checksum errors, is marked as degraded 
 and rebuilt onto a hot spare. I've been doing 'zpool detach zpool002 
 degraded disk' to remove it from the zpool and return the pools status to 
 'ONLINE'. Later that day (or sometimes the next day), a disk is marked as 
 degraded due to checksum errors and is rebuilt onto a hot spare again, rinse, 
 repeat.
 
 We've been logging this stuff for the past few days and there are a few 
 things to notice however:
 1. The disk that fails appears to be the hot spare that we rebuilt on to the 
 previous time
 2. If I don't detach the degraded disk then the newly rebuilt hot spare does 
 not seem to fail
 
 I'm just doing a scrub now to confirm there are no further checksum errors 
 and then I will detach the 'degraded' drive from the pool and see if the new 
 hot spare fails in the next 24 hours. Just wondering if anyone had seen this 
 before? 

I've seen this with SATA disks. Check the output of fmdump -eV and look at the
error reports for the ZFS checksum errors. They should show the type of 
corruption
detected. The type of corruption leads to further analysis opportunities.
 -- richard


 
 Thanks,
 Ashley
 
 pool: zpool002
 state: DEGRADED
 status: One or more devices has experienced an unrecoverable error.  An
 attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
 using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
 scan: scrub in progress since Mon Jan 24 17:17:39 2011
 25.3G scanned out of 3.91T at 25.9M/s, 43h38m to go
 0 repaired, 0.63% done
 config:
 
 NAME STATE READ WRITE CKSUM
 zpool002 DEGRADED 0 0 0
   raidz2-0   ONLINE   0 0 0
 c8t5000C50020C780C3d0ONLINE   0 0 0
 c8t5000C50020C785FBd0ONLINE   0 0 0
 c8t5000C50020C7610Bd0ONLINE   0 0 0
 c8t5000C50020C77413d0ONLINE   0 0 0
 c8t5000C50020C77437d0ONLINE   0 0 0
 c8t5000C50020DC9AE7d0ONLINE   0 0 0
   raidz2-1   DEGRADED 0 0 0
 c8t5000C50020DCBDCFd0ONLINE   0 0 0
 c8t5000C50020E3E85Fd0ONLINE   0 0 0
 c8t5000C50020E3F5FBd0ONLINE   0 0 0
 c8t5000C50020E3F37Bd0ONLINE   0 0 0
 c8t5000C50020E3F337d0ONLINE   0 0 0
 spare-5  DEGRADED 0 0   202
   c8t5000C5001034370Bd0  DEGRADED 0 023  too many 
 errors
   c8t5000C50020E3F617d0  ONLINE   0 0 0
   raidz2-2   ONLINE   0 0 0
 c8t5000C50020E9E6FFd0ONLINE   0 0 0
 c8t5000C50020E33C97d0ONLINE   0 0 0
 c8t5000C50020E94A63d0ONLINE   0 0 0
 c8t5000C50020E94E4Bd0ONLINE   0 0 0
 c8t5000C50020E233CFd0ONLINE   0 0 0
 c8t5000C50020E3447Fd0ONLINE   0 0 0
   raidz2-3   ONLINE   0 0 0
 c8t5000C50020E9549Bd0ONLINE   0 0 0
 c8t5000C50020E20003d0ONLINE   0 0 0
 c8t5000C50020E28723d0ONLINE   0 0 0
 c8t5000C50020E32873d0ONLINE   0 0 0
 c8t5000C50020E95887d0ONLINE   0 0 0
 c8t5000C50020E96577d0ONLINE   0 0 0
   raidz2-4   ONLINE   0 0 0
 c8t5000C50010384D1Fd0ONLINE   0 0 0
 c8t5000C50021176F43d0ONLINE   0 0 0
 c8t5000C50021177B3Bd0ONLINE   0 0 0
 c8t5000C500211785F3d0ONLINE   0 0 0
 c8t5000C500211792AFd0ONLINE   0 0 0
 c8t5000C500211795C3d0ONLINE   0 0 0
   raidz2-5   ONLINE   0 0 0
 c8t5000C50025CCFEEBd0ONLINE   0 0 0
 c8t5000C500104D7BEFd0ONLINE   0 0 0
 c8t5000C500104D7FE7d0ONLINE   0 0 0
 c8t5000C500104DD5AFd0ONLINE   0 0 0
 c8t5000C500104DD43Bd0ONLINE   0 0 0
 c8t5000C500104DD78Bd0ONLINE   0 0 0
  

Re: [zfs-discuss] Recurring checksum errors on RAIDZ2 vdev

2011-01-24 Thread Ian Collins

 On 01/25/11 06:52 AM, Ashley Nicholls wrote:

Hello all,

I'm having a problem that I find difficult to diagnose.

I have an IBM x3550 M3 running nexenta core platform 3.0.1 (134f) with 
7x6 disk RAIDZ2 vdevs (see listing at bottom).
Every day a disk fails with Too many checksum errors, is marked as 
degraded and rebuilt onto a hot spare. I've been doing 'zpool detach 
zpool002 degraded disk' to remove it from the zpool and return the 
pools status to 'ONLINE'. Later that day (or sometimes the next day), 
a disk is marked as degraded due to checksum errors and is rebuilt 
onto a hot spare again, rinse, repeat.


We've been logging this stuff for the past few days and there are a 
few things to notice however:
1. The disk that fails appears to be the hot spare that we rebuilt on 
to the previous time
2. If I don't detach the degraded disk then the newly rebuilt hot 
spare does not seem to fail


I'm just doing a scrub now to confirm there are no further checksum 
errors and then I will detach the 'degraded' drive from the pool and 
see if the new hot spare fails in the next 24 hours. Just wondering if 
anyone had seen this before?


I used to see these all the time on a Thumper.  They magically vanished 
when I upgraded the drive firmware.


Check to see if your drives are up to date.

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss