Hi,
I need some help figuring out what happened to my zpool yesterday.  After a
planned shutdown to replace a failing drive, all the disks in one of two raidz1
vdevs were marked faulted and now my pool is basically toast, though it
continues to limp along.  I've never seen this behavior before, so I'm hoping
someone can offer advice.

The system is a Supermicro 3U server with 16 drive bays.  Two bays are used by
OS (rpool) drives attached to the X7SB4 board's onboard SATA.  The remaining 14
drives are 1.5TB Western Digital WD15EADS drives attached to two Supermicro
AOC-SAT2-MV8 cards:

pci bus 0x0003 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081
 Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller

The data pool drives are split across two raidz1 vdevs of 7 drives each (one
group on one controller, one group on the other).  The host is running snv111.

Yesterday we shut down the system for the aforementioned maintenance.  The
failing drive (c0t5d0) had been throwing thousands of write errors, so we wanted
to get it out.  I've heard that the MV8 cards do not support hotswap so we did
the change with the system off.  Upon booting back up, the pool marked c0t5d0 as
UNAVAIL (as expected, this is a new disk) but it also showed checksum errors at
the top-level vdev, without any corresponding errors on any disks:

  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     1
          raidz1    DEGRADED     0     0     3
            c0t0d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c0t5d0  UNAVAIL      0   289     0  cannot open
            c0t6d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c1t6d0  ONLINE       1     0     0

That's my first mystery-- what do checksum errors at the top-level vdev mean
when there are no checksum errors from any disks?

We proceeded to resilver the new drive, and things rapidly fell apart.  After 7
hours, the resilver "finished" and the pool looked like this:

  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 7h6m with 740484 errors on Mon Jul 20 23:04:40 
2009
config:

        NAME              STATE     READ WRITE CKSUM
        data              DEGRADED     0     0  723K
          raidz1          DEGRADED     0     0 1.42M
            c0t0d0        DEGRADED     0     0    90  too many errors
            c0t1d0        DEGRADED     0     0     0  too many errors
            c0t2d0        DEGRADED     0     0     1  too many errors
            c0t3d0        DEGRADED     0     0     2  too many errors
            c0t4d0        DEGRADED     0     0     0  too many errors
            replacing     DEGRADED     0     0 14.1M
              c0t5d0s0/o  FAULTED      0 32.4M     0  corrupted data
              c0t5d0      ONLINE       0     0     0  208G resilvered
            c0t6d0        DEGRADED     0     0     0  too many errors
          raidz1          ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
            c1t2d0        ONLINE       0     0     0
            c1t3d0        ONLINE       0     0     0
            c1t4d0        ONLINE       0     0     0
            c1t5d0        ONLINE       0     0     0
            c1t6d0        ONLINE      18     0     0  6K resilvered

errors: 740484 data errors, use '-v' for a list


Note that c0t5d0 never looks fully replaced, I guess due to all the other
errors.  What I can't figure out is why a condition like 1 checksum error (or no
errors at all-- see c0t4d0 above) would cause a device to be marked as "too many
errors".

FMA reported major faults on 7 devices, but I am not sure how to translate
something like "zfs://pool=data/vdev=7f98aad990665c5f" into a c-t-d device name.

A subsequent reboot did not help, and in fact another resilver of c0t5d0 kicked
off and behaved the same way.  This morning I find that yet another device has
failed in the other top vdev (c1t6d0, which had been throwing small numbers of
read errors as you can see above).

The output of 'iostat -xne' shows no s/w or h/w errors on any devices except 
c1t6d0.

Am I just screwed here, or is there any hope of recovering this pool?  I find it
very difficult to believe that that many drives had simultaneous major failures
after showing no pre-fail signs.  The planned drive replacement disturbed no
internal parts-- it was just a tray/sled swap without removing the server from
the rack.

Thanks,
Eric

-- 
Eric Sproul
Lead Site Reliability Engineer
OmniTI Computer Consulting, Inc.
Web Applications & Internet Architectures
http://omniti.com
P: +1.443.325.1357 x207   F: +1.410.872.4911
_______________________________________________
storage-discuss mailing list
storage-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to