Am 28.05.12 00:35, schrieb Richard Elling:


On May 27, 2012, at 12:52 PM, Stephan Budach wrote:

Hi,

today I issued a scrub on one of my zpools and after some time I noticed that one of the vdevs became degraded due to some drive having cksum errors. The spare kicked in and the drive got resilvered, but why does the spare drive now also show almost the same number of cksum errors, as the degraded drive?

The answer is not available via zpool status. You will need to look at the FMA diagnosis:
fmadm faulty

more clues can be found in the FMA error reports:
fmdump -eV

Thanks - I had taken a look at the FMA diagnosis, but hadn't shared it in my first post. FMA only shows one instance as of yesterday:

root@solaris11c:~# fmadm faulty |less
--------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- ---------
Mai 27 10:24:24 f0601f5f-cb8b-67bc-bd63-e71948ea8428  ZFS-8000-GH    Major

Host        : solaris11c
Platform    : SUN-FIRE-X4170-M2-SERVER  Chassis_id  : 1046FMM0NH
Product_sn  : 1046FMM0NH

Fault class : fault.fs.zfs.vdev.checksum
Affects     : zfs://pool=obelixData/vdev=52e3ca377dbdbec9
                  faulted but still providing degraded service
Problem in  : zfs://pool=obelixData/vdev=52e3ca377dbdbec9
                  faulted but still providing degraded service

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

--------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- ---------
Mär 15 16:34:52 5ad04cb0-af03-e84b-cd8a-a07aff7aec2c  PCIEX-8000-J5  Major

I thought this to be the instance when the vdev initially got degraded and there have been no more errors afterwards, while the resilver took place, so I tend to think that the spare drive is indeed okay.

Thanks,
budy



 -- richard



root@solaris11c:~# zpool status obelixData
  pool: obelixData
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
scan: resilvered 1,12T in 10h50m with 0 errors on Sun May 27 21:15:32 2012
config:

        NAME                         STATE     READ WRITE CKSUM
        obelixData                   DEGRADED     0     0     0
          mirror-0                   ONLINE       0     0     0
            c9t2100001378AC02DDd1    ONLINE       0     0     0
            c9t2100001378AC02F4d1    ONLINE       0     0     0
          mirror-1                   ONLINE       0     0     0
            c9t2100001378AC02F4d0    ONLINE       0     0     0
            c9t2100001378AC02DDd0    ONLINE       0     0     0
          mirror-2                   ONLINE       0     0     0
            c9t2100001378AC02DDd2    ONLINE       0     0     0
            c9t2100001378AC02F4d2    ONLINE       0     0     0
          mirror-3                   ONLINE       0     0     0
            c9t2100001378AC02DDd3    ONLINE       0     0     0
            c9t2100001378AC02F4d3    ONLINE       0     0     0
          mirror-4                   ONLINE       0     0     0
            c9t2100001378AC02DDd5    ONLINE       0     0     0
            c9t2100001378AC02F4d5    ONLINE       0     0     0
          mirror-5                   ONLINE       0     0     0
            c9t2100001378AC02DDd4    ONLINE       0     0     0
            c9t2100001378AC02F4d4    ONLINE       0     0     0
          mirror-6                   ONLINE       0     0     0
            c9t2100001378AC02DDd6    ONLINE       0     0     0
            c9t2100001378AC02F4d6    ONLINE       0     0     0
          mirror-7                   ONLINE       0     0     0
            c9t2100001378AC02DDd7    ONLINE       0     0     0
            c9t2100001378AC02F4d7    ONLINE       0     0     0
          mirror-8                   ONLINE       0     0     0
            c9t2100001378AC02DDd8    ONLINE       0     0     0
            c9t2100001378AC02F4d8    ONLINE       0     0     0
          mirror-9                   DEGRADED     0     0     0
            c9t2100001378AC02DDd9    ONLINE       0     0     0
            spare-1                  DEGRADED     0     0    10
c9t2100001378AC02F4d9  DEGRADED     0     0    22  too many errors
              c9t2100001378AC02BFd1  ONLINE       0     0    23
          mirror-10                  ONLINE       0     0     0
            c9t2100001378AC02DDd10   ONLINE       0     0     0
            c9t2100001378AC02F4d10   ONLINE       0     0     0
          mirror-11                  ONLINE       0     0     0
            c9t2100001378AC02DDd11   ONLINE       0     0     0
            c9t2100001378AC02F4d11   ONLINE       0     0     0
          mirror-12                  ONLINE       0     0     0
            c9t2100001378AC02DDd12   ONLINE       0     0     0
            c9t2100001378AC02F4d12   ONLINE       0     0     0
          mirror-13                  ONLINE       0     0     0
            c9t2100001378AC02DDd13   ONLINE       0     0     0
            c9t2100001378AC02F4d13   ONLINE       0     0     0
          mirror-14                  ONLINE       0     0     0
            c9t2100001378AC02DDd14   ONLINE       0     0     0
            c9t2100001378AC02F4d14   ONLINE       0     0     0
        logs
          mirror-15                  ONLINE       0     0     0
            c9t2100001378AC02D9d0    ONLINE       0     0     0
            c9t2100001378AC02BFd0    ONLINE       0     0     0
        spares
          c9t2100001378AC02BFd1      INUSE     currently in use


What would be the best way to proceed? The drive c9t2100001378AC02BFd1 is the spare drive, that is tagged as ONLINE, but it shows 23 cksum errors, while the drive that became degraded only shows 22 cksum errors.

What would be the best procedure to continue? Would one now first run another scrub and detach the degraded drive afterwards, or detach the degrades drive immediately and run a scrub afterwards?

Thanks,
budy



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to