So I've been working on solving a problem we noticed that when using
certain hot pluggable busses (think SAS/SATA hotplug here), that
removing a drive did not trigger any resulting response from either FMA
or ZFS *until* something tried to use that device.  (This removal of a
drive can be thought of as simulating either disk,bus, hba, or cable
failure.)

This means that if you have an idle pool, you'll not find out about
these failures until you do some I/O to the drive.  For a hot spare,
this may not occur until you actually do a scrub.  That's really
unfortunate.

Note that I think the "disk-monitor.so" FMA module may solve this
problem, but it seems to be tied to specific Oracle hardware containing
certain IPMI features that are not necessarily general.

So, I've come up with a solution, which involves the creation of a new
FMA module, and a fix for a bug in the ZFS FMRI scheme module.  I'd like
thoughts here.  (I'm happy to post the code as well; there is no reason
this can't be pushed upstream as far as I or my employer are concerned.)

zfs-monitor.so is the module, it runs at a configurable interval
(currently 10 seconds for my debugging.)  What it does is parse the ZFS
configuration to identify all physical disks that are associated with
ZFS vdevs.  For each such device, if ZFS believes that the vdev is
healthy (ONLINE, AVAIL, or even DEGRADED in the zpool status output,
although it uses libzfs directly to get this), it opens the underlying
raw device, and attempts to read the first 512 bytes (block) from the
unit.  If this works, then the disk is presumed to be working, and we're
done.

For units that fail either the open() or read(), we use libzfs to mark
the vdev FAULTED (which will impact higher level vdevs appropriately),
and we post an FMA ereport (so that the ZFS diagnosis and retire modules
can do their thing.)

Of course, one side effect of this change is that potentially disks are
spun up too frequently, even if they need not be, so it can have a
negative impact on power savings.  However, in theory, since we're only
exchanging a single block, and always the same block, that data *ought*
to be in cache.  (This has a drawback as well though -- it means we
might not find errors on the spinning platters themselves.  But still
its far better since it catches the more common problem of a drive that
has either gone completely off the bus or has been removed or
accidentally disconnected.)

The one bug in the ZFS FMRI module that we had to fix was that it was
not failing to identify hot spare devices associated with a zpool, so
nothing was happening for those spares, because of certain logic in the
ZFS diagnosis module.

Anyway, I'm happy to share the code, and even go through the
request-sponsor process to push this upstream.  I would like the
opinions of the ZFS and FMA teams though... is the approach I'm using
sane, or have I missed some important design principle?  Certainly it
*seems* to work well on the systems I've tested, and we (Nexenta) think
that it fixes what appears to us to be a critical deficiency in the ZFS
error detection and handling.  But I'd like to hear other thoughts.

        - Garrett


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to