So I've been working on solving a problem we noticed that when using certain hot pluggable busses (think SAS/SATA hotplug here), that removing a drive did not trigger any resulting response from either FMA or ZFS *until* something tried to use that device. (This removal of a drive can be thought of as simulating either disk,bus, hba, or cable failure.)
This means that if you have an idle pool, you'll not find out about these failures until you do some I/O to the drive. For a hot spare, this may not occur until you actually do a scrub. That's really unfortunate. Note that I think the "disk-monitor.so" FMA module may solve this problem, but it seems to be tied to specific Oracle hardware containing certain IPMI features that are not necessarily general. So, I've come up with a solution, which involves the creation of a new FMA module, and a fix for a bug in the ZFS FMRI scheme module. I'd like thoughts here. (I'm happy to post the code as well; there is no reason this can't be pushed upstream as far as I or my employer are concerned.) zfs-monitor.so is the module, it runs at a configurable interval (currently 10 seconds for my debugging.) What it does is parse the ZFS configuration to identify all physical disks that are associated with ZFS vdevs. For each such device, if ZFS believes that the vdev is healthy (ONLINE, AVAIL, or even DEGRADED in the zpool status output, although it uses libzfs directly to get this), it opens the underlying raw device, and attempts to read the first 512 bytes (block) from the unit. If this works, then the disk is presumed to be working, and we're done. For units that fail either the open() or read(), we use libzfs to mark the vdev FAULTED (which will impact higher level vdevs appropriately), and we post an FMA ereport (so that the ZFS diagnosis and retire modules can do their thing.) Of course, one side effect of this change is that potentially disks are spun up too frequently, even if they need not be, so it can have a negative impact on power savings. However, in theory, since we're only exchanging a single block, and always the same block, that data *ought* to be in cache. (This has a drawback as well though -- it means we might not find errors on the spinning platters themselves. But still its far better since it catches the more common problem of a drive that has either gone completely off the bus or has been removed or accidentally disconnected.) The one bug in the ZFS FMRI module that we had to fix was that it was not failing to identify hot spare devices associated with a zpool, so nothing was happening for those spares, because of certain logic in the ZFS diagnosis module. Anyway, I'm happy to share the code, and even go through the request-sponsor process to push this upstream. I would like the opinions of the ZFS and FMA teams though... is the approach I'm using sane, or have I missed some important design principle? Certainly it *seems* to work well on the systems I've tested, and we (Nexenta) think that it fixes what appears to us to be a critical deficiency in the ZFS error detection and handling. But I'd like to hear other thoughts. - Garrett _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss