Hey folks, I'm not sure the best place to post this, so I'm trying storage-discuss first.
I've seen a couple of problems posted now on the ZFS forums where Solaris has hung due to an unexpected event on the storage side. The first of these was on my own server, where it appears the Solaris Marvell driver can't cope with hotplug events and hangs the OS until the drive is reinserted. The second was an almost identical case, but for SATA drives plugged into a motherboard running in IDE mode. Again it seems that the hardware copes with the removal fine, but Solaris hangs until the device is re-inserted. While I've raised a bug repot for the Marvell driver, what I'd like to ask is whether there's any way Solaris could be improved so that an unexpected event from *any* storage driver (boot device excepted) does not hang the OS. Any time the OS hangs like this it makes troubleshooting that much more difficult, and it has a real effect on the perception of ZFS as a reliable filesystem. Hanging the storage driver would be fine, just don't hang the rest of the OS. It seems to me that one of the benefits of ZFS is working against it here. It's such a flexible system it's being used for many, many types of devices, on varied (and even consumer grade) hardware, and that means there are a whole host of drivers being used, and consequantly a lot of scope for bugs in those drivers. I know work is being done on FMA to handle all kinds of errors, butI'm not talking about that. It seems to me that FMA involves proper detection, reporting and handling of known problems, and it looks like it involves knowing in advance what failmodes are being managed. What I'm looking for is something much more basic, something that's just able to keep the OS running when it encounters unexpected or unhandled behaviour from storage drivers or hardware. Is there any possibility of putting a layer of error checking code above storage drivers in such a way that unexpected events can be trapped and handled gracefully without hanging the OS? I know that ultimately any driver issues will need to be sorted individually, but can't help feel it would look better for ZFS and Solaris if the OS could keep running when these problems occur, and simply offline the device and report that an error has been found with the driver or hardware. And finally, I suspect I don't need to say this, but if this could be done, it would also be good if storage drivers were isolated so that a driver or hardware problem with one controller doesn't necessarily affect other controllers of the same type. That would mean that the end result if this could be achieved would be a system that for my case, instead of the entire OS hanging and my server going offline when I pulled a drive, just a single controller would have been taken offline due to the bad driver (and hopefully reactivated when I reinserted the drive, although that could be a whole new ball game). For me it would have meant that the driver problem was very obvious as a whole host of drives would have gone offline, but it would have kept my ZFS pool (and server) operational as the other controller would have remained running. There's also the possibility that this would work for the boot device too, provided you have a mirrored boot pool running on separate controllers. To me that seems a much better way of handling storage device errors. I'd love to hear some feedback on this idea, and I do accept this is probably something that's much easier to say than to do. Ross -- This message posted from opensolaris.org _______________________________________________ storage-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/storage-discuss
