Since somebody else has just posted about their entire system locking up when pulling a drive, I thought I'd raise this for discussion.
I think Ralf made a very good point in the other thread. ZFS can guarantee data integrity, what it can't do is guarantee data availability. The problem is, the way ZFS is marketed people expect it to be able to do just that. This turned into a longer thread than expected, so I'll start with what I'm asking for, and then attempt to explain my thinking. I'm essentially asking for two features to improve the availability of ZFS pools: - Isolation of storage drivers so that buggy drivers do not bring down the OS. - ZFS timeouts to improve pool availability when no timely response is received from storage drivers. And my reasons for asking for these is that there are now many, many posts on here about people experiencing either total system lockup or ZFS lockup after removing a hot swap drive, and indeed while some of them are using consumer hardware, others have reported problems with server grade kit that definately should be able to handle these errors: Aug 2008: AMD SB600 - System hang - http://www.opensolaris.org/jive/thread.jspa?threadID=70349 Aug 2008: Supermicro SAT2-MV8 - System hang - http://www.opensolaris.org/jive/thread.jspa?messageID=271218 May 2008: Sun hardware - ZFS hang - http://opensolaris.org/jive/thread.jspa?messageID=240481 Feb 2008: iSCSI - ZFS hang - http://www.opensolaris.org/jive/thread.jspa?messageID=206985 Oct 2007: Supermicro SAT2-MV8 - system hang - http://www.opensolaris.org/jive/thread.jspa?messageID=166037 Sept 2007: Fibre channel - http://opensolaris.org/jive/thread.jspa?messageID=151719 ... etc Now while the root cause of each of these may be slightly different, I feel it would still be good to address this if possible as it's going to affect the perception of ZFS as a reliable system. The common factor in all of these is that either the solaris driver hangs and locks the OS, or ZFS hangs and locks the pool. Most of these are for hardware that should handle these failures fine (mine occured for hardware that definately works fine under windows), so I'm wondering: Is there anything that can be done to prevent either type of lockup in these situations? Firstly, for the OS, if a storage component (hardware or driver) fails for a non essential part of the system, the entire OS should not hang. I appreciate there isn't a lot you can do if the OS is using the same driver as it's storage, but certainly in some of the cases above, the OS and the data are using different drivers, and I expect more examples of that could be found with a bit of work. Is there any way storage drivers could be isolated such that the OS (and hence ZFS) can report a problem with that particular driver without hanging the entire system? Please note: I know work is being done on FMA to handle all kinds of bugs, I'm not talking about that. It seems to me that FMA involves proper detection and reporting of bugs, which involves knowing in advance what the problems are and how to report them. What I'm looking for is something much simpler, something that's able to keep the OS running when it encounters unexpected or unhandled behaviour from storage drivers or hardware. It seems to me that one of the benefits of ZFS is working against it here. It's such a flexible system it's being used for many, many types of devices, and that means there are a whole host of drivers being used, and a lot of scope for bugs in those drivers. I know that ultimately any driver issues will need to be sorted individually, but what I'm wondering is whether there's any possibility of putting some error checking code at a layer above the drivers in such a way it's able to trap major problems without hanging the OS? ie: update ZFS/Solaris so they can handle storage layer bugs gracefully without downing the entire system. My second suggestion is to ask if ZFS can be made to handle unexpected events more gracefully. In the past I've suggested that ZFS have a separate timeout so that a redundant pool can continue working even if one device is not responding, and I really think that would be worthwhile. My idea is to have a "WAITING" status flag for drives, so that if one isn't responding quickly, ZFS can flag it as "WAITING", and attempt to read or write the same data from elsewhere in the pool. That would work alongside the existing failure modes, and would allow ZFS to handle hung drivers much more smoothly, preventing redundant pools hanging when a single drive fails. The ZFS update I feel is particularly appropriate. ZFS already uses checksumming since it doesn't trust drivers or hardware to always return the correct data. But ZFS then trusts those same drivers and hardware absolutely when it comes to the availability of the pool. I believe ZFS should apply the same tough standards to pool availability as it does to data integrity. A bad checksum makes ZFS read the data from elsewhere, why shouldn't a timeout do the same thing? Ross This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss