[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ross Thu, 28 Aug 2008 01:35:58 -0700

Since somebody else has just posted about their entire system locking up when 
pulling a drive, I thought I'd raise this for discussion.


I think Ralf made a very good point in the other thread.  ZFS can guarantee 
data integrity, what it can't do is guarantee data availability.  The problem 
is, the way ZFS is marketed people expect it to be able to do just that.

This turned into a longer thread than expected, so I'll start with what I'm 
asking for, and then attempt to explain my thinking.  I'm essentially asking 
for two features to improve the availability of ZFS pools:

- Isolation of storage drivers so that buggy drivers do not bring down the OS.

- ZFS timeouts to improve pool availability when no timely response is received 
from storage drivers.

And my reasons for asking for these is that there are now many, many posts on 
here about people experiencing either total system lockup or ZFS lockup after 
removing a hot swap drive, and indeed while some of them are using consumer 
hardware, others have reported problems with server grade kit that definately 
should be able to handle these errors:

Aug 2008:  AMD SB600 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
Aug 2008:  Supermicro SAT2-MV8 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
May 2008: Sun hardware - ZFS hang
 - http://opensolaris.org/jive/thread.jspa?messageID=240481
Feb 2008:  iSCSI - ZFS hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
Oct 2007:  Supermicro SAT2-MV8 - system hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
Sept 2007:  Fibre channel
 - http://opensolaris.org/jive/thread.jspa?messageID=151719
... etc

Now while the root cause of each of these may be slightly different, I feel it 
would still be good to address this if possible as it's going to affect the 
perception of ZFS as a reliable system.

The common factor in all of these is that either the solaris driver hangs and 
locks the OS, or ZFS hangs and locks the pool.  Most of these are for hardware 
that should handle these failures fine (mine occured for hardware that 
definately works fine under windows), so I'm wondering:  Is there anything that 
can be done to prevent either type of lockup in these situations?

Firstly, for the OS, if a storage component (hardware or driver) fails for a 
non essential part of the system, the entire OS should not hang.  I appreciate 
there isn't a lot you can do if the OS is using the same driver as it's 
storage, but certainly in some of the cases above, the OS and the data are 
using different drivers, and I expect more examples of that could be found with 
a bit of work.  Is there any way storage drivers could be isolated such that 
the OS (and hence ZFS) can report a problem with that particular driver without 
hanging the entire system?

Please note:  I know work is being done on FMA to handle all kinds of bugs, I'm 
not talking about that.  It seems to me that FMA involves proper detection and 
reporting of bugs, which involves knowing in advance what the problems are and 
how to report them.  What I'm looking for is something much simpler, something 
that's able to keep the OS running when it encounters unexpected or unhandled 
behaviour from storage drivers or hardware.

It seems to me that one of the benefits of ZFS is working against it here.  
It's such a flexible system it's being used for many, many types of devices, 
and that means there are a whole host of drivers being used, and a lot of scope 
for bugs in those drivers.  I know that ultimately any driver issues will need 
to be sorted individually, but what I'm wondering is whether there's any 
possibility of putting some error checking code at a layer above the drivers in 
such a way it's able to trap major problems without hanging the OS?  ie: update 
ZFS/Solaris so they can handle storage layer bugs gracefully without downing 
the entire system.

My second suggestion is to ask if ZFS can be made to handle unexpected events 
more gracefully.  In the past I've suggested that ZFS have a separate timeout 
so that a redundant pool can continue working even if one device is not 
responding, and I really think that would be worthwhile.  My idea is to have a 
"WAITING" status flag for drives, so that if one isn't responding quickly, ZFS 
can flag it as "WAITING", and attempt to read or write the same data from 
elsewhere in the pool.  That would work alongside the existing failure modes, 
and would allow ZFS to handle hung drivers much more smoothly, preventing 
redundant pools hanging when a single drive fails.

The ZFS update I feel is particularly appropriate.  ZFS already uses 
checksumming since it doesn't trust drivers or hardware to always return the 
correct data.  But ZFS then trusts those same drivers and hardware absolutely 
when it comes to the availability of the pool.

I believe ZFS should apply the same tough standards to pool availability as it 
does to data integrity.  A bad checksum makes ZFS read the data from elsewhere, 
why shouldn't a timeout do the same thing?

Ross
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Reply via email to