Hey folks,

I'm not sure the best place to post this, so I'm trying storage-discuss first.

I've seen a couple of problems posted now on the ZFS forums where Solaris has 
hung due to an unexpected event on the storage side.  The first of these was on 
my own server, where it appears the Solaris Marvell driver can't cope with 
hotplug events and hangs the OS until the drive is reinserted.

The second was an almost identical case, but for SATA drives plugged into a 
motherboard running in IDE mode.  Again it seems that the hardware copes with 
the removal fine, but Solaris hangs until the device is re-inserted.

While I've raised a bug repot for the Marvell driver, what I'd like to ask is 
whether there's any way Solaris could be improved so that an unexpected event 
from *any* storage driver (boot device excepted) does not hang the OS.  Any 
time the OS hangs like this it makes troubleshooting that much more difficult, 
and it has a real effect on the perception of ZFS as a reliable filesystem.  
Hanging the storage driver would be fine, just don't hang the rest of the OS.

It seems to me that one of the benefits of ZFS is working against it here.  
It's such a flexible system it's being used for many, many types of devices, on 
varied (and even consumer grade) hardware, and that means there are a whole 
host of drivers being used, and consequantly a lot of scope for bugs in those 
drivers. 

I know work is being done on FMA to handle all kinds of errors, butI'm not 
talking about that. It seems to me that FMA involves proper detection, 
reporting and handling of known problems, and it looks like it involves knowing 
in advance what failmodes are being managed.  What I'm looking for is something 
much more basic, something that's just able to keep the OS running when it 
encounters unexpected or unhandled behaviour from storage drivers or hardware.

Is there any possibility of putting a layer of error checking code above 
storage drivers in such a way that unexpected events can be trapped and handled 
gracefully without hanging the OS?

I know that ultimately any driver issues will need to be sorted individually, 
but can't help feel it would look better for ZFS and Solaris if the OS could 
keep running when these problems occur, and simply offline the device and 
report that an error has been found with the driver or hardware.

And finally, I suspect I don't need to say this, but if this could be done, it 
would also be good if storage drivers were isolated so that a driver or 
hardware problem with one controller doesn't necessarily affect other 
controllers of the same type.  That would mean that the end result if this 
could be achieved would be a system that for my case, instead of the entire OS 
hanging and my server going offline when I pulled a drive, just a single 
controller would have been taken offline due to the bad driver (and hopefully 
reactivated when I reinserted the drive, although that could be a whole new 
ball game).

For me it would have meant that the driver problem was very obvious as a whole 
host of drives would have gone offline, but it would have kept my ZFS pool (and 
server) operational as the other controller would have remained running.  
There's also the possibility that this would work for the boot device too, 
provided you have a mirrored boot pool running on separate controllers.

To me that seems a much better way of handling storage device errors.  I'd love 
to hear some feedback on this idea, and I do accept this is probably something 
that's much easier to say than to do.

Ross
--
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to