On 25-Nov-08, at 5:10 AM, Ross Smith wrote:

> Hey Jeff,
>
> Good to hear there's work going on to address this.
>
> What did you guys think to my idea of ZFS supporting a "waiting for a
> response" status for disks as an interim solution that allows the pool
> to continue operation while it's waiting for FMA or the driver to
> fault the drive?
> ...
>
> The first of these is already covered by ZFS with its checksums (with
> FMA doing the extra work to fault drives), so it's just the second
> that needs immediate attention, and for the life of me I can't think
> of any situation that a simple timeout wouldn't catch.
>
> Personally I'd love to see two parameters, allowing this behavior to
> be turned on if desired, and allowing timeouts to be configured:
>
> zfs-auto-device-timeout
> zfs-auto-device-timeout-fail-delay
>
> The first sets whether to use this feature, and configures the maximum
> time ZFS will wait for a response from a device before putting it in a
> "waiting" status.


The shortcomings of timeouts have been discussed on this list before.  
How do you tell the difference between a drive that is dead and a  
path that is just highly loaded?

I seem to recall the argument strongly made in the past that making  
decisions based on a timeout alone can provoke various undesirable  
cascade effects.

>   The second would be optional and is the maximum
> time ZFS will wait before faulting a device (at which point it's
> replaced by a hot spare).
>
> The reason I think this will work well with the FMA work is that you
> can implement this now and have a real improvement in ZFS
> availability.  Then, as the other work starts bringing better modeling
> for drive timeouts, the parameters can be either removed, or set
> automatically by ZFS.
> ... it should be possible for ZFS to read or
> write from other devices while it's waiting for an 'official' result
> from any one faulty component.

Sounds good - devil, meet details, etc.

--Toby

>
> Ross
>
>
> On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick  
> <[EMAIL PROTECTED]> wrote:
>> I think we (the ZFS team) all generally agree with you. ...
>>
>> The reason this is all so much harder than it sounds is that we're
>> trying to provide increasingly optimal behavior given a collection of
>> devices whose failure modes are largely ill-defined.  (Is the disk
>> dead or just slow?  Gone or just temporarily disconnected? ...
>>
>> Jeff
>>
>> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
>>> But that's exactly the problem Richard:  AFAIK.
>>>
>>> Can you state that absolutely, categorically, there is no failure  
>>> mode out there (caused by hardware faults, or bad drivers) that  
>>> won't lock a drive up for hours?  You can't, obviously, which is  
>>> why we keep saying that ZFS should have this kind of timeout  
>>> feature.
>>> ...
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to