On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote:
> 
> Personally, if a SATA disk wasn't responding to any requests after 2
> seconds I really don't care if an error has been detected, as far as
> I'm concerned that disk is faulty.

Unless you have power management enabled, or there's a bad region of the
disk, or the bus was reset, or...

> I do have a question though.  From what you're saying, the response
> time can't be consistent across all hardware, so you're once again at
> the mercy of the storage drivers.  Do you know how long does
> B_FAILFAST takes to return a response on iSCSI?  If that's over 1-2
> seconds I would still consider that too slow I'm afraid.

It's main function is how it deals with retryable errors.  If the drive
responds with a retryable error, or any error at all, it won't attempt
to retry again.  If you have a device that is taking arbitrarily long to
respond to successful commands (or to notice that a command won't
succeed), it won't help you.

> I understand that Sun in general don't want to add fault management to
> ZFS, but I don't see how this particular timeout does anything other
> than help ZFS when it's dealing with such a diverse range of media.  I
> agree that ZFS can't know itself what should be a valid timeout, but
> that's exactly why this needs to be an optional administrator set
> parameter.  The administrator of a storage array who wants to set this
> certainly knows what a valid timeout is for them, and these timeouts
> are likely to be several orders of magnitude larger than the standard
> response times.  I would configure very different values for my SATA
> drives as for my iSCSI connections, but in each case I would be
> happier knowing that ZFS has more of a chance of catching bad drivers
> or unexpected scenarios.

The main problem with exposing tunables like this is that they have a
direct correlation to service actions, and mis-diagnosing failures costs
everybody (admin, companies, Sun, etc) lots of time and money.  Once you
expose such a tunable, it will be impossible to trust any FMA diagnosis,
because you won't be able to know whether it was a mistaken tunable.

A better option would be to not use this to perform FMA diagnosis, but
instead work into the mirror child selection code.  This has already
been alluded to before, but it would be cool to keep track of latency
over time, and use this to both a) prefer one drive over another when
selecting the child and b) proactively timeout/ignore results from one
child and select the other if it's taking longer than some historical
standard deviation.  This keeps away from diagnosing drives as faulty,
but does allow ZFS to make better choices and maintain response times.
It shouldn't be hard to keep track of the average and/or standard
deviation and use it for selection; proactively timing out the slow I/Os
is much trickier.

As others have mentioned, things get more difficult with writes.  If I
issue a write to both halves of a mirror, should I return when the first
one completes, or when both complete?  One possibility is to expose this
as a tunable, but any such "best effort RAS" is a little dicey because
you have very little visibility into the state of the pool in this
scenario - "is my data protected?" becomes a very difficult question to
answer.

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to