>>>>> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:

    es> Are you running your experiments on build 101 or later?

no.

aside from that quick one for copies=2 im pretty bad about running
well-designed experiments.  and I do have old builds.  I need to buy
more hardware.

It's hard to know how to get the most stable system.  I bet it'll be a
year before this b101 stuff makes it into stable Solaris, yet the
bleeding-edge improvements are all stability-related, so for
mostly-ZFS jobs maybe it's better to run SXCE than sol10 in
production.  I suppose I should be happy about that since it means
more people will have some source. :)

    es> P.S. I'm also not sure that B_FAILFAST behaves in the way you
    es> think it does.  My reading of sd.c seems to imply that much of
    es> what you suggest is actually how it currently behaves,

Yeah, I got a private email referring me to the spec for
PSARC/2002/126 which already included both pieces I hoped for
(killing queued CDB's, and statefully tracking each device as
failed/good), so I take back what I said about B_FAILFAST being
useless---it should be able to help the ZFS availability problems
we've seen.  

The PSARC says B_FAILFAST is implemented in the ``disk driver'' which
AIUI is above the controller, just as I hoped, but there is more than
one ``disk driver'' so the B_FAILFAST stuff is not factored out to one
spot the way a vdev-level system would be but rather punted downwards
and paste-and-raped into sd, ssd, dad, ...., so whatever experience
you get with it isn't necessarily portable to disks with a different
kind of attachment.  

I still think the vdev-layer logic could make better decisions by
using more than the 1 bit of information per device, but maybe 1-bit
B_FAILFAST is enough to make me accept the shortfall as an
arguable-feature rather than a unanimous-bug.  Also if it can fix my
(1) and (2) with FMA then maybe the gap between B_FAILFAST and real
NetApp-like drive diagnosis can be done partly in userspace the way
developers seem to want.

The problems this doesn't cover are write-related:

 * what should we do about implicit and explicit fsync()s where all
   the data is already on stable storage, but not with full
   redundancy---one device won't finish writing?

   I think there should not be transparent recovery from this, though
   maybe others disagree.  but pool-level failmode doesn't settle the
   issue:

   (a) _when_ will you take the failure action (if failmode != wait)?
       The property says *what* to do, not *when* to do it.

   (b) There isn't any vdev-level failure, only device-level, so it's
       not appropriate to consult the failmode property in the first
       place---the situation is different.  The question is, do we
       keep trying, or do we transition the device to FAULTED and the
       vdev to DEGRADED so that fsync()'s can proceed without that
       device and hotspare resilver kicks in?

   (c) Inside the time interval between when the device starts writing
       slowly and when you take the (b) action, how well can you
       isolate the failure?  For example, can you insure that
       read-only access remains instantaneous, even though atime
       updates involve writing, even though these 5-second txg-flushes
       are blocked, and even though the admin might (gasp!) type
       'zpool status'---or even a label-writing command like 'zpool
       attach'?  or will one of those three things cause pool-wide or
       ZFS-wide hang that blocks read access which could theoretically
       work?

 * commands like zpool attach, detach, replace, offline, export 

    (a) should not be uninterruptably hangable.  

    (b) Problems in one pool should not spill over into another. 

    (c) And finally they should be forcable even when they can't write
        everything they'd like to, so that rebooting isn't a necessary
        move in certain kinds failure-recovery of pool gymnastics.

I expect there's some quiet work on this in b101 also---at least
someone said 'zpool status' isn't supposed to hang anymore?  so I'll
have to try it out, but B_FAILFAST isn't enough to settle the whole
issue, even modulo marginal performance improvement that more
ambitiously wacky schemes might promise us.

Attachment: pgp8eFnBFQAmv.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to