Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

Miles Nordin Wed, 26 Nov 2008 16:02:54 -0800

>>>>> "rs" == Ross Smith <[EMAIL PROTECTED]> writes:
>>>>> "nw" == Nicolas Williams <[EMAIL PROTECTED]> writes:


    rs> I disagree Bob, I think this is a very different function to
    rs> that which FMA provides.

I see two problems.

 (1) FMA doesn't seem to work very well, and was used as an excuse to
     keep proper exception handling out of ZFS for a couple years, so
     im sort of...skeptical whenever it's brought up like a panacea.

 (2) The FMA model of collecting telemmetry, taking it into
     user-space, chin-strokingly contemplating it for a while, then
     decreeing a diagnosis, is actually a rather limited one.  I can
     think of two kinds of limit:

     (a) you're diagnosing the pool FMA is running on.  FMA is on the
         root pool, but the root pool won't unfreeze until FMA
         diagnoses it.

         In practice it's much worse, because problems in one pool's
         devices can freeze all of ZFS, even other pools.  Or if the
         system is NFS-rooted and also exporting ZFS filesystems over
         NFS, maybe all of NFS freezes?  problems like that, knocking
         out FMA.  Diagnosis in kernel is harder to knock out.

     (b) calls are sleeping uninterruptably in the path that returns
         events to FMA.  ``Call down into the controller driver, wait
         for return success or failure, then count the event and
         callback to FMA as appropriate.  If something's borked, FMA
         will eventually return diagnosis.''  This plan is useless if
         the controller just freezes.  FMA never sees anything.  You
         are analyzing faults, yes, but you can only do it with
         hindsight.  When do you do the FMA callback?  To implement
         this timeout, you'd have to do a callback before and after
         each I/O, which is obviously too expensive.  

         Likewise, when FMA returns the diagnosis, are you prepared to
         act on it?  Or are you busy right now, and you're going to
         act on it just as soon as that controller returns success or
         failure?

         You can't abstract the notion of time out of your diagnosis.
         Trying to compartmentalize it interferes with working it into
         low-level event loops in a way that's sometimes needed.

It's not a matter of where things taxonomically belong, where it feels
clean to put some functionality in your compartmentalized layered
tower.  Certain things just aren't achievable from certain places.

    nw> If we're talking isolated, or even clumped-but-relatively-few
    nw> bad sectors, then having a short timeout for writes and
    nw> remapping should be possible 

I'm not sure I understand the state machine for the remapping plan
but...I think your idea is, try to write to some spot on the disk.  If
it takes too long, cancel the write, and try writing somewhere else
instead.  Then do bad-block-remapping: fix up all the pointers for the
new location, mark the spot that took too long as poisonous, all that.

I don't think it'll work.  First, you can't cancel the write.  Once
you dispatch a write that hangs, you've locked up, at a minimum, the
drive trying to write.  You don't get the option of remapping and
writing elsewhere, because the drive's stopped listening to you.
Likely, you've also locked up the bus (if the drive's on PATA or
SCSI), or maybe the whole controller.  (This is IMHO the best reason
for laying out a RAID to survive a controller failure---interaction
with a bad drive could freeze a whole controller.)

Even if you could cancel the write, when do you cancel it?  If you can
learn your drive and controller so well you convince them to ignore
you for 10 seconds instead of two minutes when they hit a block they
can't write, you've got approximately the same problem, because you
don't know where the poison sectors are.  You'll probably hit another
one.  Even a ten-second write means the drive's performance is shot by
almost three orders of magnitude---it's not workable.

Finally, this approach interferes with diagnosis.  The drives have
their own retry state machine.  If you start muddling all this ad-hoc
stuff on top of it you can't tell the difference between drive
failures, cabling problems, controller failures.  You end up with
normal thermal recalibration events being treated as some kind of
``spurious late read'' and inventing all these strange unexplained
failure terms which make it impossible to write a paper like the
Netapp or Google papers on UNC's we used to cite in here all the time,
because your failure statistics no longer correspond to a single layer
of the storage stack and can't be compared to others' statistics.
Also, remember that we suspect and wish to tolerate drives that
operate many standard deviations outside their specification, even
when they're not broken or suspect or about to break.  There are two
reasons.  First, we think they might do it.  Second, otherwise you
can't collect performance statistics you can compare with others'.

That's why the added failure handling I suggested is only to ignore
drives---either for a little while, or permanently.  Merely ignoring a
drive, without telling the drive you're ignoring it, doesn't interfere
with collecting statistics from it.

The two queues inside the drive (retryable and deadline) would let you
do this bad-block-remapping, but no drive implements it, and it's
probably impossible to implement because of the sorts of things drives
do while ``retrying''.  I described the drive-QoS idea to explain why
this B_FAILFAST-ish plan of supervising the drive's recovery behavior,
or any plan involving ``cancelling'' CDB's, is never going to work.

Here is one variant of this remapping plan I think could work, which
somewhat preserves the existing storage stack layering:

 * add a timeout to B_FAILFAST cdb's above the controller driver, a
   short one like a couple seconds.

 * when a drive is busy on a non-B_FAILFAST transaction for longer
   than the B_FAILFAST timeout, walk through the CDB queue and
   instantly fail all the B_FAILFAST transactions, without even
   sending them to the drive.

 * when a drive blows a B_FAILFAST timeout, admit no more B_FAILFAST
   transactions until it successfully completes a non-B_FAILFAST
   transaction.  If the drive is marked timeout-blown, and no
   transactions are queued for it, wait 60 seconds and then make up a
   fake transaction for it, like ``read one sector in the middle of
   the disk.''

I like the vdev-layer ideas better than the block-layer ideas though.

    nw> What should be the failure mode of a jbod disappearing due to
    nw> a pulled cable (or power supply failure)?  A pause in
    nw> operation (hangs)?  Or faulting of all affected vdevs, and if
    nw> you're mirrored across different jbods, incurring the need to
    nw> re-silver later, with degraded operation for hours on end?

The resilvering shoudl only include things written during the outage,
so the degraded operation will last some time proportional to the
outage.  Resilvering is already supposed to work this way.

The argument, I think, will be over the idea of auto-onlining things.
My opinion: if you are dealing with failure by deciding to return
success to fsync() with fewer copies of the data written, then this
should require either a spare rebuild or manually issuing 'zpool
clear' to get back to normal.  Certain kinds of rocking
behavior---like changes to the mirror roundrobin, or delaying writes
of non-fsync() data, are okay, but rocking back and forth between
redundancy states automatically during normal operation is probably
unacceptable.  

The counter opinion I suppose might be that we get more MTDL by
writing as quickly as possible to as many places as possible, so
automatic-onlining is good.  but i dont think so.

pgpEj1BG5bFK4.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

Reply via email to