On Tue, Nov 25, 2008 at 11:55:17AM +0100, [EMAIL PROTECTED] wrote:
> >My idea is simply to allow the pool to continue operation while
> >waiting for the drive to fault, even if that's a faulty write.  It
> >just means that the rest of the operations (reads and writes) can keep
> >working for the minute (or three) it takes for FMA and the rest of the
> >chain to flag a device as faulty.
> 
> Except when you're writing a lot; 3 minutes can cause a 20GB backlog
> for a single disk.

If we're talking isolated, or even clumped-but-relatively-few bad
sectors, then having a short timeout for writes and remapping
should be possible to do without running out of memory to cache
those writes.  But...

...writes to bad sectors will happen when txgs flush, and depending on
how bad sector remapping is done (say, by picking a new block address
and changing the blkptrs that referred to the old one) that might mean
redoing large chunks of the txg in the next one, which might mean that
fsync() could be delayed an additional 5 seconds or so.  And even if
that's not the case, writes to mirrors are supposed to be synchronous,
so one would think that bad block remapping should be synchronous also,
thus there must be a delay on writes to bad blocks no matter what --
though that delay could be tuned to be no more than a few seconds.

That points to a possibly decent heuristic on writes: vdev-level
timeouts that result in bad block remapping, but if the queue of
outstanding bad block remappings grows too large -> treat the disk
as faulted and degrade the pool.

Sounds simple, but it needs to be combined at a higher layer with
information from other vdevs.  Unplugging a whole jbod shouldn't
necessarily fault all the vdevs on it -- perhaps it should cause
pool operation to pause until the jbod is plugged back in... which
should then cause those outstanding bad block remappings to be
rolled back since they weren't bad blocks after all.

That's a lot of fault detection and handling logic across many layers.

Incidentally, cables to fall out, or, rather, get pulled out
accidentally.  What should be the failure mode of a jbod disappearing
due to a pulled cable (or power supply failure)?  A pause in operation
(hangs)?  Or faulting of all affected vdevs, and if you're mirrored
across different jbods, incurring the need to re-silver later, with
degraded operation for hours on end?  I bet answers will vary.  The best
answer is to provide enough redundancy (multiple power supplies,
multi-pathing, ...) to make such situations less likely, but that's not
a complete answer.

Nico
-- 
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to