Re: [zfs-discuss] hot spares - in standby?

Richard Elling Mon, 05 Feb 2007 20:47:57 -0800

Torrey McMahon wrote:

Richard Elling wrote:
Good question. If you consider that mechanical wear out is what ultimately
causes many failure modes, then the argument can be made that a spun down
disk should last longer. The problem is that there are failure modes which
are triggered by a spin up.  I've never seen field data showing the difference
between the two.
Often, the spare is up and running but for whatever reason you'll have abad block on it and you'll die during the reconstruct. Periodicallychecking the spare means reading and writing from over time in order tomake sure it's still ok. (You take the spare out of the trunk, you lookat it, you check the tire pressure, etc.) The issue I see coming downthe road is that we'll start getting into a "Golden Gate paint job"where it takes so long to check the spare that we'll just keep theprocess going constantly. Not as much wear and tear as real i/o but itwill still be up and running the entire time and you won't be able tospin the spare down.

In my experience, checking the spare tire leads to getting a flat and needing
the spare about a week later :-)  It has happened to me twice in the past
few years... I suspect a conspiracy... :-)

Back to the topic, I'd believe that some combination of hot, warm, and
cold spares would be optimal.

Anton B. Rang wrote:
> Shouldn't SCSI/ATA block sparing handle this?  Reconstruction should be
> purely a matter of writing, so "bit rot" shouldn't be an issue; or are
> there cases I'm not thinking of? (Yes, I know there are a limited number of
> spare blocks, but I wouldn't expect a spare which is turned off to develop
> severe media problems...am I wrong?)

In the disk, at the disk block level, there is fairly substantial ECC.
Yet, we still see data loss.  There are many mechanisms at work here.  One
that we have studied to some detail is superparamagnetic decay -- the medium
wishes to decay to a lower-enegy state, losing information in the process.
One way to "prevent" this is to rewrite the data -- basically resetting the
decay clock.  The study we did on this says that rewriting your data once
per year is reasonable.  Note that ZFS is COW, and scrubbing is currently a
read operation which will only write when data needs to be reconstructed.
I look at this as: rewrite-style scrubbing is preventative, read and verify
style scrubbing is prescriptive.  Either is better than neither.

In short, use spares and scrub.
 -- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hot spares - in standby?

Reply via email to