2012-05-24 18:55, Richard Elling wrote:
This is a big assumption -- that the disk will operate normally, even
for data it cannot read. In my experience, this assumption is not valid
for the majority of HDD failure modes. Also, in the case of consumer-grade
disks, a single sector media error could take a very long time to
retry/fail.

Indeed it is, and I've covered this in the thread earlier -
the bulk copying phase ("DD-phase") should monitor its real
progress, and if it detects lags in comparison to the average
or expected speeds (expected = some tuning variable i.e. 50Mb/s),
the process should skip over some (arbitrary) range of sectors
and go on from another location (such skipped sectors are in
danger indeed, until the scrub-phase detects and reconstructs
them) or fall back to the original resilver method completely.
That was already described in some detail I thought of at the
time of the posting, and I can't add much to that yet.

From what I've seen with faulty sectors is that they are usually
either single errors or a "scratched" range which can be worked
around with i.e. partitioning for legacy FSes (if the SMART
relocation doesn't deal with them properly for any reason),
while most of the rest of the disk is okay. Retries may be
lengthy, ranging from several seconds up to a minute, but
they are often constrained in a few locations and *may* add
little delay in the overall scheme of things. If the delay
is more than acceptable and/or we can't find a "working
location" on the source disk, we just fall back to the
old method - either original resilver, or if much data has
been copied to the new disk - to the new selective scrub
(it being much like the resilver, but taking into account
those sectors on the target disk which may have been copied
over correctly).

A somewhat worse case is intermittent errors in random times
and logical disk locations due to who knows what - overheating,
firmware overflow errors, bus resets, or whatever. It's rather
them being the reason for scrub-validation of data after mass
migration, perhaps (as well as a reason for preventive regular
scrubs)...

//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to