I hope there is some good outcome of this thread after all, below...
I wonder if anyone else thinks the following proposal is reasonable? ;)
2012-05-18 10:18, Daniel Carosone wrote:
Let's go over those, and clarify terminology, before going through the
rest of your post:
...* Replace: A device has gone, and needs to be completely
reconstructed.
As I detail below, i see Replace happening when a device
is going to be gone - but is still available and is being
proactively replaced.
Scrub is very similar to normal reads, apart from checking all copies
rather than serving the data from whichever copy successfully returns
first. Errors are not expected, are counted and repaired as/if found.
Resilver and Replace are very similar, and the terms are often used
interchangably. Replace is essentially resilver with a starting TXG of
0 (plus some labelling). In both cases, an error is expected or
assumed from the device in question, and repair initiated
unconditionally (and without incrementing error counters).
You're suggesting an assymetry between Resilver and Replace to exploit
the possibile speedup of sequential access; ok, seems attractive at
first blush, let's explore the idea.
Well, I've gone to a swimming pool today to swim the halfmile
and clear my head (metaphorically at least), and from the
depths I emerged with another idea:
From what I do see with the pool I'm upgrading (in another
thread), there is also a "Replace" mode for hotspare devices,
namely:
* I attached the hotspare to the pool
zpool add poolname spare c1t2d0
* I asked the pool to migrate a flaky disk's data to the new disk:
zpool replace poolname c5t6d0 c1t2d0
* I asked the pool to forget the old disk so it can be removed:
zpool detach poolname c5t6d0
(cfgadm, removal, pluck in the new disk, cfgadm, etc)
From iostat I see that all existing TLVDEV's drives, including
the one being replaced, are actively thrashed by reads for many
hours, with some writes pouring onto the new disk.
SO THE IDEA IS as follows: the disk being explicitly replaced,
as in upgrades of the pool to larger drives, should first be
copied onto new media "DD-style", which would be sequential IO
for both devices, bandwidth-bound and rather fast. Then there
should be a selective scrub, reading and checking allocated
blocks from this TLVDEV only - like resilver does today - and
repairing possible discrepancies (since the pool was likely
live during the "DD stage", as well as errors were possible
on the source drive as well as any other), and after this
selective scrub the process is complete.
BENEFITS:
* The pool quickly gets a more-or-less good copy of the original
disk, if it has not died completely and is able to serve reads
for DD-style copying. This decreases the window of exposure of
the TLVDEV to complete failure due to decreased redundancy, and
can already help to salvage much of the data in case of partly
bad source disk.
That is, after the DD-style copy the new disk may be able to
serve much of the valid data, and discrepancies might be easy
to repair using normal checksum-mismatch modes - if the old
disk kicks the bucket and/or is removed before the selective
scrub is complete to gracefully finish the replacement procedure.
The standard scrubbing approach after the DD-copy takes care
of ensuring that by the end of the procedure the new disk's
data is fully valid. This also allows to not bother about the
problems of the source disk being updated in locations ahead
or behind the point where we're reading now - some corrections
to be made by the selective scrub are expected anyway.
However, arguably, incoming writes may be placed on the source
disk and its syncing-up spare replacement (into correct sector
locations right from the start).
* Instead of scheduling many random writes, which may be slower
due to sync requirements, caching priorities, etc., we lean
towards many random reads - which would still be used if we
were using the original replace/resilver mode. Arguably, the
reads can be optimized better by ZFS pipeline and HDD NCQ/TCQ,
and in a safer manner than (random) write optimizations.
* This method should be beneficial to raidz as well as mirrors,
although the latter may have more options to cheaply recover
bad sectors detected (as HDD IO errors) on source media of
the one disk being replaced, on the fly - during DD-phase.
CAVEATS:
* This mode is of benefit for users whose pools are rather
fragmented and full, so that sequential copy is noticeably
faster than BP-tree-walk based resilvering. It is about 30x
quicker on the utilized servers and homeNAS'es that I see.
For example, on a Thumper in my other thread, resilvering
of a 250Gb disk (partition) takes 15-17 hours while writing
files and zfs-sends into a single-disk ZFS pool located on
the same 3Tb drive fills it up in 24 hours. A full scrub of
the original pool (45*250Gb) takes 24-27 hours. Time matters.
The ZFS "marketing" states that it is quicker to repair
because it only tests and copies the allocated data and
not the whole disk as other RAID systems - well, this is
only good as long as the pools are kept relatively empty.
While I can agree with benefits of limiting the disk seeks
via partitioning, i.e. by buying a 100Tb array and using
only 10Tb by allocating smaller disk slices, I don't see
a good reason to allocate the 100Tb array and consistently
keep it used at 10Tb, sorry.
Perhaps this mode with DD-style preamble should be triggered
by a separate command-line request (by admin's discretion)
or if it ever becomes a default option - it should be used
instead of the original resilver-only method after some
watermark value of disk utilization and/or known fragmentation.
* If the original disk (being replaced) is a piece of faulty
hardware, it can cause problems during the DD stage, such as:
** Lags - HDD retries on bad sectors can take a considerate
amount of time based on firmware settings/capabilities.
** Loss of device visibility from the controller, reset storms,
etc.; physical failure of original disk during the copy -
these would lead to inability to continue reading the disk.
** Erroneous reads - returned garbage will be compensated by
the following scrub after the DDing phase.
If the DDing process detects that the average read speed has
dropped to some unacceptable level or has stalled completely,
it can try to seek from another original-disk location and/or
fall back to original resilvering from other vdevs and abandon
the DD phase. This does not mean that retries should be avoided,
or that the first encountered error (even a connection error)
should be the cause for aborting or restarting the DD-phase.
It is not yet critical if some sectors were skipped during
the DDing phase - the following selective scrub will (should)
recover them, possibly by retrying the original disk as well,
and maybe it has got to recover and relocate the bad sectors
in the background by this time.
Even if the DD-phase was aborted after a non-trivial amount
of copying, the scrub/resilver should, IMHO, also read from
the partially filled new disk and only rewrite those sectors
that require rewriting (especially important for media that
is sensitive to write-wearing).
* Overall (wallclock) length of this replacement is likely to
be higher than of the original method - since about the same
amount of time will be needed for the scrub as for resilver,
and some time will be added for DDing, and maybe plagued with
retries etc. when hitting faulty sectors. However, a milestone
of relative data safety will be reached a lot faster (if the
source disk is substantially readable).
* Errors (on the target disk) are expected during the selective
scrub stage, and should be fixed quietly and not cause CKSUM
error counter bumps nor other panicky clutter.
This is so far a relatively raw idea and I've probably missed
something. Do you think it is worth pursuing and asking some
zfs developers to make a POC? ;)
Thanks,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss