I hope there is some good outcome of this thread after all, below...
I wonder if anyone else thinks the following proposal is reasonable? ;)

2012-05-18 10:18, Daniel Carosone wrote:
Let's go over those, and clarify terminology, before going through the
rest of your post:
...* Replace: A device has gone, and needs to be completely
     reconstructed.

As I detail below, i see Replace happening when a device
is going to be gone - but is still available and is being
proactively replaced.


Scrub is very similar to normal reads, apart from checking all copies
rather than serving the data from whichever copy successfully returns
first. Errors are not expected, are counted and repaired as/if found.

Resilver and Replace are very similar, and the terms are often used
interchangably. Replace is essentially resilver with a starting TXG of
0 (plus some labelling). In both cases, an error is expected or
assumed from the device in question, and repair initiated
unconditionally (and without incrementing error counters).

You're suggesting an assymetry between Resilver and Replace to exploit
the possibile speedup of sequential access; ok, seems attractive at
first blush, let's explore the idea.

Well, I've gone to a swimming pool today to swim the halfmile
and clear my head (metaphorically at least), and from the
depths I emerged with another idea:

From what I do see with the pool I'm upgrading (in another
thread), there is also a "Replace" mode for hotspare devices,
namely:
* I attached the hotspare to the pool
  zpool add poolname spare c1t2d0
* I asked the pool to migrate a flaky disk's data to the new disk:
  zpool replace poolname c5t6d0 c1t2d0
* I asked the pool to forget the old disk so it can be removed:
  zpool detach poolname c5t6d0
  (cfgadm, removal, pluck in the new disk, cfgadm, etc)

From iostat I see that all existing TLVDEV's drives, including
the one being replaced, are actively thrashed by reads for many
hours, with some writes pouring onto the new disk.


SO THE IDEA IS as follows: the disk being explicitly replaced,
as in upgrades of the pool to larger drives, should first be
copied onto new media "DD-style", which would be sequential IO
for both devices, bandwidth-bound and rather fast. Then there
should be a selective scrub, reading and checking allocated
blocks from this TLVDEV only - like resilver does today - and
repairing possible discrepancies (since the pool was likely
live during the "DD stage", as well as errors were possible
on the source drive as well as any other), and after this
selective scrub the process is complete.

BENEFITS:
* The pool quickly gets a more-or-less good copy of the original
  disk, if it has not died completely and is able to serve reads
  for DD-style copying. This decreases the window of exposure of
  the TLVDEV to complete failure due to decreased redundancy, and
  can already help to salvage much of the data in case of partly
  bad source disk.

  That is, after the DD-style copy the new disk may be able to
  serve much of the valid data, and discrepancies might be easy
  to repair using normal checksum-mismatch modes - if the old
  disk kicks the bucket and/or is removed before the selective
  scrub is complete to gracefully finish the replacement procedure.

  The standard scrubbing approach after the DD-copy takes care
  of ensuring that by the end of the procedure the new disk's
  data is fully valid. This also allows to not bother about the
  problems of the source disk being updated in locations ahead
  or behind the point where we're reading now - some corrections
  to be made by the selective scrub are expected anyway.

  However, arguably, incoming writes may be placed on the source
  disk and its syncing-up spare replacement (into correct sector
  locations right from the start).

* Instead of scheduling many random writes, which may be slower
  due to sync requirements, caching priorities, etc., we lean
  towards many random reads - which would still be used if we
  were using the original replace/resilver mode. Arguably, the
  reads can be optimized better by ZFS pipeline and HDD NCQ/TCQ,
  and in a safer manner than (random) write optimizations.

* This method should be beneficial to raidz as well as mirrors,
  although the latter may have more options to cheaply recover
  bad sectors detected (as HDD IO errors) on source media of
  the one disk being replaced, on the fly - during DD-phase.

CAVEATS:

* This mode is of benefit for users whose pools are rather
  fragmented and full, so that sequential copy is noticeably
  faster than BP-tree-walk based resilvering. It is about 30x
  quicker on the utilized servers and homeNAS'es that I see.

  For example, on a Thumper in my other thread, resilvering
  of a 250Gb disk (partition) takes 15-17 hours while writing
  files and zfs-sends into a single-disk ZFS pool located on
  the same 3Tb drive fills it up in 24 hours. A full scrub of
  the original pool (45*250Gb) takes 24-27 hours. Time matters.

  The ZFS "marketing" states that it is quicker to repair
  because it only tests and copies the allocated data and
  not the whole disk as other RAID systems - well, this is
  only good as long as the pools are kept relatively empty.
  While I can agree with benefits of limiting the disk seeks
  via partitioning, i.e. by buying a 100Tb array and using
  only 10Tb by allocating smaller disk slices, I don't see
  a good reason to allocate the 100Tb array and consistently
  keep it used at 10Tb, sorry.

  Perhaps this mode with DD-style preamble should be triggered
  by a separate command-line request (by admin's discretion)
  or if it ever becomes a default option - it should be used
  instead of the original resilver-only method after some
  watermark value of disk utilization and/or known fragmentation.

* If the original disk (being replaced) is a piece of faulty
  hardware, it can cause problems during the DD stage, such as:
** Lags - HDD retries on bad sectors can take a considerate
   amount of time based on firmware settings/capabilities.
** Loss of device visibility from the controller, reset storms,
   etc.; physical failure of original disk during the copy -
   these would lead to inability to continue reading the disk.
** Erroneous reads - returned garbage will be compensated by
   the following scrub after the DDing phase.

  If the DDing process detects that the average read speed has
  dropped to some unacceptable level or has stalled completely,
  it can try to seek from another original-disk location and/or
  fall back to original resilvering from other vdevs and abandon
  the DD phase. This does not mean that retries should be avoided,
  or that the first encountered error (even a connection error)
  should be the cause for aborting or restarting the DD-phase.

  It is not yet critical if some sectors were skipped during
  the DDing phase - the following selective scrub will (should)
  recover them, possibly by retrying the original disk as well,
  and maybe it has got to recover and relocate the bad sectors
  in the background by this time.

  Even if the DD-phase was aborted after a non-trivial amount
  of copying, the scrub/resilver should, IMHO, also read from
  the partially filled new disk and only rewrite those sectors
  that require rewriting (especially important for media that
  is sensitive to write-wearing).

* Overall (wallclock) length of this replacement is likely to
  be higher than of the original method - since about the same
  amount of time will be needed for the scrub as for resilver,
  and some time will be added for DDing, and maybe plagued with
  retries etc. when hitting faulty sectors. However, a milestone
  of relative data safety will be reached a lot faster (if the
  source disk is substantially readable).

* Errors (on the target disk) are expected during the selective
  scrub stage, and should be fixed quietly and not cause CKSUM
  error counter bumps nor other panicky clutter.


This is so far a relatively raw idea and I've probably missed
something. Do you think it is worth pursuing and asking some
zfs developers to make a POC? ;)

Thanks,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to