I hope there is some good outcome of this thread after all, below... I wonder if anyone else thinks the following proposal is reasonable? ;)
2012-05-18 10:18, Daniel Carosone wrote:
Let's go over those, and clarify terminology, before going through the rest of your post: ...* Replace: A device has gone, and needs to be completely reconstructed.
As I detail below, i see Replace happening when a device is going to be gone - but is still available and is being proactively replaced.
Scrub is very similar to normal reads, apart from checking all copies rather than serving the data from whichever copy successfully returns first. Errors are not expected, are counted and repaired as/if found. Resilver and Replace are very similar, and the terms are often used interchangably. Replace is essentially resilver with a starting TXG of 0 (plus some labelling). In both cases, an error is expected or assumed from the device in question, and repair initiated unconditionally (and without incrementing error counters). You're suggesting an assymetry between Resilver and Replace to exploit the possibile speedup of sequential access; ok, seems attractive at first blush, let's explore the idea.
Well, I've gone to a swimming pool today to swim the halfmile and clear my head (metaphorically at least), and from the depths I emerged with another idea: From what I do see with the pool I'm upgrading (in another thread), there is also a "Replace" mode for hotspare devices, namely: * I attached the hotspare to the pool zpool add poolname spare c1t2d0 * I asked the pool to migrate a flaky disk's data to the new disk: zpool replace poolname c5t6d0 c1t2d0 * I asked the pool to forget the old disk so it can be removed: zpool detach poolname c5t6d0 (cfgadm, removal, pluck in the new disk, cfgadm, etc) From iostat I see that all existing TLVDEV's drives, including the one being replaced, are actively thrashed by reads for many hours, with some writes pouring onto the new disk. SO THE IDEA IS as follows: the disk being explicitly replaced, as in upgrades of the pool to larger drives, should first be copied onto new media "DD-style", which would be sequential IO for both devices, bandwidth-bound and rather fast. Then there should be a selective scrub, reading and checking allocated blocks from this TLVDEV only - like resilver does today - and repairing possible discrepancies (since the pool was likely live during the "DD stage", as well as errors were possible on the source drive as well as any other), and after this selective scrub the process is complete. BENEFITS: * The pool quickly gets a more-or-less good copy of the original disk, if it has not died completely and is able to serve reads for DD-style copying. This decreases the window of exposure of the TLVDEV to complete failure due to decreased redundancy, and can already help to salvage much of the data in case of partly bad source disk. That is, after the DD-style copy the new disk may be able to serve much of the valid data, and discrepancies might be easy to repair using normal checksum-mismatch modes - if the old disk kicks the bucket and/or is removed before the selective scrub is complete to gracefully finish the replacement procedure. The standard scrubbing approach after the DD-copy takes care of ensuring that by the end of the procedure the new disk's data is fully valid. This also allows to not bother about the problems of the source disk being updated in locations ahead or behind the point where we're reading now - some corrections to be made by the selective scrub are expected anyway. However, arguably, incoming writes may be placed on the source disk and its syncing-up spare replacement (into correct sector locations right from the start). * Instead of scheduling many random writes, which may be slower due to sync requirements, caching priorities, etc., we lean towards many random reads - which would still be used if we were using the original replace/resilver mode. Arguably, the reads can be optimized better by ZFS pipeline and HDD NCQ/TCQ, and in a safer manner than (random) write optimizations. * This method should be beneficial to raidz as well as mirrors, although the latter may have more options to cheaply recover bad sectors detected (as HDD IO errors) on source media of the one disk being replaced, on the fly - during DD-phase. CAVEATS: * This mode is of benefit for users whose pools are rather fragmented and full, so that sequential copy is noticeably faster than BP-tree-walk based resilvering. It is about 30x quicker on the utilized servers and homeNAS'es that I see. For example, on a Thumper in my other thread, resilvering of a 250Gb disk (partition) takes 15-17 hours while writing files and zfs-sends into a single-disk ZFS pool located on the same 3Tb drive fills it up in 24 hours. A full scrub of the original pool (45*250Gb) takes 24-27 hours. Time matters. The ZFS "marketing" states that it is quicker to repair because it only tests and copies the allocated data and not the whole disk as other RAID systems - well, this is only good as long as the pools are kept relatively empty. While I can agree with benefits of limiting the disk seeks via partitioning, i.e. by buying a 100Tb array and using only 10Tb by allocating smaller disk slices, I don't see a good reason to allocate the 100Tb array and consistently keep it used at 10Tb, sorry. Perhaps this mode with DD-style preamble should be triggered by a separate command-line request (by admin's discretion) or if it ever becomes a default option - it should be used instead of the original resilver-only method after some watermark value of disk utilization and/or known fragmentation. * If the original disk (being replaced) is a piece of faulty hardware, it can cause problems during the DD stage, such as: ** Lags - HDD retries on bad sectors can take a considerate amount of time based on firmware settings/capabilities. ** Loss of device visibility from the controller, reset storms, etc.; physical failure of original disk during the copy - these would lead to inability to continue reading the disk. ** Erroneous reads - returned garbage will be compensated by the following scrub after the DDing phase. If the DDing process detects that the average read speed has dropped to some unacceptable level or has stalled completely, it can try to seek from another original-disk location and/or fall back to original resilvering from other vdevs and abandon the DD phase. This does not mean that retries should be avoided, or that the first encountered error (even a connection error) should be the cause for aborting or restarting the DD-phase. It is not yet critical if some sectors were skipped during the DDing phase - the following selective scrub will (should) recover them, possibly by retrying the original disk as well, and maybe it has got to recover and relocate the bad sectors in the background by this time. Even if the DD-phase was aborted after a non-trivial amount of copying, the scrub/resilver should, IMHO, also read from the partially filled new disk and only rewrite those sectors that require rewriting (especially important for media that is sensitive to write-wearing). * Overall (wallclock) length of this replacement is likely to be higher than of the original method - since about the same amount of time will be needed for the scrub as for resilver, and some time will be added for DDing, and maybe plagued with retries etc. when hitting faulty sectors. However, a milestone of relative data safety will be reached a lot faster (if the source disk is substantially readable). * Errors (on the target disk) are expected during the selective scrub stage, and should be fixed quietly and not cause CKSUM error counter bumps nor other panicky clutter. This is so far a relatively raw idea and I've probably missed something. Do you think it is worth pursuing and asking some zfs developers to make a POC? ;) Thanks, //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss