On 2013-01-19 18:17, Bob Friesenhahn wrote:
Resilver may in fact be just verifying that the pool disks are coherent
via metadata. This might happen if the fiber channel is flapping.
Correction: that (verification) would be scrubbing ;)
The way I get it, resilvering is related to scrubbing but limited
in impact such that it "rebuilds" a particular top-level vdev (i.e.
one of the component mirrors) with an "assigned-bad" and new device.
So they both should walk the block-pointer tree from the uberblock
(current BP tree root) until they ultimately read all the BP entries
and validate the userdata with checksums. But while scrub walks and
verifies the whole pool and fixes discrepancies (logging checksum
errors), the resilver verifies a particular TLVdev (and maybe has
a cut-off "earliest" TXG for disks which fell out of the pool and
later returned into it - with a known latest TXT that is assumed
valid on this disk) and the process expects there to be errors -
it is intent on (partially) rewriting one of the devices in it.
Hmmm... Maybe that's why there are no errors logged? I don't know :)
As for practice, I also have one Thumper that logs errors on a
couple of drives upon every scrub. I think it was related to
connectors, at least replugging the disks helped a lot (counts
went from tens per scrub to 0-3). One of the original 250Gb disks
was replaced with a 3Tb one and a 250Gb partition became part of
the old pool (the remainder became a new test pool over a single
device). Scrubbing the pools yields errors in those new 250Gb,
but never on the 2.75Tb single-disk pool... so go figure :)
Overall, intermittent errors might be attibuted to non-ECC RAM/CPUs
(not our case), temperature affecting the mechanics and electronics
(conditioned server room - not our case), electric power variations
and noise (other systems in the room on the same and other UPSes
don't complain like this), and cable/connector/HBA degradation
(oxydization, wear, etc. - likely all that remains for our causes).
This example regards internal disks of the Thumper, so at least we
are certain to attribute no problems related to further breakage
components - external cables, disk trays, etc...
zfs-discuss mailing list