On 2013-01-19 18:17, Bob Friesenhahn wrote:
Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping.
Correction: that (verification) would be scrubbing ;) The way I get it, resilvering is related to scrubbing but limited in impact such that it "rebuilds" a particular top-level vdev (i.e. one of the component mirrors) with an "assigned-bad" and new device. So they both should walk the block-pointer tree from the uberblock (current BP tree root) until they ultimately read all the BP entries and validate the userdata with checksums. But while scrub walks and verifies the whole pool and fixes discrepancies (logging checksum errors), the resilver verifies a particular TLVdev (and maybe has a cut-off "earliest" TXG for disks which fell out of the pool and later returned into it - with a known latest TXT that is assumed valid on this disk) and the process expects there to be errors - it is intent on (partially) rewriting one of the devices in it. Hmmm... Maybe that's why there are no errors logged? I don't know :) As for practice, I also have one Thumper that logs errors on a couple of drives upon every scrub. I think it was related to connectors, at least replugging the disks helped a lot (counts went from tens per scrub to 0-3). One of the original 250Gb disks was replaced with a 3Tb one and a 250Gb partition became part of the old pool (the remainder became a new test pool over a single device). Scrubbing the pools yields errors in those new 250Gb, but never on the 2.75Tb single-disk pool... so go figure :) Overall, intermittent errors might be attibuted to non-ECC RAM/CPUs (not our case), temperature affecting the mechanics and electronics (conditioned server room - not our case), electric power variations and noise (other systems in the room on the same and other UPSes don't complain like this), and cable/connector/HBA degradation (oxydization, wear, etc. - likely all that remains for our causes). This example regards internal disks of the Thumper, so at least we are certain to attribute no problems related to further breakage components - external cables, disk trays, etc... HTH, //Jim _______________________________________________ zfs-discuss mailing list email@example.com http://mail.opensolaris.org/mailman/listinfo/zfs-discuss