My problematic home-NAS (that old list readers might still remember about from a year-two ago) is back online, thanks to a friend who fixed and turned it on. I'm going to try some more research on that failure I had with the 6-disk raidz2 set, when it suddenly couldn't read and recover some blocks (presumed scratched or otherwise damaged on all disks at once while the heads hovered over similar locations).
My plan is to dig out the needed sectors of the broken block from each of the 6 disks and try any and all reasonable recombinations of redundancy and data sectors to try and match the checksum - this should be my definite answer on whether ZFS (of that oi151.1.3-based build) does all I think it can to save data or not. Either I put the last nail into my itching question's coffin, or I'd nail a bug to yell about ;) So... here are some applied questions: 1) With DD I found the logical offset after which it fails to read data from a damaged file, because ZFS doesn't trust the block. That's 3840 sectors (@512b), or 0x1e0000. With ZDB I listed the file inode's block tree and got this, in particular: # zdb -ddddd -bbbbbb -e 1601233584937321596/export/DUMP 27352564 \ > /var/tmp/brokenfile.ZDBreport 2>&1 ... 1c0000 L0 DVA[0]=<0:acbc2a46000:9000> [L0 ZFS plain file] sha256 lzjb LE contiguous dedup single size=20000L/4a00P (txg, cksum) 1e0000 L0 DVA[0]=<0:acbc2a4f000:9000> [L0 ZFS plain file] sha256 lzjb LE contiguous dedup single size=20000L/4c00P birth=324721364L/324721364P fill=1 cksum=705e79361b8f028e:5a45c8f863a4035f:41b2961480304d7:be685ec248f00e78 200000 L0 DVA[0]=<0:acbc2a58000:9000> [L0 ZFS plain file] sha256 lzjb LE contiguous dedup single size=20000L/4c00P (txg, cksum) ... So... how DO I properly interpret this to select sector ranges to DD into my test area from each of the 6 disks in the raidz2 set? On one hand, the DVA states the block length is 0x9000, and this matches the offsets of neighboring blocks. On the other hand, compressed "physical" data size is 0x4c00 for this block, and ranges 0x4800-0x5000 for other blocks of the file. Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way smaller than 0x9000. For uncompressed files I think I saw entries like "size=20000L/30000P", so I'm not sure even my multiplication by 1.5x above is valid, and the discrepancy between DVA size and interval, and "physical" allocation size reaches about 2x. So... again... how many sectors from each disk should I fetch for my research of this one block? 2) Do I understand correctly that for the offset definition, sectors in a top-level VDEV (which is all of my pool) are numbered in rows per-component disk? Like this: 0 1 2 3 4 5 6 7 8 9 10 11... That is, "offset % setsize = disknum"? If true, does such numbering scheme apply all over the TLVDEV, so as for my block on a 6-disk raidz2 disk set - its sectors start at (roughly rounded) "offset_from_DVA / 6" on each disk, right? 3) Then, if I read the ZFS on-disk spec correctly, the sectors of the first disk holding anything from this block would contain the raid-algo1 permutations of the four data sectors, sectors of the second disk contain the raid-algo2 for those 4 sectors, and the remaining 4 disks contain the data sectors? The redundancy algos should in fact cover other redundancy disks too (in order to sustain loss of any 2 disks), correct? Is it, in particular, true that the redundancy-protected stripes involve a single sector from each disk times the length of the block's portion on each disk (not that some, say, 32kb from one disk are wholly the redundancy for 4*32kb data from other disks)? I think this is what I hope to catch - if certain non-overlapping sectors got broken on each disk, but ZFS compares larger ranges to recover data - these are two work on very different data. 4) Where are the redundancy algorithms specified? Is there any simple tool that would recombine a given algo-N redundancy sector with some other 4 sectors from a 6-sector stripe in order to try and recalculate the sixth sector's contents? (Perhaps part of some unit tests?) 5) Is there any magic to the checksum algorithms? I.e. if I pass some 128KB block's logical (userdata) contents to the command-line "sha256" or "openssl sha256" - should I get the same checksum as ZFS provides and uses? 6) What exactly does a checksum apply to - the 128Kb userdata block or a 15-20Kb (lzjb-)compressed portion of data? I am sure it's the latter, but ask just in case I don't miss anything... :) As I said, in the end I hope to have from-disk and guessed userdata sectors, a gazillion or so for given logical offsets inside a 128K userdata block, which I would then recombine and hash with sha256 to see if any combination yields the value saved in block pointer and ZFS missed something, or if I don't get any such combo and ZFS does what it should exhaustively and correctly, indeed ;) Thanks a lot in advance for any info, ideas, insights, and just for reading this long post to the end ;) //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss