My problematic home-NAS (that old list readers might still remember
about from a year-two ago) is back online, thanks to a friend who
fixed and turned it on. I'm going to try some more research on that
failure I had with the 6-disk raidz2 set, when it suddenly couldn't
read and recover some blocks (presumed scratched or otherwise damaged
on all disks at once while the heads hovered over similar locations).
My plan is to dig out the needed sectors of the broken block from
each of the 6 disks and try any and all reasonable recombinations
of redundancy and data sectors to try and match the checksum - this
should be my definite answer on whether ZFS (of that oi151.1.3-based
build) does all I think it can to save data or not. Either I put the
last nail into my itching question's coffin, or I'd nail a bug to
yell about ;)
So... here are some applied questions:
1) With DD I found the logical offset after which it fails to read
data from a damaged file, because ZFS doesn't trust the block.
That's 3840 sectors (@512b), or 0x1e0000. With ZDB I listed the
file inode's block tree and got this, in particular:
# zdb -ddddd -bbbbbb -e 1601233584937321596/export/DUMP 27352564 \
> /var/tmp/brokenfile.ZDBreport 2>&1
1c0000 L0 DVA=<0:acbc2a46000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4a00P (txg, cksum)
1e0000 L0 DVA=<0:acbc2a4f000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4c00P
200000 L0 DVA=<0:acbc2a58000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4c00P (txg, cksum)
So... how DO I properly interpret this to select sector ranges to
DD into my test area from each of the 6 disks in the raidz2 set?
On one hand, the DVA states the block length is 0x9000, and this
matches the offsets of neighboring blocks.
On the other hand, compressed "physical" data size is 0x4c00 for
this block, and ranges 0x4800-0x5000 for other blocks of the file.
Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way
smaller than 0x9000. For uncompressed files I think I saw entries
like "size=20000L/30000P", so I'm not sure even my multiplication
by 1.5x above is valid, and the discrepancy between DVA size and
interval, and "physical" allocation size reaches about 2x.
So... again... how many sectors from each disk should I fetch for
my research of this one block?
2) Do I understand correctly that for the offset definition, sectors
in a top-level VDEV (which is all of my pool) are numbered in rows
per-component disk? Like this:
0 1 2 3 4 5
6 7 8 9 10 11...
That is, "offset % setsize = disknum"?
If true, does such numbering scheme apply all over the TLVDEV,
so as for my block on a 6-disk raidz2 disk set - its sectors
start at (roughly rounded) "offset_from_DVA / 6" on each disk,
3) Then, if I read the ZFS on-disk spec correctly, the sectors of
the first disk holding anything from this block would contain the
raid-algo1 permutations of the four data sectors, sectors of
the second disk contain the raid-algo2 for those 4 sectors,
and the remaining 4 disks contain the data sectors?
The redundancy algos should in fact cover other redundancy disks
too (in order to sustain loss of any 2 disks), correct?
Is it, in particular, true that the redundancy-protected stripes
involve a single sector from each disk times the length of the
block's portion on each disk (not that some, say, 32kb from one
disk are wholly the redundancy for 4*32kb data from other disks)?
I think this is what I hope to catch - if certain non-overlapping
sectors got broken on each disk, but ZFS compares larger ranges
to recover data - these are two work on very different data.
4) Where are the redundancy algorithms specified? Is there any simple
tool that would recombine a given algo-N redundancy sector with
some other 4 sectors from a 6-sector stripe in order to try and
recalculate the sixth sector's contents? (Perhaps part of some
5) Is there any magic to the checksum algorithms? I.e. if I pass
some 128KB block's logical (userdata) contents to the command-line
"sha256" or "openssl sha256" - should I get the same checksum as
ZFS provides and uses?
6) What exactly does a checksum apply to - the 128Kb userdata block
or a 15-20Kb (lzjb-)compressed portion of data? I am sure it's the
latter, but ask just in case I don't miss anything... :)
As I said, in the end I hope to have from-disk and guessed userdata
sectors, a gazillion or so for given logical offsets inside a 128K
userdata block, which I would then recombine and hash with sha256
to see if any combination yields the value saved in block pointer
and ZFS missed something, or if I don't get any such combo and ZFS
does what it should exhaustively and correctly, indeed ;)
Thanks a lot in advance for any info, ideas, insights,
and just for reading this long post to the end ;)
zfs-discuss mailing list