[zfs-discuss] Digging in the bowels of ZFS

Jim Klimov Sat, 01 Dec 2012 20:43:01 -0800

My problematic home-NAS (that old list readers might still remember
about from a year-two ago) is back online, thanks to a friend who
fixed and turned it on. I'm going to try some more research on that
failure I had with the 6-disk raidz2 set, when it suddenly couldn't
read and recover some blocks (presumed scratched or otherwise damaged
on all disks at once while the heads hovered over similar locations).


My plan is to dig out the needed sectors of the broken block from
each of the 6 disks and try any and all reasonable recombinations
of redundancy and data sectors to try and match the checksum - this
should be my definite answer on whether ZFS (of that oi151.1.3-based
build) does all I think it can to save data or not. Either I put the
last nail into my itching question's coffin, or I'd nail a bug to
yell about ;)

So... here are some applied questions:

1) With DD I found the logical offset after which it fails to read
   data from a damaged file, because ZFS doesn't trust the block.
   That's 3840 sectors (@512b), or 0x1e0000. With ZDB I listed the
   file inode's block tree and got this, in particular:

# zdb -ddddd -bbbbbb -e 1601233584937321596/export/DUMP 27352564 \
  > /var/tmp/brokenfile.ZDBreport 2>&1
...
          1c0000   L0 DVA[0]=<0:acbc2a46000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4a00P (txg, cksum)

          1e0000   L0 DVA[0]=<0:acbc2a4f000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4c00P
birth=324721364L/324721364P fill=1
cksum=705e79361b8f028e:5a45c8f863a4035f:41b2961480304d7:be685ec248f00e78

          200000   L0 DVA[0]=<0:acbc2a58000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4c00P (txg, cksum)
...

   So... how DO I properly interpret this to select sector ranges to
   DD into my test area from each of the 6 disks in the raidz2 set?

   On one hand, the DVA states the block length is 0x9000, and this
   matches the offsets of neighboring blocks.

   On the other hand, compressed "physical" data size is 0x4c00 for
   this block, and ranges 0x4800-0x5000 for other blocks of the file.
   Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way
   smaller than 0x9000. For uncompressed files I think I saw entries
   like "size=20000L/30000P", so I'm not sure even my multiplication
   by 1.5x above is valid, and the discrepancy between DVA size and
   interval, and "physical" allocation size reaches about 2x.

   So... again... how many sectors from each disk should I fetch for
   my research of this one block?

2) Do I understand correctly that for the offset definition, sectors
   in a top-level VDEV (which is all of my pool) are numbered in rows
   per-component disk? Like this:
         0  1  2  3  4  5
         6  7  8  9  10 11...

   That is, "offset % setsize = disknum"?

   If true, does such numbering scheme apply all over the TLVDEV,
   so as for my block on a 6-disk raidz2 disk set - its sectors
   start at (roughly rounded) "offset_from_DVA / 6" on each disk,
   right?

3) Then, if I read the ZFS on-disk spec correctly, the sectors of
   the first disk holding anything from this block would contain the
   raid-algo1 permutations of the four data sectors, sectors of
   the second disk contain the raid-algo2 for those 4 sectors,
   and the remaining 4 disks contain the data sectors?
   The redundancy algos should in fact cover other redundancy disks
   too (in order to sustain loss of any 2 disks), correct?

   Is it, in particular, true that the redundancy-protected stripes
   involve a single sector from each disk times the length of the
   block's portion on each disk (not that some, say, 32kb from one
   disk are wholly the redundancy for 4*32kb data from other disks)?

   I think this is what I hope to catch - if certain non-overlapping
   sectors got broken on each disk, but ZFS compares larger ranges
   to recover data - these are two work on very different data.

4) Where are the redundancy algorithms specified? Is there any simple
   tool that would recombine a given algo-N redundancy sector with
   some other 4 sectors from a 6-sector stripe in order to try and
   recalculate the sixth sector's contents? (Perhaps part of some
   unit tests?)

5) Is there any magic to the checksum algorithms? I.e. if I pass
   some 128KB block's logical (userdata) contents to the command-line
   "sha256" or "openssl sha256" - should I get the same checksum as
   ZFS provides and uses?

6) What exactly does a checksum apply to - the 128Kb userdata block
   or a 15-20Kb (lzjb-)compressed portion of data? I am sure it's the
   latter, but ask just in case I don't miss anything... :)

As I said, in the end I hope to have from-disk and guessed userdata
sectors, a gazillion or so for given logical offsets inside a 128K
userdata block, which I would then recombine and hash with sha256
to see if any combination yields the value saved in block pointer
and ZFS missed something, or if I don't get any such combo and ZFS
does what it should exhaustively and correctly, indeed ;)

Thanks a lot in advance for any info, ideas, insights,
and just for reading this long post to the end ;)
//Jim Klimov

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Digging in the bowels of ZFS

Reply via email to