Re: [zfs-discuss] Digging in the bowels of ZFS

Jim Klimov Tue, 04 Dec 2012 20:53:11 -0800

On 2012-12-03 18:23, Jim Klimov wrote:

On 2012-12-02 05:42, Jim Klimov wrote:

So... here are some applied questions:


Well, I am ready to reply a few of my own questions now :)


Continuing the desecration of my deceased files' resting grounds...

2) Do I understand correctly that for the offset definition, sectors
    in a top-level VDEV (which is all of my pool) are numbered in rows
    per-component disk? Like this:
          0  1  2  3  4  5
          6  7  8  9  10 11...

    That is, "offset % setsize = disknum"?

    If true, does such numbering scheme apply all over the TLVDEV,
    so as for my block on a 6-disk raidz2 disk set - its sectors
    start at (roughly rounded) "offset_from_DVA / 6" on each disk,
    right?

3) Then, if I read the ZFS on-disk spec correctly, the sectors of
    the first disk holding anything from this block would contain the
    raid-algo1 permutations of the four data sectors, sectors of
    the second disk contain the raid-algo2 for those 4 sectors,
    and the remaining 4 disks contain the data sectors?


My understanding was correct. For posterity, in the earlier set up
example I had an uncompressed 128KB block residing at the address
DVA[0]=<0:590002c1000:30000>. Counting in my disks' 4KB sectors,
this is 0x590002c1000/0x1000 = 0x590002C1 or 1493172929 logical
offset into the TLVDEV number 0 (and the only one in this pool).

Given that this TLVDEV is a 6-disk raidz2 set, my expected offset
on each component drive is 1493172929/6 = 248862154.83 (.83=5/6),
starting from after the ZFS header (2 labels and a reservation,
amounting to 4MB = 1024*4KB sectors). So this block's allocation
covers 8 4KB-sectors starting at 248862154+1024 on disk5 and at
248862155+1024 on disks 0,1,2,3,4.

As my further tests showed, the sector-columns (not rows as I had
expected after doc-reading) from disks 1,2,3,4 do recombine into
the original userdata (sha256 checksum matches), so disks 5 and 0
should hold the two parities - how ever that is calculated:

# for D in 1 2 3 4; do dd bs=4096 count=8 conv=noerror,sync \
  if=/dev/dsk/c7t${D}d0s0 of=b1d${D}.img skip=248863179; done

# for D in 1 2 3 4; do for R in 0 1 2 3 4 5 6 7; do \
  dd if=/pool/test3/b1d${D}.img bs=4096 skip=$R count=1; \
  done; done > /tmp/d

Note that the latter can be greatly simplified as "cat", which
also works to the same effect, and is faster:
# cat /pool/test3/b1d?.img > /tmp/d
However I left the "difficult" notation to use in experiments later on.

That is, the original 128KB block was cut into 4 pieces (my 4 data
drives in the 6-disk raidz2 set), and each 32Kb strip was stored
on a separate drive. Nice descriptive pictures in some presentations
suggested to me that the original block is stored sector by sector
rotating onto the next disk - the set of 4 sectors with 2 parity
sectors in my case being a single stripe for the RAID purposes.
This directly suggested that incomplete such "stripes", such as
the ends of files or whole small files, would still have the two
parity sectors and a handful of data sectors.

Reality differs.

For undersized allocations, i.e. of compressed data, it is possible
to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
some sectors do apparently get wasted because the A-size in the DVA
is divisible by 6*4KB. With columnar allocation of disks, it is
easier to see why full stripes have to be used:

p1 p2 d1 d2 d3 d4
.  ,  1  5  9   13
.  ,  2  6  10  14
.  ,  3  7  11  x
.  ,  4  8  12  x

In this illustration a 14-sector-long block is saved, with X being
the empty leftovers, on which we can't really save (as would be the
case with the other allocation, which is likely less efficient for
CPU and IOs).

The metadata blocks do have A-sizes of 0x3000 (2 parity + 1 data),
at least, which on 4KB-sectored disks is also pretty much for these
miniature data objects - but not as sad as 6*4KB would have been ;)

It also seems that the instinctive desire to have raidzN sets of
4*M+N disks (i.e. 6-disk raidz2, 11-disk raidz3, etc.) which was
discussed over and over on the list a couple of years ago, may
still be valid with typical block sizes being powers of two...
Even though gurus said that this should not matter much.
For IOPS - maybe not. For wasted space - likely...

I'm almost ready to go and test Q2 and Q3, however, the questions
which regard useable tools (and "what data should be fed into such
tools?") are still on the table.


> Some OLD questions remain raised, just in case anyone answers them.

>> 3b) The redundancy algos should in fact cover other redundancy disks
>>     too (in order to sustain loss of any 2 disks), correct? (...)
>>
>> 4) Where are the redundancy algorithms specified? Is there any simple
>>     tool that would recombine a given algo-N redundancy sector with
>>     some other 4 sectors from a 6-sector stripe in order to try and
>>     recalculate the sixth sector's contents? (Perhaps part of some
>>     unit tests?)

> 7) Is there a command-line tool to do lzjb compressions and
> decompressions (in the same blocky manner as would be applicable
> to ZFS compression)?
>
> I've also tried to gzip-compress the original 128KB file, but
> none of the compressed results (with varying gzip level) yielded
> the same checksum that would match the ZFS block's one.
> Zero-padding to 10240 bytes (psize=0x2800) did not help.
>
>
> 8) When should the decompression stop - as soon as it has extracted
> the logical-size number of bytes (i.e. 0x20000)?
>
>
> 9) Physical sizes magically are in whole 512b units, so it seems...
> I doubt that the compressed data would always end at such boundary.
>
> How many bytes should be covered by a checksum?
> Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)?
>

Thanks a lot in advance for any info, ideas, insights,
and just for reading this long post to the end ;)
//Jim Klimov

Ditto =)
//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

Reply via email to