For those who have work to do and can't be bothered to read detailed
context, please do scroll down to the marked Applied question about
the possible project to implement a better on-disk layout of blocks.
The busy experts' opinions are highly regarded here. Thanks ;) //Jim
CONTEXT AND SPECULATION
Well, now that I've mostly completed building my tool to locate and
extract from disk and verify the sectors related to any particular
block, I can state with certainty: data sector numbering is columnar
as was depicted in my recent mails (quote below), not rows as I had
believed earlier - and which would be more compact to store.
Columns do make certain sense, but do also lead to more wasted space
than could be possible otherwise - and I'm not sure if the allocation
in rows would be really slower to write or read, especially since
the HDD caching would coalesce requests to neighboring sectors -
be they a contiguous quarter of my block's physical data or a series
of every fourth sector from that. This would be more complex to code
and comprehend - likely. Might even require more CPU cycles to account
sizes properly (IF today we just quickly allocate columns of same
size - I skimmed over vdev_raidz.c, but did not look into this detail).
Saving 1-2 sectors from allocations which are some 10-30 sectors long
altogether - this is IMHO a worthy percentage of savings to worry and
bother about, especially with the compression-related paradigm of
"our CPUs are slackers with nothing to do". ZFS overhead on 4K-sectored
disks is pretty "expensive" already, so I see little need to feed it
extra desserts too ;)
If one were to implement a different sector allocator (rows with more
precise cutoff vs. columns as they are today) and expose it as a zfs
property that can be set by users (or testing developers), would it
make sense to call it a "compression" mode (in current terms) and use
a bit from that field? Or should a GRID bits be more properly used for
I am not sure if feature flags are a proper mechanism for this, except
to protect form import and interpretation of such "fixed" datasets and
pools on incompatible (older) implementations - the allocation layout
is likely going to be an attribute applied to each block at write-time
and noted in blkptr_t like the checksums and compression, but only
apply to raidzN.
AFAIK, the contents of userdata sectors and their ordering don't even
matter to ZFS layers until decompression - parities and checksums just
apply to prepared bulk data...
On 2012-12-06 02:08, Jim Klimov wrote:
On 2012-12-05 05:52, Jim Klimov wrote:
For undersized allocations, i.e. of compressed data, it is possible
to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
some sectors do apparently get wasted because the A-size in the DVA
is divisible by 6*4KB. With columnar allocation of disks, it is
easier to see why full stripes have to be used:
p1 p2 d1 d2 d3 d4
. , 1 5 9 13
. , 2 6 10 14
. , 3 7 11 x
. , 4 8 12 x
In this illustration a 14-sector-long block is saved, with X being
the empty leftovers, on which we can't really save (as would be the
case with the other allocation, which is likely less efficient for
CPU and IOs).
Getting more and more puzzled with this... I have seen DVA values
matching both theories now...
Interestingly, all the allocations I looked over involved the number
of sectors divisible by 3... rounding to half of my 6-disk RAID set -
is it merely a coincidence, or some means of balancing IOs?
I did not yet research where exactly the "unused" sectors are
allocated - "vertically" on the last strip, like in my yesterdays
depiction quoted above, or "horizontally" across several disks,
but now that I know about this - it really bothers me as wasted
space with no apparent gain. I mean, the raidz code does tricks
to ensure that parities are located on different disks, and in
normal conditions the userdata sector reads land on all disks
in a uniform manner. Why forfeit the natural "rotation" thanks
to P-sizes smaller than the multiple of number of data-disks?
In short: can someone explain the rationale - why are allocations
such as they are now, and can it be discussed as a bug or should
this be rationalized as a feature?
zfs-discuss mailing list