2011-10-08 7:25, Daniel Carosone пишет:
On Tue, Oct 04, 2011 at 09:28:36PM -0700, Richard Elling wrote:
On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote:

What is going on? Is there really that much metadata overhead?  How
many metadata blocks are needed for each 8k vol block, and are they
each really only holding 512 bytes of metadata in a 4k allocation?
Can they not be packed appropriately for the ashift?
Doesn't matter how small metadata compresses, the minimum size you can write
is 4KB.
This isn't about whether the metadata compresses, this is about
whether ZFS is smart enough to use all the space in a 4k block for
metadata, rather than assuming it can fit at best 512 bytes,
regardless of ashift.  By packing, I meant packing them full rather
than leaving them mostly empty and wasted (or anything to do with
compression).

Compression or packing won't cut it, I think. At least that's why
I abandoned my first suggestion to problem solution in that
bugtracker and proposed another. Basically my first idea was
ripped from ATM protocol fixed-size (small) "cells" making
up "frames" which are whole units sent over the wire with a
common header, checksum, etc. Likewise, I proposed that
4kb on-disk blocks (ashift=12) should be regarded as being
made up of 8 (or more) 512-byte "cells" each with a portion
of metadata. A major downside to such solution would be
introduced incompatibility to other implementations of ZFS
in terms of on-disk data and its interpretation by code.

Thus I proposed the second idea with a code-only solution
to optimize performance (force user-configured minimal
data block sizes and physical alignments) where metadata
blocks would remain 512 bytes because the pool is formally
ashift=9 - and on-disk data would be compatible with other
pools and OSes boasting ZFS.

As far as I understand, each 512-byte block (as of ashift=9
pool) was already too big for a single "quantum" of metadata
(which apparently range around 200-300 bytes according to
"zdb -DD").

Due to, at least, performance reasons, each metadata block
is addresed as an individual block in the ZFS tree of blocks
(roughly: rooted by uberblock, branched at metadata blocks,
leafing at data blocks). Upon every change of data (and TXG
sync), the whole branch of metadata blocks leading up to the
uberblock has to be updated, and these blocks are written
anew into empty (unassigned-yet) space on the pool thanks
to ZFS COW never overwriting live data.

On one hand, it does not seem like a problem to coalesce
writes of 8 metadata blocks into 4kb portions - in code only -
so that new 4kb-sector-aware ZFS code would perform well
on newer HDDs and waste less space than now. On another,
I do not know how long the tree branches are, but perhaps
any change of a data block produces enough changes of
metadata to fill up a 4kb block, or several, or a large
portion of one - so in practice there is no problem of
waiting for a chance to write several metadata blocks
as well... Thus I think my second solution is viable.

//Jim


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to