Hello,
as a brief introduction, I'm one of the developers of Lustre
(www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well,
technically just the DMU) for back-end storage of Lustre.  We currently
use a modified ext3/4 filesystem for the back-end storage (both data and
metadata) fairly successfully (single filesystems of up to 2PB with up
to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput
in some installations).

Lustre is a fairly heavy user of extended attributes on the metadata target
(MDT) to record virtual file->object mappings, and we'll also begin using
EAs more heavily on the object store (OST) in the near future (reverse
object->file mappings for example).

One of the performance improvements we developed early on with ext3 is
moving the EA into the inode to avoid seeking and full block writes for
small amounts of EA data.  The same could also be done to improve small
file performance (though we didn't implement that).  For ext3 this meant
increasing the inode size from 128 bytes to a format-time constant size of
256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).

My understanding from brief conversations with some of the ZFS developers
is that there are already some plans to enlarge the dnode this because
the dnode bonus buffer is getting close to being full for ZFS.  Are there
any details of this plan that I could read, or has it been discussed before?
Due to the generality of the terms I wasn't able to find anything by search.
I wanted to get the ball rolling on the large dnode discussion (which
you may have already had internally, I don't know), and start a fast EA
discussion in a separate thread.



One of the important design decisions made with the ext3 "large inode" space
(beyond the end of the regular inode) was that there was a marker in each
inode which records how much of that space was used for "fixed" fields
(e.g. nanosecond timestamps, creation time, inode version) at the time the
inode was last written.  The space beyond "i_extra_isize" is used for
extended attribute storage.  If an inode is modified and the kernel code
wants to store additional "fixed" fields in the inode it will push the EAs
out to external blocks to make room if there isn't enough in-inode space.

By having i_extra_isize stored in each inode (actually the first 16-bit
field in large inodes) we are at liberty to add new fields to the inode
itself without having to do a scan/update operation on existing inodes
(definitely desirable for ZFS also) and we don't have to waste a lot
of "reserved" space for potential future expansion or for fields at the
end that are not being used (e.g. inode version is only useful for NFSv4
and Lustre).  None of the "extra" fields are critical to correct operation
by definition, since the code has existed until now without them...
Conversely, we don't force EAs to start at a fixed offset and then use
inefficient EA wrapping for small 32- or 64-bit fields.

We also _discussed_ storing ext3 small file data in an EA on an
opportunistic basis along with more extent data (ala XFS).  Are there
plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
avoid spilling out to an external block for files that are smaller and/or
have little/no EA data?  Alternately, it would be interesting to store
file data in the (enlarged) dn_blkptr[] array for small files to avoid
fragmenting the free space within the dnode.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


Reply via email to