Andreas,

We have explored the idea of increasing the dnode size in the past
and discovered that a larger dnode size has a significant negative
performance impact on the ZPL (at least with our current caching
and read-ahead policies).  So we don't have any plans to increase
its size generically anytime soon.

However, given that the ZPL isn't the only consumer of datasets,
and that Lustre may benefit from a larger dnode size, it may be
worth investigating the possibility of supporting multiple dnode
sizes within a single pool (this is currently not supported).

Also, note that dnodes already have the notion of "fixed" DMU-
specific data and "variable" application-used data (the bonus
area).  So even in the current code, Lustre has the ability to
use 320 bytes of bonus space however it wants.

-Mark

Andreas Dilger wrote:
> Hello,
> as a brief introduction, I'm one of the developers of Lustre
> (www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well,
> technically just the DMU) for back-end storage of Lustre.  We currently
> use a modified ext3/4 filesystem for the back-end storage (both data and
> metadata) fairly successfully (single filesystems of up to 2PB with up
> to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput
> in some installations).
> 
> Lustre is a fairly heavy user of extended attributes on the metadata target
> (MDT) to record virtual file->object mappings, and we'll also begin using
> EAs more heavily on the object store (OST) in the near future (reverse
> object->file mappings for example).
> 
> One of the performance improvements we developed early on with ext3 is
> moving the EA into the inode to avoid seeking and full block writes for
> small amounts of EA data.  The same could also be done to improve small
> file performance (though we didn't implement that).  For ext3 this meant
> increasing the inode size from 128 bytes to a format-time constant size of
> 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).
> 
> My understanding from brief conversations with some of the ZFS developers
> is that there are already some plans to enlarge the dnode this because
> the dnode bonus buffer is getting close to being full for ZFS.  Are there
> any details of this plan that I could read, or has it been discussed before?
> Due to the generality of the terms I wasn't able to find anything by search.
> I wanted to get the ball rolling on the large dnode discussion (which
> you may have already had internally, I don't know), and start a fast EA
> discussion in a separate thread.
> 
> 
> 
> One of the important design decisions made with the ext3 "large inode" space
> (beyond the end of the regular inode) was that there was a marker in each
> inode which records how much of that space was used for "fixed" fields
> (e.g. nanosecond timestamps, creation time, inode version) at the time the
> inode was last written.  The space beyond "i_extra_isize" is used for
> extended attribute storage.  If an inode is modified and the kernel code
> wants to store additional "fixed" fields in the inode it will push the EAs
> out to external blocks to make room if there isn't enough in-inode space.
> 
> By having i_extra_isize stored in each inode (actually the first 16-bit
> field in large inodes) we are at liberty to add new fields to the inode
> itself without having to do a scan/update operation on existing inodes
> (definitely desirable for ZFS also) and we don't have to waste a lot
> of "reserved" space for potential future expansion or for fields at the
> end that are not being used (e.g. inode version is only useful for NFSv4
> and Lustre).  None of the "extra" fields are critical to correct operation
> by definition, since the code has existed until now without them...
> Conversely, we don't force EAs to start at a fixed offset and then use
> inefficient EA wrapping for small 32- or 64-bit fields.
> 
> We also _discussed_ storing ext3 small file data in an EA on an
> opportunistic basis along with more extent data (ala XFS).  Are there
> plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
> avoid spilling out to an external block for files that are smaller and/or
> have little/no EA data?  Alternately, it would be interesting to store
> file data in the (enlarged) dn_blkptr[] array for small files to avoid
> fragmenting the free space within the dnode.
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Reply via email to