The "Fast extended attributes" is of great interest to us in the Mac  
OS X camp.  Historically most files have  32 bytes of "Finder Info"  
which we are currently storing as an EA.  Fast access to this info  
would be a great gain for us.   We also are seeing more and more EAs  
used in Mac OS X 10.5 (many with small data) so we would be interested  
in some sort of generic fast EAs (ie embedded) or at least fast access  
to their names.

-Don


On Sep 15, 2007, at 4:19 PM, Andreas Dilger wrote:

> On Sep 13, 2007  17:48 -0700, Bill Moore wrote:
>> I think there are a couple of issues here.  The first one is to allow
>> each dataset to have its own dnode size.  While conceptually not all
>> that hard, it would take some re-jiggering of the code to make most  
>> of
>> the #defines turn into per-dataset variables.  But it should be  
>> pretty
>> straightforward, and probably not a bad idea in general.
>
> Agreed.
>
>> The other issue is a little more sticky.  My understanding is that
>> Lustre-on-DMU plans to use the same data structures as the ZPL.  That
>> way, you can mount the Lustre metadata or object stores as a regular
>> filesystem.  Given this, the question is what changes, if any,  
>> should be
>> made to the ZPL to accommodate.  Allowing the ZPL to deal with
>> non-512-byte dnodes is probably not that bad.  The question is  
>> whether
>> or not the ZPL should be made to understand the extended attributes  
>> (or
>> whatever) that is stored in the rest of the bonus buffer.
>
> There are a couple of approaches I can propose, but since I'm only at
> the level of ZFS code newbie I can't weigh weigh how easy/hard it  
> would
> be to implement them.  This is really just at the brainstorming stage
> for many of them, and we may want to split details into separate  
> threads.
>
> typedef struct dnode_phys {
>       uint8_t dn_type;
>       uint8_t dn_indblkshift;
>       uint8_t dn_nlevels = 3
>       uint8_t dn_nblkptr = 3
>       uint8_t dn_bonustype;
>       uint8_t dn_checksum;
>       uint8_t dn_compress;
>       uint8_t dn_pad[1];
>       uint16_t dn_datablkszsec;
>       uint16_t dn_bonuslen;
>       uint8_t dn_pad2[4];
>       uint64_t dn_maxblkid;
>       uint64_t dn_secphys;
>       uint64_t dn_pad3[4];
>       blkptr_t dn_blkptr[dn_nblkptr];
>       uint8_t dn_bonus[BONUSLEN]
> } dnode_phys_t;
>
> typedef struct znode_phys {
>       uint64_t zp_atime[2];
>       uint64_t zp_mtime[2];
>       uint64_t zp_ctime[2];
>       uint64_t zp_crtime[2];
>       uint64_t zp_gen;
>       uint64_t zp_mode;
>       uint64_t zp_size;
>       uint64_t zp_parent;
>       uint64_t zp_links;
>       uint64_t zp_xattr;
>       uint64_t zp_rdev;
>       uint64_t zp_flags;
>       uint64_t zp_uid;
>       uint64_t zp_gid;
>       uint64_t zp_pad[4];
>       zfs_znode_acl_t zp_acl;
> } znode_phys_t
>
> There are several issues that I think should be addressed with a  
> single
> design, since they are closely related:
> 0) versioning of the filesystem
> 1) variable dnode_phys_t size (per dataset, to start with at least)
> 2) fast small files (per dnode)
> 3) variable znode_phys_t size (per dnode)
> 4) fast extended attributes (per dnode)
>
> Lustre doesn't really care about (3) per-se, and not very much about  
> (2)
> right now but we may as well address it at the same time as the  
> others.
>
> Versioning of the filesystem
> ============================
> 0.a If we are changing the on-disk layout we have to pay attention to
>   on-disk compatibility and ensure older ZFS code does not fail badly.
>   I don't think it is possible to make all of the changes being
>   proposed here in a way that is compatible with existing code so we
>   need to version the changes in some manner.
>
> 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism  
> that
>   is superior to just incrementing a version number and forcing all
>   implementations to support every previous version's features.  See
>   http://www.mjmwired.net/kernel/Documentation/filesystems/ 
> ext2.txt#224
>   for a detailed description of how the features work.  The gist is
>   that instead of the "version" being an incrementing digit it is
>   instead a bitmask of features.
>
> 0.c It would be possible to modify ZFS to use ext2-like feature flags.
>   We would have to special-case the bits 0x00000001 and 0x00000002
>   that represent the different features of ZFS_VERSION_3 currently.
>   All new features would still increment the "version number" (which
>   would become the "INCOMPAT" version field) so old code would still
>   refuse to mount it, but instead of being sequential versions we now
>   get power-of-two jumps in the version number.  It is no longer  
> required
>   that ZFS support a strict superset of all changes that the Lustre  
> ZFS
>   code implements immediately, and it is possible to develop and  
> support
>   these changes in parallel, and land them in a safe, piecewise manner
>   (or never, as sometimes happens with features that die off)
>
> Variable dnode_phys_t size
> ==========================
> 1.a) I think everyone agrees that for a per-dataset fixed value this  
> is
>   "just" a matter of changing all the code in a mechanical fashion.
>   I'll just ignore the issue of being able to increase this in an
>   existing dataset for now.
>
> 1.b) My understanding is that dn_bonuslen covers ALL of the ZPL- 
> accessible
>   data (i.e. it is a layering violation to try and access anything  
> beyond
>   db_bonuslen and in fact the buffer may not even contain any valid  
> data
>   or concievably even segfault).  That means any data used by ZPL (and
>   by extension Lustre, which wants to maintain format compatibility)
>   needs to live inside dn_bonuslen.
>
> 1.c) With a larger dnode, it is possible to have more elements in  
> dn_blkptr[]
>   on a per-dnode basis.  I have no feeling for the relative  
> performance
>   gains of storing 5 or 12 blocks in the dnode but it can't hurt I  
> think.
>   Avoiding a seek for files < 10*128kB is still good.  It seems this
>   dnode_allocate() already takes this into account based on bonuslen  
> at
>   the time of dnode creation.
>
> 1.d) It currently doesn't seem possible to change dn_bonuslen on an  
> existing
>   object (dnode_reallocate() will truncate all the file data in that  
> case?),
>   so we'd need some mechanism to push data blocks into an external  
> blkptr
>   in this case (hopefully not impossible given that the pointer to the
>   bonus buffer might change?).
>
> 1.e) For a Lustre metadata server (which never stores file data) it
>   may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte
>   blkptr for EAs.  That is a relatively minor improvement and it seems
>   the DMU would currently not be very happy with that.
>
> Fast small files
> ================
> 2.a This means storing small files within the dnode itself.  Since
>   (AFAICS) the ZPL code is correctly layered atop the DMU, it has no
>   idea how or where the data for a file is actually stored.  This
>   leaves the possibility of storing small file data within the  
> dn_blkptr[]
>   array, which at 128 bytes/blkptr is fairly significant (larger than
>   the shrinking symlink space), especially if we have a larger dnode  
> which
>   may have a bunch of free space in it.  For a 1024-byte dnode+znode
>   we would have 760 bytes of contiguous space, and that covers 1/3
>   of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var.
>
> 2.b The DMU of course assumes the dn_blkptr contents are valid (after
>   verifying the checksums) so we'd need a mechanism (dn_flag, dn_type,
>   dn_compress, dn_datablkszsec?) that indicated whether this was
>   "packed inline" data or blkptr_t data.  At first glance I like
>   "dn_compress" the best, but there would still have to be some  
> special
>   casing to avoid handling the "blkptr" in the normal way.
>
> Variable znode_phys_t size
> ==========================
> 3.a) I initially thought that we don't have to store any extra
>   information to have a variable znode_phys_t size, because  
> dn_bonuslen
>   holds this information.  However, for symlinks ZFS checks  
> essentially
>   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
>   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
>   old symlinks on disk will be accessed incorrectly if we don't have
>   some extra information about the size of znode_phys_t in each dnode.
>
> 3.b)  We can call this "zp_extra_znsize".  If we declare the current
>   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount  
> of
>   extra space beyond sizeof(znode_phys_v0_t), so 0 for current  
> filesystems.
>
> 3.c) zp_extra_znsize would need to be stored in znode_phys_t  
> somewhere.
>   There is lots of unused space in some of the 64-bit fields, but I
>   don't know how you feel about hacks for this.  Possibilities include
>   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds,  
> etc.
>   It probably only needs to be 8 bytes or so (seems unlikely you will
>   more than double the number of fixed fields in struct znode_phys_t).
>
> 3.d) We might consider some symlink-specific mechanism to incidate
>   fast/slow symlinks (e.g. a flag) instead of depending on sizes,
>   which I always found fragile in ext3 also, and was the source of
>   several bugs.
>
> 3.e) We may instead consider (2.a) for symlinks a that point, since  
> there
>   is no reason to fear writing 60-byte files anymore (same  
> performance,
>   different (larger!) location for symlink data).
>
> 3.f) When ZFS code is accessing new fields declared in znode_phys_t  
> it has
>   to verify whether they are beyond dn_bonuslen and zp_extra_znsize to
>   know if those fields are actually valid on disk.
>
> Finally,
>
> Fast extended attributes
> ========================
> 4.a) Unfortunately, due to (1.b), I don't think we can just store the
>   EA in the dnode after the bonus buffer.
>
> 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be  
> addressed.
>   At that point (symlinks possibly excepted, depending on whether 3.e
>   is used) the EA space would be:
>
>   (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize)
>
>   For existing symlinks we'd have to also reduce this by zp_size.
>
> 4.c) It would be best to have some kind of ZAP to store the fast EA  
> data.
>   Ideally it is a very simple kind of ZAP (single buffer), but the
>   microzap format is too restrictive with only a 64-bit value.
>   One of the other Lustre desires is to store additional information  
> in
>   each directory entry (in addition to the object number) like file  
> type
>   and a remote server identifier, and having a single ZAP type that is
>   useful for small entries would be good.  Is it possible to go  
> straight
>   to a zap_leaf_phys_t without having a corresponding zap_phys_t  
> first?
>   If yes, then this would be quite useful, otherwise a fat ZAP is  
> too fat
>   to be useful for storing fast EA data and the extended directory  
> info.
>
>
> Apologies for the long email, but I think all of these issues are  
> related
> and best addressed with a single design even if they are implemented  
> in
> a piecemeal fashion.  None of these features are blockers for Lustre
> implementation atop ZFS/DMU but nobody wants the performance to be  
> bad.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code


Reply via email to