The "Fast extended attributes" is of great interest to us in the Mac
OS X camp. Historically most files have 32 bytes of "Finder Info"
which we are currently storing as an EA. Fast access to this info
would be a great gain for us. We also are seeing more and more EAs
used in Mac OS X 10.5 (many with small data) so we would be interested
in some sort of generic fast EAs (ie embedded) or at least fast access
to their names.
-Don
On Sep 15, 2007, at 4:19 PM, Andreas Dilger wrote:
> On Sep 13, 2007 17:48 -0700, Bill Moore wrote:
>> I think there are a couple of issues here. The first one is to allow
>> each dataset to have its own dnode size. While conceptually not all
>> that hard, it would take some re-jiggering of the code to make most
>> of
>> the #defines turn into per-dataset variables. But it should be
>> pretty
>> straightforward, and probably not a bad idea in general.
>
> Agreed.
>
>> The other issue is a little more sticky. My understanding is that
>> Lustre-on-DMU plans to use the same data structures as the ZPL. That
>> way, you can mount the Lustre metadata or object stores as a regular
>> filesystem. Given this, the question is what changes, if any,
>> should be
>> made to the ZPL to accommodate. Allowing the ZPL to deal with
>> non-512-byte dnodes is probably not that bad. The question is
>> whether
>> or not the ZPL should be made to understand the extended attributes
>> (or
>> whatever) that is stored in the rest of the bonus buffer.
>
> There are a couple of approaches I can propose, but since I'm only at
> the level of ZFS code newbie I can't weigh weigh how easy/hard it
> would
> be to implement them. This is really just at the brainstorming stage
> for many of them, and we may want to split details into separate
> threads.
>
> typedef struct dnode_phys {
> uint8_t dn_type;
> uint8_t dn_indblkshift;
> uint8_t dn_nlevels = 3
> uint8_t dn_nblkptr = 3
> uint8_t dn_bonustype;
> uint8_t dn_checksum;
> uint8_t dn_compress;
> uint8_t dn_pad[1];
> uint16_t dn_datablkszsec;
> uint16_t dn_bonuslen;
> uint8_t dn_pad2[4];
> uint64_t dn_maxblkid;
> uint64_t dn_secphys;
> uint64_t dn_pad3[4];
> blkptr_t dn_blkptr[dn_nblkptr];
> uint8_t dn_bonus[BONUSLEN]
> } dnode_phys_t;
>
> typedef struct znode_phys {
> uint64_t zp_atime[2];
> uint64_t zp_mtime[2];
> uint64_t zp_ctime[2];
> uint64_t zp_crtime[2];
> uint64_t zp_gen;
> uint64_t zp_mode;
> uint64_t zp_size;
> uint64_t zp_parent;
> uint64_t zp_links;
> uint64_t zp_xattr;
> uint64_t zp_rdev;
> uint64_t zp_flags;
> uint64_t zp_uid;
> uint64_t zp_gid;
> uint64_t zp_pad[4];
> zfs_znode_acl_t zp_acl;
> } znode_phys_t
>
> There are several issues that I think should be addressed with a
> single
> design, since they are closely related:
> 0) versioning of the filesystem
> 1) variable dnode_phys_t size (per dataset, to start with at least)
> 2) fast small files (per dnode)
> 3) variable znode_phys_t size (per dnode)
> 4) fast extended attributes (per dnode)
>
> Lustre doesn't really care about (3) per-se, and not very much about
> (2)
> right now but we may as well address it at the same time as the
> others.
>
> Versioning of the filesystem
> ============================
> 0.a If we are changing the on-disk layout we have to pay attention to
> on-disk compatibility and ensure older ZFS code does not fail badly.
> I don't think it is possible to make all of the changes being
> proposed here in a way that is compatible with existing code so we
> need to version the changes in some manner.
>
> 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism
> that
> is superior to just incrementing a version number and forcing all
> implementations to support every previous version's features. See
> http://www.mjmwired.net/kernel/Documentation/filesystems/
> ext2.txt#224
> for a detailed description of how the features work. The gist is
> that instead of the "version" being an incrementing digit it is
> instead a bitmask of features.
>
> 0.c It would be possible to modify ZFS to use ext2-like feature flags.
> We would have to special-case the bits 0x00000001 and 0x00000002
> that represent the different features of ZFS_VERSION_3 currently.
> All new features would still increment the "version number" (which
> would become the "INCOMPAT" version field) so old code would still
> refuse to mount it, but instead of being sequential versions we now
> get power-of-two jumps in the version number. It is no longer
> required
> that ZFS support a strict superset of all changes that the Lustre
> ZFS
> code implements immediately, and it is possible to develop and
> support
> these changes in parallel, and land them in a safe, piecewise manner
> (or never, as sometimes happens with features that die off)
>
> Variable dnode_phys_t size
> ==========================
> 1.a) I think everyone agrees that for a per-dataset fixed value this
> is
> "just" a matter of changing all the code in a mechanical fashion.
> I'll just ignore the issue of being able to increase this in an
> existing dataset for now.
>
> 1.b) My understanding is that dn_bonuslen covers ALL of the ZPL-
> accessible
> data (i.e. it is a layering violation to try and access anything
> beyond
> db_bonuslen and in fact the buffer may not even contain any valid
> data
> or concievably even segfault). That means any data used by ZPL (and
> by extension Lustre, which wants to maintain format compatibility)
> needs to live inside dn_bonuslen.
>
> 1.c) With a larger dnode, it is possible to have more elements in
> dn_blkptr[]
> on a per-dnode basis. I have no feeling for the relative
> performance
> gains of storing 5 or 12 blocks in the dnode but it can't hurt I
> think.
> Avoiding a seek for files < 10*128kB is still good. It seems this
> dnode_allocate() already takes this into account based on bonuslen
> at
> the time of dnode creation.
>
> 1.d) It currently doesn't seem possible to change dn_bonuslen on an
> existing
> object (dnode_reallocate() will truncate all the file data in that
> case?),
> so we'd need some mechanism to push data blocks into an external
> blkptr
> in this case (hopefully not impossible given that the pointer to the
> bonus buffer might change?).
>
> 1.e) For a Lustre metadata server (which never stores file data) it
> may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte
> blkptr for EAs. That is a relatively minor improvement and it seems
> the DMU would currently not be very happy with that.
>
> Fast small files
> ================
> 2.a This means storing small files within the dnode itself. Since
> (AFAICS) the ZPL code is correctly layered atop the DMU, it has no
> idea how or where the data for a file is actually stored. This
> leaves the possibility of storing small file data within the
> dn_blkptr[]
> array, which at 128 bytes/blkptr is fairly significant (larger than
> the shrinking symlink space), especially if we have a larger dnode
> which
> may have a bunch of free space in it. For a 1024-byte dnode+znode
> we would have 760 bytes of contiguous space, and that covers 1/3
> of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var.
>
> 2.b The DMU of course assumes the dn_blkptr contents are valid (after
> verifying the checksums) so we'd need a mechanism (dn_flag, dn_type,
> dn_compress, dn_datablkszsec?) that indicated whether this was
> "packed inline" data or blkptr_t data. At first glance I like
> "dn_compress" the best, but there would still have to be some
> special
> casing to avoid handling the "blkptr" in the normal way.
>
> Variable znode_phys_t size
> ==========================
> 3.a) I initially thought that we don't have to store any extra
> information to have a variable znode_phys_t size, because
> dn_bonuslen
> holds this information. However, for symlinks ZFS checks
> essentially
> "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
> fast or slow symlink. That implies if sizeof(znode_phys_t) changes
> old symlinks on disk will be accessed incorrectly if we don't have
> some extra information about the size of znode_phys_t in each dnode.
>
> 3.b) We can call this "zp_extra_znsize". If we declare the current
> znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount
> of
> extra space beyond sizeof(znode_phys_v0_t), so 0 for current
> filesystems.
>
> 3.c) zp_extra_znsize would need to be stored in znode_phys_t
> somewhere.
> There is lots of unused space in some of the 64-bit fields, but I
> don't know how you feel about hacks for this. Possibilities include
> some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds,
> etc.
> It probably only needs to be 8 bytes or so (seems unlikely you will
> more than double the number of fixed fields in struct znode_phys_t).
>
> 3.d) We might consider some symlink-specific mechanism to incidate
> fast/slow symlinks (e.g. a flag) instead of depending on sizes,
> which I always found fragile in ext3 also, and was the source of
> several bugs.
>
> 3.e) We may instead consider (2.a) for symlinks a that point, since
> there
> is no reason to fear writing 60-byte files anymore (same
> performance,
> different (larger!) location for symlink data).
>
> 3.f) When ZFS code is accessing new fields declared in znode_phys_t
> it has
> to verify whether they are beyond dn_bonuslen and zp_extra_znsize to
> know if those fields are actually valid on disk.
>
> Finally,
>
> Fast extended attributes
> ========================
> 4.a) Unfortunately, due to (1.b), I don't think we can just store the
> EA in the dnode after the bonus buffer.
>
> 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be
> addressed.
> At that point (symlinks possibly excepted, depending on whether 3.e
> is used) the EA space would be:
>
> (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize)
>
> For existing symlinks we'd have to also reduce this by zp_size.
>
> 4.c) It would be best to have some kind of ZAP to store the fast EA
> data.
> Ideally it is a very simple kind of ZAP (single buffer), but the
> microzap format is too restrictive with only a 64-bit value.
> One of the other Lustre desires is to store additional information
> in
> each directory entry (in addition to the object number) like file
> type
> and a remote server identifier, and having a single ZAP type that is
> useful for small entries would be good. Is it possible to go
> straight
> to a zap_leaf_phys_t without having a corresponding zap_phys_t
> first?
> If yes, then this would be quite useful, otherwise a fat ZAP is
> too fat
> to be useful for storing fast EA data and the extended directory
> info.
>
>
> Apologies for the long email, but I think all of these issues are
> related
> and best addressed with a single design even if they are implemented
> in
> a piecemeal fashion. None of these features are blockers for Lustre
> implementation atop ZFS/DMU but nobody wants the performance to be
> bad.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code