The "Fast extended attributes" is of great interest to us in the Mac OS X camp. Historically most files have 32 bytes of "Finder Info" which we are currently storing as an EA. Fast access to this info would be a great gain for us. We also are seeing more and more EAs used in Mac OS X 10.5 (many with small data) so we would be interested in some sort of generic fast EAs (ie embedded) or at least fast access to their names.
-Don On Sep 15, 2007, at 4:19 PM, Andreas Dilger wrote: > On Sep 13, 2007 17:48 -0700, Bill Moore wrote: >> I think there are a couple of issues here. The first one is to allow >> each dataset to have its own dnode size. While conceptually not all >> that hard, it would take some re-jiggering of the code to make most >> of >> the #defines turn into per-dataset variables. But it should be >> pretty >> straightforward, and probably not a bad idea in general. > > Agreed. > >> The other issue is a little more sticky. My understanding is that >> Lustre-on-DMU plans to use the same data structures as the ZPL. That >> way, you can mount the Lustre metadata or object stores as a regular >> filesystem. Given this, the question is what changes, if any, >> should be >> made to the ZPL to accommodate. Allowing the ZPL to deal with >> non-512-byte dnodes is probably not that bad. The question is >> whether >> or not the ZPL should be made to understand the extended attributes >> (or >> whatever) that is stored in the rest of the bonus buffer. > > There are a couple of approaches I can propose, but since I'm only at > the level of ZFS code newbie I can't weigh weigh how easy/hard it > would > be to implement them. This is really just at the brainstorming stage > for many of them, and we may want to split details into separate > threads. > > typedef struct dnode_phys { > uint8_t dn_type; > uint8_t dn_indblkshift; > uint8_t dn_nlevels = 3 > uint8_t dn_nblkptr = 3 > uint8_t dn_bonustype; > uint8_t dn_checksum; > uint8_t dn_compress; > uint8_t dn_pad[1]; > uint16_t dn_datablkszsec; > uint16_t dn_bonuslen; > uint8_t dn_pad2[4]; > uint64_t dn_maxblkid; > uint64_t dn_secphys; > uint64_t dn_pad3[4]; > blkptr_t dn_blkptr[dn_nblkptr]; > uint8_t dn_bonus[BONUSLEN] > } dnode_phys_t; > > typedef struct znode_phys { > uint64_t zp_atime[2]; > uint64_t zp_mtime[2]; > uint64_t zp_ctime[2]; > uint64_t zp_crtime[2]; > uint64_t zp_gen; > uint64_t zp_mode; > uint64_t zp_size; > uint64_t zp_parent; > uint64_t zp_links; > uint64_t zp_xattr; > uint64_t zp_rdev; > uint64_t zp_flags; > uint64_t zp_uid; > uint64_t zp_gid; > uint64_t zp_pad[4]; > zfs_znode_acl_t zp_acl; > } znode_phys_t > > There are several issues that I think should be addressed with a > single > design, since they are closely related: > 0) versioning of the filesystem > 1) variable dnode_phys_t size (per dataset, to start with at least) > 2) fast small files (per dnode) > 3) variable znode_phys_t size (per dnode) > 4) fast extended attributes (per dnode) > > Lustre doesn't really care about (3) per-se, and not very much about > (2) > right now but we may as well address it at the same time as the > others. > > Versioning of the filesystem > ============================ > 0.a If we are changing the on-disk layout we have to pay attention to > on-disk compatibility and ensure older ZFS code does not fail badly. > I don't think it is possible to make all of the changes being > proposed here in a way that is compatible with existing code so we > need to version the changes in some manner. > > 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism > that > is superior to just incrementing a version number and forcing all > implementations to support every previous version's features. See > http://www.mjmwired.net/kernel/Documentation/filesystems/ > ext2.txt#224 > for a detailed description of how the features work. The gist is > that instead of the "version" being an incrementing digit it is > instead a bitmask of features. > > 0.c It would be possible to modify ZFS to use ext2-like feature flags. > We would have to special-case the bits 0x00000001 and 0x00000002 > that represent the different features of ZFS_VERSION_3 currently. > All new features would still increment the "version number" (which > would become the "INCOMPAT" version field) so old code would still > refuse to mount it, but instead of being sequential versions we now > get power-of-two jumps in the version number. It is no longer > required > that ZFS support a strict superset of all changes that the Lustre > ZFS > code implements immediately, and it is possible to develop and > support > these changes in parallel, and land them in a safe, piecewise manner > (or never, as sometimes happens with features that die off) > > Variable dnode_phys_t size > ========================== > 1.a) I think everyone agrees that for a per-dataset fixed value this > is > "just" a matter of changing all the code in a mechanical fashion. > I'll just ignore the issue of being able to increase this in an > existing dataset for now. > > 1.b) My understanding is that dn_bonuslen covers ALL of the ZPL- > accessible > data (i.e. it is a layering violation to try and access anything > beyond > db_bonuslen and in fact the buffer may not even contain any valid > data > or concievably even segfault). That means any data used by ZPL (and > by extension Lustre, which wants to maintain format compatibility) > needs to live inside dn_bonuslen. > > 1.c) With a larger dnode, it is possible to have more elements in > dn_blkptr[] > on a per-dnode basis. I have no feeling for the relative > performance > gains of storing 5 or 12 blocks in the dnode but it can't hurt I > think. > Avoiding a seek for files < 10*128kB is still good. It seems this > dnode_allocate() already takes this into account based on bonuslen > at > the time of dnode creation. > > 1.d) It currently doesn't seem possible to change dn_bonuslen on an > existing > object (dnode_reallocate() will truncate all the file data in that > case?), > so we'd need some mechanism to push data blocks into an external > blkptr > in this case (hopefully not impossible given that the pointer to the > bonus buffer might change?). > > 1.e) For a Lustre metadata server (which never stores file data) it > may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte > blkptr for EAs. That is a relatively minor improvement and it seems > the DMU would currently not be very happy with that. > > Fast small files > ================ > 2.a This means storing small files within the dnode itself. Since > (AFAICS) the ZPL code is correctly layered atop the DMU, it has no > idea how or where the data for a file is actually stored. This > leaves the possibility of storing small file data within the > dn_blkptr[] > array, which at 128 bytes/blkptr is fairly significant (larger than > the shrinking symlink space), especially if we have a larger dnode > which > may have a bunch of free space in it. For a 1024-byte dnode+znode > we would have 760 bytes of contiguous space, and that covers 1/3 > of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var. > > 2.b The DMU of course assumes the dn_blkptr contents are valid (after > verifying the checksums) so we'd need a mechanism (dn_flag, dn_type, > dn_compress, dn_datablkszsec?) that indicated whether this was > "packed inline" data or blkptr_t data. At first glance I like > "dn_compress" the best, but there would still have to be some > special > casing to avoid handling the "blkptr" in the normal way. > > Variable znode_phys_t size > ========================== > 3.a) I initially thought that we don't have to store any extra > information to have a variable znode_phys_t size, because > dn_bonuslen > holds this information. However, for symlinks ZFS checks > essentially > "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a > fast or slow symlink. That implies if sizeof(znode_phys_t) changes > old symlinks on disk will be accessed incorrectly if we don't have > some extra information about the size of znode_phys_t in each dnode. > > 3.b) We can call this "zp_extra_znsize". If we declare the current > znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount > of > extra space beyond sizeof(znode_phys_v0_t), so 0 for current > filesystems. > > 3.c) zp_extra_znsize would need to be stored in znode_phys_t > somewhere. > There is lots of unused space in some of the 64-bit fields, but I > don't know how you feel about hacks for this. Possibilities include > some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, > etc. > It probably only needs to be 8 bytes or so (seems unlikely you will > more than double the number of fixed fields in struct znode_phys_t). > > 3.d) We might consider some symlink-specific mechanism to incidate > fast/slow symlinks (e.g. a flag) instead of depending on sizes, > which I always found fragile in ext3 also, and was the source of > several bugs. > > 3.e) We may instead consider (2.a) for symlinks a that point, since > there > is no reason to fear writing 60-byte files anymore (same > performance, > different (larger!) location for symlink data). > > 3.f) When ZFS code is accessing new fields declared in znode_phys_t > it has > to verify whether they are beyond dn_bonuslen and zp_extra_znsize to > know if those fields are actually valid on disk. > > Finally, > > Fast extended attributes > ======================== > 4.a) Unfortunately, due to (1.b), I don't think we can just store the > EA in the dnode after the bonus buffer. > > 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be > addressed. > At that point (symlinks possibly excepted, depending on whether 3.e > is used) the EA space would be: > > (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize) > > For existing symlinks we'd have to also reduce this by zp_size. > > 4.c) It would be best to have some kind of ZAP to store the fast EA > data. > Ideally it is a very simple kind of ZAP (single buffer), but the > microzap format is too restrictive with only a 64-bit value. > One of the other Lustre desires is to store additional information > in > each directory entry (in addition to the object number) like file > type > and a remote server identifier, and having a single ZAP type that is > useful for small entries would be good. Is it possible to go > straight > to a zap_leaf_phys_t without having a corresponding zap_phys_t > first? > If yes, then this would be quite useful, otherwise a fat ZAP is > too fat > to be useful for storing fast EA data and the extended directory > info. > > > Apologies for the long email, but I think all of these issues are > related > and best addressed with a single design even if they are implemented > in > a piecemeal fashion. None of these features are blockers for Lustre > implementation atop ZFS/DMU but nobody wants the performance to be > bad. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code