I suggest that we get together soon for a "dnode summit", if you will, in which we put our various plans on the whiteboard and attempt to do the global optimization. I suspect that Lustre and pNFS, for example, have very similar needs -- it would be great to make them identical.
The dnode is a truly core data structure -- we should do everything we can to keep it free of #ifdefs and conditional logic. Andreas, where are you based? When's your next trip to CA? Jeff On Mon, Sep 17, 2007 at 02:16:17PM -0600, Andreas Dilger wrote: > On Sep 17, 2007 08:31 -0600, Mark Shellenbaum wrote: > > While not entirely the same thing we will soon have a VFS feature > > registration mechanism in Nevada. Basically, a file system registers > > what features it supports. Initially this will be things such as "case > > insensitivity", "acl on create", "extended vattr_t". > > It's hard for me to comment on this without more information. I just > suggested the ext3 mechanism because what I see so far (many features > being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3) > mean that it is really hard to do parallel development of features and > ensure that the code is actually safe to access the filesystem. > > For example, if we start developing large dnode + fast EA code we might > want to ship that out sooner than it can go into a Solaris release. We > want to make sure that no Solaris code tries to mount such a filesystem > or it will assert (I think), so we would have to version the fs as v4. > > However, maybe Solaris needs some other changes that would require a v4 > that does not include large dnode + fast EA support (for whatever reason) > so now we have 2 incompatible codebases that support "v4"... > > Do you have a pointer to the upcoming versioning mechanism? > > > >3.a) I initially thought that we don't have to store any extra > > > information to have a variable znode_phys_t size, because dn_bonuslen > > > holds this information. However, for symlinks ZFS checks essentially > > > "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a > > > fast or slow symlink. That implies if sizeof(znode_phys_t) changes > > > old symlinks on disk will be accessed incorrectly if we don't have > > > some extra information about the size of znode_phys_t in each dnode. > > > > > > > There is an existing bug to create symlinks with their own object type. > > I don't think that will help unless there is an extra mechanism to detect > whether the symlink is fast or slow, instead of just using the dn_bonuslen. > Is it possible to store XATTR data on symlinks in Solaris? > > > >3.b) We can call this "zp_extra_znsize". If we declare the current > > > znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of > > > extra space beyond sizeof(znode_phys_v0_t), so 0 for current > > > filesystems. > > > > This would also require creation a new DMU_OT_ZNODE2 or something > > similarly named. > > Sure. Is it possible to change the DMU_OT type on an existing object? > > > >3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. > > > There is lots of unused space in some of the 64-bit fields, but I > > > don't know how you feel about hacks for this. Possibilities include > > > some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. > > > It probably only needs to be 8 bytes or so (seems unlikely you will > > > more than double the number of fixed fields in struct znode_phys_t). > > > > > > > The zp_flags field is off limits. It is going to be used for storing > > additional file attributes such as immutable, nounlink,... > > Ah, OK. I was wondering about that also, but it isn't in the top 10 > priorities yet. > > > I don't want to see us overload other fields. We already have several > > pad fields within the znode that could be used. > > OK, I wasn't sure about what is spoken for already. Is it ZFS policy to > always have 64-bit member fields? Some of the fields (e.g. nanoseconds) > don't really make sense as 64-bit values, and it would probably be a > waste to have a 64-bit value for zp_extra_znsize. > > > >4.c) It would be best to have some kind of ZAP to store the fast EA data. > > > Ideally it is a very simple kind of ZAP (single buffer), but the > > > microzap format is too restrictive with only a 64-bit value. > > > One of the other Lustre desires is to store additional information in > > > each directory entry (in addition to the object number) like file type > > > and a remote server identifier, and having a single ZAP type that is > > > useful for small entries would be good. Is it possible to go straight > > > to a zap_leaf_phys_t without having a corresponding zap_phys_t first? > > > If yes, then this would be quite useful, otherwise a fat ZAP is too fat > > > to be useful for storing fast EA data and the extended directory info. > > > > Can you provide a list of what attributes you want to store in the znode > > and what their sizes are? Do you expect ZFS to do anything special with > > these attributes? Should these attributes be exposed to applications? > > The main one is the Lustre logical object volume (LOV) extended attribute > data. This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or > possibly larger once on ZFS). This HAS to be accessed to do anything with > the znode, even stat currently, since the size of a file is distributed > over potentially many servers, so avoiding overhead here is critical. > > In addition to that, there will be similar smallish attributes stored with > each znode like back-pointers from the storage znodes to the metadata znode. > These are on the order of 64 bytes as well. > > > Usually, we only embed attributes in the znode if the file system has > > some sort of semantics associated with them. > > The issue I think is that this data is only useful for Lustre, so reserving > dedicated space for it in a znode is no good. Also, the LOV XATTR might be > very large, so any dedicated space would be wasted. Having a generic and > fast XATTR storage in the znode would help a variety of applications. > > > One of the original plans, from several years ago was to create a zp_zap > > field in the znode that would be used for storing additional file > > attributes. We never actually did that and the field was turned into > > one of the pad fields in the znode. > > Maybe "file attributes" is the wrong term. These are really XATTRs in the > ZFS sense, so I'll refer to them as such in the future. > > > If the attribute will be needed for every file then it should probably > > be in the znode, but if it is an optional attribute or too big then > > maybe it should be in some sort of overflow object. > > This is what I'm proposing. For small XATTRs they would live in the znode, > and large ones would be stored using the normal ZFS XATTR mechanism (which > is infinitely flexible). Since the Lustre LOV XATTR data is created when > the znode is first allocated, it will always get first crack at using the > fast XATTR space, which is fine since it is right up with the znode data in > importance. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code