I suggest that we get together soon for a "dnode summit", if you will,
in which we put our various plans on the whiteboard and attempt to do
the global optimization.  I suspect that Lustre and pNFS, for example,
have very similar needs -- it would be great to make them identical.

The dnode is a truly core data structure -- we should do everything
we can to keep it free of #ifdefs and conditional logic.

Andreas, where are you based?  When's your next trip to CA?

Jeff

On Mon, Sep 17, 2007 at 02:16:17PM -0600, Andreas Dilger wrote:
> On Sep 17, 2007  08:31 -0600, Mark Shellenbaum wrote:
> > While not entirely the same thing we will soon have a VFS feature 
> > registration mechanism in Nevada.  Basically, a file system registers 
> > what features it supports.  Initially this will be things such as "case 
> > insensitivity", "acl on create", "extended vattr_t".
> 
> It's hard for me to comment on this without more information.  I just
> suggested the ext3 mechanism because what I see so far (many features
> being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3)
> mean that it is really hard to do parallel development of features and
> ensure that the code is actually safe to access the filesystem.
> 
> For example, if we start developing large dnode + fast EA code we might 
> want to ship that out sooner than it can go into a Solaris release.  We
> want to make sure that no Solaris code tries to mount such a filesystem
> or it will assert (I think), so we would have to version the fs as v4.
> 
> However, maybe Solaris needs some other changes that would require a v4
> that does not include large dnode + fast EA support (for whatever reason)
> so now we have 2 incompatible codebases that support "v4"...
> 
> Do you have a pointer to the upcoming versioning mechanism?
> 
> > >3.a) I initially thought that we don't have to store any extra
> > >   information to have a variable znode_phys_t size, because dn_bonuslen
> > >   holds this information.  However, for symlinks ZFS checks essentially
> > >   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
> > >   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
> > >   old symlinks on disk will be accessed incorrectly if we don't have
> > >   some extra information about the size of znode_phys_t in each dnode.
> > >
> > 
> > There is an existing bug to create symlinks with their own object type.
> 
> I don't think that will help unless there is an extra mechanism to detect
> whether the symlink is fast or slow, instead of just using the dn_bonuslen.
> Is it possible to store XATTR data on symlinks in Solaris?
> 
> > >3.b)  We can call this "zp_extra_znsize".  If we declare the current
> > >   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of
> > >   extra space beyond sizeof(znode_phys_v0_t), so 0 for current 
> > >   filesystems.
> > 
> > This would also require creation a new DMU_OT_ZNODE2 or something 
> > similarly named.
> 
> Sure.  Is it possible to change the DMU_OT type on an existing object?
> 
> > >3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
> > >   There is lots of unused space in some of the 64-bit fields, but I
> > >   don't know how you feel about hacks for this.  Possibilities include
> > >   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc.
> > >   It probably only needs to be 8 bytes or so (seems unlikely you will
> > >   more than double the number of fixed fields in struct znode_phys_t).
> > >
> > 
> > The zp_flags field is off limits.  It is going to be used for storing 
> > additional file attributes such as immutable, nounlink,...
> 
> Ah, OK.  I was wondering about that also, but it isn't in the top 10
> priorities yet.
> 
> > I don't want to see us overload other fields.  We already have several 
> > pad fields within the znode that could be used.
> 
> OK, I wasn't sure about what is spoken for already.  Is it ZFS policy to
> always have 64-bit member fields?  Some of the fields (e.g. nanoseconds)
> don't really make sense as 64-bit values, and it would probably be a
> waste to have a 64-bit value for zp_extra_znsize.
> 
> > >4.c) It would be best to have some kind of ZAP to store the fast EA data.
> > >   Ideally it is a very simple kind of ZAP (single buffer), but the
> > >   microzap format is too restrictive with only a 64-bit value.
> > >   One of the other Lustre desires is to store additional information in
> > >   each directory entry (in addition to the object number) like file type
> > >   and a remote server identifier, and having a single ZAP type that is
> > >   useful for small entries would be good.  Is it possible to go straight
> > >   to a zap_leaf_phys_t without having a corresponding zap_phys_t first?
> > >   If yes, then this would be quite useful, otherwise a fat ZAP is too fat
> > >   to be useful for storing fast EA data and the extended directory info.
> > 
> > Can you provide a list of what attributes you want to store in the znode 
> > and what their sizes are?  Do you expect ZFS to do anything special with 
> > these attributes?  Should these attributes be exposed to applications?
> 
> The main one is the Lustre logical object volume (LOV) extended attribute
> data.  This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or
> possibly larger once on ZFS).  This HAS to be accessed to do anything with
> the znode, even stat currently, since the size of a file is distributed
> over potentially many servers, so avoiding overhead here is critical.
> 
> In addition to that, there will be similar smallish attributes stored with
> each znode like back-pointers from the storage znodes to the metadata znode.
> These are on the order of 64 bytes as well.
> 
> > Usually, we only embed attributes in the znode if the file system has 
> > some sort of semantics associated with them.
> 
> The issue I think is that this data is only useful for Lustre, so reserving
> dedicated space for it in a znode is no good.  Also, the LOV XATTR might be
> very large, so any dedicated space would be wasted.  Having a generic and
> fast XATTR storage in the znode would help a variety of applications.
> 
> > One of the original plans, from several years ago was to create a zp_zap 
> > field in the znode that would be used for storing additional file 
> > attributes.  We never actually did that and the field was turned into 
> > one of the pad fields in the znode.
> 
> Maybe "file attributes" is the wrong term.  These are really XATTRs in the
> ZFS sense, so I'll refer to them as such in the future.
> 
> > If the attribute will be needed for every file then it should probably 
> > be in the znode, but if it is an optional attribute  or too big then 
> > maybe it should be in some sort of overflow object.
> 
> This is what I'm proposing.  For small XATTRs they would live in the znode,
> and large ones would be stored using the normal ZFS XATTR mechanism (which
> is infinitely flexible).  Since the Lustre LOV XATTR data is created when
> the znode is first allocated, it will always get first crack at using the
> fast XATTR space, which is fine since it is right up with the znode data in
> importance.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Reply via email to