[zfs-code] Increasing dnode size

Mark Shellenbaum Mon, 17 Sep 2007 14:38:31 -0600

Andreas Dilger wrote:
> On Sep 17, 2007  08:31 -0600, Mark Shellenbaum wrote:
>> While not entirely the same thing we will soon have a VFS feature 
>> registration mechanism in Nevada.  Basically, a file system registers 
>> what features it supports.  Initially this will be things such as "case 
>> insensitivity", "acl on create", "extended vattr_t".
> 
> It's hard for me to comment on this without more information.  I just
> suggested the ext3 mechanism because what I see so far (many features
> being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3)
> mean that it is really hard to do parallel development of features and
> ensure that the code is actually safe to access the filesystem.
>


ZFS actually has 3 different version numbers.  Anything with ZFS_ is 
actually the spa version.  The ZPL also has a version associated with it 
and will have ZPL_ as its prefix.  Within each file is a unique ACL 
version.  Most of the version changing has happened at the spa level, 
but soon the ZPL version will be changing to support some additional 
attributes and other things for SMB.

> For example, if we start developing large dnode + fast EA code we might 
> want to ship that out sooner than it can go into a Solaris release.  We
> want to make sure that no Solaris code tries to mount such a filesystem
> or it will assert (I think), so we would have to version the fs as v4.
> 
> However, maybe Solaris needs some other changes that would require a v4
> that does not include large dnode + fast EA support (for whatever reason)
> so now we have 2 incompatible codebases that support "v4"...
> 
> Do you have a pointer to the upcoming versioning mechanism?
> 

Sure, take a look at:

http://www.opensolaris.org/os/community/arc/caselog/2007/315/
http://www.opensolaris.org/os/community/arc/caselog/2007/444/

These describe more than just the feature registration though.

>>> 3.a) I initially thought that we don't have to store any extra
>>>   information to have a variable znode_phys_t size, because dn_bonuslen
>>>   holds this information.  However, for symlinks ZFS checks essentially
>>>   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
>>>   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
>>>   old symlinks on disk will be accessed incorrectly if we don't have
>>>   some extra information about the size of znode_phys_t in each dnode.
>>>
>> There is an existing bug to create symlinks with their own object type.
> 
> I don't think that will help unless there is an extra mechanism to detect
> whether the symlink is fast or slow, instead of just using the dn_bonuslen.
> Is it possible to store XATTR data on symlinks in Solaris?
> 
>>> 3.b)  We can call this "zp_extra_znsize".  If we declare the current
>>>   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of
>>>   extra space beyond sizeof(znode_phys_v0_t), so 0 for current 
>>>   filesystems.
>> This would also require creation a new DMU_OT_ZNODE2 or something 
>> similarly named.
> 
> Sure.  Is it possible to change the DMU_OT type on an existing object?
> 

Not that I know of.  You would just allocate new files with the new type.

>>> 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
>>>   There is lots of unused space in some of the 64-bit fields, but I
>>>   don't know how you feel about hacks for this.  Possibilities include
>>>   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc.
>>>   It probably only needs to be 8 bytes or so (seems unlikely you will
>>>   more than double the number of fixed fields in struct znode_phys_t).
>>>
>> The zp_flags field is off limits.  It is going to be used for storing 
>> additional file attributes such as immutable, nounlink,...
> 
> Ah, OK.  I was wondering about that also, but it isn't in the top 10
> priorities yet.
> 
>> I don't want to see us overload other fields.  We already have several 
>> pad fields within the znode that could be used.
> 
> OK, I wasn't sure about what is spoken for already.  Is it ZFS policy to
> always have 64-bit member fields?  Some of the fields (e.g. nanoseconds)
> don't really make sense as 64-bit values, and it would probably be a
> waste to have a 64-bit value for zp_extra_znsize.

Not an official policy, but we do typically use 64-bit values.

> 
>>> 4.c) It would be best to have some kind of ZAP to store the fast EA data.
>>>   Ideally it is a very simple kind of ZAP (single buffer), but the
>>>   microzap format is too restrictive with only a 64-bit value.
>>>   One of the other Lustre desires is to store additional information in
>>>   each directory entry (in addition to the object number) like file type
>>>   and a remote server identifier, and having a single ZAP type that is
>>>   useful for small entries would be good.  Is it possible to go straight
>>>   to a zap_leaf_phys_t without having a corresponding zap_phys_t first?
>>>   If yes, then this would be quite useful, otherwise a fat ZAP is too fat
>>>   to be useful for storing fast EA data and the extended directory info.
>> Can you provide a list of what attributes you want to store in the znode 
>> and what their sizes are?  Do you expect ZFS to do anything special with 
>> these attributes?  Should these attributes be exposed to applications?
> 
> The main one is the Lustre logical object volume (LOV) extended attribute
> data.  This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or
> possibly larger once on ZFS).  This HAS to be accessed to do anything with
> the znode, even stat currently, since the size of a file is distributed
> over potentially many servers, so avoiding overhead here is critical.
> 
> In addition to that, there will be similar smallish attributes stored with
> each znode like back-pointers from the storage znodes to the metadata znode.
> These are on the order of 64 bytes as well.
> 
>> Usually, we only embed attributes in the znode if the file system has 
>> some sort of semantics associated with them.
> 
> The issue I think is that this data is only useful for Lustre, so reserving
> dedicated space for it in a znode is no good.  Also, the LOV XATTR might be
> very large, so any dedicated space would be wasted.  Having a generic and
> fast XATTR storage in the znode would help a variety of applications.
> 

How does lustre retrieve the data?  Do you expect the data to be 
preserved via backup utilities?

>> One of the original plans, from several years ago was to create a zp_zap 
>> field in the znode that would be used for storing additional file 
>> attributes.  We never actually did that and the field was turned into 
>> one of the pad fields in the znode.
> 
> Maybe "file attributes" is the wrong term.  These are really XATTRs in the
> ZFS sense, so I'll refer to them as such in the future.
> 

Yep, when you say EAs I was assuming small named/value pairs, not the 
Solaris based XATTR model.

>> If the attribute will be needed for every file then it should probably 
>> be in the znode, but if it is an optional attribute  or too big then 
>> maybe it should be in some sort of overflow object.
> 
> This is what I'm proposing.  For small XATTRs they would live in the znode,
> and large ones would be stored using the normal ZFS XATTR mechanism (which
> is infinitely flexible).  Since the Lustre LOV XATTR data is created when
> the znode is first allocated, it will always get first crack at using the
> fast XATTR space, which is fine since it is right up with the znode data in
> importance.
> 

How will you be setting the attributes when the object is created?  Do 
you have a kernel module that would be calling VOP_CREATE()?  The reason 
I ask is that with the ARC cases I listed earlier, you will be able to 
set additional attributes atomically at the time the file is created.


   -Mark

[zfs-code] Increasing dnode size

Reply via email to