[zfs-code] Increasing dnode size

Mark Shellenbaum Mon, 17 Sep 2007 14:43:33 -0600

Mark Shellenbaum wrote:
> Andreas Dilger wrote:
>> On Sep 17, 2007  08:31 -0600, Mark Shellenbaum wrote:
>>> While not entirely the same thing we will soon have a VFS feature 
>>> registration mechanism in Nevada.  Basically, a file system registers 
>>> what features it supports.  Initially this will be things such as "case 
>>> insensitivity", "acl on create", "extended vattr_t".
>> It's hard for me to comment on this without more information.  I just
>> suggested the ext3 mechanism because what I see so far (many features
>> being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3)
>> mean that it is really hard to do parallel development of features and
>> ensure that the code is actually safe to access the filesystem.
>>
> 
> ZFS actually has 3 different version numbers.  Anything with ZFS_ is 
> actually the spa version.  The ZPL also has a version associated with it 
> and will have ZPL_ as its prefix.  Within each file is a unique ACL 
> version.  Most of the version changing has happened at the spa level, 
> but soon the ZPL version will be changing to support some additional 
> attributes and other things for SMB.
> 
>> For example, if we start developing large dnode + fast EA code we might 
>> want to ship that out sooner than it can go into a Solaris release.  We
>> want to make sure that no Solaris code tries to mount such a filesystem
>> or it will assert (I think), so we would have to version the fs as v4.
>>
>> However, maybe Solaris needs some other changes that would require a v4
>> that does not include large dnode + fast EA support (for whatever reason)
>> so now we have 2 incompatible codebases that support "v4"...
>>
>> Do you have a pointer to the upcoming versioning mechanism?
>>
> 
> Sure, take a look at:
> 
> http://www.opensolaris.org/os/community/arc/caselog/2007/315/
> http://www.opensolaris.org/os/community/arc/caselog/2007/444/
>


Forgot to list the feature registration one.

http://www.opensolaris.org/os/community/arc/caselog/2007/227/mail


> These describe more than just the feature registration though.
> 
>>>> 3.a) I initially thought that we don't have to store any extra
>>>>   information to have a variable znode_phys_t size, because dn_bonuslen
>>>>   holds this information.  However, for symlinks ZFS checks essentially
>>>>   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
>>>>   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
>>>>   old symlinks on disk will be accessed incorrectly if we don't have
>>>>   some extra information about the size of znode_phys_t in each dnode.
>>>>
>>> There is an existing bug to create symlinks with their own object type.
>> I don't think that will help unless there is an extra mechanism to detect
>> whether the symlink is fast or slow, instead of just using the dn_bonuslen.
>> Is it possible to store XATTR data on symlinks in Solaris?
>>
>>>> 3.b)  We can call this "zp_extra_znsize".  If we declare the current
>>>>   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of
>>>>   extra space beyond sizeof(znode_phys_v0_t), so 0 for current 
>>>>   filesystems.
>>> This would also require creation a new DMU_OT_ZNODE2 or something 
>>> similarly named.
>> Sure.  Is it possible to change the DMU_OT type on an existing object?
>>
> 
> Not that I know of.  You would just allocate new files with the new type.
> 
>>>> 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
>>>>   There is lots of unused space in some of the 64-bit fields, but I
>>>>   don't know how you feel about hacks for this.  Possibilities include
>>>>   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc.
>>>>   It probably only needs to be 8 bytes or so (seems unlikely you will
>>>>   more than double the number of fixed fields in struct znode_phys_t).
>>>>
>>> The zp_flags field is off limits.  It is going to be used for storing 
>>> additional file attributes such as immutable, nounlink,...
>> Ah, OK.  I was wondering about that also, but it isn't in the top 10
>> priorities yet.
>>
>>> I don't want to see us overload other fields.  We already have several 
>>> pad fields within the znode that could be used.
>> OK, I wasn't sure about what is spoken for already.  Is it ZFS policy to
>> always have 64-bit member fields?  Some of the fields (e.g. nanoseconds)
>> don't really make sense as 64-bit values, and it would probably be a
>> waste to have a 64-bit value for zp_extra_znsize.
> 
> Not an official policy, but we do typically use 64-bit values.
> 
>>>> 4.c) It would be best to have some kind of ZAP to store the fast EA data.
>>>>   Ideally it is a very simple kind of ZAP (single buffer), but the
>>>>   microzap format is too restrictive with only a 64-bit value.
>>>>   One of the other Lustre desires is to store additional information in
>>>>   each directory entry (in addition to the object number) like file type
>>>>   and a remote server identifier, and having a single ZAP type that is
>>>>   useful for small entries would be good.  Is it possible to go straight
>>>>   to a zap_leaf_phys_t without having a corresponding zap_phys_t first?
>>>>   If yes, then this would be quite useful, otherwise a fat ZAP is too fat
>>>>   to be useful for storing fast EA data and the extended directory info.
>>> Can you provide a list of what attributes you want to store in the znode 
>>> and what their sizes are?  Do you expect ZFS to do anything special with 
>>> these attributes?  Should these attributes be exposed to applications?
>> The main one is the Lustre logical object volume (LOV) extended attribute
>> data.  This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or
>> possibly larger once on ZFS).  This HAS to be accessed to do anything with
>> the znode, even stat currently, since the size of a file is distributed
>> over potentially many servers, so avoiding overhead here is critical.
>>
>> In addition to that, there will be similar smallish attributes stored with
>> each znode like back-pointers from the storage znodes to the metadata znode.
>> These are on the order of 64 bytes as well.
>>
>>> Usually, we only embed attributes in the znode if the file system has 
>>> some sort of semantics associated with them.
>> The issue I think is that this data is only useful for Lustre, so reserving
>> dedicated space for it in a znode is no good.  Also, the LOV XATTR might be
>> very large, so any dedicated space would be wasted.  Having a generic and
>> fast XATTR storage in the znode would help a variety of applications.
>>
> 
> How does lustre retrieve the data?  Do you expect the data to be 
> preserved via backup utilities?
> 
>>> One of the original plans, from several years ago was to create a zp_zap 
>>> field in the znode that would be used for storing additional file 
>>> attributes.  We never actually did that and the field was turned into 
>>> one of the pad fields in the znode.
>> Maybe "file attributes" is the wrong term.  These are really XATTRs in the
>> ZFS sense, so I'll refer to them as such in the future.
>>
> 
> Yep, when you say EAs I was assuming small named/value pairs, not the 
> Solaris based XATTR model.
> 
>>> If the attribute will be needed for every file then it should probably 
>>> be in the znode, but if it is an optional attribute  or too big then 
>>> maybe it should be in some sort of overflow object.
>> This is what I'm proposing.  For small XATTRs they would live in the znode,
>> and large ones would be stored using the normal ZFS XATTR mechanism (which
>> is infinitely flexible).  Since the Lustre LOV XATTR data is created when
>> the znode is first allocated, it will always get first crack at using the
>> fast XATTR space, which is fine since it is right up with the znode data in
>> importance.
>>
> 
> How will you be setting the attributes when the object is created?  Do 
> you have a kernel module that would be calling VOP_CREATE()?  The reason 
> I ask is that with the ARC cases I listed earlier, you will be able to 
> set additional attributes atomically at the time the file is created.
> 
> 
>    -Mark
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

[zfs-code] Increasing dnode size

Reply via email to