Mark Shellenbaum wrote: > Andreas Dilger wrote: >> On Sep 17, 2007 08:31 -0600, Mark Shellenbaum wrote: >>> While not entirely the same thing we will soon have a VFS feature >>> registration mechanism in Nevada. Basically, a file system registers >>> what features it supports. Initially this will be things such as "case >>> insensitivity", "acl on create", "extended vattr_t". >> It's hard for me to comment on this without more information. I just >> suggested the ext3 mechanism because what I see so far (many features >> being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3) >> mean that it is really hard to do parallel development of features and >> ensure that the code is actually safe to access the filesystem. >> > > ZFS actually has 3 different version numbers. Anything with ZFS_ is > actually the spa version. The ZPL also has a version associated with it > and will have ZPL_ as its prefix. Within each file is a unique ACL > version. Most of the version changing has happened at the spa level, > but soon the ZPL version will be changing to support some additional > attributes and other things for SMB. > >> For example, if we start developing large dnode + fast EA code we might >> want to ship that out sooner than it can go into a Solaris release. We >> want to make sure that no Solaris code tries to mount such a filesystem >> or it will assert (I think), so we would have to version the fs as v4. >> >> However, maybe Solaris needs some other changes that would require a v4 >> that does not include large dnode + fast EA support (for whatever reason) >> so now we have 2 incompatible codebases that support "v4"... >> >> Do you have a pointer to the upcoming versioning mechanism? >> > > Sure, take a look at: > > http://www.opensolaris.org/os/community/arc/caselog/2007/315/ > http://www.opensolaris.org/os/community/arc/caselog/2007/444/ >
Forgot to list the feature registration one. http://www.opensolaris.org/os/community/arc/caselog/2007/227/mail > These describe more than just the feature registration though. > >>>> 3.a) I initially thought that we don't have to store any extra >>>> information to have a variable znode_phys_t size, because dn_bonuslen >>>> holds this information. However, for symlinks ZFS checks essentially >>>> "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a >>>> fast or slow symlink. That implies if sizeof(znode_phys_t) changes >>>> old symlinks on disk will be accessed incorrectly if we don't have >>>> some extra information about the size of znode_phys_t in each dnode. >>>> >>> There is an existing bug to create symlinks with their own object type. >> I don't think that will help unless there is an extra mechanism to detect >> whether the symlink is fast or slow, instead of just using the dn_bonuslen. >> Is it possible to store XATTR data on symlinks in Solaris? >> >>>> 3.b) We can call this "zp_extra_znsize". If we declare the current >>>> znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of >>>> extra space beyond sizeof(znode_phys_v0_t), so 0 for current >>>> filesystems. >>> This would also require creation a new DMU_OT_ZNODE2 or something >>> similarly named. >> Sure. Is it possible to change the DMU_OT type on an existing object? >> > > Not that I know of. You would just allocate new files with the new type. > >>>> 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. >>>> There is lots of unused space in some of the 64-bit fields, but I >>>> don't know how you feel about hacks for this. Possibilities include >>>> some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. >>>> It probably only needs to be 8 bytes or so (seems unlikely you will >>>> more than double the number of fixed fields in struct znode_phys_t). >>>> >>> The zp_flags field is off limits. It is going to be used for storing >>> additional file attributes such as immutable, nounlink,... >> Ah, OK. I was wondering about that also, but it isn't in the top 10 >> priorities yet. >> >>> I don't want to see us overload other fields. We already have several >>> pad fields within the znode that could be used. >> OK, I wasn't sure about what is spoken for already. Is it ZFS policy to >> always have 64-bit member fields? Some of the fields (e.g. nanoseconds) >> don't really make sense as 64-bit values, and it would probably be a >> waste to have a 64-bit value for zp_extra_znsize. > > Not an official policy, but we do typically use 64-bit values. > >>>> 4.c) It would be best to have some kind of ZAP to store the fast EA data. >>>> Ideally it is a very simple kind of ZAP (single buffer), but the >>>> microzap format is too restrictive with only a 64-bit value. >>>> One of the other Lustre desires is to store additional information in >>>> each directory entry (in addition to the object number) like file type >>>> and a remote server identifier, and having a single ZAP type that is >>>> useful for small entries would be good. Is it possible to go straight >>>> to a zap_leaf_phys_t without having a corresponding zap_phys_t first? >>>> If yes, then this would be quite useful, otherwise a fat ZAP is too fat >>>> to be useful for storing fast EA data and the extended directory info. >>> Can you provide a list of what attributes you want to store in the znode >>> and what their sizes are? Do you expect ZFS to do anything special with >>> these attributes? Should these attributes be exposed to applications? >> The main one is the Lustre logical object volume (LOV) extended attribute >> data. This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or >> possibly larger once on ZFS). This HAS to be accessed to do anything with >> the znode, even stat currently, since the size of a file is distributed >> over potentially many servers, so avoiding overhead here is critical. >> >> In addition to that, there will be similar smallish attributes stored with >> each znode like back-pointers from the storage znodes to the metadata znode. >> These are on the order of 64 bytes as well. >> >>> Usually, we only embed attributes in the znode if the file system has >>> some sort of semantics associated with them. >> The issue I think is that this data is only useful for Lustre, so reserving >> dedicated space for it in a znode is no good. Also, the LOV XATTR might be >> very large, so any dedicated space would be wasted. Having a generic and >> fast XATTR storage in the znode would help a variety of applications. >> > > How does lustre retrieve the data? Do you expect the data to be > preserved via backup utilities? > >>> One of the original plans, from several years ago was to create a zp_zap >>> field in the znode that would be used for storing additional file >>> attributes. We never actually did that and the field was turned into >>> one of the pad fields in the znode. >> Maybe "file attributes" is the wrong term. These are really XATTRs in the >> ZFS sense, so I'll refer to them as such in the future. >> > > Yep, when you say EAs I was assuming small named/value pairs, not the > Solaris based XATTR model. > >>> If the attribute will be needed for every file then it should probably >>> be in the znode, but if it is an optional attribute or too big then >>> maybe it should be in some sort of overflow object. >> This is what I'm proposing. For small XATTRs they would live in the znode, >> and large ones would be stored using the normal ZFS XATTR mechanism (which >> is infinitely flexible). Since the Lustre LOV XATTR data is created when >> the znode is first allocated, it will always get first crack at using the >> fast XATTR space, which is fine since it is right up with the znode data in >> importance. >> > > How will you be setting the attributes when the object is created? Do > you have a kernel module that would be calling VOP_CREATE()? The reason > I ask is that with the ARC cases I listed earlier, you will be able to > set additional attributes atomically at the time the file is created. > > > -Mark > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code