Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 11, 2007 16:04 -0400, J. Bruce Fields wrote: > A 32-bit i_version could in theory wrap pretty quickly, couldn't it? > That's not a problem in itself--the problem would only arise if two > subsequent client queries of the change attribute happened a multiple of > 2^32 i_version increments apart. > > This is more likely than the previous scenario, but still very unlikely. > I would have guessed that even in situations with a very high rate of > updates and a low rate of client revalidations, the chance of two > revalidations happening exactly 2^32 updates apart would still be no > more than 1 in 2^32. (Could odd characteristics of the workloads (like > updates that tend to happen in power-of-2 groups?) make it any more > likely?) > > I'd be happier if ext4 at least allowed the possibility of 64 bits in > the future. And there's always the chance someone would find a use for > an i_version that was nondecreasing, even if nfs didn't care. This would indeed be the case for ext3 filesystems updated to ext4. They will only be able to store the low 32 bits of the version, which is itself normally enough for NFSv4 because it only uses the inequality check. Having the full 64 bits available eliminates the risk of collisions, and given that the spec mandates a 64-bit version I'm sure someone will take full advantage of it in NFS at some point. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, Jul 11, 2007 at 09:28:06AM -0500, Dave Kleikamp wrote: > On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote: > > It just occurred to me: > > > > If i_version is 64bit, then knfsd would need to be careful when > > reading it on a 32bit host. What are the locking rules? > > How does knfsd use i_version? I would think that if all it was doing > was to compare (i_version == previous_version) That's correct. (Though it's the client that's doing the comparison, actually--the server is just reporting the value.) > then locking wouldn't really matter. Well, theoretically, > previous_version could be 0x1, and i_version could be > 0x1, knfsd checks the high word, then ext4 updates i_version > to 0x2, then knfsd checks the low word, detecting no change. > How likely is this? The choice of upper word in your example is arbitrary, but other than that I believe your example is essentially the only one. So this would only happen when *both* - the read of the new value of the low word happens precisely 2^32 i_version updates after the word was read on the client's previous cache revalidation, and - the value of i_version itself is close enough to a 32-bit boundary that wraparound can happen between the reads of the high and low words. > (I don't understand why i_version even needs to be 64 bits in the > first place.) A 32-bit i_version could in theory wrap pretty quickly, couldn't it? That's not a problem in itself--the problem would only arise if two subsequent client queries of the change attribute happened a multiple of 2^32 i_version increments apart. This is more likely than the previous scenario, but still very unlikely. I would have guessed that even in situations with a very high rate of updates and a low rate of client revalidations, the chance of two revalidations happening exactly 2^32 updates apart would still be no more than 1 in 2^32. (Could odd characteristics of the workloads (like updates that tend to happen in power-of-2 groups?) make it any more likely?) I'd be happier if ext4 at least allowed the possibility of 64 bits in the future. And there's always the chance someone would find a use for an i_version that was nondecreasing, even if nfs didn't care. > > Presumably it is only updated under i_mutex protection, but having to > > get i_mutex to read it would seem a little heavy handed. > > How does knfsd protect itself from the inode changing after i_version is > checked? Is any locking being done otherwise? If the client always requests the change attribute before reading, and the i_version is always updated after data is modified, I think we're OK. Admittedly this is a little subtle. --b. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, Jul 11, 2007 at 01:21:55PM +1000, Neil Brown wrote: > And just by-the-way, the server doesn't really have the option of not > sending the attribute. If i_version isn't defined, it has to fake > something using mtime, and hope that is good enough. ctime, actually--the change attribute is also supposed to be updated on attribute updates. > Alternately we could mandate that i_version is always kept up-to-date > and if a filesystem doesn't have anything to load from storage, it > just sets it to the current time in nanoseconds. > > That would mean that a client would need to flush it's cache whenever > the inode fell out of cache on the server, but I don't think we can > reliably do better than that. > > I think I like that approach. > > So my vote is to increment i_version in common code every time any > change is made to the file, and alloc_inode should initialise it to > current time, which might be changed by the filesystem before it calls > unlock_new_inode. So the client would be invalidating its cache more often than necessary, rather than failing to invalidate it when it should. I agree that that's probably the better tradeoff, although I wish I had a better idea of the downside. I don't know, for example, whether users might see unpleasant results if every client has to reread its cached data on a reboot. The currently proposed change--just providing a model change attribute implementation for ext4 and leaving other filesystems untouched--is a more conservative step. So I'm inclined to just do this ext4 thing first, and then look into further change attribute experiments next time around --b. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, Jul 10, 2007 at 08:19:16PM -0400, Mingming Cao wrote: > On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote: > > And how does the NFS server know that the filesystem implements i_version? > > Will a zero-value of i_version have special significance, telling the > > server to not send this attribute, perhaps? > > Bruce raised up this question a few days back when he reviewed this > patch, I think the solution is add a superblock flag for fs support > inode versioning, probably at VFS layer? Sounds fine. As long as it's something that's standard across filesystems, then I can just do something like if (sb->s_flags & MS_CHANGEATTR) /* return i_version to client as change attribute */ else /* return ctime to client as change attribute */ --b. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote: > It just occurred to me: > > If i_version is 64bit, then knfsd would need to be careful when > reading it on a 32bit host. What are the locking rules? How does knfsd use i_version? I would think that if all it was doing was to compare (i_version == previous_version), then locking wouldn't really matter. Well, theoretically, previous_version could be 0x1, and i_version could be 0x1, knfsd checks the high word, then ext4 updates i_version to 0x2, then knfsd checks the low word, detecting no change. How likely is this? (I don't understand why i_version even needs to be 64 bits in the first place.) > Presumably it is only updated under i_mutex protection, but having to > get i_mutex to read it would seem a little heavy handed. How does knfsd protect itself from the inode changing after i_version is checked? Is any locking being done otherwise? > Should it use a seqlock like i_size? > Could we use the same seqlock that i_size uses, or would we need a > separate one? > > NeilBrown -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 10, 2007 23:34 -0400, Trond Myklebust wrote: > On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote: > > So my vote is to increment i_version in common code every time any > > change is made to the file, and alloc_inode should initialise it to > > current time, which might be changed by the filesystem before it calls > > unlock_new_inode. > > ... but doesn't lustre want to control its i_version... so maybe not :-( > > If lustre wants to be exportable via pNFS (as Peter Braam has suggested > it should), then it had better be able to return a change attribute that > is compatible with the NFSv4.1 spec... The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre version can be used to compare the updates of two inodes. It is set to be the Lustre transaction number (which is sequentially incremented on a per filesystem basis), so that we can use the per-inode version to do replay of client operations even if they have been disconnected for a long time, which is why we want to be able to control the actual value. We don't want the version to be updated for e.g. file defragmentation or other similar internal-only changes which need ext4_mark_inode_dirty(). We had a patch to disable ext4 inode versioning by a flag the superblock, but we dropped it at the last minute because it needed some updates and we didn't want to wait on that for submitting these changes upstream. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote: > > On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > David Chinneer pointed that we need to journal the version number > > > updates together with the operations that causes the change of the inode > > > version number, in order to survive server crashes so clients won't see > > > the counter go backwards. > > > > > > So increment i_version in fs code is probably the place to ensure the > > > inode version changes are stored to disk. It's seems update the ext4 > > > inode version in every ext4_mark_inode_dirty() is the easiest way. > > > > That still makes us dependent upon _something_ changing the inode. For > > overwrites the only something is mtime. > > > > If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and > > I don't think we do) then I guess we'll need new code in or around > > file_update_time() to do this. > > do you mean mark inode dirty all the times in file_update_time()? Not > sure about the overhead for ext3/4. > hm, I hadn't thought about it in any detail. Maybe something like --- a/fs/inode.c~a +++ a/fs/inode.c @@ -1238,6 +1238,11 @@ void file_update_time(struct file *file) sync_it = 1; } + if (IS_I_VERSION_64(inode)) { + inode_inc_iversion(inode); /* Takes i_lock on 32-bit */ + sync_it = 1; + } + if (sync_it) mark_inode_dirty_sync(inode); } _ ? - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote: > On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > David Chinneer pointed that we need to journal the version number > > updates together with the operations that causes the change of the inode > > version number, in order to survive server crashes so clients won't see > > the counter go backwards. > > > > So increment i_version in fs code is probably the place to ensure the > > inode version changes are stored to disk. It's seems update the ext4 > > inode version in every ext4_mark_inode_dirty() is the easiest way. > > That still makes us dependent upon _something_ changing the inode. For > overwrites the only something is mtime. > > If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and > I don't think we do) then I guess we'll need new code in or around > file_update_time() to do this. do you mean mark inode dirty all the times in file_update_time()? Not sure about the overhead for ext3/4. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 21:22 -0700, Andrew Morton wrote: > On Tue, 10 Jul 2007 20:19:16 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote: > > > On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > > > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: > > > > > On Sun, 01 Jul 2007 03:37:04 -0400 > > > > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > This patch converts the 32-bit i_version in the generic inode to a > > > > > > 64-bit > > > > > > i_version field. > > > > > > > > > > > > > > > > That's obvious from the patch. But what was the reason for making > > > > > this > > > > > (unrelated to ext4) change? > > > > > > > > > > > > > The need is came from NFSv4 > > > > > > > > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: > > > > > Hi, > > > > > > > > > > This is an update of the i_version patch. > > > > > The i_version field is a 64bit counter that is set on every inode > > > > > creation and that is incremented every time the inode data is modified > > > > > (similarly to the "ctime" time-stamp). > > > > > The aim is to fulfill a NFSv4 requirement for rfc3530: > > > > > "5.5. Mandatory Attributes - Definitions > > > > > Name # DataType Access Description > > > > > ___ > > > > > change3 uint64 READ A value created > > > > > by the > > > > > server that the client can use to determine if file > > > > > data, directory contents or attributes of the object > > > > > have been modified. The servermay return the object's > > > > > time_metadata attribute for this attribute's value but > > > > > only if the filesystem object can not be updated more > > > > > frequently than the resolution of time_metadata. > > > > > " > > > > > > > > > > > > > > Please update the changelog for this. > > > > > > > > > > > > > Is above description clear to you? > > > > > > > > > > Yes, thanks. It doesn't actually tell us why we want to implement > > > this attribute and it doesn't tell us what the implications of failing > > > to do so are, but I guess we can take that on trust from the NFS guys. > > > > > > But I suspect the ext4 implementation doesn't actually do this. afaict we > > > won't update i_version for file overwrites (especially if s_time_gran can > > > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What > > > would be the implications of this? > > > > > > > In the case of overwrite (file date updated), I assume the ctime/mtime > > is being updated and the inode is being dirtied, so the version number > > is being updated. > > > > vfs_write()->.. > > ->__generic_file_aio_write_nolock() > > ->file_update_time() > > ->mark_inode_dirty_sync() > > ->__mark_inode_dirty(I_DIRTY_SYNC) > > ->ext4_dirty_inode() > > ->ext4_mark_inode_dirty() > > That assumes an mtime update for every write(). OK, so two writes in a > single nanosecond won't be happening. But in that case why is this code: > > static inline struct timespec ext4_current_time(struct inode *inode) > { > return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ? > current_fs_time(inode->i_sb) : CURRENT_TIME_SEC; > } > > checking (s_time_gran < NSEC_PER_SEC) ?? > Ext4 can still load/read ext3 fs (which by default with 128 bytes old inode size, means doens't have support nanosecond timestamps), so it's not always gurantee nanosecond timestamps granularity.(it depends on the size of the inode (>128 bytes), by default, a fresh ext4 increase inode size to 256 bytes to have the room to store nanoseond timestamps, inode versioning etc) > Overall it is a bit unpleasing to rely upon mtime updates for a correct NFS > server implementation: if we were to later decrease s_time_gran (as we > might do, for performance reasons), the NFS server implementation starts > reporting incorrect information. > :( that is a problem... > > > And how does the NFS server know that the filesystem implements > > > i_version? > > > Will a zero-value of i_version have special significance, telling the > > > server to not send this attribute, perhaps? > > > > Bruce raised up this question a few days back when he reviewed this > > patch, I think the solution is add a superblock flag for fs support > > inode versioning, probably at VFS layer? > > That would work. > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMA
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 11 Jul 2007 15:05:27 +1000 Neil Brown <[EMAIL PROTECTED]> wrote: > > It just occurred to me: > > If i_version is 64bit, then knfsd would need to be careful when > reading it on a 32bit host. What are the locking rules? > > Presumably it is only updated under i_mutex protection, but having to > get i_mutex to read it would seem a little heavy handed. > > Should it use a seqlock like i_size? > Could we use the same seqlock that i_size uses, or would we need a > separate one? > seqlocks are a bit of a pain to use (we've had plenty of deadlocks on the i_size one). We could reuse inode.i_lock for this modification. Its mandate is "general purpose innermost lock to protect stuff in this inode". - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > David Chinneer pointed that we need to journal the version number > updates together with the operations that causes the change of the inode > version number, in order to survive server crashes so clients won't see > the counter go backwards. > > So increment i_version in fs code is probably the place to ensure the > inode version changes are stored to disk. It's seems update the ext4 > inode version in every ext4_mark_inode_dirty() is the easiest way. That still makes us dependent upon _something_ changing the inode. For overwrites the only something is mtime. If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and I don't think we do) then I guess we'll need new code in or around file_update_time() to do this. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote: > On Tuesday July 10, [EMAIL PROTECTED] wrote: > > > > Yes, thanks. It doesn't actually tell us why we want to implement > > this attribute and it doesn't tell us what the implications of failing > > to do so are, but I guess we can take that on trust from the NFS guys. > > You would like to think so, but remember NFSv4 was designed by a > committee :-) > > The 'change' number is used for cache consistency, and as the spec > makes very strong statements about the 'change' number, it is very > hard (or impossible) to implement a server correctly without storing a > change number in stable storage (just one of my grips about V4). > > > > > But I suspect the ext4 implementation doesn't actually do this. afaict we > > won't update i_version for file overwrites (especially if s_time_gran can > > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What > > would be the implications of this? > > The first part sounds like a bug - i_version should really be updated > by every call to ->commit_write (if that is still what it is called). > > The MAP_SHARED thing is less obvious. I guess every time we notice > that the page might have been changed, we need to increment i_version. > > > > > And how does the NFS server know that the filesystem implements i_version? > > Will a zero-value of i_version have special significance, telling the > > server to not send this attribute, perhaps? > > That is a very important question. Zero probably makes sense, but > what ever it is needs to be agreed and documented. > And just by-the-way, the server doesn't really have the option of not > sending the attribute. If i_version isn't defined, it has to fake > something using mtime, and hope that is good enough. > > Alternately we could mandate that i_version is always kept up-to-date > and if a filesystem doesn't have anything to load from storage, it > just sets it to the current time in nanoseconds. > > That would mean that a client would need to flush it's cache whenever > the inode fell out of cache on the server, but I don't think we can > reliably do better than that. > > I think I like that approach. > > So my vote is to increment i_version in common code every time any > change is made to the file, David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. > and alloc_inode should initialise it to > current time, which might be changed by the filesystem before it calls > unlock_new_inode. > ... but doesn't lustre want to control its i_version... so maybe not :-( > > NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
It just occurred to me: If i_version is 64bit, then knfsd would need to be careful when reading it on a 32bit host. What are the locking rules? Presumably it is only updated under i_mutex protection, but having to get i_mutex to read it would seem a little heavy handed. Should it use a seqlock like i_size? Could we use the same seqlock that i_size uses, or would we need a separate one? NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 10 Jul 2007 20:19:16 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote: > > On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: > > > > On Sun, 01 Jul 2007 03:37:04 -0400 > > > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > > > > > This patch converts the 32-bit i_version in the generic inode to a > > > > > 64-bit > > > > > i_version field. > > > > > > > > > > > > > That's obvious from the patch. But what was the reason for making this > > > > (unrelated to ext4) change? > > > > > > > > > > The need is came from NFSv4 > > > > > > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: > > > > Hi, > > > > > > > > This is an update of the i_version patch. > > > > The i_version field is a 64bit counter that is set on every inode > > > > creation and that is incremented every time the inode data is modified > > > > (similarly to the "ctime" time-stamp). > > > > The aim is to fulfill a NFSv4 requirement for rfc3530: > > > > "5.5. Mandatory Attributes - Definitions > > > > Name# DataType Access Description > > > > ___ > > > > change 3 uint64 READ A value created by the > > > > server that the client can use to determine if file > > > > data, directory contents or attributes of the object > > > > have been modified. The servermay return the object's > > > > time_metadata attribute for this attribute's value but > > > > only if the filesystem object can not be updated more > > > > frequently than the resolution of time_metadata. > > > > " > > > > > > > > > > > Please update the changelog for this. > > > > > > > > > > Is above description clear to you? > > > > > > > Yes, thanks. It doesn't actually tell us why we want to implement > > this attribute and it doesn't tell us what the implications of failing > > to do so are, but I guess we can take that on trust from the NFS guys. > > > > But I suspect the ext4 implementation doesn't actually do this. afaict we > > won't update i_version for file overwrites (especially if s_time_gran can > > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What > > would be the implications of this? > > > > In the case of overwrite (file date updated), I assume the ctime/mtime > is being updated and the inode is being dirtied, so the version number > is being updated. > > vfs_write()->.. > ->__generic_file_aio_write_nolock() > ->file_update_time() > ->mark_inode_dirty_sync() > ->__mark_inode_dirty(I_DIRTY_SYNC) > ->ext4_dirty_inode() > ->ext4_mark_inode_dirty() That assumes an mtime update for every write(). OK, so two writes in a single nanosecond won't be happening. But in that case why is this code: static inline struct timespec ext4_current_time(struct inode *inode) { return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ? current_fs_time(inode->i_sb) : CURRENT_TIME_SEC; } checking (s_time_gran < NSEC_PER_SEC) ?? Overall it is a bit unpleasing to rely upon mtime updates for a correct NFS server implementation: if we were to later decrease s_time_gran (as we might do, for performance reasons), the NFS server implementation starts reporting incorrect information. > > And how does the NFS server know that the filesystem implements i_version? > > Will a zero-value of i_version have special significance, telling the > > server to not send this attribute, perhaps? > > Bruce raised up this question a few days back when he reviewed this > patch, I think the solution is add a superblock flag for fs support > inode versioning, probably at VFS layer? That would work. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote: > On Tuesday July 10, [EMAIL PROTECTED] wrote: > > > > Yes, thanks. It doesn't actually tell us why we want to implement > > this attribute and it doesn't tell us what the implications of failing > > to do so are, but I guess we can take that on trust from the NFS guys. > > You would like to think so, but remember NFSv4 was designed by a > committee :-) > > The 'change' number is used for cache consistency, and as the spec > makes very strong statements about the 'change' number, it is very > hard (or impossible) to implement a server correctly without storing a > change number in stable storage (just one of my grips about V4). Well... It reflects a requirement that was just as present in the caching models that we use for NFSv2/v3, but that we glossed over for other reasons. The real difference here is that v2/v3 caching model assumes that you have sufficient resolution in the ctime/mtime to allow clients to detect any changes to the file/directory contents, whereas NFSv4 allows you to use an arbitrary variable (that may be the ctime, if it has sufficient resolution) for the same purposes. > > But I suspect the ext4 implementation doesn't actually do this. afaict we > > won't update i_version for file overwrites (especially if s_time_gran can > > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What > > would be the implications of this? > > The first part sounds like a bug - i_version should really be updated > by every call to ->commit_write (if that is still what it is called). > > The MAP_SHARED thing is less obvious. I guess every time we notice > that the page might have been changed, we need to increment i_version. You need to increment is any time that you detect remotely visible changes. IOW: any change that posix mandates should result in a ctime update, must also result in an update of i_version. The only difference is that i_version must not suffer from the time resolution issues that ctime does. > > And how does the NFS server know that the filesystem implements i_version? > > Will a zero-value of i_version have special significance, telling the > > server to not send this attribute, perhaps? > > That is a very important question. Zero probably makes sense, but > what ever it is needs to be agreed and documented. > And just by-the-way, the server doesn't really have the option of not > sending the attribute. If i_version isn't defined, it has to fake > something using mtime, and hope that is good enough. > > Alternately we could mandate that i_version is always kept up-to-date > and if a filesystem doesn't have anything to load from storage, it > just sets it to the current time in nanoseconds. > > That would mean that a client would need to flush it's cache whenever > the inode fell out of cache on the server, but I don't think we can > reliably do better than that. > > I think I like that approach. > > So my vote is to increment i_version in common code every time any > change is made to the file, and alloc_inode should initialise it to > current time, which might be changed by the filesystem before it calls > unlock_new_inode. > ... but doesn't lustre want to control its i_version... so maybe not :-( If lustre wants to be exportable via pNFS (as Peter Braam has suggested it should), then it had better be able to return a change attribute that is compatible with the NFSv4.1 spec... Trond - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tuesday July 10, [EMAIL PROTECTED] wrote: > > Yes, thanks. It doesn't actually tell us why we want to implement > this attribute and it doesn't tell us what the implications of failing > to do so are, but I guess we can take that on trust from the NFS guys. You would like to think so, but remember NFSv4 was designed by a committee :-) The 'change' number is used for cache consistency, and as the spec makes very strong statements about the 'change' number, it is very hard (or impossible) to implement a server correctly without storing a change number in stable storage (just one of my grips about V4). > > But I suspect the ext4 implementation doesn't actually do this. afaict we > won't update i_version for file overwrites (especially if s_time_gran can > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What > would be the implications of this? The first part sounds like a bug - i_version should really be updated by every call to ->commit_write (if that is still what it is called). The MAP_SHARED thing is less obvious. I guess every time we notice that the page might have been changed, we need to increment i_version. > > And how does the NFS server know that the filesystem implements i_version? > Will a zero-value of i_version have special significance, telling the > server to not send this attribute, perhaps? That is a very important question. Zero probably makes sense, but what ever it is needs to be agreed and documented. And just by-the-way, the server doesn't really have the option of not sending the attribute. If i_version isn't defined, it has to fake something using mtime, and hope that is good enough. Alternately we could mandate that i_version is always kept up-to-date and if a filesystem doesn't have anything to load from storage, it just sets it to the current time in nanoseconds. That would mean that a client would need to flush it's cache whenever the inode fell out of cache on the server, but I don't think we can reliably do better than that. I think I like that approach. So my vote is to increment i_version in common code every time any change is made to the file, and alloc_inode should initialise it to current time, which might be changed by the filesystem before it calls unlock_new_inode. ... but doesn't lustre want to control its i_version... so maybe not :-( NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote: > On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: > > > On Sun, 01 Jul 2007 03:37:04 -0400 > > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > > > This patch converts the 32-bit i_version in the generic inode to a > > > > 64-bit > > > > i_version field. > > > > > > > > > > That's obvious from the patch. But what was the reason for making this > > > (unrelated to ext4) change? > > > > > > > The need is came from NFSv4 > > > > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: > > > Hi, > > > > > > This is an update of the i_version patch. > > > The i_version field is a 64bit counter that is set on every inode > > > creation and that is incremented every time the inode data is modified > > > (similarly to the "ctime" time-stamp). > > > The aim is to fulfill a NFSv4 requirement for rfc3530: > > > "5.5. Mandatory Attributes - Definitions > > > Name # DataType Access Description > > > ___ > > > change3 uint64 READ A value created by the > > > server that the client can use to determine if file > > > data, directory contents or attributes of the object > > > have been modified. The servermay return the object's > > > time_metadata attribute for this attribute's value but > > > only if the filesystem object can not be updated more > > > frequently than the resolution of time_metadata. > > > " > > > > > > > > Please update the changelog for this. > > > > > > > Is above description clear to you? > > > > Yes, thanks. It doesn't actually tell us why we want to implement > this attribute and it doesn't tell us what the implications of failing > to do so are, but I guess we can take that on trust from the NFS guys. > > But I suspect the ext4 implementation doesn't actually do this. afaict we > won't update i_version for file overwrites (especially if s_time_gran can > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What > would be the implications of this? > In the case of overwrite (file date updated), I assume the ctime/mtime is being updated and the inode is being dirtied, so the version number is being updated. vfs_write()->.. ->__generic_file_aio_write_nolock() ->file_update_time() ->mark_inode_dirty_sync() ->__mark_inode_dirty(I_DIRTY_SYNC) ->ext4_dirty_inode() ->ext4_mark_inode_dirty() > And how does the NFS server know that the filesystem implements i_version? > Will a zero-value of i_version have special significance, telling the > server to not send this attribute, perhaps? Bruce raised up this question a few days back when he reviewed this patch, I think the solution is add a superblock flag for fs support inode versioning, probably at VFS layer? Mingming - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: > > On Sun, 01 Jul 2007 03:37:04 -0400 > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > This patch converts the 32-bit i_version in the generic inode to a 64-bit > > > i_version field. > > > > > > > That's obvious from the patch. But what was the reason for making this > > (unrelated to ext4) change? > > > > The need is came from NFSv4 > > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: > > Hi, > > > > This is an update of the i_version patch. > > The i_version field is a 64bit counter that is set on every inode > > creation and that is incremented every time the inode data is modified > > (similarly to the "ctime" time-stamp). > > The aim is to fulfill a NFSv4 requirement for rfc3530: > > "5.5. Mandatory Attributes - Definitions > > Name# DataType Access Description > > ___ > > change 3 uint64 READ A value created by the > > server that the client can use to determine if file > > data, directory contents or attributes of the object > > have been modified. The servermay return the object's > > time_metadata attribute for this attribute's value but > > only if the filesystem object can not be updated more > > frequently than the resolution of time_metadata. > > " > > > > > Please update the changelog for this. > > > > Is above description clear to you? > Yes, thanks. It doesn't actually tell us why we want to implement this attribute and it doesn't tell us what the implications of failing to do so are, but I guess we can take that on trust from the NFS guys. But I suspect the ext4 implementation doesn't actually do this. afaict we won't update i_version for file overwrites (especially if s_time_gran can indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What would be the implications of this? And how does the NFS server know that the filesystem implements i_version? Will a zero-value of i_version have special significance, telling the server to not send this attribute, perhaps? - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: > On Sun, 01 Jul 2007 03:37:04 -0400 > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > This patch converts the 32-bit i_version in the generic inode to a 64-bit > > i_version field. > > > > That's obvious from the patch. But what was the reason for making this > (unrelated to ext4) change? > The need is came from NFSv4 On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: > Hi, > > This is an update of the i_version patch. > The i_version field is a 64bit counter that is set on every inode > creation and that is incremented every time the inode data is modified > (similarly to the "ctime" time-stamp). > The aim is to fulfill a NFSv4 requirement for rfc3530: > "5.5. Mandatory Attributes - Definitions > Name # DataType Access Description > ___ > change3 uint64 READ A value created by the > server that the client can use to determine if file > data, directory contents or attributes of the object > have been modified. The servermay return the object's > time_metadata attribute for this attribute's value but > only if the filesystem object can not be updated more > frequently than the resolution of time_metadata. > " > > Please update the changelog for this. > Is above description clear to you? > > Index: linux-2.6.21/include/linux/fs.h > > === > > --- linux-2.6.21.orig/include/linux/fs.h > > +++ linux-2.6.21/include/linux/fs.h > > @@ -549,7 +549,7 @@ struct inode { > > uid_t i_uid; > > gid_t i_gid; > > dev_t i_rdev; > > - unsigned long i_version; > > + u64 i_version; > > loff_t i_size; > > #ifdef __NEED_I_SIZE_ORDERED > > seqcount_t i_size_seqcount; > - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Sun, 01 Jul 2007 03:37:04 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > This patch converts the 32-bit i_version in the generic inode to a 64-bit > i_version field. > That's obvious from the patch. But what was the reason for making this (unrelated to ext4) change? Please update the changelog for this. > Index: linux-2.6.21/include/linux/fs.h > === > --- linux-2.6.21.orig/include/linux/fs.h > +++ linux-2.6.21/include/linux/fs.h > @@ -549,7 +549,7 @@ struct inode { > uid_t i_uid; > gid_t i_gid; > dev_t i_rdev; > - unsigned long i_version; > + u64 i_version; > loff_t i_size; > #ifdef __NEED_I_SIZE_ORDERED > seqcount_t i_size_seqcount; - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Fri, 2007-07-06 at 16:53 -0600, Andreas Dilger wrote: > On Jul 06, 2007 09:51 -0400, J. Bruce Fields wrote: > > The use of a mount option means the change attribute could be > > inconsistent across mounts. If we really need this, wouldn't it make > > more sense for it to be a persistent feature of the filesystem, set at > > mkfs time? > > Yes, having it stored into the superblock in s_flags is probably a good > idea. Kalpak, do you think you could get a patch that adds e.g. > EXT4_FLAGS_NO_INODE_VERSION (like EXT4_FLAGS_SIGNED_HASH in e2fsprogs). > Per our ext4 interlock discussion today, I have dropped the ext4-no-inode version-mount-option patch from ext4 patch queue. Thanks, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 06, 2007 09:51 -0400, J. Bruce Fields wrote: > The use of a mount option means the change attribute could be > inconsistent across mounts. If we really need this, wouldn't it make > more sense for it to be a persistent feature of the filesystem, set at > mkfs time? Yes, having it stored into the superblock in s_flags is probably a good idea. Kalpak, do you think you could get a patch that adds e.g. EXT4_FLAGS_NO_INODE_VERSION (like EXT4_FLAGS_SIGNED_HASH in e2fsprogs). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, Jul 03, 2007 at 05:32:00PM -0600, Andreas Dilger wrote: > On Jul 03, 2007 18:15 -0400, J. Bruce Fields wrote: > > How will nfsd tell whether it can really on a given filesystem's > > i_version, or whether it should fall back on ctime? > > Good question. Well, we don't need anything particularly complicated--just a one-bit flag on the superblock would be enough. > > So what's the motivation for the "noversion" mount option? > > Lustre needs to be able to control the version number directly (version > number needs to be ordered between all inodes, is set by Lustre to be a > transaction number). Instead of trying to incorporate this unused code > into ext4 we just turn off the ext4 version code and let Lustre control > this directly. It may even be that NFSv4 will need to control the version > numbers itself... I can't think of any reason we would need to in the near future, but maybe I'm insufficiently creative. The use of a mount option means the change attribute could be inconsistent across mounts. If we really need this, wouldn't it make more sense for it to be a persistent feature of the filesystem, set at mkfs time? --b. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 03, 2007 18:15 -0400, J. Bruce Fields wrote: > How will nfsd tell whether it can really on a given filesystem's > i_version, or whether it should fall back on ctime? Good question. > > As to performance concerns that raise before the inode version counter > > (at least for ext4) is done inside ext4_mark_inode_dirty), so there is > > no extra IO work to store this counter to disk. > > So what's the motivation for the "noversion" mount option? Lustre needs to be able to control the version number directly (version number needs to be ordered between all inodes, is set by Lustre to be a transaction number). Instead of trying to incorporate this unused code into ext4 we just turn off the ext4 version code and let Lustre control this directly. It may even be that NFSv4 will need to control the version numbers itself... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Mon, Jul 02, 2007 at 10:58:33AM -0400, Mingming Cao wrote: > Trond or Bruce, can you please review these patch series and ack if you > agrees? Thanks, looks like what we need! How will nfsd tell whether it can really on a given filesystem's i_version, or whether it should fall back on ctime? > As to performance concerns that raise before the inode version counter > (at least for ext4) is done inside ext4_mark_inode_dirty), so there is > no extra IO work to store this counter to disk. So what's the motivation for the "noversion" mount option? --b. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 03, 2007 10:24 -0400, Trond Myklebust wrote: > It looks OK to me, but you might want to strip out the now redundant > i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename(). Agreed, and I thought we discussed that already on the ext4 list. > I also have some questions about how this will affect the readdir code: > unless I missed something, the filp->f_version is still unsigned long, > so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir() > no longer make sense. I don't see them as any worse than existing checks. For 32-bit systems we only ever had a 32-bit in-memory version anyway so using only the low 32 bits of i_version in f_version is no more racy than in the past. For 64-bit systems using the full on-disk i_version is possible. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Mon, 2007-07-02 at 10:58 -0400, Mingming Cao wrote: > Trond or Bruce, can you please review these patch series and ack if you > agrees? Thanks. > > As to performance concerns that raise before the inode version counter > (at least for ext4) is done inside ext4_mark_inode_dirty), so there is > no extra IO work to store this counter to disk. Hi Mingming, It looks OK to me, but you might want to strip out the now redundant i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename(). I also have some questions about how this will affect the readdir code: unless I missed something, the filp->f_version is still unsigned long, so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir() no longer make sense. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
Trond or Bruce, can you please review these patch series and ack if you agrees? Thanks. As to performance concerns that raise before the inode version counter (at least for ext4) is done inside ext4_mark_inode_dirty), so there is no extra IO work to store this counter to disk. Mingming On Sun, 2007-07-01 at 03:37 -0400, Mingming Cao wrote: > This patch converts the 32-bit i_version in the generic inode to a 64-bit > i_version field. > > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> > Signed-off-by: Jean Noel Cordenner <[EMAIL PROTECTED]> > Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]> > > Index: linux-2.6.21/include/linux/fs.h > === > --- linux-2.6.21.orig/include/linux/fs.h > +++ linux-2.6.21/include/linux/fs.h > @@ -549,7 +549,7 @@ struct inode { > uid_t i_uid; > gid_t i_gid; > dev_t i_rdev; > - unsigned long i_version; > + u64 i_version; > loff_t i_size; > #ifdef __NEED_I_SIZE_ORDERED > seqcount_t i_size_seqcount; > > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
This patch converts the 32-bit i_version in the generic inode to a 64-bit i_version field. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> Signed-off-by: Jean Noel Cordenner <[EMAIL PROTECTED]> Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]> Index: linux-2.6.21/include/linux/fs.h === --- linux-2.6.21.orig/include/linux/fs.h +++ linux-2.6.21/include/linux/fs.h @@ -549,7 +549,7 @@ struct inode { uid_t i_uid; gid_t i_gid; dev_t i_rdev; - unsigned long i_version; + u64 i_version; loff_t i_size; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html