Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, Jan 8, 2018 at 10:39 PM, Jeff Laytonwrote: >> > > Got it, that's helpful. Does this patch help (on top of the others) ? > > 8<-- > > SQUASH: nfs: compare raw iversion counter since that's what's > being stored > Did not apply cleanly (in include/linux/iversion.h) but it was easy to adjust. However still no improvements. Applied on top of others. Best regards, Krzysztof -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, 2018-01-08 at 21:17 +0100, Krzysztof Kozlowski wrote: > On Mon, Jan 08, 2018 at 02:15:29PM -0500, Jeff Layton wrote: > > On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote: > > > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote: > > > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote: > > > > > > (...) > > > > > > > > > Ok, thanks. If you're seeing hangs then that might imply that we > > > > > > have > > > > > > some sort of excessive looping going on in the cmpxchg loops. > > > > > > > > > > > > Could you apply the patch below and let me know if it causes either > > > > > > of > > > > > > the warnings to pop? That might at least point us in the right > > > > > > direction: > > > > > > > > > > No new warnings with attached patch (except existing already lockdep: > > > > > "INFO: trying to register non-static key."). > > > > > > > > > > > > > Yeah, I saw that in the original logs and it looks unrelated (and > > > > harmless). > > > > > > > > > Systemd timeouts on mounting /home but after entering rescue shell > > > > > there > > > > > is no problem running mount /home: > > > > > Give root password for maintenance > > > > > (or press Control-D to continue): > > > > > root@odroidxu3:~# mount /home > > > > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with > > > > > ordered data mode. Opts: (null) > > > > > > > > > > > > > Ok, thanks for testing it. So I guess we can probably rule out excessive > > > > looping in those functions as the issue. > > > > > > > > To make sure I understand the problem: When systemd tries to do the > > > > initial mount of /home (which is an ext4 filesystem), it hangs. But once > > > > it drops to the shell, it works, if you do the mount by hand. > > > > > > > > Is that correct? > > > > > > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial > > > console). > > > > > > > If so, then is it possible to trigger sysrq commands during the hanging > > > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to > > > > determine what it's blocking on? > > > > (trimming the output) > > > > Thanks. I don't really see anything obvious in that info, > > unfortunately. What we really need to do is find the systemd task > > performing the mount, and see what it's doing. > > It's systemd 236.0-2 coming from Arch Linux for ARM. All packages are > updated. > > > We do have one questionable bug in the NFS changes though. Does this > > patch help at all? > > Patches do not change anything (I tried "SQUASH: nfs: fix i_version > increment when adding a request" and "SQUASH: ext4: use raw API for > xattr inode refcounts"). > > I tried again regular SDcard-root boot and it succeeded. Only nfsroot > fails. > > Best regards, > Krzysztof > Got it, that's helpful. Does this patch help (on top of the others) ? 8<-- SQUASH: nfs: compare raw iversion counter since that's what's being stored Signed-off-by: Jeff Layton--- fs/nfs/inode.c | 6 +++--- include/linux/iversion.h | 13 + 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 0b85cca1184b..93552c482992 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -1289,7 +1289,7 @@ static unsigned long nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE) && (fattr->valid & NFS_ATTR_FATTR_CHANGE) - && !inode_cmp_iversion(inode, fattr->pre_change_attr)) { + && !inode_cmp_iversion_raw(inode, fattr->pre_change_attr)) { inode_set_iversion_raw(inode, fattr->change_attr); if (S_ISDIR(inode->i_mode)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA); @@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat if (!nfs_file_has_buffered_writers(nfsi)) { /* Verify a few of the more important attributes */ - if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode_cmp_iversion(inode, fattr->change_attr)) + if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode_cmp_iversion_raw(inode, fattr->change_attr)) invalid |= NFS_INO_INVALID_ATTR | NFS_INO_REVAL_PAGECACHE; if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && !timespec_equal(>i_mtime, >mtime)) @@ -1778,7 +1778,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) /* More cache consistency checks */ if (fattr->valid & NFS_ATTR_FATTR_CHANGE) { - if (inode_cmp_iversion(inode, fattr->change_attr)) { + if (inode_cmp_iversion_raw(inode, fattr->change_attr)) { dprintk("NFS: change_attr change on server for file %s/%ld\n",
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, Jan 08, 2018 at 02:15:29PM -0500, Jeff Layton wrote: > On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote: > > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote: > > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote: > > > > (...) > > > > > > > Ok, thanks. If you're seeing hangs then that might imply that we have > > > > > some sort of excessive looping going on in the cmpxchg loops. > > > > > > > > > > Could you apply the patch below and let me know if it causes either of > > > > > the warnings to pop? That might at least point us in the right > > > > > direction: > > > > > > > > No new warnings with attached patch (except existing already lockdep: > > > > "INFO: trying to register non-static key."). > > > > > > > > > > Yeah, I saw that in the original logs and it looks unrelated (and > > > harmless). > > > > > > > Systemd timeouts on mounting /home but after entering rescue shell there > > > > is no problem running mount /home: > > > > Give root password for maintenance > > > > (or press Control-D to continue): > > > > root@odroidxu3:~# mount /home > > > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with > > > > ordered data mode. Opts: (null) > > > > > > > > > > Ok, thanks for testing it. So I guess we can probably rule out excessive > > > looping in those functions as the issue. > > > > > > To make sure I understand the problem: When systemd tries to do the > > > initial mount of /home (which is an ext4 filesystem), it hangs. But once > > > it drops to the shell, it works, if you do the mount by hand. > > > > > > Is that correct? > > > > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial > > console). > > > > > If so, then is it possible to trigger sysrq commands during the hanging > > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to > > > determine what it's blocking on? > > (trimming the output) > > Thanks. I don't really see anything obvious in that info, > unfortunately. What we really need to do is find the systemd task > performing the mount, and see what it's doing. It's systemd 236.0-2 coming from Arch Linux for ARM. All packages are updated. > We do have one questionable bug in the NFS changes though. Does this > patch help at all? Patches do not change anything (I tried "SQUASH: nfs: fix i_version increment when adding a request" and "SQUASH: ext4: use raw API for xattr inode refcounts"). I tried again regular SDcard-root boot and it succeeded. Only nfsroot fails. Best regards, Krzysztof > > ---8<- > > SQUASH: nfs: fix i_version increment when adding a request > > NFS treats this value as an opaque value with no flag, so we must > increment it as such instead of using inode_inc_iversion. > > Signed-off-by: Jeff Layton> --- > fs/nfs/write.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index a03fbac1f88c..48837b6250e9 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, > struct nfs_page *req) > spin_lock(>private_lock); > if (!nfs_have_writebacks(inode) && > NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE)) > - inode_inc_iversion(inode); > + atomic64_inc(>i_version); > if (likely(!PageSwapCache(req->wb_page))) { > set_bit(PG_MAPPED, >wb_flags); > SetPagePrivate(req->wb_page); > -- > 2.14.3 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, 2018-01-08 at 14:15 -0500, Jeff Layton wrote: > On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote: > > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote: > > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote: > > > > (...) > > > > > > > Ok, thanks. If you're seeing hangs then that might imply that we have > > > > > some sort of excessive looping going on in the cmpxchg loops. > > > > > > > > > > Could you apply the patch below and let me know if it causes either of > > > > > the warnings to pop? That might at least point us in the right > > > > > direction: > > > > > > > > No new warnings with attached patch (except existing already lockdep: > > > > "INFO: trying to register non-static key."). > > > > > > > > > > Yeah, I saw that in the original logs and it looks unrelated (and > > > harmless). > > > > > > > Systemd timeouts on mounting /home but after entering rescue shell there > > > > is no problem running mount /home: > > > > Give root password for maintenance > > > > (or press Control-D to continue): > > > > root@odroidxu3:~# mount /home > > > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with > > > > ordered data mode. Opts: (null) > > > > > > > > > > Ok, thanks for testing it. So I guess we can probably rule out excessive > > > looping in those functions as the issue. > > > > > > To make sure I understand the problem: When systemd tries to do the > > > initial mount of /home (which is an ext4 filesystem), it hangs. But once > > > it drops to the shell, it works, if you do the mount by hand. > > > > > > Is that correct? > > > > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial > > console). > > > > > If so, then is it possible to trigger sysrq commands during the hanging > > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to > > > determine what it's blocking on? > > (trimming the output) > > Thanks. I don't really see anything obvious in that info, > unfortunately. What we really need to do is find the systemd task > performing the mount, and see what it's doing. > > We do have one questionable bug in the NFS changes though. Does this > patch help at all? > > ---8<- > > SQUASH: nfs: fix i_version increment when adding a request > > NFS treats this value as an opaque value with no flag, so we must > increment it as such instead of using inode_inc_iversion. > > Signed-off-by: Jeff Layton> --- > fs/nfs/write.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index a03fbac1f88c..48837b6250e9 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, > struct nfs_page *req) > spin_lock(>private_lock); > if (!nfs_have_writebacks(inode) && > NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE)) > - inode_inc_iversion(inode); > + atomic64_inc(>i_version); > if (likely(!PageSwapCache(req->wb_page))) { > set_bit(PG_MAPPED, >wb_flags); > SetPagePrivate(req->wb_page); Sorry for the slow dribble of patches. I found a bug in the ext4 code too. Might want to try this ext4 patch as well: --8<- SQUASH: ext4: use raw API for xattr inode refcounts This could distort the values otherwise. (Side Q: Why do we do this across the ctime and iversion like this?) Signed-off-by: Jeff Layton --- fs/ext4/xattr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index ba6fd5439aa4..63656dbafdc4 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -295,13 +295,13 @@ ext4_xattr_inode_hash(struct ext4_sb_info *sbi, const void *buffer, size_t size) static u64 ext4_xattr_inode_get_ref(struct inode *ea_inode) { return ((u64)ea_inode->i_ctime.tv_sec << 32) | - (u32) inode_peek_iversion(ea_inode); + (u32) inode_peek_iversion_raw(ea_inode); } static void ext4_xattr_inode_set_ref(struct inode *ea_inode, u64 ref_count) { ea_inode->i_ctime.tv_sec = (u32)(ref_count >> 32); - inode_set_iversion(ea_inode, ref_count & 0x); + inode_set_iversion_raw(ea_inode, ref_count & 0x); } static u32 ext4_xattr_inode_get_hash(struct inode *ea_inode) -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote: > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote: > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote: > > (...) > > > > > Ok, thanks. If you're seeing hangs then that might imply that we have > > > > some sort of excessive looping going on in the cmpxchg loops. > > > > > > > > Could you apply the patch below and let me know if it causes either of > > > > the warnings to pop? That might at least point us in the right > > > > direction: > > > > > > No new warnings with attached patch (except existing already lockdep: > > > "INFO: trying to register non-static key."). > > > > > > > Yeah, I saw that in the original logs and it looks unrelated (and > > harmless). > > > > > Systemd timeouts on mounting /home but after entering rescue shell there > > > is no problem running mount /home: > > > Give root password for maintenance > > > (or press Control-D to continue): > > > root@odroidxu3:~# mount /home > > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered > > > data mode. Opts: (null) > > > > > > > Ok, thanks for testing it. So I guess we can probably rule out excessive > > looping in those functions as the issue. > > > > To make sure I understand the problem: When systemd tries to do the > > initial mount of /home (which is an ext4 filesystem), it hangs. But once > > it drops to the shell, it works, if you do the mount by hand. > > > > Is that correct? > > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial > console). > > > If so, then is it possible to trigger sysrq commands during the hanging > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to > > determine what it's blocking on? (trimming the output) Thanks. I don't really see anything obvious in that info, unfortunately. What we really need to do is find the systemd task performing the mount, and see what it's doing. We do have one questionable bug in the NFS changes though. Does this patch help at all? ---8<- SQUASH: nfs: fix i_version increment when adding a request NFS treats this value as an opaque value with no flag, so we must increment it as such instead of using inode_inc_iversion. Signed-off-by: Jeff Layton--- fs/nfs/write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index a03fbac1f88c..48837b6250e9 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req) spin_lock(>private_lock); if (!nfs_have_writebacks(inode) && NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE)) - inode_inc_iversion(inode); + atomic64_inc(>i_version); if (likely(!PageSwapCache(req->wb_page))) { set_bit(PG_MAPPED, >wb_flags); SetPagePrivate(req->wb_page); -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote: > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote: (...) > > > Ok, thanks. If you're seeing hangs then that might imply that we have > > > some sort of excessive looping going on in the cmpxchg loops. > > > > > > Could you apply the patch below and let me know if it causes either of > > > the warnings to pop? That might at least point us in the right > > > direction: > > > > No new warnings with attached patch (except existing already lockdep: > > "INFO: trying to register non-static key."). > > > > Yeah, I saw that in the original logs and it looks unrelated (and > harmless). > > > Systemd timeouts on mounting /home but after entering rescue shell there > > is no problem running mount /home: > > Give root password for maintenance > > (or press Control-D to continue): > > root@odroidxu3:~# mount /home > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered > > data mode. Opts: (null) > > > > Ok, thanks for testing it. So I guess we can probably rule out excessive > looping in those functions as the issue. > > To make sure I understand the problem: When systemd tries to do the > initial mount of /home (which is an ext4 filesystem), it hangs. But once > it drops to the shell, it works, if you do the mount by hand. > > Is that correct? Yes, although it also timeouts on setting up /dev/ttySAC2 (serial console). > If so, then is it possible to trigger sysrq commands during the hanging > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to > determine what it's blocking on? Yes, I have sysrq. Blocked state (w): ## [ *** ] (2 of 4) A start job is running for…v-ttySAC2.device (1min / 1min 30s) *** break sent *** [*** ] (2 of 4) A start job is running for…v-ttySAC2.device (1min / 1min 30s)[ 110.962917] sysrq: SysRq : Show Blocked State [ 110.965931] taskPC stack pid father [ 110.971360] lvm D0 219 1 0x [ 110.976649] [] (__schedule) from [] (schedule+0x4c/0xac) [ 110.983643] [] (schedule) from [] (rpc_wait_bit_killable+0x2c/0xf0) [ 110.991621] [] (rpc_wait_bit_killable) from [] (__wait_on_bit+0x80/0xb4) [ 111.25] [] (__wait_on_bit) from [] (out_of_line_wait_on_bit+0x8c/0x94) [ 111.008603] [] (out_of_line_wait_on_bit) from [] (__rpc_execute+0x19c/0x274) [ 111.017370] [] (__rpc_execute) from [] (rpc_run_task+0x134/0x14c) [ 111.025164] [] (rpc_run_task) from [] (nfs4_call_sync_sequence+0x58/0x78) [ 111.033648] [] (nfs4_call_sync_sequence) from [] (nfs4_proc_access+0xf4/0x12c) [ 111.042586] [] (nfs4_proc_access) from [] (nfs_do_access+0x230/0x3e8) [ 111.050716] [] (nfs_do_access) from [] (nfs_permission+0x2a0/0x2c0) [ 111.058694] [] (nfs_permission) from [] (link_path_walk+0x6c/0x508) [ 111.066658] [] (link_path_walk) from [] (path_lookupat+0x8c/0x1ec) [ 111.074541] [] (path_lookupat) from [] (path_openat+0x770/0x964) [ 111.082253] [] (path_openat) from [] (do_filp_open+0x60/0xc4) [ 111.089711] [] (do_filp_open) from [] (do_sys_open+0x114/0x1c4) [ 111.097339] [] (do_sys_open) from [] (ret_fast_syscall+0x0/0x28) [ 111.105036] udevadm D0 259 1 0x [ 111.110482] [] (__schedule) from [] (schedule+0x4c/0xac) [ 111.117508] [] (schedule) from [] (rpc_wait_bit_killable+0x2c/0xf0) [ 111.125494] [] (rpc_wait_bit_killable) from [] (__wait_on_bit+0x80/0xb4) [ 111.133903] [] (__wait_on_bit) from [] (out_of_line_wait_on_bit+0x8c/0x94) [ 111.142482] [] (out_of_line_wait_on_bit) from [] (__rpc_execute+0x19c/0x274) [ 111.151236] [] (__rpc_execute) from [] (rpc_run_task+0x134/0x14c) [ 111.159028] [] (rpc_run_task) from [] (nfs4_call_sync_sequence+0x58/0x78) [ 111.167523] [] (nfs4_call_sync_sequence) from [] (nfs4_proc_getattr+0xbc/0xe4) [ 111.176452] [] (nfs4_proc_getattr) from [] (__nfs_revalidate_inode+0x8c/0x130) [ 111.185378] [] (__nfs_revalidate_inode) from [] (nfs_permission+0x16c/0x2c0) [ 111.194132] [] (nfs_permission) from [] (link_path_walk+0x6c/0x508) [ 111.202096] [] (link_path_walk) from [] (path_lookupat+0x8c/0x1ec) [ 111.209980] [] (path_lookupat) from [] (filename_lookup+0x8c/0xe8) [ 111.217865] [] (filename_lookup) from [] (SyS_faccessat+0x9c/0x208) [ 111.225839] [] (SyS_faccessat) from [] (ret_fast_syscall+0x0/0x28) [*** ] (3 of 4) A start job is running for…l-home.device (1min 2s / 1min 30s) ## Another blocked (after few seconds): [ 125.364311] sysrq: SysRq : Show Blocked State [ 125.367292] taskPC stack pid father [ 125.372688] udevadm D0 259 1 0x [ 125.377998] [] (__schedule) from [] (schedule+0x4c/0xac) [ 125.384990] [] (schedule) from [] (rpc_wait_bit_killable+0x2c/0xf0) [ 125.392961] [] (rpc_wait_bit_killable) from [] (__wait_on_bit+0x80/0xb4) [ 125.401365] [] (__wait_on_bit) from [] (out_of_line_wait_on_bit+0x8c/0x94) [ 125.409944] []
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote: > On Mon, Jan 08, 2018 at 08:29:24AM -0500, Jeff Layton wrote: > > On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote: > > > On Mon, Jan 8, 2018 at 1:56 PM, Jeff Laytonwrote: > > > > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote: > > > > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton > > > > > wrote: > > > > > > From: Jeff Layton > > > > > > > > > > > > Since i_version is mostly treated as an opaque value, we can > > > > > > exploit that > > > > > > fact to avoid incrementing it when no one is watching. With that > > > > > > change, > > > > > > we can avoid incrementing the counter on writes, unless someone has > > > > > > queried for it since it was last incremented. If the a/c/mtime don't > > > > > > change, and the i_version hasn't changed, then there's no need to > > > > > > dirty > > > > > > the inode metadata on a write. > > > > > > > > > > > > Convert the i_version counter to an atomic64_t, and use the lowest > > > > > > order > > > > > > bit to hold a flag that will tell whether anyone has queried the > > > > > > value > > > > > > since it was last incremented. > > > > > > > > > > > > When we go to maybe increment it, we fetch the value and check the > > > > > > flag > > > > > > bit. If it's clear then we don't need to do anything if the update > > > > > > isn't being forced. > > > > > > > > > > > > If we do need to update, then we increment the counter by 2, and > > > > > > clear > > > > > > the flag bit, and then use a CAS op to swap it into place. If that > > > > > > works, we return true. If it doesn't then do it again with the value > > > > > > that we fetch from the CAS operation. > > > > > > > > > > > > On the query side, if the flag is already set, then we just shift > > > > > > the > > > > > > value down by 1 bit and return it. Otherwise, we set the flag in our > > > > > > on-stack value and again use cmpxchg to swap it into place if it > > > > > > hasn't > > > > > > changed. If it has, then we use the value from the cmpxchg as the > > > > > > new > > > > > > "old" value and try again. > > > > > > > > > > > > This method allows us to avoid incrementing the counter on writes > > > > > > (and > > > > > > dirtying the metadata) under typical workloads. We only need to > > > > > > increment > > > > > > if it has been queried since it was last changed. > > > > > > > > > > > > Signed-off-by: Jeff Layton > > > > > > --- > > > > > > include/linux/fs.h | 2 +- > > > > > > include/linux/iversion.h | 208 > > > > > > ++- > > > > > > 2 files changed, 154 insertions(+), 56 deletions(-) > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot. > > > > > This commit popped up through bisect (log at the end). Systemd > > > > > timeouts on some device-specific services, including mounting ext4 > > > > > /home: > > > > > > > > > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / > > > > > no limit) > > > > > [ TIME ] Timed out waiting for device dev-ttySAC2.device. > > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on > > > > > ttySAC2. > > > > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device > > > > > dev-disk-by\x2dlabel-home.device. > > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home. > > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems. > > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on > > > > > /dev/disk/by-label/home. > > > > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress > > > > > polling (1min 53s / no limit) > > > > > > > > > > Kernel command line: > > > > > console=tty1 console=ttySAC2,115200n8 > > > > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none > > > > > nfsrootdebug root=/dev/nfs > > > > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw > > > > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend > > > > > > > > > > /home is /dev/mmcblk1p2: > > > > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2 > > > > > tune2fs 1.43.7 (16-Oct-2017) > > > > > Filesystem volume name: home > > > > > Last mounted on: /home > > > > > Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed > > > > > Filesystem magic number: 0xEF53 > > > > > Filesystem revision #:1 (dynamic) > > > > > Filesystem features: has_journal ext_attr resize_inode dir_index > > > > > filetype needs_recovery extent flex_bg sparse_super large_file > > > > > uninit_bg dir_nlink extra_isize > > > > > Filesystem flags: signed_directory_hash > > > > > Default mount options:user_xattr acl > > > > > Filesystem state: clean > > > > > Errors behavior: Continue > > > > > Filesystem OS type: Linux > > > > >
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, Jan 08, 2018 at 08:29:24AM -0500, Jeff Layton wrote: > On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote: > > On Mon, Jan 8, 2018 at 1:56 PM, Jeff Laytonwrote: > > > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote: > > > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton wrote: > > > > > From: Jeff Layton > > > > > > > > > > Since i_version is mostly treated as an opaque value, we can exploit > > > > > that > > > > > fact to avoid incrementing it when no one is watching. With that > > > > > change, > > > > > we can avoid incrementing the counter on writes, unless someone has > > > > > queried for it since it was last incremented. If the a/c/mtime don't > > > > > change, and the i_version hasn't changed, then there's no need to > > > > > dirty > > > > > the inode metadata on a write. > > > > > > > > > > Convert the i_version counter to an atomic64_t, and use the lowest > > > > > order > > > > > bit to hold a flag that will tell whether anyone has queried the value > > > > > since it was last incremented. > > > > > > > > > > When we go to maybe increment it, we fetch the value and check the > > > > > flag > > > > > bit. If it's clear then we don't need to do anything if the update > > > > > isn't being forced. > > > > > > > > > > If we do need to update, then we increment the counter by 2, and clear > > > > > the flag bit, and then use a CAS op to swap it into place. If that > > > > > works, we return true. If it doesn't then do it again with the value > > > > > that we fetch from the CAS operation. > > > > > > > > > > On the query side, if the flag is already set, then we just shift the > > > > > value down by 1 bit and return it. Otherwise, we set the flag in our > > > > > on-stack value and again use cmpxchg to swap it into place if it > > > > > hasn't > > > > > changed. If it has, then we use the value from the cmpxchg as the new > > > > > "old" value and try again. > > > > > > > > > > This method allows us to avoid incrementing the counter on writes (and > > > > > dirtying the metadata) under typical workloads. We only need to > > > > > increment > > > > > if it has been queried since it was last changed. > > > > > > > > > > Signed-off-by: Jeff Layton > > > > > --- > > > > > include/linux/fs.h | 2 +- > > > > > include/linux/iversion.h | 208 > > > > > ++- > > > > > 2 files changed, 154 insertions(+), 56 deletions(-) > > > > > > > > > > > > > Hi, > > > > > > > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot. > > > > This commit popped up through bisect (log at the end). Systemd > > > > timeouts on some device-specific services, including mounting ext4 > > > > /home: > > > > > > > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no > > > > limit) > > > > [ TIME ] Timed out waiting for device dev-ttySAC2.device. > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2. > > > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device > > > > dev-disk-by\x2dlabel-home.device. > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home. > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems. > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on > > > > /dev/disk/by-label/home. > > > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress > > > > polling (1min 53s / no limit) > > > > > > > > Kernel command line: > > > > console=tty1 console=ttySAC2,115200n8 > > > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none > > > > nfsrootdebug root=/dev/nfs > > > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw > > > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend > > > > > > > > /home is /dev/mmcblk1p2: > > > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2 > > > > tune2fs 1.43.7 (16-Oct-2017) > > > > Filesystem volume name: home > > > > Last mounted on: /home > > > > Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed > > > > Filesystem magic number: 0xEF53 > > > > Filesystem revision #:1 (dynamic) > > > > Filesystem features: has_journal ext_attr resize_inode dir_index > > > > filetype needs_recovery extent flex_bg sparse_super large_file > > > > uninit_bg dir_nlink extra_isize > > > > Filesystem flags: signed_directory_hash > > > > Default mount options:user_xattr acl > > > > Filesystem state: clean > > > > Errors behavior: Continue > > > > Filesystem OS type: Linux > > > > Inode count: 1430800 > > > > Block count: 5717760 > > > > Reserved block count: 285888 > > > > Free blocks: 5467576 > > > > Free inodes: 1428301 > > > > First block: 0 > > > > Block size: 4096 > > > > Fragment size:4096 > > > > Reserved GDT
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, 2018-01-08 at 05:30 -0800, Matthew Wilcox wrote: > On Fri, Dec 22, 2017 at 07:05:56AM -0500, Jeff Layton wrote: > > + cur = inode_peek_iversion_raw(inode); > > + for (;;) { > > + /* If flag is clear then we needn't do anything */ > > + if (!force && !(cur & I_VERSION_QUERIED)) > > + return false; > > + /* Since lowest bit is flag, add 2 to avoid it */ > > + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT; > > Isn't this an extraordinarily complicated way of spelling: > > new = cur + 1; > > We know 'cur' has I_VERSION_QUERIED set, so clearing that bit and adding > two is going to be the same as adding 1 ... right? > It would be, but if "force" is true, then I_VERSION_QUERIED may not be set. -- Jeff Layton-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote: > On Mon, Jan 8, 2018 at 1:56 PM, Jeff Laytonwrote: > > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote: > > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton wrote: > > > > From: Jeff Layton > > > > > > > > Since i_version is mostly treated as an opaque value, we can exploit > > > > that > > > > fact to avoid incrementing it when no one is watching. With that change, > > > > we can avoid incrementing the counter on writes, unless someone has > > > > queried for it since it was last incremented. If the a/c/mtime don't > > > > change, and the i_version hasn't changed, then there's no need to dirty > > > > the inode metadata on a write. > > > > > > > > Convert the i_version counter to an atomic64_t, and use the lowest order > > > > bit to hold a flag that will tell whether anyone has queried the value > > > > since it was last incremented. > > > > > > > > When we go to maybe increment it, we fetch the value and check the flag > > > > bit. If it's clear then we don't need to do anything if the update > > > > isn't being forced. > > > > > > > > If we do need to update, then we increment the counter by 2, and clear > > > > the flag bit, and then use a CAS op to swap it into place. If that > > > > works, we return true. If it doesn't then do it again with the value > > > > that we fetch from the CAS operation. > > > > > > > > On the query side, if the flag is already set, then we just shift the > > > > value down by 1 bit and return it. Otherwise, we set the flag in our > > > > on-stack value and again use cmpxchg to swap it into place if it hasn't > > > > changed. If it has, then we use the value from the cmpxchg as the new > > > > "old" value and try again. > > > > > > > > This method allows us to avoid incrementing the counter on writes (and > > > > dirtying the metadata) under typical workloads. We only need to > > > > increment > > > > if it has been queried since it was last changed. > > > > > > > > Signed-off-by: Jeff Layton > > > > --- > > > > include/linux/fs.h | 2 +- > > > > include/linux/iversion.h | 208 > > > > ++- > > > > 2 files changed, 154 insertions(+), 56 deletions(-) > > > > > > > > > > Hi, > > > > > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot. > > > This commit popped up through bisect (log at the end). Systemd > > > timeouts on some device-specific services, including mounting ext4 > > > /home: > > > > > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no > > > limit) > > > [ TIME ] Timed out waiting for device dev-ttySAC2.device. > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2. > > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device > > > dev-disk-by\x2dlabel-home.device. > > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home. > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems. > > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on > > > /dev/disk/by-label/home. > > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress > > > polling (1min 53s / no limit) > > > > > > Kernel command line: > > > console=tty1 console=ttySAC2,115200n8 > > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none > > > nfsrootdebug root=/dev/nfs > > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw > > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend > > > > > > /home is /dev/mmcblk1p2: > > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2 > > > tune2fs 1.43.7 (16-Oct-2017) > > > Filesystem volume name: home > > > Last mounted on: /home > > > Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed > > > Filesystem magic number: 0xEF53 > > > Filesystem revision #:1 (dynamic) > > > Filesystem features: has_journal ext_attr resize_inode dir_index > > > filetype needs_recovery extent flex_bg sparse_super large_file > > > uninit_bg dir_nlink extra_isize > > > Filesystem flags: signed_directory_hash > > > Default mount options:user_xattr acl > > > Filesystem state: clean > > > Errors behavior: Continue > > > Filesystem OS type: Linux > > > Inode count: 1430800 > > > Block count: 5717760 > > > Reserved block count: 285888 > > > Free blocks: 5467576 > > > Free inodes: 1428301 > > > First block: 0 > > > Block size: 4096 > > > Fragment size:4096 > > > Reserved GDT blocks: 1022 > > > Blocks per group: 32768 > > > Fragments per group: 32768 > > > Inodes per group: 8176 > > > Inode blocks per group: 511 > > > Flex block group size:16 > > > Filesystem created: Thu May 21 12:17:05 2015 > > > Last mount time: Thu Dec 21 13:31:26 2017
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Fri, Dec 22, 2017 at 07:05:56AM -0500, Jeff Layton wrote: > + cur = inode_peek_iversion_raw(inode); > + for (;;) { > + /* If flag is clear then we needn't do anything */ > + if (!force && !(cur & I_VERSION_QUERIED)) > + return false; > + /* Since lowest bit is flag, add 2 to avoid it */ > + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT; Isn't this an extraordinarily complicated way of spelling: new = cur + 1; We know 'cur' has I_VERSION_QUERIED set, so clearing that bit and adding two is going to be the same as adding 1 ... right? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Mon, Jan 8, 2018 at 1:56 PM, Jeff Laytonwrote: > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote: >> On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton wrote: >> > From: Jeff Layton >> > >> > Since i_version is mostly treated as an opaque value, we can exploit that >> > fact to avoid incrementing it when no one is watching. With that change, >> > we can avoid incrementing the counter on writes, unless someone has >> > queried for it since it was last incremented. If the a/c/mtime don't >> > change, and the i_version hasn't changed, then there's no need to dirty >> > the inode metadata on a write. >> > >> > Convert the i_version counter to an atomic64_t, and use the lowest order >> > bit to hold a flag that will tell whether anyone has queried the value >> > since it was last incremented. >> > >> > When we go to maybe increment it, we fetch the value and check the flag >> > bit. If it's clear then we don't need to do anything if the update >> > isn't being forced. >> > >> > If we do need to update, then we increment the counter by 2, and clear >> > the flag bit, and then use a CAS op to swap it into place. If that >> > works, we return true. If it doesn't then do it again with the value >> > that we fetch from the CAS operation. >> > >> > On the query side, if the flag is already set, then we just shift the >> > value down by 1 bit and return it. Otherwise, we set the flag in our >> > on-stack value and again use cmpxchg to swap it into place if it hasn't >> > changed. If it has, then we use the value from the cmpxchg as the new >> > "old" value and try again. >> > >> > This method allows us to avoid incrementing the counter on writes (and >> > dirtying the metadata) under typical workloads. We only need to increment >> > if it has been queried since it was last changed. >> > >> > Signed-off-by: Jeff Layton >> > --- >> > include/linux/fs.h | 2 +- >> > include/linux/iversion.h | 208 >> > ++- >> > 2 files changed, 154 insertions(+), 56 deletions(-) >> > >> >> Hi, >> >> On recent linux-next my ARM/Exynos boards fail to boot over nfsroot. >> This commit popped up through bisect (log at the end). Systemd >> timeouts on some device-specific services, including mounting ext4 >> /home: >> >> [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no >> limit) >> [ TIME ] Timed out waiting for device dev-ttySAC2.device. >> Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2. >> Jan 07 13:29:38 [ TIME ] Timed out waiting for device >> dev-disk-by\x2dlabel-home.device. >> Jan 07 13:29:38 [DEPEND] Dependency failed for /home. >> Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems. >> Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on >> /dev/disk/by-label/home. >> Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress >> polling (1min 53s / no limit) >> >> Kernel command line: >> console=tty1 console=ttySAC2,115200n8 >> ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none >> nfsrootdebug root=/dev/nfs >> nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw >> smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend >> >> /home is /dev/mmcblk1p2: >> kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2 >> tune2fs 1.43.7 (16-Oct-2017) >> Filesystem volume name: home >> Last mounted on: /home >> Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed >> Filesystem magic number: 0xEF53 >> Filesystem revision #:1 (dynamic) >> Filesystem features: has_journal ext_attr resize_inode dir_index >> filetype needs_recovery extent flex_bg sparse_super large_file >> uninit_bg dir_nlink extra_isize >> Filesystem flags: signed_directory_hash >> Default mount options:user_xattr acl >> Filesystem state: clean >> Errors behavior: Continue >> Filesystem OS type: Linux >> Inode count: 1430800 >> Block count: 5717760 >> Reserved block count: 285888 >> Free blocks: 5467576 >> Free inodes: 1428301 >> First block: 0 >> Block size: 4096 >> Fragment size:4096 >> Reserved GDT blocks: 1022 >> Blocks per group: 32768 >> Fragments per group: 32768 >> Inodes per group: 8176 >> Inode blocks per group: 511 >> Flex block group size:16 >> Filesystem created: Thu May 21 12:17:05 2015 >> Last mount time: Thu Dec 21 13:31:26 2017 >> Last write time: Thu Dec 21 13:31:26 2017 >> Mount count: 1 >> Maximum mount count: -1 >> Last checked: Thu Dec 21 13:31:25 2017 >> Check interval: 0 () >> Lifetime writes: 126 GB >> Reserved blocks uid: 0 (user root) >> Reserved blocks gid: 0 (group root) >> First inode: 11 >> Inode size: 256 >>
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote: > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Laytonwrote: > > From: Jeff Layton > > > > Since i_version is mostly treated as an opaque value, we can exploit that > > fact to avoid incrementing it when no one is watching. With that change, > > we can avoid incrementing the counter on writes, unless someone has > > queried for it since it was last incremented. If the a/c/mtime don't > > change, and the i_version hasn't changed, then there's no need to dirty > > the inode metadata on a write. > > > > Convert the i_version counter to an atomic64_t, and use the lowest order > > bit to hold a flag that will tell whether anyone has queried the value > > since it was last incremented. > > > > When we go to maybe increment it, we fetch the value and check the flag > > bit. If it's clear then we don't need to do anything if the update > > isn't being forced. > > > > If we do need to update, then we increment the counter by 2, and clear > > the flag bit, and then use a CAS op to swap it into place. If that > > works, we return true. If it doesn't then do it again with the value > > that we fetch from the CAS operation. > > > > On the query side, if the flag is already set, then we just shift the > > value down by 1 bit and return it. Otherwise, we set the flag in our > > on-stack value and again use cmpxchg to swap it into place if it hasn't > > changed. If it has, then we use the value from the cmpxchg as the new > > "old" value and try again. > > > > This method allows us to avoid incrementing the counter on writes (and > > dirtying the metadata) under typical workloads. We only need to increment > > if it has been queried since it was last changed. > > > > Signed-off-by: Jeff Layton > > --- > > include/linux/fs.h | 2 +- > > include/linux/iversion.h | 208 > > ++- > > 2 files changed, 154 insertions(+), 56 deletions(-) > > > > Hi, > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot. > This commit popped up through bisect (log at the end). Systemd > timeouts on some device-specific services, including mounting ext4 > /home: > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit) > [ TIME ] Timed out waiting for device dev-ttySAC2.device. > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2. > Jan 07 13:29:38 [ TIME ] Timed out waiting for device > dev-disk-by\x2dlabel-home.device. > Jan 07 13:29:38 [DEPEND] Dependency failed for /home. > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems. > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on > /dev/disk/by-label/home. > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress > polling (1min 53s / no limit) > > Kernel command line: > console=tty1 console=ttySAC2,115200n8 > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none > nfsrootdebug root=/dev/nfs > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend > > /home is /dev/mmcblk1p2: > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2 > tune2fs 1.43.7 (16-Oct-2017) > Filesystem volume name: home > Last mounted on: /home > Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed > Filesystem magic number: 0xEF53 > Filesystem revision #:1 (dynamic) > Filesystem features: has_journal ext_attr resize_inode dir_index > filetype needs_recovery extent flex_bg sparse_super large_file > uninit_bg dir_nlink extra_isize > Filesystem flags: signed_directory_hash > Default mount options:user_xattr acl > Filesystem state: clean > Errors behavior: Continue > Filesystem OS type: Linux > Inode count: 1430800 > Block count: 5717760 > Reserved block count: 285888 > Free blocks: 5467576 > Free inodes: 1428301 > First block: 0 > Block size: 4096 > Fragment size:4096 > Reserved GDT blocks: 1022 > Blocks per group: 32768 > Fragments per group: 32768 > Inodes per group: 8176 > Inode blocks per group: 511 > Flex block group size:16 > Filesystem created: Thu May 21 12:17:05 2015 > Last mount time: Thu Dec 21 13:31:26 2017 > Last write time: Thu Dec 21 13:31:26 2017 > Mount count: 1 > Maximum mount count: -1 > Last checked: Thu Dec 21 13:31:25 2017 > Check interval: 0 () > Lifetime writes: 126 GB > Reserved blocks uid: 0 (user root) > Reserved blocks gid: 0 (group root) > First inode: 11 > Inode size: 256 > Required extra isize: 28 > Desired extra isize: 28 > Journal inode:8 > Default directory hash: half_md4 > Directory Hash Seed:
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Fri, Dec 22, 2017 at 1:05 PM, Jeff Laytonwrote: > From: Jeff Layton > > Since i_version is mostly treated as an opaque value, we can exploit that > fact to avoid incrementing it when no one is watching. With that change, > we can avoid incrementing the counter on writes, unless someone has > queried for it since it was last incremented. If the a/c/mtime don't > change, and the i_version hasn't changed, then there's no need to dirty > the inode metadata on a write. > > Convert the i_version counter to an atomic64_t, and use the lowest order > bit to hold a flag that will tell whether anyone has queried the value > since it was last incremented. > > When we go to maybe increment it, we fetch the value and check the flag > bit. If it's clear then we don't need to do anything if the update > isn't being forced. > > If we do need to update, then we increment the counter by 2, and clear > the flag bit, and then use a CAS op to swap it into place. If that > works, we return true. If it doesn't then do it again with the value > that we fetch from the CAS operation. > > On the query side, if the flag is already set, then we just shift the > value down by 1 bit and return it. Otherwise, we set the flag in our > on-stack value and again use cmpxchg to swap it into place if it hasn't > changed. If it has, then we use the value from the cmpxchg as the new > "old" value and try again. > > This method allows us to avoid incrementing the counter on writes (and > dirtying the metadata) under typical workloads. We only need to increment > if it has been queried since it was last changed. > > Signed-off-by: Jeff Layton > --- > include/linux/fs.h | 2 +- > include/linux/iversion.h | 208 > ++- > 2 files changed, 154 insertions(+), 56 deletions(-) > Hi, On recent linux-next my ARM/Exynos boards fail to boot over nfsroot. This commit popped up through bisect (log at the end). Systemd timeouts on some device-specific services, including mounting ext4 /home: [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit) [ TIME ] Timed out waiting for device dev-ttySAC2.device. Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2. Jan 07 13:29:38 [ TIME ] Timed out waiting for device dev-disk-by\x2dlabel-home.device. Jan 07 13:29:38 [DEPEND] Dependency failed for /home. Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems. Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on /dev/disk/by-label/home. Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress polling (1min 53s / no limit) Kernel command line: console=tty1 console=ttySAC2,115200n8 ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none nfsrootdebug root=/dev/nfs nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend /home is /dev/mmcblk1p2: kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2 tune2fs 1.43.7 (16-Oct-2017) Filesystem volume name: home Last mounted on: /home Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 1430800 Block count: 5717760 Reserved block count: 285888 Free blocks: 5467576 Free inodes: 1428301 First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 1022 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8176 Inode blocks per group: 511 Flex block group size:16 Filesystem created: Thu May 21 12:17:05 2015 Last mount time: Thu Dec 21 13:31:26 2017 Last write time: Thu Dec 21 13:31:26 2017 Mount count: 1 Maximum mount count: -1 Last checked: Thu Dec 21 13:31:25 2017 Check interval: 0 () Lifetime writes: 126 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: 42e17e23-86b2-4356-ad63-78aa51651d03 Journal backup: inode blocks Full dmesg log: http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0 The regular boot from rootfs on SD card also fails - but without any serial console logs (just "Starting kernel...") so the real cause is unknown. Any hints?
Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently
On Fri 22-12-17 07:05:56, Jeff Layton wrote: > From: Jeff Layton> > Since i_version is mostly treated as an opaque value, we can exploit that > fact to avoid incrementing it when no one is watching. With that change, > we can avoid incrementing the counter on writes, unless someone has > queried for it since it was last incremented. If the a/c/mtime don't > change, and the i_version hasn't changed, then there's no need to dirty > the inode metadata on a write. > > Convert the i_version counter to an atomic64_t, and use the lowest order > bit to hold a flag that will tell whether anyone has queried the value > since it was last incremented. > > When we go to maybe increment it, we fetch the value and check the flag > bit. If it's clear then we don't need to do anything if the update > isn't being forced. > > If we do need to update, then we increment the counter by 2, and clear > the flag bit, and then use a CAS op to swap it into place. If that > works, we return true. If it doesn't then do it again with the value > that we fetch from the CAS operation. > > On the query side, if the flag is already set, then we just shift the > value down by 1 bit and return it. Otherwise, we set the flag in our > on-stack value and again use cmpxchg to swap it into place if it hasn't > changed. If it has, then we use the value from the cmpxchg as the new > "old" value and try again. > > This method allows us to avoid incrementing the counter on writes (and > dirtying the metadata) under typical workloads. We only need to increment > if it has been queried since it was last changed. > > Signed-off-by: Jeff Layton Looks good to me. You can add: Reviewed-by: Jan Kara Honza > --- > include/linux/fs.h | 2 +- > include/linux/iversion.h | 208 > ++- > 2 files changed, 154 insertions(+), 56 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 76382c24e9d0..6804d075933e 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -639,7 +639,7 @@ struct inode { > struct hlist_head i_dentry; > struct rcu_head i_rcu; > }; > - u64 i_version; > + atomic64_t i_version; > atomic_ti_count; > atomic_ti_dio_count; > atomic_ti_writecount; > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > index e08c634779df..cef242e54489 100644 > --- a/include/linux/iversion.h > +++ b/include/linux/iversion.h > @@ -5,6 +5,8 @@ > #include > > /* > + * The inode->i_version field: > + * --- > * The change attribute (i_version) is mandated by NFSv4 and is mostly for > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must > * appear different to observers if there was a change to the inode's data or > @@ -27,86 +29,171 @@ > * i_version on namespace changes in directories (mkdir, rmdir, unlink, > etc.). > * We consider these sorts of filesystems to have a kernel-managed i_version. > * > + * This implementation uses the low bit in the i_version field as a flag to > + * track when the value has been queried. If it has not been queried since it > + * was last incremented, we can skip the increment in most cases. > + * > + * In the event that we're updating the ctime, we will usually go ahead and > + * bump the i_version anyway. Since that has to go to stable storage in some > + * fashion, we might as well increment it as well. > + * > + * With this implementation, the value should always appear to observers to > + * increase over time if the file has changed. It's recommended to use > + * inode_cmp_iversion() helper to compare values. > + * > * Note that some filesystems (e.g. NFS and AFS) just use the field to store > - * a server-provided value (for the most part). For that reason, those > + * a server-provided value for the most part. For that reason, those > * filesystems do not set SB_I_VERSION. These filesystems are considered to > * have a self-managed i_version. > + * > + * Persistently storing the i_version > + * -- > + * Queries of the i_version field are not gated on them hitting the backing > + * store. It's always possible that the host could crash after allowing > + * a query of the value but before it has made it to disk. > + * > + * To mitigate this problem, filesystems should always use > + * inode_set_iversion_queried when loading an existing inode from disk. This > + * ensures that the next attempted inode increment will result in the value > + * changing. > + * > + * Storing the value to disk therefore does not count as a query, so those > + * filesystems should use inode_peek_iversion to grab the value to be stored. > + * There is no need to flag the value as