Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-09 Thread Krzysztof Kozlowski
On Mon, Jan 8, 2018 at 10:39 PM, Jeff Layton  wrote:
>>
>
> Got it, that's helpful. Does this patch help (on top of the others) ?
>
> 8<--
>
> SQUASH: nfs: compare raw iversion counter since that's what's
>  being stored
>

Did not apply cleanly (in include/linux/iversion.h) but it was easy to
adjust. However still no improvements. Applied on top of others.

Best regards,
Krzysztof
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Jeff Layton
On Mon, 2018-01-08 at 21:17 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 08, 2018 at 02:15:29PM -0500, Jeff Layton wrote:
> > On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote:
> > > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> > > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> > > 
> > > (...)
> > > 
> > > > > > Ok, thanks. If you're seeing hangs then that might imply that we 
> > > > > > have
> > > > > > some sort of excessive looping going on in the cmpxchg loops.
> > > > > > 
> > > > > > Could you apply the patch below and let me know if it causes either 
> > > > > > of
> > > > > > the warnings to pop? That might at least point us in the right
> > > > > > direction:
> > > > > 
> > > > > No new warnings with attached patch (except existing already lockdep:
> > > > > "INFO: trying to register non-static key.").
> > > > > 
> > > > 
> > > > Yeah, I saw that in the original logs and it looks unrelated (and
> > > > harmless).
> > > > 
> > > > > Systemd timeouts on mounting /home but after entering rescue shell 
> > > > > there
> > > > > is no problem running mount /home:
> > > > >   Give root password for maintenance
> > > > >   (or press Control-D to continue): 
> > > > >   root@odroidxu3:~# mount /home
> > > > >   [  220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with 
> > > > > ordered data mode. Opts: (null)
> > > > > 
> > > > 
> > > > Ok, thanks for testing it. So I guess we can probably rule out excessive
> > > > looping in those functions as the issue.
> > > > 
> > > > To make sure I understand the problem: When systemd tries to do the
> > > > initial mount of /home (which is an ext4 filesystem), it hangs. But once
> > > > it drops to the shell, it works, if you do the mount by hand.
> > > > 
> > > > Is that correct?
> > > 
> > > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
> > > console).
> > > 
> > > > If so, then is it possible to trigger sysrq commands during the hanging
> > > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> > > > determine what it's blocking on?
> > 
> > (trimming the output)
> > 
> > Thanks. I don't really see anything obvious in that info,
> > unfortunately. What we really need to do is find the systemd task
> > performing the mount, and see what it's doing.
> 
> It's systemd 236.0-2 coming from Arch Linux for ARM. All packages are
> updated.
> 
> > We do have one questionable bug in the NFS changes though. Does this
> > patch help at all?
> 
> Patches do not change anything (I tried "SQUASH: nfs: fix i_version
> increment when adding a request" and "SQUASH: ext4: use raw API for
> xattr inode refcounts").
> 
> I tried again regular SDcard-root boot and it succeeded. Only nfsroot
> fails.
> 
> Best regards,
> Krzysztof
> 

Got it, that's helpful. Does this patch help (on top of the others) ?

8<--

SQUASH: nfs: compare raw iversion counter since that's what's
 being stored

Signed-off-by: Jeff Layton 
---
 fs/nfs/inode.c   |  6 +++---
 include/linux/iversion.h | 13 +
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 0b85cca1184b..93552c482992 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1289,7 +1289,7 @@ static unsigned long nfs_wcc_update_inode(struct inode 
*inode, struct nfs_fattr
 
if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   && !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+   && !inode_cmp_iversion_raw(inode, 
fattr->pre_change_attr)) {
inode_set_iversion_raw(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
@@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode 
*inode, struct nfs_fattr *fat
 
if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
-   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode_cmp_iversion(inode, fattr->change_attr))
+   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode_cmp_iversion_raw(inode, fattr->change_attr))
invalid |= NFS_INO_INVALID_ATTR | 
NFS_INO_REVAL_PAGECACHE;
 
if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && 
!timespec_equal(>i_mtime, >mtime))
@@ -1778,7 +1778,7 @@ static int nfs_update_inode(struct inode *inode, struct 
nfs_fattr *fattr)
 
/* More cache consistency checks */
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
-   if (inode_cmp_iversion(inode, fattr->change_attr)) {
+   if (inode_cmp_iversion_raw(inode, fattr->change_attr)) {
dprintk("NFS: change_attr change on server for file 
%s/%ld\n",
  

Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Krzysztof Kozlowski
On Mon, Jan 08, 2018 at 02:15:29PM -0500, Jeff Layton wrote:
> On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote:
> > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> > 
> > (...)
> > 
> > > > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > > > some sort of excessive looping going on in the cmpxchg loops.
> > > > > 
> > > > > Could you apply the patch below and let me know if it causes either of
> > > > > the warnings to pop? That might at least point us in the right
> > > > > direction:
> > > > 
> > > > No new warnings with attached patch (except existing already lockdep:
> > > > "INFO: trying to register non-static key.").
> > > > 
> > > 
> > > Yeah, I saw that in the original logs and it looks unrelated (and
> > > harmless).
> > > 
> > > > Systemd timeouts on mounting /home but after entering rescue shell there
> > > > is no problem running mount /home:
> > > > Give root password for maintenance
> > > > (or press Control-D to continue): 
> > > > root@odroidxu3:~# mount /home
> > > > [  220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with 
> > > > ordered data mode. Opts: (null)
> > > > 
> > > 
> > > Ok, thanks for testing it. So I guess we can probably rule out excessive
> > > looping in those functions as the issue.
> > > 
> > > To make sure I understand the problem: When systemd tries to do the
> > > initial mount of /home (which is an ext4 filesystem), it hangs. But once
> > > it drops to the shell, it works, if you do the mount by hand.
> > > 
> > > Is that correct?
> > 
> > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
> > console).
> > 
> > > If so, then is it possible to trigger sysrq commands during the hanging
> > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> > > determine what it's blocking on?
> 
> (trimming the output)
> 
> Thanks. I don't really see anything obvious in that info,
> unfortunately. What we really need to do is find the systemd task
> performing the mount, and see what it's doing.

It's systemd 236.0-2 coming from Arch Linux for ARM. All packages are
updated.

> We do have one questionable bug in the NFS changes though. Does this
> patch help at all?

Patches do not change anything (I tried "SQUASH: nfs: fix i_version
increment when adding a request" and "SQUASH: ext4: use raw API for
xattr inode refcounts").

I tried again regular SDcard-root boot and it succeeded. Only nfsroot
fails.

Best regards,
Krzysztof


> 
> ---8<-
> 
> SQUASH: nfs: fix i_version increment when adding a request
> 
> NFS treats this value as an opaque value with no flag, so we must
> increment it as such instead of using inode_inc_iversion.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/nfs/write.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a03fbac1f88c..48837b6250e9 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, 
> struct nfs_page *req)
>   spin_lock(>private_lock);
>   if (!nfs_have_writebacks(inode) &&
>   NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
> - inode_inc_iversion(inode);
> + atomic64_inc(>i_version);
>   if (likely(!PageSwapCache(req->wb_page))) {
>   set_bit(PG_MAPPED, >wb_flags);
>   SetPagePrivate(req->wb_page);
> -- 
> 2.14.3
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Jeff Layton
On Mon, 2018-01-08 at 14:15 -0500, Jeff Layton wrote:
> On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote:
> > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> > 
> > (...)
> > 
> > > > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > > > some sort of excessive looping going on in the cmpxchg loops.
> > > > > 
> > > > > Could you apply the patch below and let me know if it causes either of
> > > > > the warnings to pop? That might at least point us in the right
> > > > > direction:
> > > > 
> > > > No new warnings with attached patch (except existing already lockdep:
> > > > "INFO: trying to register non-static key.").
> > > > 
> > > 
> > > Yeah, I saw that in the original logs and it looks unrelated (and
> > > harmless).
> > > 
> > > > Systemd timeouts on mounting /home but after entering rescue shell there
> > > > is no problem running mount /home:
> > > > Give root password for maintenance
> > > > (or press Control-D to continue): 
> > > > root@odroidxu3:~# mount /home
> > > > [  220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with 
> > > > ordered data mode. Opts: (null)
> > > > 
> > > 
> > > Ok, thanks for testing it. So I guess we can probably rule out excessive
> > > looping in those functions as the issue.
> > > 
> > > To make sure I understand the problem: When systemd tries to do the
> > > initial mount of /home (which is an ext4 filesystem), it hangs. But once
> > > it drops to the shell, it works, if you do the mount by hand.
> > > 
> > > Is that correct?
> > 
> > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
> > console).
> > 
> > > If so, then is it possible to trigger sysrq commands during the hanging
> > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> > > determine what it's blocking on?
> 
> (trimming the output)
> 
> Thanks. I don't really see anything obvious in that info,
> unfortunately. What we really need to do is find the systemd task
> performing the mount, and see what it's doing.
> 
> We do have one questionable bug in the NFS changes though. Does this
> patch help at all?
> 
> ---8<-
> 
> SQUASH: nfs: fix i_version increment when adding a request
> 
> NFS treats this value as an opaque value with no flag, so we must
> increment it as such instead of using inode_inc_iversion.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/nfs/write.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a03fbac1f88c..48837b6250e9 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, 
> struct nfs_page *req)
>   spin_lock(>private_lock);
>   if (!nfs_have_writebacks(inode) &&
>   NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
> - inode_inc_iversion(inode);
> + atomic64_inc(>i_version);
>   if (likely(!PageSwapCache(req->wb_page))) {
>   set_bit(PG_MAPPED, >wb_flags);
>   SetPagePrivate(req->wb_page);

Sorry for the slow dribble of patches. I found a bug in the ext4 code
too. Might want to try this ext4 patch as well:

--8<-

SQUASH: ext4: use raw API for xattr inode refcounts

This could distort the values otherwise.

(Side Q: Why do we do this across the ctime and iversion like this?)

Signed-off-by: Jeff Layton 
---
 fs/ext4/xattr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index ba6fd5439aa4..63656dbafdc4 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -295,13 +295,13 @@ ext4_xattr_inode_hash(struct ext4_sb_info *sbi, const 
void *buffer, size_t size)
 static u64 ext4_xattr_inode_get_ref(struct inode *ea_inode)
 {
return ((u64)ea_inode->i_ctime.tv_sec << 32) |
-   (u32) inode_peek_iversion(ea_inode);
+   (u32) inode_peek_iversion_raw(ea_inode);
 }
 
 static void ext4_xattr_inode_set_ref(struct inode *ea_inode, u64 ref_count)
 {
ea_inode->i_ctime.tv_sec = (u32)(ref_count >> 32);
-   inode_set_iversion(ea_inode, ref_count & 0x);
+   inode_set_iversion_raw(ea_inode, ref_count & 0x);
 }
 
 static u32 ext4_xattr_inode_get_hash(struct inode *ea_inode)
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Jeff Layton
On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> 
> (...)
> 
> > > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > > some sort of excessive looping going on in the cmpxchg loops.
> > > > 
> > > > Could you apply the patch below and let me know if it causes either of
> > > > the warnings to pop? That might at least point us in the right
> > > > direction:
> > > 
> > > No new warnings with attached patch (except existing already lockdep:
> > > "INFO: trying to register non-static key.").
> > > 
> > 
> > Yeah, I saw that in the original logs and it looks unrelated (and
> > harmless).
> > 
> > > Systemd timeouts on mounting /home but after entering rescue shell there
> > > is no problem running mount /home:
> > >   Give root password for maintenance
> > >   (or press Control-D to continue): 
> > >   root@odroidxu3:~# mount /home
> > >   [  220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered 
> > > data mode. Opts: (null)
> > > 
> > 
> > Ok, thanks for testing it. So I guess we can probably rule out excessive
> > looping in those functions as the issue.
> > 
> > To make sure I understand the problem: When systemd tries to do the
> > initial mount of /home (which is an ext4 filesystem), it hangs. But once
> > it drops to the shell, it works, if you do the mount by hand.
> > 
> > Is that correct?
> 
> Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
> console).
> 
> > If so, then is it possible to trigger sysrq commands during the hanging
> > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> > determine what it's blocking on?

(trimming the output)

Thanks. I don't really see anything obvious in that info,
unfortunately. What we really need to do is find the systemd task
performing the mount, and see what it's doing.

We do have one questionable bug in the NFS changes though. Does this
patch help at all?

---8<-

SQUASH: nfs: fix i_version increment when adding a request

NFS treats this value as an opaque value with no flag, so we must
increment it as such instead of using inode_inc_iversion.

Signed-off-by: Jeff Layton 
---
 fs/nfs/write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a03fbac1f88c..48837b6250e9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, 
struct nfs_page *req)
spin_lock(>private_lock);
if (!nfs_have_writebacks(inode) &&
NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
-   inode_inc_iversion(inode);
+   atomic64_inc(>i_version);
if (likely(!PageSwapCache(req->wb_page))) {
set_bit(PG_MAPPED, >wb_flags);
SetPagePrivate(req->wb_page);
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Krzysztof Kozlowski
On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:

(...)

> > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > some sort of excessive looping going on in the cmpxchg loops.
> > > 
> > > Could you apply the patch below and let me know if it causes either of
> > > the warnings to pop? That might at least point us in the right
> > > direction:
> > 
> > No new warnings with attached patch (except existing already lockdep:
> > "INFO: trying to register non-static key.").
> > 
> 
> Yeah, I saw that in the original logs and it looks unrelated (and
> harmless).
> 
> > Systemd timeouts on mounting /home but after entering rescue shell there
> > is no problem running mount /home:
> > Give root password for maintenance
> > (or press Control-D to continue): 
> > root@odroidxu3:~# mount /home
> > [  220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered 
> > data mode. Opts: (null)
> > 
> 
> Ok, thanks for testing it. So I guess we can probably rule out excessive
> looping in those functions as the issue.
> 
> To make sure I understand the problem: When systemd tries to do the
> initial mount of /home (which is an ext4 filesystem), it hangs. But once
> it drops to the shell, it works, if you do the mount by hand.
> 
> Is that correct?

Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
console).

> If so, then is it possible to trigger sysrq commands during the hanging
> mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> determine what it's blocking on?

Yes, I have sysrq. Blocked state (w):
##
[ ***  ] (2 of 4) A start job is running for…v-ttySAC2.device (1min / 1min 30s)
*** break sent ***
[***   ] (2 of 4) A start job is running for…v-ttySAC2.device (1min / 1min 
30s)[  110.962917] sysrq: SysRq : Show Blocked State
[  110.965931]   taskPC stack   pid father
[  110.971360] lvm D0   219  1 0x
[  110.976649] [] (__schedule) from [] (schedule+0x4c/0xac)
[  110.983643] [] (schedule) from [] 
(rpc_wait_bit_killable+0x2c/0xf0)
[  110.991621] [] (rpc_wait_bit_killable) from [] 
(__wait_on_bit+0x80/0xb4)
[  111.25] [] (__wait_on_bit) from [] 
(out_of_line_wait_on_bit+0x8c/0x94)
[  111.008603] [] (out_of_line_wait_on_bit) from [] 
(__rpc_execute+0x19c/0x274)
[  111.017370] [] (__rpc_execute) from [] 
(rpc_run_task+0x134/0x14c)
[  111.025164] [] (rpc_run_task) from [] 
(nfs4_call_sync_sequence+0x58/0x78)
[  111.033648] [] (nfs4_call_sync_sequence) from [] 
(nfs4_proc_access+0xf4/0x12c)
[  111.042586] [] (nfs4_proc_access) from [] 
(nfs_do_access+0x230/0x3e8)
[  111.050716] [] (nfs_do_access) from [] 
(nfs_permission+0x2a0/0x2c0)
[  111.058694] [] (nfs_permission) from [] 
(link_path_walk+0x6c/0x508)
[  111.066658] [] (link_path_walk) from [] 
(path_lookupat+0x8c/0x1ec)
[  111.074541] [] (path_lookupat) from [] 
(path_openat+0x770/0x964)
[  111.082253] [] (path_openat) from [] 
(do_filp_open+0x60/0xc4)
[  111.089711] [] (do_filp_open) from [] 
(do_sys_open+0x114/0x1c4)
[  111.097339] [] (do_sys_open) from [] 
(ret_fast_syscall+0x0/0x28)
[  111.105036] udevadm D0   259  1 0x
[  111.110482] [] (__schedule) from [] (schedule+0x4c/0xac)
[  111.117508] [] (schedule) from [] 
(rpc_wait_bit_killable+0x2c/0xf0)
[  111.125494] [] (rpc_wait_bit_killable) from [] 
(__wait_on_bit+0x80/0xb4)
[  111.133903] [] (__wait_on_bit) from [] 
(out_of_line_wait_on_bit+0x8c/0x94)
[  111.142482] [] (out_of_line_wait_on_bit) from [] 
(__rpc_execute+0x19c/0x274)
[  111.151236] [] (__rpc_execute) from [] 
(rpc_run_task+0x134/0x14c)
[  111.159028] [] (rpc_run_task) from [] 
(nfs4_call_sync_sequence+0x58/0x78)
[  111.167523] [] (nfs4_call_sync_sequence) from [] 
(nfs4_proc_getattr+0xbc/0xe4)
[  111.176452] [] (nfs4_proc_getattr) from [] 
(__nfs_revalidate_inode+0x8c/0x130)
[  111.185378] [] (__nfs_revalidate_inode) from [] 
(nfs_permission+0x16c/0x2c0)
[  111.194132] [] (nfs_permission) from [] 
(link_path_walk+0x6c/0x508)
[  111.202096] [] (link_path_walk) from [] 
(path_lookupat+0x8c/0x1ec)
[  111.209980] [] (path_lookupat) from [] 
(filename_lookup+0x8c/0xe8)
[  111.217865] [] (filename_lookup) from [] 
(SyS_faccessat+0x9c/0x208)
[  111.225839] [] (SyS_faccessat) from [] 
(ret_fast_syscall+0x0/0x28)
[***   ] (3 of 4) A start job is running for…l-home.device (1min 2s / 1min 30s)
##

Another blocked (after few seconds):
[  125.364311] sysrq: SysRq : Show Blocked State
[  125.367292]   taskPC stack   pid father
[  125.372688] udevadm D0   259  1 0x
[  125.377998] [] (__schedule) from [] (schedule+0x4c/0xac)
[  125.384990] [] (schedule) from [] 
(rpc_wait_bit_killable+0x2c/0xf0)
[  125.392961] [] (rpc_wait_bit_killable) from [] 
(__wait_on_bit+0x80/0xb4)
[  125.401365] [] (__wait_on_bit) from [] 
(out_of_line_wait_on_bit+0x8c/0x94)
[  125.409944] [] 

Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Jeff Layton
On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 08, 2018 at 08:29:24AM -0500, Jeff Layton wrote:
> > On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote:
> > > On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton  wrote:
> > > > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> > > > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton  
> > > > > wrote:
> > > > > > From: Jeff Layton 
> > > > > > 
> > > > > > Since i_version is mostly treated as an opaque value, we can 
> > > > > > exploit that
> > > > > > fact to avoid incrementing it when no one is watching. With that 
> > > > > > change,
> > > > > > we can avoid incrementing the counter on writes, unless someone has
> > > > > > queried for it since it was last incremented. If the a/c/mtime don't
> > > > > > change, and the i_version hasn't changed, then there's no need to 
> > > > > > dirty
> > > > > > the inode metadata on a write.
> > > > > > 
> > > > > > Convert the i_version counter to an atomic64_t, and use the lowest 
> > > > > > order
> > > > > > bit to hold a flag that will tell whether anyone has queried the 
> > > > > > value
> > > > > > since it was last incremented.
> > > > > > 
> > > > > > When we go to maybe increment it, we fetch the value and check the 
> > > > > > flag
> > > > > > bit.  If it's clear then we don't need to do anything if the update
> > > > > > isn't being forced.
> > > > > > 
> > > > > > If we do need to update, then we increment the counter by 2, and 
> > > > > > clear
> > > > > > the flag bit, and then use a CAS op to swap it into place. If that
> > > > > > works, we return true. If it doesn't then do it again with the value
> > > > > > that we fetch from the CAS operation.
> > > > > > 
> > > > > > On the query side, if the flag is already set, then we just shift 
> > > > > > the
> > > > > > value down by 1 bit and return it. Otherwise, we set the flag in our
> > > > > > on-stack value and again use cmpxchg to swap it into place if it 
> > > > > > hasn't
> > > > > > changed. If it has, then we use the value from the cmpxchg as the 
> > > > > > new
> > > > > > "old" value and try again.
> > > > > > 
> > > > > > This method allows us to avoid incrementing the counter on writes 
> > > > > > (and
> > > > > > dirtying the metadata) under typical workloads. We only need to 
> > > > > > increment
> > > > > > if it has been queried since it was last changed.
> > > > > > 
> > > > > > Signed-off-by: Jeff Layton 
> > > > > > ---
> > > > > >  include/linux/fs.h   |   2 +-
> > > > > >  include/linux/iversion.h | 208 
> > > > > > ++-
> > > > > >  2 files changed, 154 insertions(+), 56 deletions(-)
> > > > > > 
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> > > > > This commit popped up through bisect (log at the end). Systemd
> > > > > timeouts on some device-specific services, including mounting ext4
> > > > > /home:
> > > > > 
> > > > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / 
> > > > > no limit)
> > > > > [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on 
> > > > > ttySAC2.
> > > > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> > > > > dev-disk-by\x2dlabel-home.device.
> > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> > > > > /dev/disk/by-label/home.
> > > > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> > > > > polling (1min 53s / no limit)
> > > > > 
> > > > > Kernel command line:
> > > > > console=tty1 console=ttySAC2,115200n8
> > > > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> > > > > nfsrootdebug root=/dev/nfs
> > > > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> > > > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> > > > > 
> > > > > /home is /dev/mmcblk1p2:
> > > > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> > > > > tune2fs 1.43.7 (16-Oct-2017)
> > > > > Filesystem volume name:   home
> > > > > Last mounted on:  /home
> > > > > Filesystem UUID:  3f9dbeba-2738-45d3-807e-c1b2e21128ed
> > > > > Filesystem magic number:  0xEF53
> > > > > Filesystem revision #:1 (dynamic)
> > > > > Filesystem features:  has_journal ext_attr resize_inode dir_index
> > > > > filetype needs_recovery extent flex_bg sparse_super large_file
> > > > > uninit_bg dir_nlink extra_isize
> > > > > Filesystem flags: signed_directory_hash
> > > > > Default mount options:user_xattr acl
> > > > > Filesystem state: clean
> > > > > Errors behavior:  Continue
> > > > > Filesystem OS type:   Linux
> > > > > 

Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Krzysztof Kozlowski
On Mon, Jan 08, 2018 at 08:29:24AM -0500, Jeff Layton wrote:
> On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote:
> > On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton  wrote:
> > > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> > > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton  wrote:
> > > > > From: Jeff Layton 
> > > > > 
> > > > > Since i_version is mostly treated as an opaque value, we can exploit 
> > > > > that
> > > > > fact to avoid incrementing it when no one is watching. With that 
> > > > > change,
> > > > > we can avoid incrementing the counter on writes, unless someone has
> > > > > queried for it since it was last incremented. If the a/c/mtime don't
> > > > > change, and the i_version hasn't changed, then there's no need to 
> > > > > dirty
> > > > > the inode metadata on a write.
> > > > > 
> > > > > Convert the i_version counter to an atomic64_t, and use the lowest 
> > > > > order
> > > > > bit to hold a flag that will tell whether anyone has queried the value
> > > > > since it was last incremented.
> > > > > 
> > > > > When we go to maybe increment it, we fetch the value and check the 
> > > > > flag
> > > > > bit.  If it's clear then we don't need to do anything if the update
> > > > > isn't being forced.
> > > > > 
> > > > > If we do need to update, then we increment the counter by 2, and clear
> > > > > the flag bit, and then use a CAS op to swap it into place. If that
> > > > > works, we return true. If it doesn't then do it again with the value
> > > > > that we fetch from the CAS operation.
> > > > > 
> > > > > On the query side, if the flag is already set, then we just shift the
> > > > > value down by 1 bit and return it. Otherwise, we set the flag in our
> > > > > on-stack value and again use cmpxchg to swap it into place if it 
> > > > > hasn't
> > > > > changed. If it has, then we use the value from the cmpxchg as the new
> > > > > "old" value and try again.
> > > > > 
> > > > > This method allows us to avoid incrementing the counter on writes (and
> > > > > dirtying the metadata) under typical workloads. We only need to 
> > > > > increment
> > > > > if it has been queried since it was last changed.
> > > > > 
> > > > > Signed-off-by: Jeff Layton 
> > > > > ---
> > > > >  include/linux/fs.h   |   2 +-
> > > > >  include/linux/iversion.h | 208 
> > > > > ++-
> > > > >  2 files changed, 154 insertions(+), 56 deletions(-)
> > > > > 
> > > > 
> > > > Hi,
> > > > 
> > > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> > > > This commit popped up through bisect (log at the end). Systemd
> > > > timeouts on some device-specific services, including mounting ext4
> > > > /home:
> > > > 
> > > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no 
> > > > limit)
> > > > [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> > > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> > > > dev-disk-by\x2dlabel-home.device.
> > > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> > > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> > > > /dev/disk/by-label/home.
> > > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> > > > polling (1min 53s / no limit)
> > > > 
> > > > Kernel command line:
> > > > console=tty1 console=ttySAC2,115200n8
> > > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> > > > nfsrootdebug root=/dev/nfs
> > > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> > > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> > > > 
> > > > /home is /dev/mmcblk1p2:
> > > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> > > > tune2fs 1.43.7 (16-Oct-2017)
> > > > Filesystem volume name:   home
> > > > Last mounted on:  /home
> > > > Filesystem UUID:  3f9dbeba-2738-45d3-807e-c1b2e21128ed
> > > > Filesystem magic number:  0xEF53
> > > > Filesystem revision #:1 (dynamic)
> > > > Filesystem features:  has_journal ext_attr resize_inode dir_index
> > > > filetype needs_recovery extent flex_bg sparse_super large_file
> > > > uninit_bg dir_nlink extra_isize
> > > > Filesystem flags: signed_directory_hash
> > > > Default mount options:user_xattr acl
> > > > Filesystem state: clean
> > > > Errors behavior:  Continue
> > > > Filesystem OS type:   Linux
> > > > Inode count:  1430800
> > > > Block count:  5717760
> > > > Reserved block count: 285888
> > > > Free blocks:  5467576
> > > > Free inodes:  1428301
> > > > First block:  0
> > > > Block size:   4096
> > > > Fragment size:4096
> > > > Reserved GDT 

Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Jeff Layton
On Mon, 2018-01-08 at 05:30 -0800, Matthew Wilcox wrote:
> On Fri, Dec 22, 2017 at 07:05:56AM -0500, Jeff Layton wrote:
> > +   cur = inode_peek_iversion_raw(inode);
> > +   for (;;) {
> > +   /* If flag is clear then we needn't do anything */
> > +   if (!force && !(cur & I_VERSION_QUERIED))
> > +   return false;
> > +   /* Since lowest bit is flag, add 2 to avoid it */
> > +   new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> 
> Isn't this an extraordinarily complicated way of spelling:
> 
>   new = cur + 1;
> 
> We know 'cur' has I_VERSION_QUERIED set, so clearing that bit and adding
> two is going to be the same as adding 1 ... right?
> 

It would be, but if "force" is true, then I_VERSION_QUERIED may not be
set.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Jeff Layton
On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton  wrote:
> > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton  wrote:
> > > > From: Jeff Layton 
> > > > 
> > > > Since i_version is mostly treated as an opaque value, we can exploit 
> > > > that
> > > > fact to avoid incrementing it when no one is watching. With that change,
> > > > we can avoid incrementing the counter on writes, unless someone has
> > > > queried for it since it was last incremented. If the a/c/mtime don't
> > > > change, and the i_version hasn't changed, then there's no need to dirty
> > > > the inode metadata on a write.
> > > > 
> > > > Convert the i_version counter to an atomic64_t, and use the lowest order
> > > > bit to hold a flag that will tell whether anyone has queried the value
> > > > since it was last incremented.
> > > > 
> > > > When we go to maybe increment it, we fetch the value and check the flag
> > > > bit.  If it's clear then we don't need to do anything if the update
> > > > isn't being forced.
> > > > 
> > > > If we do need to update, then we increment the counter by 2, and clear
> > > > the flag bit, and then use a CAS op to swap it into place. If that
> > > > works, we return true. If it doesn't then do it again with the value
> > > > that we fetch from the CAS operation.
> > > > 
> > > > On the query side, if the flag is already set, then we just shift the
> > > > value down by 1 bit and return it. Otherwise, we set the flag in our
> > > > on-stack value and again use cmpxchg to swap it into place if it hasn't
> > > > changed. If it has, then we use the value from the cmpxchg as the new
> > > > "old" value and try again.
> > > > 
> > > > This method allows us to avoid incrementing the counter on writes (and
> > > > dirtying the metadata) under typical workloads. We only need to 
> > > > increment
> > > > if it has been queried since it was last changed.
> > > > 
> > > > Signed-off-by: Jeff Layton 
> > > > ---
> > > >  include/linux/fs.h   |   2 +-
> > > >  include/linux/iversion.h | 208 
> > > > ++-
> > > >  2 files changed, 154 insertions(+), 56 deletions(-)
> > > > 
> > > 
> > > Hi,
> > > 
> > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> > > This commit popped up through bisect (log at the end). Systemd
> > > timeouts on some device-specific services, including mounting ext4
> > > /home:
> > > 
> > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no 
> > > limit)
> > > [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> > > dev-disk-by\x2dlabel-home.device.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> > > /dev/disk/by-label/home.
> > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> > > polling (1min 53s / no limit)
> > > 
> > > Kernel command line:
> > > console=tty1 console=ttySAC2,115200n8
> > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> > > nfsrootdebug root=/dev/nfs
> > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> > > 
> > > /home is /dev/mmcblk1p2:
> > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> > > tune2fs 1.43.7 (16-Oct-2017)
> > > Filesystem volume name:   home
> > > Last mounted on:  /home
> > > Filesystem UUID:  3f9dbeba-2738-45d3-807e-c1b2e21128ed
> > > Filesystem magic number:  0xEF53
> > > Filesystem revision #:1 (dynamic)
> > > Filesystem features:  has_journal ext_attr resize_inode dir_index
> > > filetype needs_recovery extent flex_bg sparse_super large_file
> > > uninit_bg dir_nlink extra_isize
> > > Filesystem flags: signed_directory_hash
> > > Default mount options:user_xattr acl
> > > Filesystem state: clean
> > > Errors behavior:  Continue
> > > Filesystem OS type:   Linux
> > > Inode count:  1430800
> > > Block count:  5717760
> > > Reserved block count: 285888
> > > Free blocks:  5467576
> > > Free inodes:  1428301
> > > First block:  0
> > > Block size:   4096
> > > Fragment size:4096
> > > Reserved GDT blocks:  1022
> > > Blocks per group: 32768
> > > Fragments per group:  32768
> > > Inodes per group: 8176
> > > Inode blocks per group:   511
> > > Flex block group size:16
> > > Filesystem created:   Thu May 21 12:17:05 2015
> > > Last mount time:  Thu Dec 21 13:31:26 2017

Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Matthew Wilcox
On Fri, Dec 22, 2017 at 07:05:56AM -0500, Jeff Layton wrote:
> + cur = inode_peek_iversion_raw(inode);
> + for (;;) {
> + /* If flag is clear then we needn't do anything */
> + if (!force && !(cur & I_VERSION_QUERIED))
> + return false;
> + /* Since lowest bit is flag, add 2 to avoid it */
> + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;

Isn't this an extraordinarily complicated way of spelling:

new = cur + 1;

We know 'cur' has I_VERSION_QUERIED set, so clearing that bit and adding
two is going to be the same as adding 1 ... right?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Krzysztof Kozlowski
On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton  wrote:
> On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
>> On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton  wrote:
>> > From: Jeff Layton 
>> >
>> > Since i_version is mostly treated as an opaque value, we can exploit that
>> > fact to avoid incrementing it when no one is watching. With that change,
>> > we can avoid incrementing the counter on writes, unless someone has
>> > queried for it since it was last incremented. If the a/c/mtime don't
>> > change, and the i_version hasn't changed, then there's no need to dirty
>> > the inode metadata on a write.
>> >
>> > Convert the i_version counter to an atomic64_t, and use the lowest order
>> > bit to hold a flag that will tell whether anyone has queried the value
>> > since it was last incremented.
>> >
>> > When we go to maybe increment it, we fetch the value and check the flag
>> > bit.  If it's clear then we don't need to do anything if the update
>> > isn't being forced.
>> >
>> > If we do need to update, then we increment the counter by 2, and clear
>> > the flag bit, and then use a CAS op to swap it into place. If that
>> > works, we return true. If it doesn't then do it again with the value
>> > that we fetch from the CAS operation.
>> >
>> > On the query side, if the flag is already set, then we just shift the
>> > value down by 1 bit and return it. Otherwise, we set the flag in our
>> > on-stack value and again use cmpxchg to swap it into place if it hasn't
>> > changed. If it has, then we use the value from the cmpxchg as the new
>> > "old" value and try again.
>> >
>> > This method allows us to avoid incrementing the counter on writes (and
>> > dirtying the metadata) under typical workloads. We only need to increment
>> > if it has been queried since it was last changed.
>> >
>> > Signed-off-by: Jeff Layton 
>> > ---
>> >  include/linux/fs.h   |   2 +-
>> >  include/linux/iversion.h | 208 
>> > ++-
>> >  2 files changed, 154 insertions(+), 56 deletions(-)
>> >
>>
>> Hi,
>>
>> On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
>> This commit popped up through bisect (log at the end). Systemd
>> timeouts on some device-specific services, including mounting ext4
>> /home:
>>
>> [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no 
>> limit)
>> [ TIME ] Timed out waiting for device dev-ttySAC2.device.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
>> Jan 07 13:29:38 [ TIME ] Timed out waiting for device
>> dev-disk-by\x2dlabel-home.device.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
>> /dev/disk/by-label/home.
>> Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
>> polling (1min 53s / no limit)
>>
>> Kernel command line:
>> console=tty1 console=ttySAC2,115200n8
>> ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
>> nfsrootdebug root=/dev/nfs
>> nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
>> smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
>>
>> /home is /dev/mmcblk1p2:
>> kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
>> tune2fs 1.43.7 (16-Oct-2017)
>> Filesystem volume name:   home
>> Last mounted on:  /home
>> Filesystem UUID:  3f9dbeba-2738-45d3-807e-c1b2e21128ed
>> Filesystem magic number:  0xEF53
>> Filesystem revision #:1 (dynamic)
>> Filesystem features:  has_journal ext_attr resize_inode dir_index
>> filetype needs_recovery extent flex_bg sparse_super large_file
>> uninit_bg dir_nlink extra_isize
>> Filesystem flags: signed_directory_hash
>> Default mount options:user_xattr acl
>> Filesystem state: clean
>> Errors behavior:  Continue
>> Filesystem OS type:   Linux
>> Inode count:  1430800
>> Block count:  5717760
>> Reserved block count: 285888
>> Free blocks:  5467576
>> Free inodes:  1428301
>> First block:  0
>> Block size:   4096
>> Fragment size:4096
>> Reserved GDT blocks:  1022
>> Blocks per group: 32768
>> Fragments per group:  32768
>> Inodes per group: 8176
>> Inode blocks per group:   511
>> Flex block group size:16
>> Filesystem created:   Thu May 21 12:17:05 2015
>> Last mount time:  Thu Dec 21 13:31:26 2017
>> Last write time:  Thu Dec 21 13:31:26 2017
>> Mount count:  1
>> Maximum mount count:  -1
>> Last checked: Thu Dec 21 13:31:25 2017
>> Check interval:   0 ()
>> Lifetime writes:  126 GB
>> Reserved blocks uid:  0 (user root)
>> Reserved blocks gid:  0 (group root)
>> First inode:  11
>> Inode size:   256
>> 

Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-08 Thread Jeff Layton
On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton  wrote:
> > From: Jeff Layton 
> > 
> > Since i_version is mostly treated as an opaque value, we can exploit that
> > fact to avoid incrementing it when no one is watching. With that change,
> > we can avoid incrementing the counter on writes, unless someone has
> > queried for it since it was last incremented. If the a/c/mtime don't
> > change, and the i_version hasn't changed, then there's no need to dirty
> > the inode metadata on a write.
> > 
> > Convert the i_version counter to an atomic64_t, and use the lowest order
> > bit to hold a flag that will tell whether anyone has queried the value
> > since it was last incremented.
> > 
> > When we go to maybe increment it, we fetch the value and check the flag
> > bit.  If it's clear then we don't need to do anything if the update
> > isn't being forced.
> > 
> > If we do need to update, then we increment the counter by 2, and clear
> > the flag bit, and then use a CAS op to swap it into place. If that
> > works, we return true. If it doesn't then do it again with the value
> > that we fetch from the CAS operation.
> > 
> > On the query side, if the flag is already set, then we just shift the
> > value down by 1 bit and return it. Otherwise, we set the flag in our
> > on-stack value and again use cmpxchg to swap it into place if it hasn't
> > changed. If it has, then we use the value from the cmpxchg as the new
> > "old" value and try again.
> > 
> > This method allows us to avoid incrementing the counter on writes (and
> > dirtying the metadata) under typical workloads. We only need to increment
> > if it has been queried since it was last changed.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  include/linux/fs.h   |   2 +-
> >  include/linux/iversion.h | 208 
> > ++-
> >  2 files changed, 154 insertions(+), 56 deletions(-)
> > 
> 
> Hi,
> 
> On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> This commit popped up through bisect (log at the end). Systemd
> timeouts on some device-specific services, including mounting ext4
> /home:
> 
> [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
> [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> dev-disk-by\x2dlabel-home.device.
> Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> /dev/disk/by-label/home.
> Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> polling (1min 53s / no limit)
> 
> Kernel command line:
> console=tty1 console=ttySAC2,115200n8
> ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> nfsrootdebug root=/dev/nfs
> nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> 
> /home is /dev/mmcblk1p2:
> kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> tune2fs 1.43.7 (16-Oct-2017)
> Filesystem volume name:   home
> Last mounted on:  /home
> Filesystem UUID:  3f9dbeba-2738-45d3-807e-c1b2e21128ed
> Filesystem magic number:  0xEF53
> Filesystem revision #:1 (dynamic)
> Filesystem features:  has_journal ext_attr resize_inode dir_index
> filetype needs_recovery extent flex_bg sparse_super large_file
> uninit_bg dir_nlink extra_isize
> Filesystem flags: signed_directory_hash
> Default mount options:user_xattr acl
> Filesystem state: clean
> Errors behavior:  Continue
> Filesystem OS type:   Linux
> Inode count:  1430800
> Block count:  5717760
> Reserved block count: 285888
> Free blocks:  5467576
> Free inodes:  1428301
> First block:  0
> Block size:   4096
> Fragment size:4096
> Reserved GDT blocks:  1022
> Blocks per group: 32768
> Fragments per group:  32768
> Inodes per group: 8176
> Inode blocks per group:   511
> Flex block group size:16
> Filesystem created:   Thu May 21 12:17:05 2015
> Last mount time:  Thu Dec 21 13:31:26 2017
> Last write time:  Thu Dec 21 13:31:26 2017
> Mount count:  1
> Maximum mount count:  -1
> Last checked: Thu Dec 21 13:31:25 2017
> Check interval:   0 ()
> Lifetime writes:  126 GB
> Reserved blocks uid:  0 (user root)
> Reserved blocks gid:  0 (group root)
> First inode:  11
> Inode size:   256
> Required extra isize: 28
> Desired extra isize:  28
> Journal inode:8
> Default directory hash:   half_md4
> Directory Hash Seed:  

Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-07 Thread Krzysztof Kozlowski
On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton  wrote:
> From: Jeff Layton 
>
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
>
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
>
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
>
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
>
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
>
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
>
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/fs.h   |   2 +-
>  include/linux/iversion.h | 208 
> ++-
>  2 files changed, 154 insertions(+), 56 deletions(-)
>

Hi,

On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
This commit popped up through bisect (log at the end). Systemd
timeouts on some device-specific services, including mounting ext4
/home:

[ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
[ TIME ] Timed out waiting for device dev-ttySAC2.device.
Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
Jan 07 13:29:38 [ TIME ] Timed out waiting for device
dev-disk-by\x2dlabel-home.device.
Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
/dev/disk/by-label/home.
Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
polling (1min 53s / no limit)

Kernel command line:
console=tty1 console=ttySAC2,115200n8
ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
nfsrootdebug root=/dev/nfs
nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend

/home is /dev/mmcblk1p2:
kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
tune2fs 1.43.7 (16-Oct-2017)
Filesystem volume name:   home
Last mounted on:  /home
Filesystem UUID:  3f9dbeba-2738-45d3-807e-c1b2e21128ed
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent flex_bg sparse_super large_file
uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options:user_xattr acl
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
Inode count:  1430800
Block count:  5717760
Reserved block count: 285888
Free blocks:  5467576
Free inodes:  1428301
First block:  0
Block size:   4096
Fragment size:4096
Reserved GDT blocks:  1022
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 8176
Inode blocks per group:   511
Flex block group size:16
Filesystem created:   Thu May 21 12:17:05 2015
Last mount time:  Thu Dec 21 13:31:26 2017
Last write time:  Thu Dec 21 13:31:26 2017
Mount count:  1
Maximum mount count:  -1
Last checked: Thu Dec 21 13:31:25 2017
Check interval:   0 ()
Lifetime writes:  126 GB
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   256
Required extra isize: 28
Desired extra isize:  28
Journal inode:8
Default directory hash:   half_md4
Directory Hash Seed:  42e17e23-86b2-4356-ad63-78aa51651d03
Journal backup:   inode blocks


Full dmesg log:
http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0

The regular boot from rootfs on SD card also fails - but without any
serial console logs (just "Starting kernel...") so the real cause is
unknown.

Any hints?


Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-02 Thread Jan Kara
On Fri 22-12-17 07:05:56, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
> 
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
> 
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
> 
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
> 
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
> 
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/fs.h   |   2 +-
>  include/linux/iversion.h | 208 
> ++-
>  2 files changed, 154 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 76382c24e9d0..6804d075933e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -639,7 +639,7 @@ struct inode {
>   struct hlist_head   i_dentry;
>   struct rcu_head i_rcu;
>   };
> - u64 i_version;
> + atomic64_t  i_version;
>   atomic_ti_count;
>   atomic_ti_dio_count;
>   atomic_ti_writecount;
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index e08c634779df..cef242e54489 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -5,6 +5,8 @@
>  #include 
>  
>  /*
> + * The inode->i_version field:
> + * ---
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
>   * appear different to observers if there was a change to the inode's data or
> @@ -27,86 +29,171 @@
>   * i_version on namespace changes in directories (mkdir, rmdir, unlink, 
> etc.).
>   * We consider these sorts of filesystems to have a kernel-managed i_version.
>   *
> + * This implementation uses the low bit in the i_version field as a flag to
> + * track when the value has been queried. If it has not been queried since it
> + * was last incremented, we can skip the increment in most cases.
> + *
> + * In the event that we're updating the ctime, we will usually go ahead and
> + * bump the i_version anyway. Since that has to go to stable storage in some
> + * fashion, we might as well increment it as well.
> + *
> + * With this implementation, the value should always appear to observers to
> + * increase over time if the file has changed. It's recommended to use
> + * inode_cmp_iversion() helper to compare values.
> + *
>   * Note that some filesystems (e.g. NFS and AFS) just use the field to store
> - * a server-provided value (for the most part). For that reason, those
> + * a server-provided value for the most part. For that reason, those
>   * filesystems do not set SB_I_VERSION. These filesystems are considered to
>   * have a self-managed i_version.
> + *
> + * Persistently storing the i_version
> + * --
> + * Queries of the i_version field are not gated on them hitting the backing
> + * store. It's always possible that the host could crash after allowing
> + * a query of the value but before it has made it to disk.
> + *
> + * To mitigate this problem, filesystems should always use
> + * inode_set_iversion_queried when loading an existing inode from disk. This
> + * ensures that the next attempted inode increment will result in the value
> + * changing.
> + *
> + * Storing the value to disk therefore does not count as a query, so those
> + * filesystems should use inode_peek_iversion to grab the value to be stored.
> + * There is no need to flag the value as