[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server
*** This bug is a duplicate of bug 2009325 *** https://bugs.launchpad.net/bugs/2009325 ** This bug has been marked a duplicate of bug 2009325 NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws-5.4 in Ubuntu. https://bugs.launchpad.net/bugs/2012020 Title: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in linux-aws-5.4 package in Ubuntu: Confirmed Status in linux-signed-aws-5.4 package in Ubuntu: Confirmed Bug description: We are observing a massive uptick in metadata requests to an NFSv4.1 (AWS EFS) network drive that is linked to a one or more web servers. This started happening about a week ago across a number of stacks on AWS. This issue ended up exhausting all the burst credits we had accrued and caused a production outage. Diagnostics... `nfsiostat` shows 50-600 ops/s on the servers where it is happening. `nfsstat -m` shows these mount flags: `Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y` `nfstrace --mode=live --verbose=2` shows that the same operations are happening repeatedly on 2-5 filehandles; ``` CALL [ operations: 4 tag: void minor version: 1 [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 cache this: 0 ] [ PUTFH(22) [ object: 839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08 ] [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] REPLY [ operations: 4 status: OK tag: void [ SEQUENCE(53) [ status: OK session: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest slotid: 63 target highest slotid: 63 status flags: 0 ] [ PUTFH(22) [ status: OK ] [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE access: READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] ``` `lsof -N` shows nothing using files from the NFS drive. Unmounting and remounting the disk stops the issue temporarily I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet of the output is below; ``` Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: permission(0:55/1440391441734181492), mask=0x1, res=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: nfs_lookup_revalidate_done(/Components) is valid Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> nfs41_call_sync_prepare data->seq_server 8a198bb4 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot used_slots= highest_used=4294967295 max_slots=64 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 cache_this=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: change attribute=2153 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file size=71680 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: fsid=(0x0/0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: fs_locations done, error = 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file mode=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: rdev=(0x0:0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: space used=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: atime=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_metadata: ctime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: mtime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_attr_mounted_on_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel:
[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server
** Also affects: linux-signed-aws-5.4 (Ubuntu) Importance: Undecided Status: New ** Changed in: linux-signed-aws-5.4 (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws-5.4 in Ubuntu. https://bugs.launchpad.net/bugs/2012020 Title: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in linux-aws-5.4 package in Ubuntu: Confirmed Status in linux-signed-aws-5.4 package in Ubuntu: Confirmed Bug description: We are observing a massive uptick in metadata requests to an NFSv4.1 (AWS EFS) network drive that is linked to a one or more web servers. This started happening about a week ago across a number of stacks on AWS. This issue ended up exhausting all the burst credits we had accrued and caused a production outage. Diagnostics... `nfsiostat` shows 50-600 ops/s on the servers where it is happening. `nfsstat -m` shows these mount flags: `Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y` `nfstrace --mode=live --verbose=2` shows that the same operations are happening repeatedly on 2-5 filehandles; ``` CALL [ operations: 4 tag: void minor version: 1 [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 cache this: 0 ] [ PUTFH(22) [ object: 839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08 ] [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] REPLY [ operations: 4 status: OK tag: void [ SEQUENCE(53) [ status: OK session: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest slotid: 63 target highest slotid: 63 status flags: 0 ] [ PUTFH(22) [ status: OK ] [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE access: READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] ``` `lsof -N` shows nothing using files from the NFS drive. Unmounting and remounting the disk stops the issue temporarily I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet of the output is below; ``` Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: permission(0:55/1440391441734181492), mask=0x1, res=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: nfs_lookup_revalidate_done(/Components) is valid Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> nfs41_call_sync_prepare data->seq_server 8a198bb4 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot used_slots= highest_used=4294967295 max_slots=64 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 cache_this=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: change attribute=2153 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file size=71680 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: fsid=(0x0/0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: fs_locations done, error = 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file mode=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: rdev=(0x0:0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: space used=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: atime=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_metadata: ctime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: mtime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_attr_mounted_on_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_getfattr_attrs: xdr
[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server
** Changed in: linux-aws (Ubuntu) Status: New => Confirmed ** Changed in: linux-aws-5.4 (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws-5.4 in Ubuntu. https://bugs.launchpad.net/bugs/2012020 Title: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in linux-aws-5.4 package in Ubuntu: Confirmed Bug description: We are observing a massive uptick in metadata requests to an NFSv4.1 (AWS EFS) network drive that is linked to a one or more web servers. This started happening about a week ago across a number of stacks on AWS. This issue ended up exhausting all the burst credits we had accrued and caused a production outage. Diagnostics... `nfsiostat` shows 50-600 ops/s on the servers where it is happening. `nfsstat -m` shows these mount flags: `Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y` `nfstrace --mode=live --verbose=2` shows that the same operations are happening repeatedly on 2-5 filehandles; ``` CALL [ operations: 4 tag: void minor version: 1 [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 cache this: 0 ] [ PUTFH(22) [ object: 839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08 ] [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] REPLY [ operations: 4 status: OK tag: void [ SEQUENCE(53) [ status: OK session: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest slotid: 63 target highest slotid: 63 status flags: 0 ] [ PUTFH(22) [ status: OK ] [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE access: READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] ``` `lsof -N` shows nothing using files from the NFS drive. Unmounting and remounting the disk stops the issue temporarily I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet of the output is below; ``` Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: permission(0:55/1440391441734181492), mask=0x1, res=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: nfs_lookup_revalidate_done(/Components) is valid Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> nfs41_call_sync_prepare data->seq_server 8a198bb4 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot used_slots= highest_used=4294967295 max_slots=64 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 cache_this=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: change attribute=2153 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file size=71680 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: fsid=(0x0/0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: fs_locations done, error = 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file mode=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: rdev=(0x0:0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: space used=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: atime=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_metadata: ctime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: mtime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_attr_mounted_on_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_getfattr_attrs: xdr returned 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213852] decode_getfattr_generic: xd
[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server
** Also affects: linux-aws-5.4 (Ubuntu) Importance: Undecided Status: New ** Also affects: linux-aws (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws-5.4 in Ubuntu. https://bugs.launchpad.net/bugs/2012020 Title: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: New Status in linux-aws-5.4 package in Ubuntu: New Bug description: We are observing a massive uptick in metadata requests to an NFSv4.1 (AWS EFS) network drive that is linked to a one or more web servers. This started happening about a week ago across a number of stacks on AWS. This issue ended up exhausting all the burst credits we had accrued and caused a production outage. Diagnostics... `nfsiostat` shows 50-600 ops/s on the servers where it is happening. `nfsstat -m` shows these mount flags: `Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y` `nfstrace --mode=live --verbose=2` shows that the same operations are happening repeatedly on 2-5 filehandles; ``` CALL [ operations: 4 tag: void minor version: 1 [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 cache this: 0 ] [ PUTFH(22) [ object: 839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08 ] [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] REPLY [ operations: 4 status: OK tag: void [ SEQUENCE(53) [ status: OK session: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest slotid: 63 target highest slotid: 63 status flags: 0 ] [ PUTFH(22) [ status: OK ] [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE access: READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] ``` `lsof -N` shows nothing using files from the NFS drive. Unmounting and remounting the disk stops the issue temporarily I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet of the output is below; ``` Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: permission(0:55/1440391441734181492), mask=0x1, res=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: nfs_lookup_revalidate_done(/Components) is valid Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> nfs41_call_sync_prepare data->seq_server 8a198bb4 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot used_slots= highest_used=4294967295 max_slots=64 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 cache_this=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: change attribute=2153 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file size=71680 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: fsid=(0x0/0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: fs_locations done, error = 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file mode=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: rdev=(0x0:0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: space used=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: atime=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_metadata: ctime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: mtime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_attr_mounted_on_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_getfattr_attrs: xdr returned 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213852] decode_getfa
[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server
** Description changed: We are observing a massive uptick in metadata requests to an NFSv4.1 (AWS EFS) network drive that is linked to a one or more web servers. This started happening about a week ago across a number of stacks on AWS. This issue ended up exhausting all the burst credits we had accrued and caused a production outage. Diagnostics... `nfsiostat` shows 50-600 ops/s on the servers where it is happening. `nfsstat -m` shows these mount flags: `Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y` `nfstrace --mode=live --verbose=2` shows that the same operations are happening repeatedly on 2-5 filehandles; ``` CALL [ operations: 4 tag: void minor version: 1 [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 cache this: 0 ] [ PUTFH(22) [ object: 839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08 ] [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] REPLY [ operations: 4 status: OK tag: void [ SEQUENCE(53) [ status: OK session: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest slotid: 63 target highest slotid: 63 status flags: 0 ] [ PUTFH(22) [ status: OK ] [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE access: READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] ``` `lsof -N` shows nothing using files from the NFS drive. Unmounting and remounting the disk stops the issue temporarily I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet of the output is below; ``` Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: permission(0:55/1440391441734181492), mask=0x1, res=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: nfs_lookup_revalidate_done(/Components) is valid Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> nfs41_call_sync_prepare data->seq_server 8a198bb4 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot used_slots= highest_used=4294967295 max_slots=64 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 cache_this=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: change attribute=2153 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file size=71680 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: fsid=(0x0/0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: fs_locations done, error = 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file mode=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: rdev=(0x0:0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: space used=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: atime=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_metadata: ctime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: mtime=1677729393 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_attr_mounted_on_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_getfattr_attrs: xdr returned 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213852] decode_getfattr_generic: xdr returned 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213853] --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=64 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213854] <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213855] nfs4_free_slot: slotid 1 highest_used_slotid 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213855] nfs41_sequence_process: Error 0 free the slot Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213856] nfs4_free_slot: slotid 0 highest_us
[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server
** Tags added: apport-collected bionic ec2-images ** Description changed: We are observing a massive uptick in metadata requests to an NFSv4.1 (AWS EFS) network drive that is linked to a one or more web servers. This started happening about a week ago across a number of stacks on AWS. This issue ended up exhausting all the burst credits we had accrued and caused a production outage. Diagnostics... `nfsiostat` shows 50-600 ops/s on the servers where it is happening. `nfsstat -m` shows these mount flags: `Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y` `nfstrace --mode=live --verbose=2` shows that the same operations are happening repeatedly on 2-5 filehandles; ``` CALL [ operations: 4 tag: void minor version: 1 [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 cache this: 0 ] [ PUTFH(22) [ object: 839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08 ] [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] REPLY [ operations: 4 status: OK tag: void [ SEQUENCE(53) [ status: OK session: 0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest slotid: 63 target highest slotid: 63 status flags: 0 ] [ PUTFH(22) [ status: OK ] [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE access: READ LOOKUP MODIFY EXTEND DELETE ] [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ] ``` `lsof -N` shows nothing using files from the NFS drive. I tried changing some of the mount options to reduce the amount of metadata requests going through to the NFS server and found that `lazytime` did help - observations are as below; Server 1: checked mount options - using relatime checked nfs stats - 460.982 ops/s updated mount options unmounted and mounted the NFS drive checked nfs stats - 1.400 ops/s rebooted checked nfs stats - 1.316 ops/s Server 2: checked mount options - using relatime checked nfs stats - 390.998 ops/s did not updated mount options unmounted and mounted the NFS drive checked nfs stats - 1.750 ops/s rebooted checked nfs stats - 531.932 ops/s Conclusion from this is that; - Unmounting and remounting the disk stops the issue temporarily - `lazytime` option seems to prevent the issue from happening - Issue may be boot related I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet of the output is below; ``` Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: permission(0:55/1440391441734181492), mask=0x1, res=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: nfs_lookup_revalidate_done(/Components) is valid Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> nfs41_call_sync_prepare data->seq_server 8a198bb4 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot used_slots= highest_used=4294967295 max_slots=64 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 cache_this=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: change attribute=2153 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file size=71680 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: fsid=(0x0/0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: fileid=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: fs_locations done, error = 0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file mode=00 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: rdev=(0x0:0x0) Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: space used=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: atime=0 Mar 17 10:20:41 ip-10-1-2-84 kernel: [