[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server

2023-03-22 Thread AzT3K
*** This bug is a duplicate of bug 2009325 ***
https://bugs.launchpad.net/bugs/2009325

** This bug has been marked a duplicate of bug 2009325
   NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws-5.4 in Ubuntu.
https://bugs.launchpad.net/bugs/2012020

Title:
  Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per
  second per server

Status in linux package in Ubuntu:
  Confirmed
Status in linux-aws package in Ubuntu:
  Confirmed
Status in linux-aws-5.4 package in Ubuntu:
  Confirmed
Status in linux-signed-aws-5.4 package in Ubuntu:
  Confirmed

Bug description:
  We are observing a massive uptick in metadata requests to an NFSv4.1
  (AWS EFS) network drive that is linked to a one or more web servers.
  This started happening about a week ago across a number of stacks on
  AWS. This issue ended up exhausting all the burst credits we had
  accrued and caused a production outage.

  Diagnostics...

  `nfsiostat` shows 50-600 ops/s on the servers where it is happening.

  `nfsstat -m` shows these mount flags: `Flags:
  
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y`

  `nfstrace --mode=live --verbose=2` shows that the same operations are
  happening repeatedly on 2-5 filehandles;

  ```
     CALL  [ operations: 4 tag: void minor version: 1
     [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 
sequenceid: 0xfc12c slotid: 2 cache this: 0 ]
     [ PUTFH(22) [ object: 
839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08
 ]
     [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY 
]  ]
     REPLY [  operations: 4 status: OK tag: void
     [ SEQUENCE(53) [ status: OK session: 
0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest 
slotid: 63 target highest slotid: 63 status flags: 0 ]
     [ PUTFH(22) [ status: OK ]
     [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE  
access: READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA 
TIME_MODIFY ]  ]
  ```

  `lsof -N` shows nothing using files from the NFS drive.

  Unmounting and remounting the disk stops the issue temporarily

  I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet
  of the output is below;

  ```
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: 
nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: 
permission(0:55/1440391441734181492), mask=0x1, res=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: 
nfs_lookup_revalidate_done(/Components) is valid
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> 
nfs41_call_sync_prepare data->seq_server 8a198bb4
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot 
used_slots= highest_used=4294967295 max_slots=64
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot 
used_slots=0001 highest_used=0 slotid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: 
sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 
cache_this=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: 
change attribute=2153
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file 
size=71680
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: 
fsid=(0x0/0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: 
fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: 
fs_locations done, error = 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file 
mode=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: 
rdev=(0x0:0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: 
space used=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: 
atime=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] 
decode_attr_time_metadata: ctime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: 
mtime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] 
decode_attr_mounted_on_fileid: fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel:

[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server

2023-03-21 Thread AzT3K
** Also affects: linux-signed-aws-5.4 (Ubuntu)
   Importance: Undecided
   Status: New

** Changed in: linux-signed-aws-5.4 (Ubuntu)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws-5.4 in Ubuntu.
https://bugs.launchpad.net/bugs/2012020

Title:
  Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per
  second per server

Status in linux package in Ubuntu:
  Confirmed
Status in linux-aws package in Ubuntu:
  Confirmed
Status in linux-aws-5.4 package in Ubuntu:
  Confirmed
Status in linux-signed-aws-5.4 package in Ubuntu:
  Confirmed

Bug description:
  We are observing a massive uptick in metadata requests to an NFSv4.1
  (AWS EFS) network drive that is linked to a one or more web servers.
  This started happening about a week ago across a number of stacks on
  AWS. This issue ended up exhausting all the burst credits we had
  accrued and caused a production outage.

  Diagnostics...

  `nfsiostat` shows 50-600 ops/s on the servers where it is happening.

  `nfsstat -m` shows these mount flags: `Flags:
  
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y`

  `nfstrace --mode=live --verbose=2` shows that the same operations are
  happening repeatedly on 2-5 filehandles;

  ```
     CALL  [ operations: 4 tag: void minor version: 1
     [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 
sequenceid: 0xfc12c slotid: 2 cache this: 0 ]
     [ PUTFH(22) [ object: 
839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08
 ]
     [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY 
]  ]
     REPLY [  operations: 4 status: OK tag: void
     [ SEQUENCE(53) [ status: OK session: 
0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest 
slotid: 63 target highest slotid: 63 status flags: 0 ]
     [ PUTFH(22) [ status: OK ]
     [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE  
access: READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA 
TIME_MODIFY ]  ]
  ```

  `lsof -N` shows nothing using files from the NFS drive.

  Unmounting and remounting the disk stops the issue temporarily

  I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet
  of the output is below;

  ```
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: 
nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: 
permission(0:55/1440391441734181492), mask=0x1, res=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: 
nfs_lookup_revalidate_done(/Components) is valid
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> 
nfs41_call_sync_prepare data->seq_server 8a198bb4
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot 
used_slots= highest_used=4294967295 max_slots=64
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot 
used_slots=0001 highest_used=0 slotid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: 
sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 
cache_this=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: 
change attribute=2153
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file 
size=71680
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: 
fsid=(0x0/0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: 
fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: 
fs_locations done, error = 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file 
mode=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: 
rdev=(0x0:0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: 
space used=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: 
atime=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] 
decode_attr_time_metadata: ctime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: 
mtime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] 
decode_attr_mounted_on_fileid: fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_getfattr_attrs: 
xdr

[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server

2023-03-21 Thread AzT3K
** Changed in: linux-aws (Ubuntu)
   Status: New => Confirmed

** Changed in: linux-aws-5.4 (Ubuntu)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws-5.4 in Ubuntu.
https://bugs.launchpad.net/bugs/2012020

Title:
  Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per
  second per server

Status in linux package in Ubuntu:
  Confirmed
Status in linux-aws package in Ubuntu:
  Confirmed
Status in linux-aws-5.4 package in Ubuntu:
  Confirmed

Bug description:
  We are observing a massive uptick in metadata requests to an NFSv4.1
  (AWS EFS) network drive that is linked to a one or more web servers.
  This started happening about a week ago across a number of stacks on
  AWS. This issue ended up exhausting all the burst credits we had
  accrued and caused a production outage.

  Diagnostics...

  `nfsiostat` shows 50-600 ops/s on the servers where it is happening.

  `nfsstat -m` shows these mount flags: `Flags:
  
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y`

  `nfstrace --mode=live --verbose=2` shows that the same operations are
  happening repeatedly on 2-5 filehandles;

  ```
     CALL  [ operations: 4 tag: void minor version: 1
     [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 
sequenceid: 0xfc12c slotid: 2 cache this: 0 ]
     [ PUTFH(22) [ object: 
839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08
 ]
     [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY 
]  ]
     REPLY [  operations: 4 status: OK tag: void
     [ SEQUENCE(53) [ status: OK session: 
0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest 
slotid: 63 target highest slotid: 63 status flags: 0 ]
     [ PUTFH(22) [ status: OK ]
     [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE  
access: READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA 
TIME_MODIFY ]  ]
  ```

  `lsof -N` shows nothing using files from the NFS drive.

  Unmounting and remounting the disk stops the issue temporarily

  I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet
  of the output is below;

  ```
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: 
nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: 
permission(0:55/1440391441734181492), mask=0x1, res=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: 
nfs_lookup_revalidate_done(/Components) is valid
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> 
nfs41_call_sync_prepare data->seq_server 8a198bb4
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot 
used_slots= highest_used=4294967295 max_slots=64
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot 
used_slots=0001 highest_used=0 slotid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: 
sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 
cache_this=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: 
change attribute=2153
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file 
size=71680
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: 
fsid=(0x0/0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: 
fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: 
fs_locations done, error = 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file 
mode=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: 
rdev=(0x0:0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: 
space used=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: 
atime=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] 
decode_attr_time_metadata: ctime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: 
mtime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] 
decode_attr_mounted_on_fileid: fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_getfattr_attrs: 
xdr returned 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213852] decode_getfattr_generic: 
xd

[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server

2023-03-20 Thread AzT3K
** Also affects: linux-aws-5.4 (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: linux-aws (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws-5.4 in Ubuntu.
https://bugs.launchpad.net/bugs/2012020

Title:
  Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per
  second per server

Status in linux package in Ubuntu:
  Confirmed
Status in linux-aws package in Ubuntu:
  New
Status in linux-aws-5.4 package in Ubuntu:
  New

Bug description:
  We are observing a massive uptick in metadata requests to an NFSv4.1
  (AWS EFS) network drive that is linked to a one or more web servers.
  This started happening about a week ago across a number of stacks on
  AWS. This issue ended up exhausting all the burst credits we had
  accrued and caused a production outage.

  Diagnostics...

  `nfsiostat` shows 50-600 ops/s on the servers where it is happening.

  `nfsstat -m` shows these mount flags: `Flags:
  
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y`

  `nfstrace --mode=live --verbose=2` shows that the same operations are
  happening repeatedly on 2-5 filehandles;

  ```
     CALL  [ operations: 4 tag: void minor version: 1
     [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 
sequenceid: 0xfc12c slotid: 2 cache this: 0 ]
     [ PUTFH(22) [ object: 
839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08
 ]
     [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY 
]  ]
     REPLY [  operations: 4 status: OK tag: void
     [ SEQUENCE(53) [ status: OK session: 
0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest 
slotid: 63 target highest slotid: 63 status flags: 0 ]
     [ PUTFH(22) [ status: OK ]
     [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE  
access: READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA 
TIME_MODIFY ]  ]
  ```

  `lsof -N` shows nothing using files from the NFS drive.

  Unmounting and remounting the disk stops the issue temporarily

  I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet
  of the output is below;

  ```
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: 
nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: 
permission(0:55/1440391441734181492), mask=0x1, res=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: 
nfs_lookup_revalidate_done(/Components) is valid
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> 
nfs41_call_sync_prepare data->seq_server 8a198bb4
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot 
used_slots= highest_used=4294967295 max_slots=64
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot 
used_slots=0001 highest_used=0 slotid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: 
sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 
cache_this=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: 
change attribute=2153
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file 
size=71680
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: 
fsid=(0x0/0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: 
fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: 
fs_locations done, error = 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file 
mode=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: 
rdev=(0x0:0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: 
space used=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: 
atime=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] 
decode_attr_time_metadata: ctime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: 
mtime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] 
decode_attr_mounted_on_fileid: fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_getfattr_attrs: 
xdr returned 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213852] decode_getfa

[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server

2023-03-19 Thread AzT3K
** Description changed:

  We are observing a massive uptick in metadata requests to an NFSv4.1
  (AWS EFS) network drive that is linked to a one or more web servers.
  This started happening about a week ago across a number of stacks on
  AWS. This issue ended up exhausting all the burst credits we had accrued
  and caused a production outage.
  
  Diagnostics...
  
  `nfsiostat` shows 50-600 ops/s on the servers where it is happening.
  
  `nfsstat -m` shows these mount flags: `Flags:
  
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y`
  
  `nfstrace --mode=live --verbose=2` shows that the same operations are
  happening repeatedly on 2-5 filehandles;
  
  ```
     CALL  [ operations: 4 tag: void minor version: 1
     [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 
sequenceid: 0xfc12c slotid: 2 cache this: 0 ]
     [ PUTFH(22) [ object: 
839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08
 ]
     [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY 
]  ]
     REPLY [  operations: 4 status: OK tag: void
     [ SEQUENCE(53) [ status: OK session: 
0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest 
slotid: 63 target highest slotid: 63 status flags: 0 ]
     [ PUTFH(22) [ status: OK ]
     [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE  
access: READ LOOKUP MODIFY EXTEND DELETE  ]
     [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA 
TIME_MODIFY ]  ]
  ```
  
  `lsof -N` shows nothing using files from the NFS drive.
  
  Unmounting and remounting the disk stops the issue temporarily
  
  I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet
  of the output is below;
  
  ```
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: 
nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: 
permission(0:55/1440391441734181492), mask=0x1, res=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: 
nfs_lookup_revalidate_done(/Components) is valid
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> 
nfs41_call_sync_prepare data->seq_server 8a198bb4
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot 
used_slots= highest_used=4294967295 max_slots=64
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot 
used_slots=0001 highest_used=0 slotid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: 
sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 
cache_this=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: 
change attribute=2153
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file 
size=71680
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: 
fsid=(0x0/0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: 
fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: 
fs_locations done, error = 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file 
mode=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: 
rdev=(0x0:0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: 
space used=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: 
atime=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] 
decode_attr_time_metadata: ctime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_attr_time_modify: 
mtime=1677729393
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] 
decode_attr_mounted_on_fileid: fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_getfattr_attrs: 
xdr returned 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213852] decode_getfattr_generic: 
xdr returned 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213853] --> nfs4_alloc_slot 
used_slots=0001 highest_used=0 max_slots=64
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213854] <-- nfs4_alloc_slot 
used_slots=0003 highest_used=1 slotid=1
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213855] nfs4_free_slot: slotid 1 
highest_used_slotid 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213855] nfs41_sequence_process: 
Error 0 free the slot
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213856] nfs4_free_slot: slotid 0 
highest_us

[Kernel-packages] [Bug 2012020] Re: Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server

2023-03-16 Thread AzT3K
** Tags added: apport-collected bionic ec2-images

** Description changed:

  We are observing a massive uptick in metadata requests to an NFSv4.1
  (AWS EFS) network drive that is linked to a one or more web servers.
  This started happening about a week ago across a number of stacks on
  AWS. This issue ended up exhausting all the burst credits we had accrued
  and caused a production outage.
  
  Diagnostics...
  
  `nfsiostat` shows 50-600 ops/s on the servers where it is happening.
  
  `nfsstat -m` shows these mount flags: `Flags:
  
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,fsc,local_lock=none,addr=y.y.y.y`
  
  `nfstrace --mode=live --verbose=2` shows that the same operations are
  happening repeatedly on 2-5 filehandles;
  
  ```
 CALL  [ operations: 4 tag: void minor version: 1
 [ SEQUENCE(53) [ sessionid: 0xeab1201a93b66961 
sequenceid: 0xfc12c slotid: 2 cache this: 0 ]
 [ PUTFH(22) [ object: 
839c9918169aed49bd6f96adab5438bc47b1460162fe4a739cd8c98405868b2b5de303ed1f4ecfa82c26f5ca1db4cce8e63aad38fe72a397bdae51d8a7e93eb118ad8c736b30323165273b9c805db85936cf3be626b6ba4165ecf9755a54fdd174a535e217ffee4fa0166feea6e86becf9fa16280fd877b3545e07fe03aede08
 ]
 [ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE  ]
 [ GETATTR(9) [ mask: 0x1830 CHANGE SIZE TIME_METADATA TIME_MODIFY 
]  ]
 REPLY [  operations: 4 status: OK tag: void
 [ SEQUENCE(53) [ status: OK session: 
0xeab1201a93b66961 sequenceid: 0xfc12c slotid: 2 highest 
slotid: 63 target highest slotid: 63 status flags: 0 ]
 [ PUTFH(22) [ status: OK ]
 [ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE  
access: READ LOOKUP MODIFY EXTEND DELETE  ]
 [ GETATTR(9) [ status: OK mask: 0x1830 CHANGE SIZE TIME_METADATA 
TIME_MODIFY ]  ]
  ```
  
  `lsof -N` shows nothing using files from the NFS drive.
  
  I tried changing some of the mount options to reduce the amount of
  metadata requests going through to the NFS server and found that
  `lazytime` did help - observations are as below;
  
  Server 1:
  checked mount options - using relatime
  checked nfs stats - 460.982 ops/s
  updated mount options
  unmounted and mounted the NFS drive
  checked nfs stats - 1.400 ops/s
  rebooted
  checked nfs stats - 1.316 ops/s
  Server 2:
  checked mount options - using relatime
  checked nfs stats - 390.998 ops/s
  did not updated mount options
  unmounted and mounted the NFS drive
  checked nfs stats - 1.750 ops/s
  rebooted
  checked nfs stats - 531.932 ops/s
  
  Conclusion from this is that;
  
  - Unmounting and remounting the disk stops the issue temporarily
  - `lazytime` option seems to prevent the issue from happening
  - Issue may be boot related
  
  I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet
  of the output is below;
  
  ```
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: 
nfs_update_inode(0:55/1440391441734181492 fh_crc=0xc29dd48a ct=2 info=0x26040)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: 
permission(0:55/1440391441734181492), mask=0x1, res=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: 
nfs_lookup_revalidate_done(/Components) is valid
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> 
nfs41_call_sync_prepare data->seq_server 8a198bb4
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot 
used_slots= highest_used=4294967295 max_slots=64
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot 
used_slots=0001 highest_used=0 slotid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: 
sessionid=1354075206:678036945:0:16777216 seqid=238661 slotid=0 max_slotid=0 
cache_this=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: 
change attribute=2153
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file 
size=71680
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: 
fsid=(0x0/0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: 
fileid=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_fs_locations: 
fs_locations done, error = 0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file 
mode=00
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: 
rdev=(0x0:0x0)
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_space_used: 
space used=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_attr_time_access: 
atime=0
  Mar 17 10:20:41 ip-10-1-2-84 kernel: [