** Description changed: - Hi, + NFSv4 client stuck during state recovery on Ubuntu 22.04 (kernel 5.15) - We are seeing an NFSv4.0 client hang on Linux kernel 5.15 (Ubuntu - 22.04). + 1. Environment - The issue starts when the server returns NFS4ERR_EXPIRED. The client - then enters recovery, but reclaim never completes. + - Client OS: Ubuntu 22.04 + - Kernel version: 5.15.0-113-generic + - NFS protocol: NFSv4.0 - The state manager thread is stuck with the following stack: + Mount options: 10.59.62.51:/ on /nfs type nfs4 + (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.59.254.244,local_lock=none,addr=10.59.62.51) + ------------------------------------------------------------------------ + + 2. Problem Description + + We observed two types of abnormal behaviors related to NFSv4 client + state recovery. + + ------------------------------------------------------------------------ + + Case 1: No recovery after NFS4ERR_STALE_CLIENTID + + Expected behavior (per RFC7530): - Client should re-establish client + identity via SETCLIENTID / SETCLIENTID_CONFIRM - Client should reclaim + state (open/lock) + + Actual behavior: - Client keeps retrying normal requests - No recovery + process is triggered - No SETCLIENTID observed + + ------------------------------------------------------------------------ + + Case 2: Client stuck during reclaim after lease expiration + + Scenario + + 1. Client stops sending RENEW due to network issue + 2. Server considers the lease expired + 3. After network recovery: + - Client sends RENEW + - Server responds with NFS4ERR_EXPIRED + 4. Client starts recovery: + - SETCLIENTID succeeds + - SETCLIENTID_CONFIRM succeeds + 5. Client enters reclaim phase(with open rpc reclaim=false) + + + Client gets stuck during reclaim phase. + + Stack trace: + [<0>] rpc_wait_bit_killable + [<0>] __rpc_wait_for_completion_task + [<0>] nfs4_run_open_task + [<0>] nfs4_open_recover_helper + [<0>] nfs4_open_recover + [<0>] nfs4_do_open_expired + [<0>] nfs40_open_expired + [<0>] __nfs4_reclaim_open_state + [<0>] nfs4_reclaim_open_state + [<0>] nfs4_do_reclaim + [<0>] nfs4_state_manager + + + ------------------------------------------------------------------------ + + 3. Reproduction Steps + + 1. Mount NFS filesystem (see above) + 2. Run workload scripts: + - create_and_open.sh + - flock_test.py + 3. Restart NFS server during workload to cause the client lease to expire + 4. Issue reproduces reliably + + ------------------------------------------------------------------------ + + Additional stack traces + + create_and_open.sh ``` - rpc_wait_bit_killable - __rpc_wait_for_completion_task - nfs4_run_open_task - nfs4_open_recover_helper - nfs4_open_recover - nfs4_do_open_expired - nfs40_open_expired - __nfs4_reclaim_open_state - nfs4_reclaim_open_state - nfs4_do_reclaim - nfs4_state_manager + [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc] + [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc] + [<0>] nfs4_do_close+0x2d7/0x380 [nfsv4] + [<0>] __nfs4_close.constprop.0+0x11f/0x1f0 [nfsv4] + [<0>] nfs4_close_sync+0x13/0x20 [nfsv4] + [<0>] nfs4_close_context+0x35/0x60 [nfsv4] + [<0>] __put_nfs_open_context+0xc7/0x150 [nfs] + [<0>] nfs_file_clear_open_context+0x4c/0x60 [nfs] + [<0>] nfs_file_release+0x3e/0x50 [nfs] + [<0>] __fput+0x9c/0x280 + [<0>] ____fput+0xe/0x20 + [<0>] task_work_run+0x6a/0xb0 + [<0>] exit_to_user_mode_loop+0x157/0x160 + [<0>] exit_to_user_mode_prepare+0xa0/0xb0 + [<0>] syscall_exit_to_user_mode+0x27/0x50 + [<0>] do_syscall_64+0x63/0xb0 + [<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1 ``` - Meanwhile: - - The server repeatedly returns NFS4ERR_EXPIRED - - The client does not successfully reclaim state - - IO continues and repeatedly fails - RPC stats show: - - ~30M calls - - very low retransmissions (94) + ------------------------------------------------------------------------ - This suggests the issue is unlikely to be caused by network loss or - server unresponsiveness. + flock_test.py + ``` + [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc] + [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc] + [<0>] _nfs4_do_setlk+0x290/0x410 [nfsv4] + [<0>] nfs4_proc_setlk+0x78/0x160 [nfsv4] + [<0>] nfs4_retry_setlk+0x1dd/0x250 [nfsv4] + [<0>] nfs4_proc_lock+0x9d/0x1b0 [nfsv4] + [<0>] do_setlk+0x64/0x100 [nfs] + [<0>] nfs_lock+0xb3/0x180 [nfs] + [<0>] do_lock_file_wait+0x4f/0x120 + [<0>] fcntl_setlk+0x127/0x2e0 + [<0>] do_fcntl+0x4ce/0x5a0 + [<0>] __x64_sys_fcntl+0xa9/0xd0 + [<0>] x64_sys_call+0x1f5c/0x1fa0 + [<0>] do_syscall_64+0x56/0xb0 + [<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1 + ``` - Additionally, we have verified that: - - Network connectivity is stable - - The NFS server is operating normally (no restart or failover observed) - Importantly: - - We do observe that RENEW/SEQUENCE-related traffic is being sent from the client - - However, the client still ends up with an expired lease (NFS4ERR_EXPIRED) + ------------------------------------------------------------------------ - This raises the question whether the lease renewal is not being properly - processed or completed on the client side. - Given that we are using NFSv4.1 (where lease renewal is implicit via - SEQUENCE), we would like to understand: + [10.59.62.51-man] + ``` + [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc] + [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc] + [<0>] nfs4_run_open_task+0x152/0x1e0 [nfsv4] + [<0>] nfs4_open_recover_helper+0x155/0x210 [nfsv4] + [<0>] nfs4_open_recover+0x22/0xd0 [nfsv4] + [<0>] nfs4_do_open_reclaim+0x128/0x220 [nfsv4] + [<0>] nfs4_open_reclaim+0x42/0xa0 [nfsv4] + [<0>] __nfs4_reclaim_open_state+0x25/0x110 [nfsv4] + [<0>] nfs4_reclaim_open_state+0xd1/0x2c0 [nfsv4] + [<0>] nfs4_do_reclaim+0x12f/0x230 [nfsv4] + [<0>] nfs4_state_manager+0x6d9/0x870 [nfsv4] + [<0>] nfs4_run_state_manager+0xa8/0x1a0 [nfsv4] + [<0>] kthread+0x127/0x150 + [<0>] ret_from_fork+0x1f/0x30 + ``` - 1. Under what conditions could the client still hit NFS4ERR_EXPIRED despite ongoing renew/SEQUENCE activity and a healthy server/network? - 2. Is it possible that RPC completion, session slot handling, or sequence handling issues could prevent the lease from being effectively renewed? - 3. Could this be a known issue in the NFSv4.1 recovery or session handling path in 5.15? + ------------------------------------------------------------------------ - It appears the client is stuck in the OPEN reclaim path waiting for RPC - completion, and recovery cannot make forward progress. + 4. Kernel Version Comparison - Are there known fixes or patches in newer kernels (e.g., 5.19 or 6.x) - that address this class of issue? + Affected: + Ubuntu 22.04 5.15.0-113-generic - Any pointers or suggestions would be greatly appreciated. + Not affected: + Ubuntu 20.04 5.4.0-48-generic + Ubuntu 22.04 6.8.0-60-generic + Ubuntu 24.04 6.8.0-31-generic + Centos 7.9 4.19.188-10.el7.ucloud.x86_64 + Centos 7.9 3.10.0-1062.9.1.el7.x86_64 + Centos 8.3 4.18.0-240.1.1.el8_3.x86_64 - Thanks + + ------------------------------------------------------------------------ + + 5. Questions + + 1. Is it expected that no recovery is triggered after + NFS4ERR_STALE_CLIENTID? + 2. During reclaim, should OPEN be sent with reclaim=true? + 3. Could reclaim=false cause reclaim failure? + 4. Why is client stuck in rpc_wait_bit_killable? + 5. Is this a known issue in kernel 5.15? + 6. Are there any related patches or fixes?
** Attachment added: "create_and_open.sh" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146310/+attachment/5956856/+files/create_and_open.sh -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2146310 Title: NFSv4 client hang in OPEN reclaim path waiting for RPC completion To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146310/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
