** Attachment added: "flock_test.py" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146310/+attachment/5956857/+files/flock_test.py
** Description changed: NFSv4 client stuck during state recovery on Ubuntu 22.04 (kernel 5.15) 1. Environment - Client OS: Ubuntu 22.04 - Kernel version: 5.15.0-113-generic - NFS protocol: NFSv4.0 Mount options: 10.59.62.51:/ on /nfs type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.59.254.244,local_lock=none,addr=10.59.62.51) ------------------------------------------------------------------------ 2. Problem Description We observed two types of abnormal behaviors related to NFSv4 client state recovery. ------------------------------------------------------------------------ Case 1: No recovery after NFS4ERR_STALE_CLIENTID Expected behavior (per RFC7530): - Client should re-establish client identity via SETCLIENTID / SETCLIENTID_CONFIRM - Client should reclaim state (open/lock) Actual behavior: - Client keeps retrying normal requests - No recovery process is triggered - No SETCLIENTID observed ------------------------------------------------------------------------ Case 2: Client stuck during reclaim after lease expiration Scenario 1. Client stops sending RENEW due to network issue 2. Server considers the lease expired 3. After network recovery: - - Client sends RENEW - - Server responds with NFS4ERR_EXPIRED + - Client sends RENEW + - Server responds with NFS4ERR_EXPIRED 4. Client starts recovery: - - SETCLIENTID succeeds - - SETCLIENTID_CONFIRM succeeds + - SETCLIENTID succeeds + - SETCLIENTID_CONFIRM succeeds 5. Client enters reclaim phase(with open rpc reclaim=false) - Client gets stuck during reclaim phase. - Stack trace: - [<0>] rpc_wait_bit_killable + Stack trace: + [<0>] rpc_wait_bit_killable [<0>] __rpc_wait_for_completion_task - [<0>] nfs4_run_open_task - [<0>] nfs4_open_recover_helper + [<0>] nfs4_run_open_task + [<0>] nfs4_open_recover_helper [<0>] nfs4_open_recover - [<0>] nfs4_do_open_expired - [<0>] nfs40_open_expired + [<0>] nfs4_do_open_expired + [<0>] nfs40_open_expired [<0>] __nfs4_reclaim_open_state - [<0>] nfs4_reclaim_open_state - [<0>] nfs4_do_reclaim + [<0>] nfs4_reclaim_open_state + [<0>] nfs4_do_reclaim [<0>] nfs4_state_manager - ------------------------------------------------------------------------ 3. Reproduction Steps 1. Mount NFS filesystem (see above) - 2. Run workload scripts: - - create_and_open.sh - - flock_test.py + 2. Run workload scripts(attachment below): + - create_and_open.sh + - flock_test.py 3. Restart NFS server during workload to cause the client lease to expire 4. Issue reproduces reliably ------------------------------------------------------------------------ Additional stack traces create_and_open.sh ``` [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc] [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc] [<0>] nfs4_do_close+0x2d7/0x380 [nfsv4] [<0>] __nfs4_close.constprop.0+0x11f/0x1f0 [nfsv4] [<0>] nfs4_close_sync+0x13/0x20 [nfsv4] [<0>] nfs4_close_context+0x35/0x60 [nfsv4] [<0>] __put_nfs_open_context+0xc7/0x150 [nfs] [<0>] nfs_file_clear_open_context+0x4c/0x60 [nfs] [<0>] nfs_file_release+0x3e/0x50 [nfs] [<0>] __fput+0x9c/0x280 [<0>] ____fput+0xe/0x20 [<0>] task_work_run+0x6a/0xb0 [<0>] exit_to_user_mode_loop+0x157/0x160 [<0>] exit_to_user_mode_prepare+0xa0/0xb0 [<0>] syscall_exit_to_user_mode+0x27/0x50 [<0>] do_syscall_64+0x63/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1 ``` - ------------------------------------------------------------------------ flock_test.py ``` [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc] [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc] [<0>] _nfs4_do_setlk+0x290/0x410 [nfsv4] [<0>] nfs4_proc_setlk+0x78/0x160 [nfsv4] [<0>] nfs4_retry_setlk+0x1dd/0x250 [nfsv4] [<0>] nfs4_proc_lock+0x9d/0x1b0 [nfsv4] [<0>] do_setlk+0x64/0x100 [nfs] [<0>] nfs_lock+0xb3/0x180 [nfs] [<0>] do_lock_file_wait+0x4f/0x120 [<0>] fcntl_setlk+0x127/0x2e0 [<0>] do_fcntl+0x4ce/0x5a0 [<0>] __x64_sys_fcntl+0xa9/0xd0 [<0>] x64_sys_call+0x1f5c/0x1fa0 [<0>] do_syscall_64+0x56/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1 ``` - ------------------------------------------------------------------------ - [10.59.62.51-man] ``` [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc] [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc] [<0>] nfs4_run_open_task+0x152/0x1e0 [nfsv4] [<0>] nfs4_open_recover_helper+0x155/0x210 [nfsv4] [<0>] nfs4_open_recover+0x22/0xd0 [nfsv4] [<0>] nfs4_do_open_reclaim+0x128/0x220 [nfsv4] [<0>] nfs4_open_reclaim+0x42/0xa0 [nfsv4] [<0>] __nfs4_reclaim_open_state+0x25/0x110 [nfsv4] [<0>] nfs4_reclaim_open_state+0xd1/0x2c0 [nfsv4] [<0>] nfs4_do_reclaim+0x12f/0x230 [nfsv4] [<0>] nfs4_state_manager+0x6d9/0x870 [nfsv4] [<0>] nfs4_run_state_manager+0xa8/0x1a0 [nfsv4] [<0>] kthread+0x127/0x150 [<0>] ret_from_fork+0x1f/0x30 ``` ------------------------------------------------------------------------ 4. Kernel Version Comparison Affected: Ubuntu 22.04 5.15.0-113-generic Not affected: Ubuntu 20.04 5.4.0-48-generic Ubuntu 22.04 6.8.0-60-generic Ubuntu 24.04 6.8.0-31-generic Centos 7.9 4.19.188-10.el7.ucloud.x86_64 Centos 7.9 3.10.0-1062.9.1.el7.x86_64 Centos 8.3 4.18.0-240.1.1.el8_3.x86_64 - ------------------------------------------------------------------------ 5. Questions 1. Is it expected that no recovery is triggered after - NFS4ERR_STALE_CLIENTID? + NFS4ERR_STALE_CLIENTID? 2. During reclaim, should OPEN be sent with reclaim=true? 3. Could reclaim=false cause reclaim failure? 4. Why is client stuck in rpc_wait_bit_killable? 5. Is this a known issue in kernel 5.15? 6. Are there any related patches or fixes? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2146310 Title: NFSv4 client hang in OPEN reclaim path waiting for RPC completion To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146310/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
