Bug#837907: stat() hangs on a particular file
* Salvatore Bonaccorso [2021-05-09 22:37:02 +0200]: > Control: tags -1 + moreinfo > > Do you still can reproduce the issue with a recent kernel? No, I haven't seen it in a while (and never with 4.9 or 4.19 either; nor with 5.10 but that's statistically less significant).
Bug#837907: stat() hangs on a particular file
Control: tags -1 + moreinfo On Thu, Sep 15, 2016 at 01:12:42PM +0200, Sergio Gelato wrote: > Package: linux-image-3.16.0-4-amd64 > Version: 3.16.36-1+deb8u1 > > One of our systems is suddenly unable to stat() a particular file > (cookies.sqlite-wal in a user's Firefox profile). Any attempt to > do so hangs in the system call, as shown by strace. The file resides > on an NFSv4 share (sec=krb5p). Other files in the same directory on > the same share remain accessible. The affected file is normally accessible > on the NFS server and from other NFS clients running the same kernel. > > The user has reported a similar incident yesterday on some directories > on a different NFS share (also sec=krb5p, but hosted on a different > server). He rebooted to clear up the problem. > > I'd like advice on how to troubleshoot this effectively. I've tried > rpcdebug -m {nfs,rpc,nlm} -s all but didn't see any smoking gun; maybe > some information cached by the kernel is suppressing NFS activity > associated with the stat() calls. The log entries I do see say > NFS: nfs_lookup_revalidate(cookies.sqlite-wal) is valid > > Some related kernel traces from this system's logs (in chronological order, > with an intervening reboot; the first trace is associated with yesterday's > incident, the second trace is 2-3 minutes newer than the timestamp on > cookies.sqlite-wal): > > [97483.663949] INFO: task ls:23767 blocked for more than 120 seconds. > [97483.663951] Tainted: PW O 3.16.0-4-amd64 #1 > [97483.663952] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [97483.663954] ls D 8800afd3b808 0 23767 1 > 0x0004 > [97483.663957] 8800afd3b3b0 0086 00012f40 > 88039a057fd8 > [97483.663960] 00012f40 8800afd3b3b0 88041ea137f0 > 88041edcd128 > [97483.663962] 0002 8113eb50 88039a057d60 > 88039a057e40 > [97483.663965] Call Trace: > [97483.663968] [] ? wait_on_page_read+0x60/0x60 > [97483.663971] [] ? io_schedule+0x99/0x120 > [97483.663974] [] ? sleep_on_page+0xa/0x10 > [97483.663977] [] ? __wait_on_bit+0x5c/0x90 > [97483.663980] [] ? wait_on_page_bit+0xc6/0xd0 > [97483.663983] [] ? autoremove_wake_function+0x30/0x30 > [97483.663986] [] ? pagevec_lookup_tag+0x1d/0x30 > [97483.663989] [] ? filemap_fdatawait_range+0xd0/0x160 > [97483.663993] [] ? filemap_write_and_wait+0x36/0x50 > [97483.664002] [] ? nfs_getattr+0x108/0x220 [nfs] > [97483.664005] [] ? vfs_fstatat+0x57/0x90 > [97483.664009] [] ? SYSC_newlstat+0x1d/0x40 > [97483.664013] [] ? system_call_fast_compare_end+0x10/0x15 > > [ 9724.415533] INFO: task mozStorage #5:2748 blocked for more than 120 > seconds. > [ 9724.415537] Tainted: PW O 3.16.0-4-amd64 #1 > [ 9724.415538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 9724.415539] mozStorage #5 D 8803ee153a88 0 2748 2323 > 0x > [ 9724.415542] 8803ee153630 0082 00012f40 > 8803ee2cffd8 > [ 9724.415544] 00012f40 8803ee153630 88041ea537f0 > 88041edaf7a0 > [ 9724.415545] 0002 a0669800 8803ee2cfc30 > 88040b1152c0 > [ 9724.415547] Call Trace: > [ 9724.415557] [] ? nfs_pageio_doio+0x50/0x50 [nfs] > [ 9724.415560] [] ? io_schedule+0x99/0x120 > [ 9724.415563] [] ? nfs_wait_bit_uninterruptible+0xa/0x10 > [nfs] > [ 9724.415566] [] ? __wait_on_bit+0x5c/0x90 > [ 9724.415568] [] ? internal_add_timer+0x2a/0x70 > [ 9724.415571] [] ? nfs_pageio_doio+0x50/0x50 [nfs] > [ 9724.415573] [] ? out_of_line_wait_on_bit+0x77/0x90 > [ 9724.415575] [] ? autoremove_wake_function+0x30/0x30 > [ 9724.415578] [] ? nfs_updatepage+0x15e/0x830 [nfs] > [ 9724.415582] [] ? nfs_write_end+0x57/0x320 [nfs] > [ 9724.415585] [] ? > iov_iter_copy_from_user_atomic+0x75/0x190 > [ 9724.415588] [] ? generic_perform_write+0x11b/0x1c0 > [ 9724.415590] [] ? __generic_file_write_iter+0x158/0x340 > [ 9724.415592] [] ? generic_file_write_iter+0x39/0xa0 > [ 9724.415595] [] ? nfs_file_write+0x83/0x1a0 [nfs] > [ 9724.415598] [] ? new_sync_write+0x74/0xa0 > [ 9724.415600] [] ? vfs_write+0xb2/0x1f0 > [ 9724.415601] [] ? SyS_write+0x42/0xa0 > [ 9724.415603] [] ? SyS_lseek+0x43/0xa0 > [ 9724.415605] [] ? system_call_fast_compare_end+0x10/0x15 Do you still can reproduce the issue with a recent kernel? Regards, Salvatore
Bug#837907: stat() hangs on a particular file
This has now been seen several times on two different clients (so it probably isn't a hardware issue such as bad RAM; the second client has ECC). The backtraces vary slightly but always involve NFS. I haven't found a way to recover without a reboot. I'm now upgrading these systems to the kernel from jessie-backports; we'll see if the problem still occurs with the newer kernel. I'm wondering if this might be a regression in 3.16.36 since the problem appeared recently. I haven't seen it on Ubuntu's 3.13 kernels (of which we have many instances in production) either. Bisecting is difficult since I still don't have a sure-fire trigger for the bug.
Bug#837907: stat() hangs on a particular file
Package: linux-image-3.16.0-4-amd64 Version: 3.16.36-1+deb8u1 One of our systems is suddenly unable to stat() a particular file (cookies.sqlite-wal in a user's Firefox profile). Any attempt to do so hangs in the system call, as shown by strace. The file resides on an NFSv4 share (sec=krb5p). Other files in the same directory on the same share remain accessible. The affected file is normally accessible on the NFS server and from other NFS clients running the same kernel. The user has reported a similar incident yesterday on some directories on a different NFS share (also sec=krb5p, but hosted on a different server). He rebooted to clear up the problem. I'd like advice on how to troubleshoot this effectively. I've tried rpcdebug -m {nfs,rpc,nlm} -s all but didn't see any smoking gun; maybe some information cached by the kernel is suppressing NFS activity associated with the stat() calls. The log entries I do see say NFS: nfs_lookup_revalidate(cookies.sqlite-wal) is valid Some related kernel traces from this system's logs (in chronological order, with an intervening reboot; the first trace is associated with yesterday's incident, the second trace is 2-3 minutes newer than the timestamp on cookies.sqlite-wal): [97483.663949] INFO: task ls:23767 blocked for more than 120 seconds. [97483.663951] Tainted: PW O 3.16.0-4-amd64 #1 [97483.663952] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [97483.663954] ls D 8800afd3b808 0 23767 1 0x0004 [97483.663957] 8800afd3b3b0 0086 00012f40 88039a057fd8 [97483.663960] 00012f40 8800afd3b3b0 88041ea137f0 88041edcd128 [97483.663962] 0002 8113eb50 88039a057d60 88039a057e40 [97483.663965] Call Trace: [97483.663968] [] ? wait_on_page_read+0x60/0x60 [97483.663971] [] ? io_schedule+0x99/0x120 [97483.663974] [] ? sleep_on_page+0xa/0x10 [97483.663977] [] ? __wait_on_bit+0x5c/0x90 [97483.663980] [] ? wait_on_page_bit+0xc6/0xd0 [97483.663983] [] ? autoremove_wake_function+0x30/0x30 [97483.663986] [] ? pagevec_lookup_tag+0x1d/0x30 [97483.663989] [] ? filemap_fdatawait_range+0xd0/0x160 [97483.663993] [] ? filemap_write_and_wait+0x36/0x50 [97483.664002] [] ? nfs_getattr+0x108/0x220 [nfs] [97483.664005] [] ? vfs_fstatat+0x57/0x90 [97483.664009] [] ? SYSC_newlstat+0x1d/0x40 [97483.664013] [] ? system_call_fast_compare_end+0x10/0x15 [ 9724.415533] INFO: task mozStorage #5:2748 blocked for more than 120 seconds. [ 9724.415537] Tainted: PW O 3.16.0-4-amd64 #1 [ 9724.415538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9724.415539] mozStorage #5 D 8803ee153a88 0 2748 2323 0x [ 9724.415542] 8803ee153630 0082 00012f40 8803ee2cffd8 [ 9724.415544] 00012f40 8803ee153630 88041ea537f0 88041edaf7a0 [ 9724.415545] 0002 a0669800 8803ee2cfc30 88040b1152c0 [ 9724.415547] Call Trace: [ 9724.415557] [] ? nfs_pageio_doio+0x50/0x50 [nfs] [ 9724.415560] [] ? io_schedule+0x99/0x120 [ 9724.415563] [] ? nfs_wait_bit_uninterruptible+0xa/0x10 [nfs] [ 9724.415566] [] ? __wait_on_bit+0x5c/0x90 [ 9724.415568] [] ? internal_add_timer+0x2a/0x70 [ 9724.415571] [] ? nfs_pageio_doio+0x50/0x50 [nfs] [ 9724.415573] [] ? out_of_line_wait_on_bit+0x77/0x90 [ 9724.415575] [] ? autoremove_wake_function+0x30/0x30 [ 9724.415578] [] ? nfs_updatepage+0x15e/0x830 [nfs] [ 9724.415582] [] ? nfs_write_end+0x57/0x320 [nfs] [ 9724.415585] [] ? iov_iter_copy_from_user_atomic+0x75/0x190 [ 9724.415588] [] ? generic_perform_write+0x11b/0x1c0 [ 9724.415590] [] ? __generic_file_write_iter+0x158/0x340 [ 9724.415592] [] ? generic_file_write_iter+0x39/0xa0 [ 9724.415595] [] ? nfs_file_write+0x83/0x1a0 [nfs] [ 9724.415598] [] ? new_sync_write+0x74/0xa0 [ 9724.415600] [] ? vfs_write+0xb2/0x1f0 [ 9724.415601] [] ? SyS_write+0x42/0xa0 [ 9724.415603] [] ? SyS_lseek+0x43/0xa0 [ 9724.415605] [] ? system_call_fast_compare_end+0x10/0x15