Bug#837907: stat() hangs on a particular file

2021-05-10 Thread Sergio Gelato
* Salvatore Bonaccorso [2021-05-09 22:37:02 +0200]:
> Control: tags -1 + moreinfo
> 
> Do you still can reproduce the issue with a recent kernel?

No, I haven't seen it in a while (and never with 4.9 or 4.19 either; nor with
5.10 but that's statistically less significant).



Bug#837907: stat() hangs on a particular file

2021-05-09 Thread Salvatore Bonaccorso
Control: tags -1 + moreinfo

On Thu, Sep 15, 2016 at 01:12:42PM +0200, Sergio Gelato wrote:
> Package: linux-image-3.16.0-4-amd64
> Version: 3.16.36-1+deb8u1
> 
> One of our systems is suddenly unable to stat() a particular file
> (cookies.sqlite-wal in a user's Firefox profile). Any attempt to
> do so hangs in the system call, as shown by strace. The file resides
> on an NFSv4 share (sec=krb5p). Other files in the same directory on
> the same share remain accessible. The affected file is normally accessible
> on the NFS server and from other NFS clients running the same kernel.
> 
> The user has reported a similar incident yesterday on some directories
> on a different NFS share (also sec=krb5p, but hosted on a different
> server). He rebooted to clear up the problem.
> 
> I'd like advice on how to troubleshoot this effectively. I've tried
> rpcdebug -m {nfs,rpc,nlm} -s all but didn't see any smoking gun; maybe
> some information cached by the kernel is suppressing NFS activity
> associated with the stat() calls. The log entries I do see say
> NFS: nfs_lookup_revalidate(cookies.sqlite-wal) is valid
> 
> Some related kernel traces from this system's logs (in chronological order,
> with an intervening reboot; the first trace is associated with yesterday's
> incident, the second trace is 2-3 minutes newer than the timestamp on
> cookies.sqlite-wal):
> 
> [97483.663949] INFO: task ls:23767 blocked for more than 120 seconds.
> [97483.663951]   Tainted: PW  O  3.16.0-4-amd64 #1
> [97483.663952] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [97483.663954] ls  D 8800afd3b808 0 23767  1 
> 0x0004
> [97483.663957]  8800afd3b3b0 0086 00012f40 
> 88039a057fd8
> [97483.663960]  00012f40 8800afd3b3b0 88041ea137f0 
> 88041edcd128
> [97483.663962]  0002 8113eb50 88039a057d60 
> 88039a057e40
> [97483.663965] Call Trace:
> [97483.663968]  [] ? wait_on_page_read+0x60/0x60
> [97483.663971]  [] ? io_schedule+0x99/0x120
> [97483.663974]  [] ? sleep_on_page+0xa/0x10
> [97483.663977]  [] ? __wait_on_bit+0x5c/0x90
> [97483.663980]  [] ? wait_on_page_bit+0xc6/0xd0
> [97483.663983]  [] ? autoremove_wake_function+0x30/0x30
> [97483.663986]  [] ? pagevec_lookup_tag+0x1d/0x30
> [97483.663989]  [] ? filemap_fdatawait_range+0xd0/0x160
> [97483.663993]  [] ? filemap_write_and_wait+0x36/0x50
> [97483.664002]  [] ? nfs_getattr+0x108/0x220 [nfs]
> [97483.664005]  [] ? vfs_fstatat+0x57/0x90
> [97483.664009]  [] ? SYSC_newlstat+0x1d/0x40
> [97483.664013]  [] ? system_call_fast_compare_end+0x10/0x15
> 
> [ 9724.415533] INFO: task mozStorage #5:2748 blocked for more than 120 
> seconds.
> [ 9724.415537]   Tainted: PW  O 3.16.0-4-amd64 #1
> [ 9724.415538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 9724.415539] mozStorage #5   D 8803ee153a88 0  2748   2323 
> 0x
> [ 9724.415542]  8803ee153630 0082 00012f40 
> 8803ee2cffd8
> [ 9724.415544]  00012f40 8803ee153630 88041ea537f0 
> 88041edaf7a0
> [ 9724.415545]  0002 a0669800 8803ee2cfc30 
> 88040b1152c0
> [ 9724.415547] Call Trace:
> [ 9724.415557]  [] ? nfs_pageio_doio+0x50/0x50 [nfs]
> [ 9724.415560]  [] ? io_schedule+0x99/0x120
> [ 9724.415563]  [] ? nfs_wait_bit_uninterruptible+0xa/0x10 
> [nfs]
> [ 9724.415566]  [] ? __wait_on_bit+0x5c/0x90
> [ 9724.415568]  [] ? internal_add_timer+0x2a/0x70
> [ 9724.415571]  [] ? nfs_pageio_doio+0x50/0x50 [nfs]
> [ 9724.415573]  [] ? out_of_line_wait_on_bit+0x77/0x90
> [ 9724.415575]  [] ? autoremove_wake_function+0x30/0x30
> [ 9724.415578]  [] ? nfs_updatepage+0x15e/0x830 [nfs]
> [ 9724.415582]  [] ? nfs_write_end+0x57/0x320 [nfs]
> [ 9724.415585]  [] ? 
> iov_iter_copy_from_user_atomic+0x75/0x190
> [ 9724.415588]  [] ? generic_perform_write+0x11b/0x1c0
> [ 9724.415590]  [] ? __generic_file_write_iter+0x158/0x340
> [ 9724.415592]  [] ? generic_file_write_iter+0x39/0xa0
> [ 9724.415595]  [] ? nfs_file_write+0x83/0x1a0 [nfs]
> [ 9724.415598]  [] ? new_sync_write+0x74/0xa0
> [ 9724.415600]  [] ? vfs_write+0xb2/0x1f0
> [ 9724.415601]  [] ? SyS_write+0x42/0xa0
> [ 9724.415603]  [] ? SyS_lseek+0x43/0xa0
> [ 9724.415605]  [] ? system_call_fast_compare_end+0x10/0x15

Do you still can reproduce the issue with a recent kernel?

Regards,
Salvatore



Bug#837907: stat() hangs on a particular file

2016-10-11 Thread Sergio Gelato
This has now been seen several times on two different clients (so it
probably isn't a hardware issue such as bad RAM; the second client has
ECC). The backtraces vary slightly but always involve NFS. I haven't found
a way to recover without a reboot.

I'm now upgrading these systems to the kernel from jessie-backports; we'll
see if the problem still occurs with the newer kernel.

I'm wondering if this might be a regression in 3.16.36 since the problem
appeared recently. I haven't seen it on Ubuntu's 3.13 kernels (of which
we have many instances in production) either.

Bisecting is difficult since I still don't have a sure-fire trigger for the bug.



Bug#837907: stat() hangs on a particular file

2016-09-15 Thread Sergio Gelato
Package: linux-image-3.16.0-4-amd64
Version: 3.16.36-1+deb8u1

One of our systems is suddenly unable to stat() a particular file
(cookies.sqlite-wal in a user's Firefox profile). Any attempt to
do so hangs in the system call, as shown by strace. The file resides
on an NFSv4 share (sec=krb5p). Other files in the same directory on
the same share remain accessible. The affected file is normally accessible
on the NFS server and from other NFS clients running the same kernel.

The user has reported a similar incident yesterday on some directories
on a different NFS share (also sec=krb5p, but hosted on a different
server). He rebooted to clear up the problem.

I'd like advice on how to troubleshoot this effectively. I've tried
rpcdebug -m {nfs,rpc,nlm} -s all but didn't see any smoking gun; maybe
some information cached by the kernel is suppressing NFS activity
associated with the stat() calls. The log entries I do see say
NFS: nfs_lookup_revalidate(cookies.sqlite-wal) is valid

Some related kernel traces from this system's logs (in chronological order,
with an intervening reboot; the first trace is associated with yesterday's
incident, the second trace is 2-3 minutes newer than the timestamp on
cookies.sqlite-wal):

[97483.663949] INFO: task ls:23767 blocked for more than 120 seconds.
[97483.663951]   Tainted: PW  O  3.16.0-4-amd64 #1
[97483.663952] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[97483.663954] ls  D 8800afd3b808 0 23767  1 0x0004
[97483.663957]  8800afd3b3b0 0086 00012f40 
88039a057fd8
[97483.663960]  00012f40 8800afd3b3b0 88041ea137f0 
88041edcd128
[97483.663962]  0002 8113eb50 88039a057d60 
88039a057e40
[97483.663965] Call Trace:
[97483.663968]  [] ? wait_on_page_read+0x60/0x60
[97483.663971]  [] ? io_schedule+0x99/0x120
[97483.663974]  [] ? sleep_on_page+0xa/0x10
[97483.663977]  [] ? __wait_on_bit+0x5c/0x90
[97483.663980]  [] ? wait_on_page_bit+0xc6/0xd0
[97483.663983]  [] ? autoremove_wake_function+0x30/0x30
[97483.663986]  [] ? pagevec_lookup_tag+0x1d/0x30
[97483.663989]  [] ? filemap_fdatawait_range+0xd0/0x160
[97483.663993]  [] ? filemap_write_and_wait+0x36/0x50
[97483.664002]  [] ? nfs_getattr+0x108/0x220 [nfs]
[97483.664005]  [] ? vfs_fstatat+0x57/0x90
[97483.664009]  [] ? SYSC_newlstat+0x1d/0x40
[97483.664013]  [] ? system_call_fast_compare_end+0x10/0x15

[ 9724.415533] INFO: task mozStorage #5:2748 blocked for more than 120 seconds.
[ 9724.415537]   Tainted: PW  O 3.16.0-4-amd64 #1
[ 9724.415538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 9724.415539] mozStorage #5   D 8803ee153a88 0  2748   2323 0x
[ 9724.415542]  8803ee153630 0082 00012f40 
8803ee2cffd8
[ 9724.415544]  00012f40 8803ee153630 88041ea537f0 
88041edaf7a0
[ 9724.415545]  0002 a0669800 8803ee2cfc30 
88040b1152c0
[ 9724.415547] Call Trace:
[ 9724.415557]  [] ? nfs_pageio_doio+0x50/0x50 [nfs]
[ 9724.415560]  [] ? io_schedule+0x99/0x120
[ 9724.415563]  [] ? nfs_wait_bit_uninterruptible+0xa/0x10 
[nfs]
[ 9724.415566]  [] ? __wait_on_bit+0x5c/0x90
[ 9724.415568]  [] ? internal_add_timer+0x2a/0x70
[ 9724.415571]  [] ? nfs_pageio_doio+0x50/0x50 [nfs]
[ 9724.415573]  [] ? out_of_line_wait_on_bit+0x77/0x90
[ 9724.415575]  [] ? autoremove_wake_function+0x30/0x30
[ 9724.415578]  [] ? nfs_updatepage+0x15e/0x830 [nfs]
[ 9724.415582]  [] ? nfs_write_end+0x57/0x320 [nfs]
[ 9724.415585]  [] ? iov_iter_copy_from_user_atomic+0x75/0x190
[ 9724.415588]  [] ? generic_perform_write+0x11b/0x1c0
[ 9724.415590]  [] ? __generic_file_write_iter+0x158/0x340
[ 9724.415592]  [] ? generic_file_write_iter+0x39/0xa0
[ 9724.415595]  [] ? nfs_file_write+0x83/0x1a0 [nfs]
[ 9724.415598]  [] ? new_sync_write+0x74/0xa0
[ 9724.415600]  [] ? vfs_write+0xb2/0x1f0
[ 9724.415601]  [] ? SyS_write+0x42/0xa0
[ 9724.415603]  [] ? SyS_lseek+0x43/0xa0
[ 9724.415605]  [] ? system_call_fast_compare_end+0x10/0x15