Public bug reported:

We're running a server at AWS which collects data from machines over
CIFS.  This involves a a lot of mounting and umounting of CIFS (about
100 targets with 2 shares each with 10 delay in between). The targets
might sometimes become unavailable when they turned of for the weekend
or rebooted.

The server doing this has to be rebooted every few hours because CIFS
connection start to hang and don't recover. The usual symptom is:

Jul 24 10:12:59 connector kernel: [ 7765.705409] CIFS: Attempting to mount 
//172.22.2.112/Meldung
Jul 24 10:13:01 connector kernel: [ 7767.689258] CIFS: Attempting to mount 
//172.22.2.112/Wartung
Jul 24 10:13:06 connector kernel: [ 7772.758283] CIFS: Attempting to mount 
//172.30.113.108/Meldung
Jul 24 10:13:06 connector kernel: [ 7773.300475] CIFS: Attempting to mount 
//172.30.113.108/Wartung
Jul 24 10:13:09 connector kernel: [ 7776.364516] CIFS: Attempting to mount 
//172.30.99.55/Meldung
Jul 24 10:13:11 connector kernel: [ 7777.978731] CIFS: Attempting to mount 
//172.30.99.55/Wartung
[...]
Jul 24 10:16:13 connector kernel: [ 7960.390529] CIFS VFS: \\172.30.113.108 has 
not responded in 180 seconds. Reconnecting...
Jul 24 10:16:15 connector kernel: [ 7962.468649] CIFS VFS: \\172.30.93.171 has 
not responded in 180 seconds. Reconnecting...
Jul 24 10:16:18 connector kernel: [ 7964.999037] CIFS VFS: \\172.30.99.55 has 
not responded in 180 seconds. Reconnecting...
Jul 24 10:16:31 connector kernel: [ 7977.798821] INFO: task cifsd:26252 blocked 
for more than 120 seconds.
Jul 24 10:16:31 connector kernel: [ 7977.803730]       Not tainted 
5.4.0-1020-aws #20-Ubuntu
Jul 24 10:16:31 connector kernel: [ 7977.808526] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 24 10:16:31 connector kernel: [ 7977.820291] cifsd           D    0 26252   
   2 0x80004000
Jul 24 10:16:31 connector kernel: [ 7977.820298] Call Trace:
Jul 24 10:16:31 connector kernel: [ 7977.820307]  __schedule+0x2e3/0x740
Jul 24 10:16:31 connector kernel: [ 7977.820310]  ? __switch_to_asm+0x40/0x70
Jul 24 10:16:31 connector kernel: [ 7977.820313]  ? __switch_to_asm+0x34/0x70
Jul 24 10:16:31 connector kernel: [ 7977.820315]  schedule+0x42/0xb0
Jul 24 10:16:31 connector kernel: [ 7977.820318]  
rwsem_down_read_slowpath+0x16c/0x4a0
Jul 24 10:16:31 connector kernel: [ 7977.820321]  down_read+0x85/0xa0
Jul 24 10:16:31 connector kernel: [ 7977.820324]  iterate_supers_type+0x70/0xf0
Jul 24 10:16:31 connector kernel: [ 7977.820411]  ? 
cifs_set_cifscreds.isra.0+0x800/0x800 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820429]  cifs_reconnect+0x8a/0xdc0 
[cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820433]  ? vprintk_func+0x4c/0xbc
Jul 24 10:16:31 connector kernel: [ 7977.820449]  
cifs_readv_from_socket+0x17a/0x260 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820465]  
cifs_read_from_socket+0x4c/0x70 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820482]  ? allocate_buffers+0x43/0x130 
[cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820497]  
cifs_demultiplex_thread+0xe1/0xcc0 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820500]  kthread+0x104/0x140
Jul 24 10:16:31 connector kernel: [ 7977.820516]  ? 
cifs_handle_standard+0x1b0/0x1b0 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820518]  ? kthread_park+0x90/0x90
Jul 24 10:16:31 connector kernel: [ 7977.820520]  ret_from_fork+0x22/0x40
Jul 24 10:16:31 connector kernel: [ 7977.820524] INFO: task cifsd:26328 blocked 
for more than 120 seconds.
Jul 24 10:16:31 connector kernel: [ 7977.827503]       Not tainted 
5.4.0-1020-aws #20-Ubuntu


That is, cifsd gets stuck fetching credentials for the reconnect. I'm attaching 
the full syslog with stack traces from all hung cifsd task (I don't see where 
the deadlock is there).

The mounting/unmounting is done in a privileged Docker container. If we
restart that, we usually run into an Oops:

Jul 25 07:43:29 connector kernel: [64677.164367] Oops: 0000 [#1] SMP NOPTI
Jul 25 07:43:29 connector kernel: [64677.164370] CPU: 0 PID: 265452 Comm: cifsd 
Not tainted 5.4.0-1020-aws #20-Ubuntu
Jul 25 07:43:29 connector kernel: [64677.164370] Hardware name: Amazon EC2 
t3a.large/, BIOS 1.0 10/16/2017
Jul 25 07:43:29 connector kernel: [64677.164400] RIP: 
0010:cifs_reconnect+0x9be/0xdc0 [cifs]
Jul 25 07:43:29 connector kernel: [64677.164403] Code: e8 bb 43 0c d5 66 90 48 
8b 45 c0 48 8d 55 c0 4c 8d 6d b8 48 39 c2 74 62 49 be 00 01 00 00 00 00 ad de 
48 8b 45 c0 4c 8d 78 f
8 <48> 8b 00 48 8d 58 f8 4d 39 ef 74 3d 49 8b 57 10 48 89 50 08 48 89
Jul 25 07:43:29 connector kernel: [64677.218175] RSP: 0018:ffffbf25c0b27cf8 
EFLAGS: 00010286
Jul 25 07:43:29 connector kernel: [64677.222539] RAX: 0000000000000000 RBX: 
ffff9cdef66f0800 RCX: ffffffff95cd8510
Jul 25 07:43:29 connector kernel: [64677.227607] RDX: ffffbf25c0b27d30 RSI: 
ffffbf25c0b27d18 RDI: ffffffffc0aeec18
Jul 25 07:43:29 connector kernel: [64677.232638] RBP: ffffbf25c0b27d70 R08: 
0000000000000180 R09: 0000000000000000
Jul 25 07:43:29 connector kernel: [64677.237666] R10: ffff9cdf32a173c8 R11: 
0000000000000000 R12: 00000000fffffffe
Jul 25 07:43:29 connector kernel: [64677.242789] R13: ffffbf25c0b27d28 R14: 
dead000000000100 R15: fffffffffffffff8
Jul 25 07:43:29 connector kernel: [64677.247874] FS:  0000000000000000(0000) 
GS:ffff9cdf32a00000(0000) knlGS:0000000000000000
Jul 25 07:43:29 connector kernel: [64677.254956] CS:  0010 DS: 0000 ES: 0000 
CR0: 0000000080050033
Jul 25 07:43:29 connector kernel: [64677.259348] CR2: 0000000000000000 CR3: 
00000001cddce000 CR4: 00000000003406f0
Jul 25 07:43:29 connector kernel: [64677.264439] Call Trace:
Jul 25 07:43:29 connector kernel: [64677.267345]  ? vprintk_func+0x4c/0xbc
Jul 25 07:43:29 connector kernel: [64677.270720]  
cifs_readv_from_socket+0x17a/0x260 [cifs]
Jul 25 07:43:29 connector kernel: [64677.274889]  
cifs_read_from_socket+0x4c/0x70 [cifs]
Jul 25 07:43:29 connector kernel: [64677.278914]  ? cifs_add_credits+0x56/0x60 
[cifs]
Jul 25 07:43:29 connector kernel: [64677.282722]  ? allocate_buffers+0x6d/0x130 
[cifs]
Jul 25 07:43:29 connector kernel: [64677.286453]  
cifs_demultiplex_thread+0xe1/0xcc0 [cifs]
Jul 25 07:43:29 connector kernel: [64677.290566]  kthread+0x104/0x140
Jul 25 07:43:29 connector kernel: [64677.293969]  ? 
cifs_handle_standard+0x1b0/0x1b0 [cifs]
Jul 25 07:43:29 connector kernel: [64677.298096]  ? kthread_park+0x90/0x90
Jul 25 07:43:29 connector kernel: [64677.301535]  ret_from_fork+0x22/0x40
Jul 25 07:43:29 connector kernel: [64677.304799] Modules linked in: md4 
nls_utf8 cifs libarc4 libdes rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace 
fscache xt_nat veth vxlan ip
6_udp_tunnel udp_tunnel xt_policy iptable_mangle xt_mark xt_u32 xt_tcpudp 
xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
xt_addrtype iptable_filter 
iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 
bpfilter br_netfilter bridge stp llc aufs overlay dm_multipath scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua ppdev 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd 
cryptd glue_helper ena serio_raw parport_pc parport sch_fq_codel drm i2c_core 
sunrpc ip_tables x_tables a
utofs4
Jul 25 07:43:29 connector kernel: [64677.387761] CR2: 0000000000000000
Jul 25 07:43:29 connector kernel: [64677.391027] ---[ end trace 
b498d70d7111f607 ]---


The mount options used are:
ro,relatime,vers=1.0,cache=strict,username=xxx,domain=xxx,uid=0,noforceuid,gid=0,noforcegid,addr=172.30.2.138,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=61440,wsize=65536,bsize=1048576,echo_interval=60,actimeo=1

The attached log files also contain a bit of CIFS debug messages generated with:
  echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control
  echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control
  echo 1 > /proc/fs/cifs/cifsFYI

Is there any way of trying a newer kernel?
https://github.com/torvalds/linux/commits/master/fs/cifs suggests some
of the problems (at least the Oops) might have been fixed.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-1020-aws 5.4.0-1020.20
ProcVersionSignature: User Name 5.4.0-1020.20-aws 5.4.44
Uname: Linux 5.4.0-1020-aws x86_64
ApportVersion: 2.20.11-0ubuntu27.4
Architecture: amd64
CasperMD5CheckResult: skip
Date: Sat Jul 25 11:55:47 2020
Ec2AMI: ami-07d14b5d47292e022
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: eu-central-1a
Ec2InstanceType: t3a.large
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/usr/bin/zsh
SourcePackage: linux-aws
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: linux-aws (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug ec2-images focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1888936

Title:
  cifsd deadlocks / CIFS related Oopses

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1888936/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to