Public bug reported:

---Problem Description---
Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 3.19.0-58. 
The system is running docker containers, and has the NVIDIA GPU driver loaded. 
We've seen about 4 stalls in the last month, all with the 3.19.0-58 kernel, and 
with the NVIDIA 352.93 and 361.49 drivers.
  
---uname output---
Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
 
---Additional Hardware Info---
2 x NVIDIA K80 GPU adapter:
$ lspci | grep NV
0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 

 
Machine Type = 8247-42L 
 
---System Hang---
 Usual symptom is that the system is unresponsive except maybe for ping and 
writing the stall-detection messages to the console. Login/getty isn't 
available either via ssh nor on the console. System must be power cycled to 
recover.
 
Attached is the kernel log from a stall detection on May 18th. The detection 
first occurs at: May 18 15:17:55.

The system is later rebooted and those messages indicate the kernel
(3.19.0-58) and NVIDIA driver version (352.93) that were active at the
time.

We've suffered 3 or 4 stalls since, all with the same kernel, but some
with a newer NVIDIA driver (361.49).

Unfortunately, information about the newer stalls wasn't preserved in
the various log files (and we're not capturing the console constantly),
so we don't have detailed data for those.

We'd welcome any suggestions for how to collect additional data for
these occurrences.

I can't say for sure that we haven't seen the stalls on other systems,
but they're occuring fairly frequently on this system, and it's unusual
in that it's running both Docker and NVIDIA GPU driver. So maybe aufs or
the NVIDIA driver are somehow involved.

>From the kern.log,

The Call trace points to some kind of deadlock in aufs -

May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
May 18 15:17:55 dldev1 kernel: [713670.798628] cc1             R  running task  
      0 99183  99173 0x00040004
May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
May 18 15:17:55 dldev1 kernel: [713670.798643] [c000000fa64673a0] 
[c0000000000cf004] wake_up_worker+0x44/0x60 (unreliable)
May 18 15:17:55 dldev1 kernel: [713670.798671] [c000000fa6467570] 
[c000000fa64675d0] 0xc000000fa64675d0
May 18 15:17:55 dldev1 kernel: [713670.798676] [c000000fa64675d0] 
[c000000000a1b050] __schedule+0x370/0x900
May 18 15:17:55 dldev1 kernel: [713670.798679] [c000000fa64677f0] 
[c000000fa6467850] 0xc000000fa6467850
May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
May 18 15:17:55 dldev1 kernel: [713670.798684] cc1             D 
00001000005d9410     0 99427  99405 0x00040004
May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
May 18 15:17:55 dldev1 kernel: [713670.798691] [c0000017efdd3460] 
[c0000017efdd34a0] 0xc0000017efdd34a0 (unreliable)
May 18 15:17:55 dldev1 kernel: [713670.798695] [c0000017efdd3630] 
[c0000017efdd3690] 0xc0000017efdd3690
May 18 15:17:55 dldev1 kernel: [713670.798698] [c0000017efdd3690] 
[c000000000a1b050] __schedule+0x370/0x900
May 18 15:17:55 dldev1 kernel: [713670.798702] [c0000017efdd38b0] 
[c000000000a1f128] rwsem_down_write_failed+0x288/0x400
May 18 15:17:55 dldev1 kernel: [713670.798706] [c0000017efdd3940] 
[c000000000a1e538] down_write+0x88/0x90
May 18 15:17:55 dldev1 kernel: [713670.798716] [c0000017efdd3970] 
[d00000001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs]
May 18 15:17:55 dldev1 kernel: [713670.798724] [c0000017efdd39a0] 
[d00000001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs]
May 18 15:17:55 dldev1 kernel: [713670.798733] [c0000017efdd39e0] 
[d00000001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs]
May 18 15:17:55 dldev1 kernel: [713670.798737] [c0000017efdd3aa0] 
[c0000000002c88f8] lookup_fast+0x368/0x3b0
May 18 15:17:55 dldev1 kernel: [713670.798740] [c0000017efdd3b10] 
[c0000000002cb620] path_lookupat+0x180/0x970
May 18 15:17:55 dldev1 kernel: [713670.798743] [c0000017efdd3be0] 
[c0000000002cbe68] filename_lookup+0x58/0x140
May 18 15:17:55 dldev1 kernel: [713670.798746] [c0000017efdd3c30] 
[c0000000002cde04] user_path_at_empty+0x84/0xe0
May 18 15:17:55 dldev1 kernel: [713670.798749] [c0000017efdd3d20] 
[c0000000002be744] vfs_fstatat+0x84/0x140
May 18 15:17:55 dldev1 kernel: [713670.798753] [c0000017efdd3d80] 
[c0000000002bee14] SyS_newlstat+0x34/0x60
May 18 15:17:55 dldev1 kernel: [713670.798757] [c0000017efdd3e30] 
[c000000000009258] system_call+0x38/0xd0

Other info from the log -

May 18 16:00:00 dldev1 kernel: [   11.802963] nvidia 0002:03:00.0: enabling 
device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [   11.803114] nvidia 0002:04:00.0: enabling 
device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [   11.803221] nvidia 0006:03:00.0: enabling 
device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [   11.803349] nvidia 0006:04:00.0: enabling 
device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [   11.803598] [drm] Initialized nvidia-drm 
0.0.0 20150116 for 0002:03:00.0 on minor 0
May 18 16:00:00 dldev1 kernel: [   11.803654] [drm] Initialized nvidia-drm 
0.0.0 20150116 for 0002:04:00.0 on minor 1
May 18 16:00:00 dldev1 kernel: [   11.803700] [drm] Initialized nvidia-drm 
0.0.0 20150116 for 0006:03:00.0 on minor 2
May 18 16:00:00 dldev1 kernel: [   11.803742] [drm] Initialized nvidia-drm 
0.0.0 20150116 for 0006:04:00.0 on minor 3
May 18 16:00:00 dldev1 kernel: [   11.803749] NVRM: loading NVIDIA UNIX ppc64le 
Kernel Module  352.93  Tue Apr  5 17:31:42 PDT 2016
....

May 18 16:00:01 dldev1 kernel: [  278.416678] aufs 3.x-rcN-20150105
May 18 16:00:01 dldev1 kernel: [  278.952335] bridge: automatic filtering via 
arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter 
if you need this.
May 18 16:00:01 dldev1 kernel: [  278.953351] Bridge firewalling registered
May 18 16:00:01 dldev1 kernel: [  278.957164] nf_conntrack version 0.5.0 (16384 
buckets, 65536 max)
May 18 16:00:01 dldev1 kernel: [  278.997765] ip_tables: (C) 2000-2006 
Netfilter Core Team
May 18 16:00:01 dldev1 kernel: [  279.013814] nvidia 0002:03:00.0: Using 64-bit 
DMA iommu bypass
May 18 16:00:03 dldev1 kernel: [  280.480596] IPv6: ADDRCONF(NETDEV_UP): 
docker0: link is not ready
May 18 16:00:04 dldev1 kernel: [  282.035907] nvidia 0002:04:00.0: Using 64-bit 
DMA iommu bypass
May 18 16:00:07 dldev1 kernel: [  285.065144] nvidia 0006:03:00.0: Using 64-bit 
DMA iommu bypass
May 18 16:00:10 dldev1 kernel: [  288.150339] nvidia 0006:04:00.0: Using 64-bit 
DMA iommu bypass
...
May 18 16:01:55 dldev1 kernel: [  393.265785] aufs 
au_opts_verify:1612:docker[3444]: dirperm1 breaks the protection by the 
permission bits on the lower branch
May 18 16:01:56 dldev1 kernel: [  393.328552] aufs 
au_opts_verify:1612:docker[3444]: dirperm1 breaks the protection by the 
permission bits on the lower branch
May 18 16:01:56 dldev1 kernel: [  393.366023] device veth0beb24c entered 
promiscuous mode
May 18 16:01:56 dldev1 kernel: [  393.367228] IPv6: ADDRCONF(NETDEV_UP): 
veth0beb24c: link is not ready
May 18 16:01:56 dldev1 kernel: [  393.367235] docker0: port 1(veth0beb24c) 
entered forwarding state
May 18 16:01:56 dldev1 kernel: [  393.367314] docker0: port 1(veth0beb24c) 
entered forwarding state
May 18 16:01:56 dldev1 kernel: [  393.367825] docker0: port 1(veth0beb24c) 
entered disabled state
May 18 16:01:56 dldev1 kernel: [  393.526038] docker0: port 1(veth0beb24c) 
entered disabled state
May 18 16:01:56 dldev1 kernel: [  393.530143] device veth0beb24c left 
promiscuous mode
May 18 16:01:56 dldev1 kernel: [  393.530148] docker0: port 1(veth0beb24c) 
entered disabled state

> Hi Breno,
> 
> As per Ubuntu support, all new fixes will target 16.04 and Canonical will
> not automatically fix 14.04. If one needs a fix in 14.04, it has to go
> through as a special request to Canonical. So should we check if this
> problem happens on Ubuntu 16.04?

That is correct. Ubuntu 14.04 fixes will be SRU. I.e, they need to be
fixed in 16.10 first, and then backport to 16.04 an 14.04.

Kernel 4.4 for Ubuntu 16.04.5 is already in proposed. Does it contain
the bug also? I understand that kernel 3.19 is already EOL.

The requested docker info; we're using the docker packages provided by
Canonical:

$ dpkg -l | egrep 'docker|aufs'
ii  aufs-tools                             1:3.2+20130722-1.1                   
      ppc64el      Tools to manage aufs filesystems
ii  docker.io                              1.9.1~git20151210-18bfacb            
      ppc64el      Linux container runtime

$ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-Toolchain-) 
[ibm/gcc-5-branch, revision: 229493 merged from gcc-5-branch, revision 228917]
 Git commit:   18bfacb
 Built:        Thu Dec 17 19:19:02 UTC 2015
 OS/Arch:      linux/ppc64le

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-Toolchain-) 
[ibm/gcc-5-branch, revision: 229493 merged from gcc-5-branch, revision 228917]
 Git commit:   18bfacb
 Built:        Thu Dec 17 19:19:02 UTC 2015
 OS/Arch:      linux/ppc64le


$ docker info
Containers: 8
Images: 27
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 43
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.0-58-generic
Operating System: Ubuntu 14.04.4 LTS
CPUs: 192
Total Memory: 127.5 GiB
Name: dldev1
ID: NKEF:PLNM:HHXN:3DAD:3CLU:Z2GP:AZLR:MZRO:ADYA:5GOV:YZRN:DWOS
WARNING: No swap limit support

I had said: "we're using the docker packages provided by Canonical".

That's not true.  The 'docker.io' package we're using is from:

deb http://ftp.unicamp.br/pub/ppc64el/ubuntu/14_04/docker-ppc64el/
trusty main

The aufs.ko does come from Canonical:

$ dpkg -S aufs.ko
linux-image-extra-3.19.0-58-generic: 
/lib/modules/3.19.0-58-generic/kernel/ubuntu/aufs/aufs.ko

Another occurrence this afternoon, again while building a software
project in a Docker container.

Yesterday was building 'Caffe', which is mostly C or C++. Today was
building 'TensorFlow' which includes a lot of Java.

Still seeing this hang on Ubuntu 16.04 with 4.4.0-38 kernel.

Similar scenario--building TensorFlow in a container often causes the
problem. I assume because the build is creating lots of filesystem
artifacts.

Seems that running 'sync' every few seconds during the build makes the
problem less likely to strike.

Will attach kernel log from an instance on 16.04 / 4.4.0-38 from Oct
7th.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-142751 severity-high 
targetmilestone-inin16042

** Tags added: architecture-ppc64le bugnameltc-142751 severity-high
targetmilestone-inin16042

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1633223

Title:
  rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and
  docker

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1633223/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to