[Kernel-packages] [Bug 1633223] Re: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker

2019-07-24 Thread Brad Figg
** Tags added: cscc

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1633223

Title:
  rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and
  docker

Status in linux package in Ubuntu:
  Invalid

Bug description:
  ---Problem Description---
  Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 
3.19.0-58. The system is running docker containers, and has the NVIDIA GPU 
driver loaded. We've seen about 4 stalls in the last month, all with the 
3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers.

  ---uname output---
  Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  2 x NVIDIA K80 GPU adapter:
  $ lspci | grep NV
  0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 

   
  Machine Type = 8247-42L 
   
  ---System Hang---
   Usual symptom is that the system is unresponsive except maybe for ping and 
writing the stall-detection messages to the console. Login/getty isn't 
available either via ssh nor on the console. System must be power cycled to 
recover.
   
  Attached is the kernel log from a stall detection on May 18th. The detection 
first occurs at: May 18 15:17:55.

  The system is later rebooted and those messages indicate the kernel
  (3.19.0-58) and NVIDIA driver version (352.93) that were active at the
  time.

  We've suffered 3 or 4 stalls since, all with the same kernel, but some
  with a newer NVIDIA driver (361.49).

  Unfortunately, information about the newer stalls wasn't preserved in
  the various log files (and we're not capturing the console
  constantly), so we don't have detailed data for those.

  We'd welcome any suggestions for how to collect additional data for
  these occurrences.

  I can't say for sure that we haven't seen the stalls on other systems,
  but they're occuring fairly frequently on this system, and it's
  unusual in that it's running both Docker and NVIDIA GPU driver. So
  maybe aufs or the NVIDIA driver are somehow involved.

  From the kern.log,

  The Call trace points to some kind of deadlock in aufs -

  May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
  May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R  running 
task0 99183  99173 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] 
[c00cf004] wake_up_worker+0x44/0x60 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] 
[c00fa64675d0] 0xc00fa64675d0
  May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] 
[c00fa6467850] 0xc00fa6467850
  May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
  May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 
105d9410 0 99427  99405 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] 
[c017efdd34a0] 0xc017efdd34a0 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] 
[c017efdd3690] 0xc017efdd3690
  May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] 
[c0a1f128] rwsem_down_write_failed+0x288/0x400
  May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] 
[c0a1e538] down_write+0x88/0x90
  May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] 
[d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] 
[d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] 
[d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] 
[c02c88f8] lookup_fast+0x368/0x3b0
  May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] 
[c02cb620] path_lookupat+0x180/0x970
  May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] 
[c02cbe68] filename_lookup+0x58/0x140
  May 18 15:17:55 dldev1 kernel: [713670.798746] [c017efdd3c30] 
[c02cde04] user_path_at_empty+0x84/0xe0
  May 18 15:17:55 dldev1 kernel: [713670.798749] [c017efdd3d20] 
[c02be744] vfs_fstatat+0x84/0x140
  May 18 15:17:55 dldev1 kernel: 

[Kernel-packages] [Bug 1633223] Re: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker

2017-03-21 Thread Manoj Iyer
** Changed in: linux (Ubuntu)
   Status: Incomplete => Invalid

** Changed in: linux (Ubuntu)
 Assignee: Taco Screen team (taco-screen-team) => (unassigned)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1633223

Title:
  rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and
  docker

Status in linux package in Ubuntu:
  Invalid

Bug description:
  ---Problem Description---
  Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 
3.19.0-58. The system is running docker containers, and has the NVIDIA GPU 
driver loaded. We've seen about 4 stalls in the last month, all with the 
3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers.

  ---uname output---
  Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  2 x NVIDIA K80 GPU adapter:
  $ lspci | grep NV
  0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 

   
  Machine Type = 8247-42L 
   
  ---System Hang---
   Usual symptom is that the system is unresponsive except maybe for ping and 
writing the stall-detection messages to the console. Login/getty isn't 
available either via ssh nor on the console. System must be power cycled to 
recover.
   
  Attached is the kernel log from a stall detection on May 18th. The detection 
first occurs at: May 18 15:17:55.

  The system is later rebooted and those messages indicate the kernel
  (3.19.0-58) and NVIDIA driver version (352.93) that were active at the
  time.

  We've suffered 3 or 4 stalls since, all with the same kernel, but some
  with a newer NVIDIA driver (361.49).

  Unfortunately, information about the newer stalls wasn't preserved in
  the various log files (and we're not capturing the console
  constantly), so we don't have detailed data for those.

  We'd welcome any suggestions for how to collect additional data for
  these occurrences.

  I can't say for sure that we haven't seen the stalls on other systems,
  but they're occuring fairly frequently on this system, and it's
  unusual in that it's running both Docker and NVIDIA GPU driver. So
  maybe aufs or the NVIDIA driver are somehow involved.

  From the kern.log,

  The Call trace points to some kind of deadlock in aufs -

  May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
  May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R  running 
task0 99183  99173 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] 
[c00cf004] wake_up_worker+0x44/0x60 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] 
[c00fa64675d0] 0xc00fa64675d0
  May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] 
[c00fa6467850] 0xc00fa6467850
  May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
  May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 
105d9410 0 99427  99405 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] 
[c017efdd34a0] 0xc017efdd34a0 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] 
[c017efdd3690] 0xc017efdd3690
  May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] 
[c0a1f128] rwsem_down_write_failed+0x288/0x400
  May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] 
[c0a1e538] down_write+0x88/0x90
  May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] 
[d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] 
[d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] 
[d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] 
[c02c88f8] lookup_fast+0x368/0x3b0
  May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] 
[c02cb620] path_lookupat+0x180/0x970
  May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] 
[c02cbe68] filename_lookup+0x58/0x140
  May 18 15:17:55 dldev1 kernel: [713670.798746] [c017efdd3c30] 
[c02cde04] user_path_at_empty+0x84/0xe0
  May 18 

[Kernel-packages] [Bug 1633223] Re: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker

2017-02-27 Thread Andrew Cloke
Revisiting this bug, is this issue still persisting?

** Changed in: linux (Ubuntu)
   Status: New => Incomplete

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1633223

Title:
  rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and
  docker

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  ---Problem Description---
  Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 
3.19.0-58. The system is running docker containers, and has the NVIDIA GPU 
driver loaded. We've seen about 4 stalls in the last month, all with the 
3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers.

  ---uname output---
  Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  2 x NVIDIA K80 GPU adapter:
  $ lspci | grep NV
  0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 

   
  Machine Type = 8247-42L 
   
  ---System Hang---
   Usual symptom is that the system is unresponsive except maybe for ping and 
writing the stall-detection messages to the console. Login/getty isn't 
available either via ssh nor on the console. System must be power cycled to 
recover.
   
  Attached is the kernel log from a stall detection on May 18th. The detection 
first occurs at: May 18 15:17:55.

  The system is later rebooted and those messages indicate the kernel
  (3.19.0-58) and NVIDIA driver version (352.93) that were active at the
  time.

  We've suffered 3 or 4 stalls since, all with the same kernel, but some
  with a newer NVIDIA driver (361.49).

  Unfortunately, information about the newer stalls wasn't preserved in
  the various log files (and we're not capturing the console
  constantly), so we don't have detailed data for those.

  We'd welcome any suggestions for how to collect additional data for
  these occurrences.

  I can't say for sure that we haven't seen the stalls on other systems,
  but they're occuring fairly frequently on this system, and it's
  unusual in that it's running both Docker and NVIDIA GPU driver. So
  maybe aufs or the NVIDIA driver are somehow involved.

  From the kern.log,

  The Call trace points to some kind of deadlock in aufs -

  May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
  May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R  running 
task0 99183  99173 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] 
[c00cf004] wake_up_worker+0x44/0x60 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] 
[c00fa64675d0] 0xc00fa64675d0
  May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] 
[c00fa6467850] 0xc00fa6467850
  May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
  May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 
105d9410 0 99427  99405 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] 
[c017efdd34a0] 0xc017efdd34a0 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] 
[c017efdd3690] 0xc017efdd3690
  May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] 
[c0a1f128] rwsem_down_write_failed+0x288/0x400
  May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] 
[c0a1e538] down_write+0x88/0x90
  May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] 
[d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] 
[d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] 
[d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] 
[c02c88f8] lookup_fast+0x368/0x3b0
  May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] 
[c02cb620] path_lookupat+0x180/0x970
  May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] 
[c02cbe68] filename_lookup+0x58/0x140
  May 18 15:17:55 dldev1 kernel: [713670.798746] [c017efdd3c30] 
[c02cde04] user_path_at_empty+0x84/0xe0
  May 18 15:17:55 dldev1 kernel: [713670.798749] 

[Kernel-packages] [Bug 1633223] Re: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker

2016-10-14 Thread Felix
On 16.04, the version of aufs looks correct, maybe it's a different bug. Did 
you also try with Docker 1.12 on 16.04?
Looking at the other bug, there is a Docker container than can help diagnose if 
you are encountering the aufs issue:
docker run -it --rm akihirosuda/test18180

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1633223

Title:
  rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and
  docker

Status in linux package in Ubuntu:
  New

Bug description:
  ---Problem Description---
  Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 
3.19.0-58. The system is running docker containers, and has the NVIDIA GPU 
driver loaded. We've seen about 4 stalls in the last month, all with the 
3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers.

  ---uname output---
  Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  2 x NVIDIA K80 GPU adapter:
  $ lspci | grep NV
  0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 

   
  Machine Type = 8247-42L 
   
  ---System Hang---
   Usual symptom is that the system is unresponsive except maybe for ping and 
writing the stall-detection messages to the console. Login/getty isn't 
available either via ssh nor on the console. System must be power cycled to 
recover.
   
  Attached is the kernel log from a stall detection on May 18th. The detection 
first occurs at: May 18 15:17:55.

  The system is later rebooted and those messages indicate the kernel
  (3.19.0-58) and NVIDIA driver version (352.93) that were active at the
  time.

  We've suffered 3 or 4 stalls since, all with the same kernel, but some
  with a newer NVIDIA driver (361.49).

  Unfortunately, information about the newer stalls wasn't preserved in
  the various log files (and we're not capturing the console
  constantly), so we don't have detailed data for those.

  We'd welcome any suggestions for how to collect additional data for
  these occurrences.

  I can't say for sure that we haven't seen the stalls on other systems,
  but they're occuring fairly frequently on this system, and it's
  unusual in that it's running both Docker and NVIDIA GPU driver. So
  maybe aufs or the NVIDIA driver are somehow involved.

  From the kern.log,

  The Call trace points to some kind of deadlock in aufs -

  May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
  May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R  running 
task0 99183  99173 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] 
[c00cf004] wake_up_worker+0x44/0x60 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] 
[c00fa64675d0] 0xc00fa64675d0
  May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] 
[c00fa6467850] 0xc00fa6467850
  May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
  May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 
105d9410 0 99427  99405 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] 
[c017efdd34a0] 0xc017efdd34a0 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] 
[c017efdd3690] 0xc017efdd3690
  May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] 
[c0a1f128] rwsem_down_write_failed+0x288/0x400
  May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] 
[c0a1e538] down_write+0x88/0x90
  May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] 
[d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] 
[d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] 
[d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] 
[c02c88f8] lookup_fast+0x368/0x3b0
  May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] 
[c02cb620] path_lookupat+0x180/0x970
  May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] 
[c02cbe68] filename_lookup+0x58/0x140
  May 18 15:17:55 

[Kernel-packages] [Bug 1633223] Re: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker

2016-10-13 Thread Felix
Did you see #1533043? Check your version of aufs (`modinfo aufs`)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1633223

Title:
  rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and
  docker

Status in linux package in Ubuntu:
  New

Bug description:
  ---Problem Description---
  Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 
3.19.0-58. The system is running docker containers, and has the NVIDIA GPU 
driver loaded. We've seen about 4 stalls in the last month, all with the 
3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers.

  ---uname output---
  Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  2 x NVIDIA K80 GPU adapter:
  $ lspci | grep NV
  0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 

   
  Machine Type = 8247-42L 
   
  ---System Hang---
   Usual symptom is that the system is unresponsive except maybe for ping and 
writing the stall-detection messages to the console. Login/getty isn't 
available either via ssh nor on the console. System must be power cycled to 
recover.
   
  Attached is the kernel log from a stall detection on May 18th. The detection 
first occurs at: May 18 15:17:55.

  The system is later rebooted and those messages indicate the kernel
  (3.19.0-58) and NVIDIA driver version (352.93) that were active at the
  time.

  We've suffered 3 or 4 stalls since, all with the same kernel, but some
  with a newer NVIDIA driver (361.49).

  Unfortunately, information about the newer stalls wasn't preserved in
  the various log files (and we're not capturing the console
  constantly), so we don't have detailed data for those.

  We'd welcome any suggestions for how to collect additional data for
  these occurrences.

  I can't say for sure that we haven't seen the stalls on other systems,
  but they're occuring fairly frequently on this system, and it's
  unusual in that it's running both Docker and NVIDIA GPU driver. So
  maybe aufs or the NVIDIA driver are somehow involved.

  From the kern.log,

  The Call trace points to some kind of deadlock in aufs -

  May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
  May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R  running 
task0 99183  99173 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] 
[c00cf004] wake_up_worker+0x44/0x60 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] 
[c00fa64675d0] 0xc00fa64675d0
  May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] 
[c00fa6467850] 0xc00fa6467850
  May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
  May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 
105d9410 0 99427  99405 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] 
[c017efdd34a0] 0xc017efdd34a0 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] 
[c017efdd3690] 0xc017efdd3690
  May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] 
[c0a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] 
[c0a1f128] rwsem_down_write_failed+0x288/0x400
  May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] 
[c0a1e538] down_write+0x88/0x90
  May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] 
[d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] 
[d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] 
[d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] 
[c02c88f8] lookup_fast+0x368/0x3b0
  May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] 
[c02cb620] path_lookupat+0x180/0x970
  May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] 
[c02cbe68] filename_lookup+0x58/0x140
  May 18 15:17:55 dldev1 kernel: [713670.798746] [c017efdd3c30] 
[c02cde04] user_path_at_empty+0x84/0xe0
  May 18 15:17:55 dldev1 kernel: [713670.798749] [c017efdd3d20] 
[c02be744] vfs_fstatat+0x84/0x140