[Kernel-packages] [Bug 1633223] host 'dlfire5' hang from morning of Oct 18th
** Attachment added: "host 'dlfire5' hang from morning of Oct 18th" https://bugs.launchpad.net/bugs/1633223/+attachment/4844766/+files/dlfire5-oct18-kern.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1633223 Title: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker Status in linux package in Ubuntu: Invalid Bug description: ---Problem Description--- Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 3.19.0-58. The system is running docker containers, and has the NVIDIA GPU driver loaded. We've seen about 4 stalls in the last month, all with the 3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers. ---uname output--- Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- 2 x NVIDIA K80 GPU adapter: $ lspci | grep NV 0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) Machine Type = 8247-42L ---System Hang--- Usual symptom is that the system is unresponsive except maybe for ping and writing the stall-detection messages to the console. Login/getty isn't available either via ssh nor on the console. System must be power cycled to recover. Attached is the kernel log from a stall detection on May 18th. The detection first occurs at: May 18 15:17:55. The system is later rebooted and those messages indicate the kernel (3.19.0-58) and NVIDIA driver version (352.93) that were active at the time. We've suffered 3 or 4 stalls since, all with the same kernel, but some with a newer NVIDIA driver (361.49). Unfortunately, information about the newer stalls wasn't preserved in the various log files (and we're not capturing the console constantly), so we don't have detailed data for those. We'd welcome any suggestions for how to collect additional data for these occurrences. I can't say for sure that we haven't seen the stalls on other systems, but they're occuring fairly frequently on this system, and it's unusual in that it's running both Docker and NVIDIA GPU driver. So maybe aufs or the NVIDIA driver are somehow involved. From the kern.log, The Call trace points to some kind of deadlock in aufs - May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3: May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R running task0 99183 99173 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] [c00cf004] wake_up_worker+0x44/0x60 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] [c00fa64675d0] 0xc00fa64675d0 May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] [c0a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] [c00fa6467850] 0xc00fa6467850 May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75: May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 105d9410 0 99427 99405 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] [c017efdd34a0] 0xc017efdd34a0 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] [c017efdd3690] 0xc017efdd3690 May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] [c0a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] [c0a1f128] rwsem_down_write_failed+0x288/0x400 May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] [c0a1e538] down_write+0x88/0x90 May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] [d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] [d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] [d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] [c02c88f8] lookup_fast+0x368/0x3b0 May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] [c02cb620] path_lookupat+0x180/0x970 May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] [c02cbe68] filename_lookup+0x58/0x140 May 18 15:17:55 dldev1 kernel: [713670.798746] [c017efdd3c30] [c02cde04] user_path_at_empty+0x84/0xe0 May 18 15:17
[Kernel-packages] [Bug 1633223] host 'dlfire5' hang from morning of Oct 18th (2nd hang)
** Attachment added: "host 'dlfire5' hang from morning of Oct 18th (2nd hang)" https://bugs.launchpad.net/bugs/1633223/+attachment/4844767/+files/dlfire5-oct18b-kern.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1633223 Title: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker Status in linux package in Ubuntu: Invalid Bug description: ---Problem Description--- Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 3.19.0-58. The system is running docker containers, and has the NVIDIA GPU driver loaded. We've seen about 4 stalls in the last month, all with the 3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers. ---uname output--- Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- 2 x NVIDIA K80 GPU adapter: $ lspci | grep NV 0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) Machine Type = 8247-42L ---System Hang--- Usual symptom is that the system is unresponsive except maybe for ping and writing the stall-detection messages to the console. Login/getty isn't available either via ssh nor on the console. System must be power cycled to recover. Attached is the kernel log from a stall detection on May 18th. The detection first occurs at: May 18 15:17:55. The system is later rebooted and those messages indicate the kernel (3.19.0-58) and NVIDIA driver version (352.93) that were active at the time. We've suffered 3 or 4 stalls since, all with the same kernel, but some with a newer NVIDIA driver (361.49). Unfortunately, information about the newer stalls wasn't preserved in the various log files (and we're not capturing the console constantly), so we don't have detailed data for those. We'd welcome any suggestions for how to collect additional data for these occurrences. I can't say for sure that we haven't seen the stalls on other systems, but they're occuring fairly frequently on this system, and it's unusual in that it's running both Docker and NVIDIA GPU driver. So maybe aufs or the NVIDIA driver are somehow involved. From the kern.log, The Call trace points to some kind of deadlock in aufs - May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3: May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R running task0 99183 99173 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] [c00cf004] wake_up_worker+0x44/0x60 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] [c00fa64675d0] 0xc00fa64675d0 May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] [c0a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] [c00fa6467850] 0xc00fa6467850 May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75: May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 105d9410 0 99427 99405 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] [c017efdd34a0] 0xc017efdd34a0 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] [c017efdd3690] 0xc017efdd3690 May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] [c0a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] [c0a1f128] rwsem_down_write_failed+0x288/0x400 May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] [c0a1e538] down_write+0x88/0x90 May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] [d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] [d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] [d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] [c02c88f8] lookup_fast+0x368/0x3b0 May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] [c02cb620] path_lookupat+0x180/0x970 May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] [c02cbe68] filename_lookup+0x58/0x140 May 18 15:17:55 dldev1 kernel: [713670.798746] [c017efdd3c30] [c02cde04] user_path_at_empty+0x84/0xe0
[Kernel-packages] [Bug 1633223] host 'dlfire5' hang from morning of Oct 18th (2nd hang)
--- Comment (attachment only) From ha...@us.ibm.com 2016-10-18 12:55 EDT--- ** Attachment added: "host 'dlfire5' hang from morning of Oct 18th (2nd hang)" https://bugs.launchpad.net/bugs/1633223/+attachment/4763344/+files/dlfire5-oct18b-kern.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1633223 Title: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker Status in linux package in Ubuntu: New Bug description: ---Problem Description--- Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 3.19.0-58. The system is running docker containers, and has the NVIDIA GPU driver loaded. We've seen about 4 stalls in the last month, all with the 3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers. ---uname output--- Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- 2 x NVIDIA K80 GPU adapter: $ lspci | grep NV 0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) Machine Type = 8247-42L ---System Hang--- Usual symptom is that the system is unresponsive except maybe for ping and writing the stall-detection messages to the console. Login/getty isn't available either via ssh nor on the console. System must be power cycled to recover. Attached is the kernel log from a stall detection on May 18th. The detection first occurs at: May 18 15:17:55. The system is later rebooted and those messages indicate the kernel (3.19.0-58) and NVIDIA driver version (352.93) that were active at the time. We've suffered 3 or 4 stalls since, all with the same kernel, but some with a newer NVIDIA driver (361.49). Unfortunately, information about the newer stalls wasn't preserved in the various log files (and we're not capturing the console constantly), so we don't have detailed data for those. We'd welcome any suggestions for how to collect additional data for these occurrences. I can't say for sure that we haven't seen the stalls on other systems, but they're occuring fairly frequently on this system, and it's unusual in that it's running both Docker and NVIDIA GPU driver. So maybe aufs or the NVIDIA driver are somehow involved. From the kern.log, The Call trace points to some kind of deadlock in aufs - May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3: May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R running task0 99183 99173 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] [c00cf004] wake_up_worker+0x44/0x60 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] [c00fa64675d0] 0xc00fa64675d0 May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] [c0a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] [c00fa6467850] 0xc00fa6467850 May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75: May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 105d9410 0 99427 99405 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] [c017efdd34a0] 0xc017efdd34a0 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] [c017efdd3690] 0xc017efdd3690 May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] [c0a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] [c0a1f128] rwsem_down_write_failed+0x288/0x400 May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] [c0a1e538] down_write+0x88/0x90 May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] [d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] [d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] [d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] [c02c88f8] lookup_fast+0x368/0x3b0 May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] [c02cb620] path_lookupat+0x180/0x970 May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] [c02cbe68] filename_lookup+0x58/0x140 May 18 15:17:55 dldev1 kernel: [71
[Kernel-packages] [Bug 1633223] host 'dlfire5' hang from morning of Oct 18th
--- Comment (attachment only) From ha...@us.ibm.com 2016-10-18 11:04 EDT--- ** Attachment added: "host 'dlfire5' hang from morning of Oct 18th" https://bugs.launchpad.net/bugs/1633223/+attachment/4763299/+files/dlfire5-oct18-kern.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1633223 Title: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker Status in linux package in Ubuntu: New Bug description: ---Problem Description--- Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 3.19.0-58. The system is running docker containers, and has the NVIDIA GPU driver loaded. We've seen about 4 stalls in the last month, all with the 3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers. ---uname output--- Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- 2 x NVIDIA K80 GPU adapter: $ lspci | grep NV 0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) Machine Type = 8247-42L ---System Hang--- Usual symptom is that the system is unresponsive except maybe for ping and writing the stall-detection messages to the console. Login/getty isn't available either via ssh nor on the console. System must be power cycled to recover. Attached is the kernel log from a stall detection on May 18th. The detection first occurs at: May 18 15:17:55. The system is later rebooted and those messages indicate the kernel (3.19.0-58) and NVIDIA driver version (352.93) that were active at the time. We've suffered 3 or 4 stalls since, all with the same kernel, but some with a newer NVIDIA driver (361.49). Unfortunately, information about the newer stalls wasn't preserved in the various log files (and we're not capturing the console constantly), so we don't have detailed data for those. We'd welcome any suggestions for how to collect additional data for these occurrences. I can't say for sure that we haven't seen the stalls on other systems, but they're occuring fairly frequently on this system, and it's unusual in that it's running both Docker and NVIDIA GPU driver. So maybe aufs or the NVIDIA driver are somehow involved. From the kern.log, The Call trace points to some kind of deadlock in aufs - May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3: May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R running task0 99183 99173 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798643] [c00fa64673a0] [c00cf004] wake_up_worker+0x44/0x60 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798671] [c00fa6467570] [c00fa64675d0] 0xc00fa64675d0 May 18 15:17:55 dldev1 kernel: [713670.798676] [c00fa64675d0] [c0a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798679] [c00fa64677f0] [c00fa6467850] 0xc00fa6467850 May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75: May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 105d9410 0 99427 99405 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798691] [c017efdd3460] [c017efdd34a0] 0xc017efdd34a0 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798695] [c017efdd3630] [c017efdd3690] 0xc017efdd3690 May 18 15:17:55 dldev1 kernel: [713670.798698] [c017efdd3690] [c0a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798702] [c017efdd38b0] [c0a1f128] rwsem_down_write_failed+0x288/0x400 May 18 15:17:55 dldev1 kernel: [713670.798706] [c017efdd3940] [c0a1e538] down_write+0x88/0x90 May 18 15:17:55 dldev1 kernel: [713670.798716] [c017efdd3970] [d0001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798724] [c017efdd39a0] [d0001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798733] [c017efdd39e0] [d0001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798737] [c017efdd3aa0] [c02c88f8] lookup_fast+0x368/0x3b0 May 18 15:17:55 dldev1 kernel: [713670.798740] [c017efdd3b10] [c02cb620] path_lookupat+0x180/0x970 May 18 15:17:55 dldev1 kernel: [713670.798743] [c017efdd3be0] [c02cbe68] filename_lookup+0x58/0x140 May 18 15:17:55 dldev1 kernel: [713670.798746]