Re: [elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6
On Thu, Oct 6, 2016 at 3:27 PM, Akemi Yagiwrote: > On Thu, Oct 6, 2016 at 2:22 PM, Grigory Shamov > wrote: >> An update: >> >> Looks like the same issue was observed in RedHat 7 kernels, also based on >> 3.10: >> This pertains to perf_event_overflow error with increased >> kernel.watchdog.thresh >> >> https://access.redhat.com/solutions/1354963 >> >> ``` >> * Red Hat Enterprise Linux (RHEL) 7 >> * seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64) >> * the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value >> than the default >> * Docker >> ``` >> >> They report panic on Docker; we see it on normal app workload >> (but HPC applications are long-running and use lot of memory, so they can >> be somewhat similar to a heavily used container). >> >> The RedHat solution basically suggests to update to their later kernel. >> What would one does with the Elrepo one? > > I'd like to track down the patch(es) Red Hat applied to fix the issue. > It is possible that, while kernel-lt does not have the patch, > kernel-ml may have it. At any rate the patch must be identified to > find that out. I now suspect the following patch was the one: https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/9809b18fcf6b8d8ec4d3643677345907e6b50eca It first appeared in kernel 3.12. RH backported it to 7.1/7.2 kernels. Akemi ___ elrepo mailing list elrepo@lists.elrepo.org http://lists.elrepo.org/mailman/listinfo/elrepo
Re: [elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6
An update: Looks like the same issue was observed in RedHat 7 kernels, also based on 3.10: This pertains to perf_event_overflow error with increased kernel.watchdog.thresh https://access.redhat.com/solutions/1354963 ``` * Red Hat Enterprise Linux (RHEL) 7 * seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64) * the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value than the default * Docker ``` They report panic on Docker; we see it on normal app workload (but HPC applications are long-running and use lot of memory, so they can be somewhat similar to a heavily used container). The RedHat solution basically suggests to update to their later kernel. What would one does with the Elrepo one? -- Grigory Shamov Westgrid/ComputeCanada Site Lead University of Manitoba E2-588 EITC Building, (204) 474-9625 On 2016-10-06, 11:31 AM, "elrepo-boun...@lists.elrepo.org on behalf of Grigory Shamov"wrote: >Hi All, > >We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our >HPC cluster. >The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips, >SSE4.2). >We have first tested if the kernel works with our driver stack, were >satisfied, and went to production. > >It turned out though that under production load, time to time, on some of >the nodes (a few of them, seemingly at random), kernel panics on >nmi_watchdog hard lockups (and time to time emits barfs about soft >lockups) emitting various messages like this: > >³²" >Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3 > Š Call trace follows; mentions watchdog_overflow_callback Š >Shutting down cpus with NMI >drms_kms_helper: panic occurred, switching back to text console >³²² > >Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it >is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60. >It made things worse, the test node quickly had kernel panic with Call >trace mentioning ³perf_event_overflow². > >Is there anything we can do about these errors, and what would be the >possible reason for them? Could anyone suggest a fix? Thank you very much >in advance. > > >-- >Grigory Shamov > >Westgrid/ComputeCanada Site Lead >University of Manitoba >E2-588 EITC Building, >(204) 474-9625 > > > > > >___ >elrepo mailing list >elrepo@lists.elrepo.org >http://lists.elrepo.org/mailman/listinfo/elrepo ___ elrepo mailing list elrepo@lists.elrepo.org http://lists.elrepo.org/mailman/listinfo/elrepo
[elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6
Hi All, We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our HPC cluster. The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips, SSE4.2). We have first tested if the kernel works with our driver stack, were satisfied, and went to production. It turned out though that under production load, time to time, on some of the nodes (a few of them, seemingly at random), kernel panics on nmi_watchdog hard lockups (and time to time emits barfs about soft lockups) emitting various messages like this: ³²" Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3 Š Call trace follows; mentions watchdog_overflow_callback Š Shutting down cpus with NMI drms_kms_helper: panic occurred, switching back to text console ³²² Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60. It made things worse, the test node quickly had kernel panic with Call trace mentioning ³perf_event_overflow². Is there anything we can do about these errors, and what would be the possible reason for them? Could anyone suggest a fix? Thank you very much in advance. -- Grigory Shamov Westgrid/ComputeCanada Site Lead University of Manitoba E2-588 EITC Building, (204) 474-9625 ___ elrepo mailing list elrepo@lists.elrepo.org http://lists.elrepo.org/mailman/listinfo/elrepo
[elrepo] skylake, nvidia, bumblebee
Hello, as mentioned a few weeks ago I am trying to get C7 fully functional on a new Acer Aspire E5-575G-55DE laptop, which has an intel i5-6200U with integrated HD Graphics 520 as well as a nvidia GeForce GTX 950M. Thanks to advice from this list I got the intel graphics to work with the standard EL7 kernel by adding i915.preliminary_hw_support=1 to the kernel options. I then followed the instructions from http://elrepo.org/tiki/bumblebee to install the nvidia kmod and driver, as well as bumblebee. After rebooting I get a "Oh no! Something has gone wrong" screen. I managed to fix this by sudo mv /usr/lib64/xorg/modules/extensions/nvidia/ /usr/lib64/xorg/ and modifying XorgModulePath accordingly in bumblebee.conf . Everything now works: I can now boot in X mode (using the intel gpu) and start applications using the nvidia gpu with optirun. Comparing the Xorg.0.log files between initial (broken) install and after my fix, the main difference is (removing timestamps for readability): BROKEN: (II) LoadModule: "glx" (II) Loading /usr/lib64/xorg/modules/extensions/nvidia/libglx.so (EE) Failed to load /usr/lib64/xorg/modules/extensions/nvidia/libglx.so: libnvidia-tls.so.367.44: cannot open shared object file: No such file or directory (II) UnloadModule: "glx" (II) Unloading glx (EE) Failed to load module "glx" (loader failed, 7) FIXED: (II) LoadModule: "glx" (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so (II) Module glx: vendor="X.Org Foundation" compiled for 1.17.2, module version = 1.0.0 ABI class: X.Org Server Extension, version 9.0 (==) AIGLX enabled So it seems that when the nvidia libglx is in /usr/lib64/modules/extensions/nvidia/libglx.so , X tries to load that instead of the correct /usr/lib64/xorg/modules/extensions/libglx.so . This is surprising because in both cases I have: (==) ModulePath set to "/usr/lib64/xorg/modules" The elrepo instructions worked for me on another laptop with intel+nvidia, so I don't understand. I had to fiddle with a bunch of things on this laptop, both for this nvidia problem and to try to get the wifi working, so it's possible I screwed up something. But I've double-checked everything I could think of without finding the culprit. So, does anyone have any ideas why X is picking up the wrong libglx.so here? I'm willing to check any files and test things, I just don't know where to look or what to try now. Thanks, Nicolas ___ elrepo mailing list elrepo@lists.elrepo.org http://lists.elrepo.org/mailman/listinfo/elrepo
Re: [elrepo] skylake, nvidia, bumblebee
On 10/06/2016 04:15 PM, Nicolas Thierry-Mieg wrote: Hello, as mentioned a few weeks ago I am trying to get C7 fully functional on a new Acer Aspire E5-575G-55DE laptop, which has an intel i5-6200U with integrated HD Graphics 520 as well as a nvidia GeForce GTX 950M. Thanks to advice from this list I got the intel graphics to work with the standard EL7 kernel by adding i915.preliminary_hw_support=1 to the kernel options. I then followed the instructions from http://elrepo.org/tiki/bumblebee to install the nvidia kmod and driver, as well as bumblebee. After rebooting I get a "Oh no! Something has gone wrong" screen. I managed to fix this by sudo mv /usr/lib64/xorg/modules/extensions/nvidia/ /usr/lib64/xorg/ and modifying XorgModulePath accordingly in bumblebee.conf . Everything now works: I can now boot in X mode (using the intel gpu) and start applications using the nvidia gpu with optirun. Comparing the Xorg.0.log files between initial (broken) install and after my fix, the main difference is (removing timestamps for readability): BROKEN: (II) LoadModule: "glx" (II) Loading /usr/lib64/xorg/modules/extensions/nvidia/libglx.so (EE) Failed to load /usr/lib64/xorg/modules/extensions/nvidia/libglx.so: libnvidia-tls.so.367.44: cannot open shared object file: No such file or directory (II) UnloadModule: "glx" (II) Unloading glx (EE) Failed to load module "glx" (loader failed, 7) FIXED: (II) LoadModule: "glx" (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so (II) Module glx: vendor="X.Org Foundation" compiled for 1.17.2, module version = 1.0.0 ABI class: X.Org Server Extension, version 9.0 (==) AIGLX enabled So it seems that when the nvidia libglx is in /usr/lib64/modules/extensions/nvidia/libglx.so , X tries to load that instead of the correct /usr/lib64/xorg/modules/extensions/libglx.so . I might be wrong but I think that you did not fix it correctly. nvidia comes with its own libglx and /etc/Xorg.d/*nvidia.conf properly tells Xorg to load it from /usr/lib64/xorg/modules/extensions/nvidia/. The real issue is that the module cannot find libnvidia-tls.so.367.44 ___ elrepo mailing list elrepo@lists.elrepo.org http://lists.elrepo.org/mailman/listinfo/elrepo