Re: [elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6

2016-10-06 Thread Akemi Yagi
On Thu, Oct 6, 2016 at 3:27 PM, Akemi Yagi  wrote:
> On Thu, Oct 6, 2016 at 2:22 PM, Grigory Shamov
>  wrote:
>> An update:
>>
>> Looks like the same issue was observed in RedHat 7 kernels, also based on
>> 3.10:
>> This pertains to perf_event_overflow error with increased
>> kernel.watchdog.thresh
>>
>> https://access.redhat.com/solutions/1354963
>>
>> ```
>> * Red Hat Enterprise Linux (RHEL) 7
>> * seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64)
>> * the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value
>> than the default
>> * Docker
>> ```
>>
>> They report panic on Docker; we see it on normal app workload
>> (but HPC applications are long-running and use lot of memory, so they can
>> be somewhat similar to a heavily used container).
>>
>> The RedHat solution basically suggests to update to their later kernel.
>> What would one does with the Elrepo one?
>
> I'd like to track down the patch(es) Red Hat applied to fix the issue.
> It is possible that, while kernel-lt does not have the patch,
> kernel-ml may have it. At any rate the patch must be identified to
> find that out.

I now suspect the following patch was the one:

https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/9809b18fcf6b8d8ec4d3643677345907e6b50eca

It first appeared in kernel 3.12. RH backported it to 7.1/7.2 kernels.

Akemi
___
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo


Re: [elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6

2016-10-06 Thread Grigory Shamov
An update:

Looks like the same issue was observed in RedHat 7 kernels, also based on
3.10:
This pertains to perf_event_overflow error with increased
kernel.watchdog.thresh


https://access.redhat.com/solutions/1354963

```
* Red Hat Enterprise Linux (RHEL) 7
* seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64)
* the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value
than the default
* Docker
```

They report panic on Docker; we see it on normal app workload
(but HPC applications are long-running and use lot of memory, so they can
be somewhat similar to a heavily used container).

The RedHat solution basically suggests to update to their later kernel.
What would one does with the Elrepo one?



-- 
Grigory Shamov

Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625





On 2016-10-06, 11:31 AM, "elrepo-boun...@lists.elrepo.org on behalf of
Grigory Shamov"  wrote:

>Hi All,
>
>We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our
>HPC cluster. 
>The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips,
>SSE4.2).
>We have first tested if the kernel works with our driver stack, were
>satisfied, and went to production.
>
>It turned out though that under production load,  time to time, on some of
>the nodes (a few of them, seemingly at random), kernel panics on
>nmi_watchdog hard lockups (and time to time emits barfs about soft
>lockups) emitting various messages like this:
>
>³²"
>Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3
> Š Call trace follows; mentions watchdog_overflow_callback Š
>Shutting down cpus with NMI
>drms_kms_helper: panic occurred, switching back to text console
>³²²
>
>Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it
>is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60.
>It made things worse, the test node quickly had kernel panic with Call
>trace mentioning ³perf_event_overflow².
>
>Is there anything we can do about these errors, and what would be the
>possible reason for them? Could anyone suggest a fix? Thank you very much
>in advance.  
>
>
>-- 
>Grigory Shamov
>
>Westgrid/ComputeCanada Site Lead
>University of Manitoba
>E2-588 EITC Building,
>(204) 474-9625
>
>
>
>
>
>___
>elrepo mailing list
>elrepo@lists.elrepo.org
>http://lists.elrepo.org/mailman/listinfo/elrepo

___
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo


[elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6

2016-10-06 Thread Grigory Shamov
Hi All,

We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our
HPC cluster. 
The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips,
SSE4.2).
We have first tested if the kernel works with our driver stack, were
satisfied, and went to production.

It turned out though that under production load,  time to time, on some of
the nodes (a few of them, seemingly at random), kernel panics on
nmi_watchdog hard lockups (and time to time emits barfs about soft
lockups) emitting various messages like this:

³²"
Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3
 Š Call trace follows; mentions watchdog_overflow_callback Š
Shutting down cpus with NMI
drms_kms_helper: panic occurred, switching back to text console
³²²

Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it
is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60.
It made things worse, the test node quickly had kernel panic with Call
trace mentioning ³perf_event_overflow².

Is there anything we can do about these errors, and what would be the
possible reason for them? Could anyone suggest a fix? Thank you very much
in advance.  


-- 
Grigory Shamov

Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625





___
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo


[elrepo] skylake, nvidia, bumblebee

2016-10-06 Thread Nicolas Thierry-Mieg

Hello,

as mentioned a few weeks ago I am trying to get C7 fully functional on a 
new Acer Aspire E5-575G-55DE laptop, which has an intel i5-6200U with 
integrated HD Graphics 520 as well as a nvidia GeForce GTX 950M.


Thanks to advice from this list I got the intel graphics to work with 
the standard EL7 kernel by adding i915.preliminary_hw_support=1 to the 
kernel options.


I then followed the instructions from http://elrepo.org/tiki/bumblebee 
to install the nvidia kmod and driver, as well as bumblebee. After 
rebooting I get a "Oh no! Something has gone wrong" screen.


I managed to fix this by
sudo mv /usr/lib64/xorg/modules/extensions/nvidia/ /usr/lib64/xorg/
and modifying XorgModulePath accordingly in bumblebee.conf .

Everything now works: I can now boot in X mode (using the intel gpu) and 
start applications using the nvidia gpu with optirun.


Comparing the Xorg.0.log files between initial (broken) install and 
after my fix, the main difference is (removing timestamps for readability):


BROKEN:
(II) LoadModule: "glx"
(II) Loading /usr/lib64/xorg/modules/extensions/nvidia/libglx.so
(EE) Failed to load /usr/lib64/xorg/modules/extensions/nvidia/libglx.so: 
libnvidia-tls.so.367.44: cannot open shared object file: No such file or 
directory

(II) UnloadModule: "glx"
(II) Unloading glx
(EE) Failed to load module "glx" (loader failed, 7)

FIXED:
(II) LoadModule: "glx"
(II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
(II) Module glx: vendor="X.Org Foundation"
compiled for 1.17.2, module version = 1.0.0
ABI class: X.Org Server Extension, version 9.0
(==) AIGLX enabled


So it seems that when the nvidia libglx is in 
/usr/lib64/modules/extensions/nvidia/libglx.so , X tries to load that 
instead of the correct /usr/lib64/xorg/modules/extensions/libglx.so .


This is surprising because in both cases I have:
(==) ModulePath set to "/usr/lib64/xorg/modules"


The elrepo instructions worked for me on another laptop with 
intel+nvidia, so I don't understand. I had to fiddle with a bunch of 
things on this laptop, both for this nvidia problem and to try to get 
the wifi working, so it's possible I screwed up something. But I've 
double-checked everything I could think of without finding the culprit.
So, does anyone have any ideas why X is picking up the wrong libglx.so 
here? I'm willing to check any files and test things, I just don't know 
where to look or what to try now.


Thanks,
Nicolas
___
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo


Re: [elrepo] skylake, nvidia, bumblebee

2016-10-06 Thread Manuel Wolfshant

On 10/06/2016 04:15 PM, Nicolas Thierry-Mieg wrote:

Hello,

as mentioned a few weeks ago I am trying to get C7 fully functional on 
a new Acer Aspire E5-575G-55DE laptop, which has an intel i5-6200U 
with integrated HD Graphics 520 as well as a nvidia GeForce GTX 950M.


Thanks to advice from this list I got the intel graphics to work with 
the standard EL7 kernel by adding i915.preliminary_hw_support=1 to the 
kernel options.


I then followed the instructions from http://elrepo.org/tiki/bumblebee 
to install the nvidia kmod and driver, as well as bumblebee. After 
rebooting I get a "Oh no! Something has gone wrong" screen.


I managed to fix this by
sudo mv /usr/lib64/xorg/modules/extensions/nvidia/ /usr/lib64/xorg/
and modifying XorgModulePath accordingly in bumblebee.conf .

Everything now works: I can now boot in X mode (using the intel gpu) 
and start applications using the nvidia gpu with optirun.


Comparing the Xorg.0.log files between initial (broken) install and 
after my fix, the main difference is (removing timestamps for 
readability):


BROKEN:
(II) LoadModule: "glx"
(II) Loading /usr/lib64/xorg/modules/extensions/nvidia/libglx.so
(EE) Failed to load 
/usr/lib64/xorg/modules/extensions/nvidia/libglx.so: 
libnvidia-tls.so.367.44: cannot open shared object file: No such file 
or directory

(II) UnloadModule: "glx"
(II) Unloading glx
(EE) Failed to load module "glx" (loader failed, 7)

FIXED:
(II) LoadModule: "glx"
(II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
(II) Module glx: vendor="X.Org Foundation"
compiled for 1.17.2, module version = 1.0.0
ABI class: X.Org Server Extension, version 9.0
(==) AIGLX enabled


So it seems that when the nvidia libglx is in 
/usr/lib64/modules/extensions/nvidia/libglx.so , X tries to load that 
instead of the correct /usr/lib64/xorg/modules/extensions/libglx.so .



I might be wrong but I think that you did not fix it correctly.
nvidia comes with its own libglx and /etc/Xorg.d/*nvidia.conf properly 
tells Xorg to load it from /usr/lib64/xorg/modules/extensions/nvidia/. 
The real issue is that the module cannot find libnvidia-tls.so.367.44


___
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo