cgroups OOM handler causing lockups?

Whitney Sorenson Tue, 01 Jul 2014 09:13:29 -0700

We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the
new generation C3 machines (generally c3.8xl) and have been experiencing
frequent system reboots.


Due to this issue (
http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E)
we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel
machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets of
machines seem equally likely to experience reboots, although the 3.10
machines do not come back unaided.

It seems that the kernel runs into problems in the OOM handler, and we see
traces such as:

[378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300]
(https://gist.github.com/wsorenson/d2a12f1892b43aa28936)

Is this possibly related to https://issues.apache.org/jira/browse/MESOS-662
 ?

Any guidance on how to debug further or if this is a known issue with
certain mesos versions? Some sleuthing indicates that a patch for the above
may have been added, removed
<https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=f90fe7641ea8f7066a6a1171a24ddaa8dc30e789>,
and added again
<https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=326aa493fb445302137d538d456712249504d251>
later.

Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they report
some cgroup deadlock fixes. We are testing this out asap.

Thanks,

-Whitney

cgroups OOM handler causing lockups?

Reply via email to