Hey Whitney,

I'll let Ian Downes comment on the specific patches you linked, but at a
high level the bug in MESOS-662 was due to Mesos trying to handle OOM
situations in user space instead of letting kernel handle it. We have since
then changed the behavior to let Kernel handle the OOM. You can confirm
this by checking "oom.control" file in the cgroup of your container (it
should say 'oom_kill_disable 0').


On Tue, Jul 1, 2014 at 9:12 AM, Whitney Sorenson <[email protected]>
wrote:

> We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the
> new generation C3 machines (generally c3.8xl) and have been experiencing
> frequent system reboots.
>
> Due to this issue (
> http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E)
> we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel
> machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets of
> machines seem equally likely to experience reboots, although the 3.10
> machines do not come back unaided.
>
> It seems that the kernel runs into problems in the OOM handler, and we see
> traces such as:
>
> [378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300]
> (https://gist.github.com/wsorenson/d2a12f1892b43aa28936)
>
> Is this possibly related to
> https://issues.apache.org/jira/browse/MESOS-662 ?
>
> Any guidance on how to debug further or if this is a known issue with
> certain mesos versions? Some sleuthing indicates that a patch for the above
> may have been added, removed
> <https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=f90fe7641ea8f7066a6a1171a24ddaa8dc30e789>,
> and added again
> <https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=326aa493fb445302137d538d456712249504d251>
> later.
>
> Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they report
> some cgroup deadlock fixes. We are testing this out asap.
>
> Thanks,
>
> -Whitney
>

Reply via email to