Thanks for clearing up about those patches.

I can confirm:

cat /cgroup/memory/memory.oom_control
oom_kill_disable 0
under_oom 0

We can try to reproduce outside of Mesos and see if we have similar issues.
Thankfully, we are not using EBS.
-Whitney



On Tue, Jul 1, 2014 at 1:36 PM, Ian Downes <[email protected]> wrote:

> Hi Whitney,
>
> As Vinod said, 0.18.0 will ensure the kernel is set handle OOM
> conditions. The patches you linked are refactors that should not have
> changed the behavior since 0.18.0. Could you please double check that
> /sys/fs/cgroup/memory/memory.oom.control has "oom_kill_disable 0"?
>
> Can you attempt to reproduce this outside of Mesos by running a
> process inside a manually created memory cgroup. Something like the dd
> command in the thread you linked should trigger the OOM handler to run
> and probably kill processes. Or, perhaps run your java process with a
> much lower memory limit.
>
> From the trace you provided I see mention of ext4 write so it looks
> like the OOM handler is indeed trying to flush dirty pages to disk.
> Are you running this on EBS? If so, there could be timing issues here
> that the kernel isn't handling well; can you test using just the
> ephemeral disk(s)?
>
> Please do let us know how this goes!
>
> Ian
>
> On Tue, Jul 1, 2014 at 10:17 AM, Vinod Kone <[email protected]> wrote:
> > Hey Whitney,
> >
> > I'll let Ian Downes comment on the specific patches you linked, but at a
> > high level the bug in MESOS-662 was due to Mesos trying to handle OOM
> > situations in user space instead of letting kernel handle it. We have
> since
> > then changed the behavior to let Kernel handle the OOM. You can confirm
> this
> > by checking "oom.control" file in the cgroup of your container (it should
> > say 'oom_kill_disable 0').
> >
> >
> > On Tue, Jul 1, 2014 at 9:12 AM, Whitney Sorenson <[email protected]>
> > wrote:
> >>
> >> We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the
> >> new generation C3 machines (generally c3.8xl) and have been experiencing
> >> frequent system reboots.
> >>
> >> Due to this issue
> >> (
> http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E
> )
> >> we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel
> >> machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets
> of
> >> machines seem equally likely to experience reboots, although the 3.10
> >> machines do not come back unaided.
> >>
> >> It seems that the kernel runs into problems in the OOM handler, and we
> see
> >> traces such as:
> >>
> >> [378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300]
> >> (https://gist.github.com/wsorenson/d2a12f1892b43aa28936)
> >>
> >> Is this possibly related to
> >> https://issues.apache.org/jira/browse/MESOS-662 ?
> >>
> >> Any guidance on how to debug further or if this is a known issue with
> >> certain mesos versions? Some sleuthing indicates that a patch for the
> above
> >> may have been added, removed, and added again later.
> >>
> >> Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they report
> >> some cgroup deadlock fixes. We are testing this out asap.
> >>
> >> Thanks,
> >>
> >> -Whitney
> >
> >
>

Reply via email to