Thanks for clearing up about those patches. I can confirm:
cat /cgroup/memory/memory.oom_control oom_kill_disable 0 under_oom 0 We can try to reproduce outside of Mesos and see if we have similar issues. Thankfully, we are not using EBS. -Whitney On Tue, Jul 1, 2014 at 1:36 PM, Ian Downes <[email protected]> wrote: > Hi Whitney, > > As Vinod said, 0.18.0 will ensure the kernel is set handle OOM > conditions. The patches you linked are refactors that should not have > changed the behavior since 0.18.0. Could you please double check that > /sys/fs/cgroup/memory/memory.oom.control has "oom_kill_disable 0"? > > Can you attempt to reproduce this outside of Mesos by running a > process inside a manually created memory cgroup. Something like the dd > command in the thread you linked should trigger the OOM handler to run > and probably kill processes. Or, perhaps run your java process with a > much lower memory limit. > > From the trace you provided I see mention of ext4 write so it looks > like the OOM handler is indeed trying to flush dirty pages to disk. > Are you running this on EBS? If so, there could be timing issues here > that the kernel isn't handling well; can you test using just the > ephemeral disk(s)? > > Please do let us know how this goes! > > Ian > > On Tue, Jul 1, 2014 at 10:17 AM, Vinod Kone <[email protected]> wrote: > > Hey Whitney, > > > > I'll let Ian Downes comment on the specific patches you linked, but at a > > high level the bug in MESOS-662 was due to Mesos trying to handle OOM > > situations in user space instead of letting kernel handle it. We have > since > > then changed the behavior to let Kernel handle the OOM. You can confirm > this > > by checking "oom.control" file in the cgroup of your container (it should > > say 'oom_kill_disable 0'). > > > > > > On Tue, Jul 1, 2014 at 9:12 AM, Whitney Sorenson <[email protected]> > > wrote: > >> > >> We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the > >> new generation C3 machines (generally c3.8xl) and have been experiencing > >> frequent system reboots. > >> > >> Due to this issue > >> ( > http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E > ) > >> we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel > >> machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets > of > >> machines seem equally likely to experience reboots, although the 3.10 > >> machines do not come back unaided. > >> > >> It seems that the kernel runs into problems in the OOM handler, and we > see > >> traces such as: > >> > >> [378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300] > >> (https://gist.github.com/wsorenson/d2a12f1892b43aa28936) > >> > >> Is this possibly related to > >> https://issues.apache.org/jira/browse/MESOS-662 ? > >> > >> Any guidance on how to debug further or if this is a known issue with > >> certain mesos versions? Some sleuthing indicates that a patch for the > above > >> may have been added, removed, and added again later. > >> > >> Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they report > >> some cgroup deadlock fixes. We are testing this out asap. > >> > >> Thanks, > >> > >> -Whitney > > > > >

