Re: cgroups OOM handler causing lockups?

2014-07-25 Thread Whitney Sorenson
Following up on this issue. After setting up a test cluster running many parallel VMs and cgroups OOMs we were able to isolate at least one current issue: https://bugzilla.kernel.org/show_bug.cgi?id=80881 We've also noticed a separate deadlock issue occurring in 3.10 which does not appear to be p

Re: cgroups OOM handler causing lockups?

2014-07-01 Thread Whitney Sorenson
Thanks for clearing up about those patches. I can confirm: cat /cgroup/memory/memory.oom_control oom_kill_disable 0 under_oom 0 We can try to reproduce outside of Mesos and see if we have similar issues. Thankfully, we are not using EBS. -Whitney On Tue, Jul 1, 2014 at 1:36 PM, Ian Downes wr

Re: cgroups OOM handler causing lockups?

2014-07-01 Thread Ian Downes
Hi Whitney, As Vinod said, 0.18.0 will ensure the kernel is set handle OOM conditions. The patches you linked are refactors that should not have changed the behavior since 0.18.0. Could you please double check that /sys/fs/cgroup/memory/memory.oom.control has "oom_kill_disable 0"? Can you attempt

Re: cgroups OOM handler causing lockups?

2014-07-01 Thread Vinod Kone
Hey Whitney, I'll let Ian Downes comment on the specific patches you linked, but at a high level the bug in MESOS-662 was due to Mesos trying to handle OOM situations in user space instead of letting kernel handle it. We have since then changed the behavior to let Kernel handle the OOM. You can co

cgroups OOM handler causing lockups?

2014-07-01 Thread Whitney Sorenson
We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the new generation C3 machines (generally c3.8xl) and have been experiencing frequent system reboots. Due to this issue ( http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSU