Hello Thomas, Your impression is mostly correct: the kernel will *try* to reclaim memory by writing out dirty pages before killing processes in a cgroup but if it's unable to reclaim sufficient pages within some interval (I don't recall this off-hand) then it will start killing things.
We observed this on a 3.4 kernel where we could overwhelm the disk subsystem and trigger an oom. Just how quickly this happens depends on how fast you're writing compared to how fast your disk subsystem can write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when contained in a memory cgroup will fill the cache quickly, reach its limit and get oom'ed. We were not able to reproduce this under 3.10 and 3.11 kernels. Which kernel are you using? Example: under 3.4: [idownes@hostname tmp]$ cat /proc/self/cgroup 6:perf_event:/ 4:memory:/test 3:freezer:/ 2:cpuacct:/ 1:cpu:/ [idownes@hostname tmp]$ cat /sys/fs/cgroup/memory/test/memory.limit_in_bytes # 128 MB 134217728 [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M Killed [idownes@hostname tmp]$ ls -lah lotsazeros -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros You can also look in /proc/vmstat at nr_dirty to see how many dirty pages there are (system wide). If you wrote at a rate sustainable by your disk subsystem then you would see a sawtooth pattern _/|_/| ... (use something like watch) as the cgroup approached its limit and the kernel flushed dirty pages to bring it down. This might be an interesting read: http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ Hope this helps! Please do let us know if you're seeing this on a kernel >= 3.10, otherwise it's likely this is a kernel issue rather than something with Mesos. Thanks, Ian On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <[email protected]> wrote: > Hello, > > We're running Mesos 0.18.0 with cgroups isolation, and have run into > situations where lots of file I/O causes tasks to be killed due to exceeding > memory limits. Here's an example: > https://gist.github.com/tpetr/ce5d80a0de9f713765f0 > > We were under the impression that if cache was using a lot of memory it > would be reclaimed *before* the OOM process decides to kills the task. Is > this accurate? We also found MESOS-762 while trying to diagnose -- could > this be a regression? > > Thanks, > Tom

