Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32 kernel.
I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got some weird results. I initially gave the task 256 MB, and it never exceeded the memory allocation (I killed the task manually after 5 minutes when the file hit 50 GB). Then I noticed your example was 128 MB, so I resized and tried again. It exceeded memory <https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82> almost immediately. The next (replacement) task our framework started ran successfully and never exceeded memory. I watched nr_dirty and it fluctuated between 10000 to 14000 when the task is running. The slave host is a c3.xlarge in EC2, if it makes a difference. As Mesos users, we'd like an isolation strategy that isn't affected by cache this much -- it makes it harder for us to appropriately size things. Is it possible through Mesos or cgroups itself to make the page cache not count towards the total memory consumption? If the answer is no, do you think it'd be worth looking at using Docker for isolation instead? -Tom On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes <[email protected]> wrote: > Hello Thomas, > > Your impression is mostly correct: the kernel will *try* to reclaim > memory by writing out dirty pages before killing processes in a cgroup > but if it's unable to reclaim sufficient pages within some interval (I > don't recall this off-hand) then it will start killing things. > > We observed this on a 3.4 kernel where we could overwhelm the disk > subsystem and trigger an oom. Just how quickly this happens depends on > how fast you're writing compared to how fast your disk subsystem can > write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when > contained in a memory cgroup will fill the cache quickly, reach its > limit and get oom'ed. We were not able to reproduce this under 3.10 > and 3.11 kernels. Which kernel are you using? > > Example: under 3.4: > > [idownes@hostname tmp]$ cat /proc/self/cgroup > 6:perf_event:/ > 4:memory:/test > 3:freezer:/ > 2:cpuacct:/ > 1:cpu:/ > [idownes@hostname tmp]$ cat > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # 128 MB > 134217728 > [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M > Killed > [idownes@hostname tmp]$ ls -lah lotsazeros > -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros > > > You can also look in /proc/vmstat at nr_dirty to see how many dirty > pages there are (system wide). If you wrote at a rate sustainable by > your disk subsystem then you would see a sawtooth pattern _/|_/| ... > (use something like watch) as the cgroup approached its limit and the > kernel flushed dirty pages to bring it down. > > This might be an interesting read: > > http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ > > Hope this helps! Please do let us know if you're seeing this on a > kernel >= 3.10, otherwise it's likely this is a kernel issue rather > than something with Mesos. > > Thanks, > Ian > > > On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <[email protected]> wrote: > > Hello, > > > > We're running Mesos 0.18.0 with cgroups isolation, and have run into > > situations where lots of file I/O causes tasks to be killed due to > exceeding > > memory limits. Here's an example: > > https://gist.github.com/tpetr/ce5d80a0de9f713765f0 > > > > We were under the impression that if cache was using a lot of memory it > > would be reclaimed *before* the OOM process decides to kills the task. Is > > this accurate? We also found MESOS-762 while trying to diagnose -- could > > this be a regression? > > > > Thanks, > > Tom >

