Eric pointed out that I had a typo in the instance type -- it's a c3.8xlarge (containing SSDs, which could make a difference here).
On Wed, Jun 18, 2014 at 10:36 AM, Thomas Petr <[email protected]> wrote: > Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32 > kernel. > > I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got > some weird results. I initially gave the task 256 MB, and it never exceeded > the memory allocation (I killed the task manually after 5 minutes when the > file hit 50 GB). Then I noticed your example was 128 MB, so I resized and > tried again. It exceeded memory > <https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82> almost > immediately. The next (replacement) task our framework started ran > successfully and never exceeded memory. I watched nr_dirty and it > fluctuated between 10000 to 14000 when the task is running. The slave host > is a c3.xlarge in EC2, if it makes a difference. > > As Mesos users, we'd like an isolation strategy that isn't affected by > cache this much -- it makes it harder for us to appropriately size things. > Is it possible through Mesos or cgroups itself to make the page cache not > count towards the total memory consumption? If the answer is no, do you > think it'd be worth looking at using Docker for isolation instead? > > -Tom > > > On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes <[email protected]> wrote: > >> Hello Thomas, >> >> Your impression is mostly correct: the kernel will *try* to reclaim >> memory by writing out dirty pages before killing processes in a cgroup >> but if it's unable to reclaim sufficient pages within some interval (I >> don't recall this off-hand) then it will start killing things. >> >> We observed this on a 3.4 kernel where we could overwhelm the disk >> subsystem and trigger an oom. Just how quickly this happens depends on >> how fast you're writing compared to how fast your disk subsystem can >> write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when >> contained in a memory cgroup will fill the cache quickly, reach its >> limit and get oom'ed. We were not able to reproduce this under 3.10 >> and 3.11 kernels. Which kernel are you using? >> >> Example: under 3.4: >> >> [idownes@hostname tmp]$ cat /proc/self/cgroup >> 6:perf_event:/ >> 4:memory:/test >> 3:freezer:/ >> 2:cpuacct:/ >> 1:cpu:/ >> [idownes@hostname tmp]$ cat >> /sys/fs/cgroup/memory/test/memory.limit_in_bytes # 128 MB >> 134217728 >> [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M >> Killed >> [idownes@hostname tmp]$ ls -lah lotsazeros >> -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros >> >> >> You can also look in /proc/vmstat at nr_dirty to see how many dirty >> pages there are (system wide). If you wrote at a rate sustainable by >> your disk subsystem then you would see a sawtooth pattern _/|_/| ... >> (use something like watch) as the cgroup approached its limit and the >> kernel flushed dirty pages to bring it down. >> >> This might be an interesting read: >> >> http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ >> >> Hope this helps! Please do let us know if you're seeing this on a >> kernel >= 3.10, otherwise it's likely this is a kernel issue rather >> than something with Mesos. >> >> Thanks, >> Ian >> >> >> On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <[email protected]> wrote: >> > Hello, >> > >> > We're running Mesos 0.18.0 with cgroups isolation, and have run into >> > situations where lots of file I/O causes tasks to be killed due to >> exceeding >> > memory limits. Here's an example: >> > https://gist.github.com/tpetr/ce5d80a0de9f713765f0 >> > >> > We were under the impression that if cache was using a lot of memory it >> > would be reclaimed *before* the OOM process decides to kills the task. >> Is >> > this accurate? We also found MESOS-762 while trying to diagnose -- could >> > this be a regression? >> > >> > Thanks, >> > Tom >> > >

