Awesome response! inline below -
----- Original Message ----- > From: "Sharma Podila" <[email protected]> > To: [email protected] > Cc: "Ian Downes" <[email protected]>, "Eric Abbott" <[email protected]> > Sent: Thursday, June 19, 2014 11:54:34 AM > Subject: Re: cgroups memory isolation > Purely from a user expectation point of view, I am wondering if such an > "abuse" (overuse?) of I/O bandwidth/rate should translate into I/O bandwidth > getting throttled for the job instead of it manifesting into an OOM that > results in a job kill. Such I/O overuse translating into memory overuse > seems like an implementation detail (for lack of a better phrase) of the OS > that uses cache'ing. It's not like the job asked for its memory to be used > up for I/O cache'ing :-) In cgroups, you could optionally specify the memory limit as soft, vs. hard (OOM). > I do see that this isn't Mesos specific, but, rather a containerization > artifact that is inevitable in a shared resource environment. > That said, specifying memory size for jobs is not trivial in a shared > resource environment. Conservative safe margins do help prevent OOMs, but, > they also come with the side effect of fragmenting resources and reducing > utilization. In some cases, they can cause job starvation to some extent, if > most available memory is allocated to the conservative buffering for every > job. Yup, unless you develop tuning models / hunting algorithms. You need some level of global visibility & history. > Another approach that could help, if feasible, is to have containers with > elastic boundaries (different from over-subscription) that manage things > such that sum of actual usage of all containers is <= system resources. This > helps when not all jobs have peak use of resources simultaneously. You "could" use soft limits & resize, I like to call it the "push-over" policy. If the limits are not enforced, what prevents abusive users in absence of global visibility? IMHO - having soft c-group memory limits being an option seems to be the right play given the environment. Thoughts? > On Wed, Jun 18, 2014 at 1:42 PM, Tim St Clair < [email protected] > wrote: > > FWIW - There is classic grid mantra that applies here. Test your workflow > > on > > an upper bound, then over provision to be safe. > > > Mesos is no different then SGE, PBS, LSF, Condor, etc. > > > Also, there is no hunting algo for "jobs", that would have to live outside > > of > > mesos itself, on some batch system built atop. > > > Cheers, > > > Tim > > > > From: "Thomas Petr" < [email protected] > > > > > > > To: "Ian Downes" < [email protected] > > > > > > > Cc: [email protected] , "Eric Abbott" < [email protected] > > > > > > > Sent: Wednesday, June 18, 2014 9:36:51 AM > > > > > > Subject: Re: cgroups memory isolation > > > > > > Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32 > > > kernel. > > > > > > I ran ` dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got > > > some > > > weird results. I initially gave the task 256 MB, and it never exceeded > > > the > > > memory allocation (I killed the task manually after 5 minutes when the > > > file > > > hit 50 GB). Then I noticed your example was 128 MB, so I resized and > > > tried > > > again. It exceeded memory almost immediately. The next (replacement) task > > > our framework started ran successfully and never exceeded memory. I > > > watched > > > nr_dirty and it fluctuated between 10000 to 14000 when the task is > > > running. > > > The slave host is a c3.xlarge in EC2, if it makes a difference. > > > > > > As Mesos users, we'd like an isolation strategy that isn't affected by > > > cache > > > this much -- it makes it harder for us to appropriately size things. Is > > > it > > > possible through Mesos or cgroups itself to make the page cache not count > > > towards the total memory consumption? If the answer is no, do you think > > > it'd > > > be worth looking at using Docker for isolation instead? > > > > > > - Tom > > > > > > On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes < [email protected] > > > > wrote: > > > > > > > Hello Thomas, > > > > > > > > > > Your impression is mostly correct: the kernel will *try* to reclaim > > > > > > > > > > memory by writing out dirty pages before killing processes in a cgroup > > > > > > > > > > but if it's unable to reclaim sufficient pages within some interval (I > > > > > > > > > > don't recall this off-hand) then it will start killing things. > > > > > > > > > > We observed this on a 3.4 kernel where we could overwhelm the disk > > > > > > > > > > subsystem and trigger an oom. Just how quickly this happens depends on > > > > > > > > > > how fast you're writing compared to how fast your disk subsystem can > > > > > > > > > > write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when > > > > > > > > > > contained in a memory cgroup will fill the cache quickly, reach its > > > > > > > > > > limit and get oom'ed. We were not able to reproduce this under 3.10 > > > > > > > > > > and 3.11 kernels. Which kernel are you using? > > > > > > > > > > Example: under 3.4: > > > > > > > > > > [idownes@hostname tmp]$ cat /proc/self/cgroup > > > > > > > > > > 6:perf_event:/ > > > > > > > > > > 4:memory:/test > > > > > > > > > > 3:freezer:/ > > > > > > > > > > 2:cpuacct:/ > > > > > > > > > > 1:cpu:/ > > > > > > > > > > [idownes@hostname tmp]$ cat > > > > > > > > > > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # 128 MB > > > > > > > > > > 134217728 > > > > > > > > > > [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M > > > > > > > > > > Killed > > > > > > > > > > [idownes@hostname tmp]$ ls -lah lotsazeros > > > > > > > > > > -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros > > > > > > > > > > You can also look in /proc/vmstat at nr_dirty to see how many dirty > > > > > > > > > > pages there are (system wide). If you wrote at a rate sustainable by > > > > > > > > > > your disk subsystem then you would see a sawtooth pattern _/|_/| ... > > > > > > > > > > (use something like watch) as the cgroup approached its limit and the > > > > > > > > > > kernel flushed dirty pages to bring it down. > > > > > > > > > > This might be an interesting read: > > > > > > > > > > http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ > > > > > > > > > > Hope this helps! Please do let us know if you're seeing this on a > > > > > > > > > > kernel >= 3.10, otherwise it's likely this is a kernel issue rather > > > > > > > > > > than something with Mesos. > > > > > > > > > > Thanks, > > > > > > > > > > Ian > > > > > > > > > > On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr < [email protected] > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > We're running Mesos 0.18.0 with cgroups isolation, and have run into > > > > > > > > > > > situations where lots of file I/O causes tasks to be killed due to > > > > > exceeding > > > > > > > > > > > memory limits. Here's an example: > > > > > > > > > > > https://gist.github.com/tpetr/ce5d80a0de9f713765f0 > > > > > > > > > > > > > > > > > > > > > > We were under the impression that if cache was using a lot of memory > > > > > it > > > > > > > > > > > would be reclaimed *before* the OOM process decides to kills the > > > > > task. > > > > > Is > > > > > > > > > > > this accurate? We also found MESOS-762 while trying to diagnose -- > > > > > could > > > > > > > > > > > this be a regression? > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Tom > > > > > > > > -- > > > Cheers, > > > Tim > > > Freedom, Features, Friends, First -> Fedora > > > https://fedoraproject.org/wiki/SIGs/bigdata > -- Cheers, Tim Freedom, Features, Friends, First -> Fedora https://fedoraproject.org/wiki/SIGs/bigdata

