Awesome response! 

inline below - 

----- Original Message -----

> From: "Sharma Podila" <[email protected]>
> To: [email protected]
> Cc: "Ian Downes" <[email protected]>, "Eric Abbott" <[email protected]>
> Sent: Thursday, June 19, 2014 11:54:34 AM
> Subject: Re: cgroups memory isolation

> Purely from a user expectation point of view, I am wondering if such an
> "abuse" (overuse?) of I/O bandwidth/rate should translate into I/O bandwidth
> getting throttled for the job instead of it manifesting into an OOM that
> results in a job kill. Such I/O overuse translating into memory overuse
> seems like an implementation detail (for lack of a better phrase) of the OS
> that uses cache'ing. It's not like the job asked for its memory to be used
> up for I/O cache'ing :-)

In cgroups, you could optionally specify the memory limit as soft, vs. hard 
(OOM). 

> I do see that this isn't Mesos specific, but, rather a containerization
> artifact that is inevitable in a shared resource environment.

> That said, specifying memory size for jobs is not trivial in a shared
> resource environment. Conservative safe margins do help prevent OOMs, but,
> they also come with the side effect of fragmenting resources and reducing
> utilization. In some cases, they can cause job starvation to some extent, if
> most available memory is allocated to the conservative buffering for every
> job.

Yup, unless you develop tuning models / hunting algorithms. You need some level 
of global visibility & history. 

> Another approach that could help, if feasible, is to have containers with
> elastic boundaries (different from over-subscription) that manage things
> such that sum of actual usage of all containers is <= system resources. This
> helps when not all jobs have peak use of resources simultaneously.

You "could" use soft limits & resize, I like to call it the "push-over" policy. 
If the limits are not enforced, what prevents abusive users in absence of 
global visibility? 

IMHO - having soft c-group memory limits being an option seems to be the right 
play given the environment. 

Thoughts? 

> On Wed, Jun 18, 2014 at 1:42 PM, Tim St Clair < [email protected] > wrote:

> > FWIW - There is classic grid mantra that applies here. Test your workflow
> > on
> > an upper bound, then over provision to be safe.
> 

> > Mesos is no different then SGE, PBS, LSF, Condor, etc.
> 
> > Also, there is no hunting algo for "jobs", that would have to live outside
> > of
> > mesos itself, on some batch system built atop.
> 

> > Cheers,
> 
> > Tim
> 

> > > From: "Thomas Petr" < [email protected] >
> > 
> 
> > > To: "Ian Downes" < [email protected] >
> > 
> 
> > > Cc: [email protected] , "Eric Abbott" < [email protected] >
> > 
> 
> > > Sent: Wednesday, June 18, 2014 9:36:51 AM
> > 
> 
> > > Subject: Re: cgroups memory isolation
> > 
> 

> > > Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
> > > kernel.
> > 
> 

> > > I ran ` dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
> > > some
> > > weird results. I initially gave the task 256 MB, and it never exceeded
> > > the
> > > memory allocation (I killed the task manually after 5 minutes when the
> > > file
> > > hit 50 GB). Then I noticed your example was 128 MB, so I resized and
> > > tried
> > > again. It exceeded memory almost immediately. The next (replacement) task
> > > our framework started ran successfully and never exceeded memory. I
> > > watched
> > > nr_dirty and it fluctuated between 10000 to 14000 when the task is
> > > running.
> > > The slave host is a c3.xlarge in EC2, if it makes a difference.
> > 
> 

> > > As Mesos users, we'd like an isolation strategy that isn't affected by
> > > cache
> > > this much -- it makes it harder for us to appropriately size things. Is
> > > it
> > > possible through Mesos or cgroups itself to make the page cache not count
> > > towards the total memory consumption? If the answer is no, do you think
> > > it'd
> > > be worth looking at using Docker for isolation instead?
> > 
> 

> > > - Tom
> > 
> 

> > > On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes < [email protected] >
> > > wrote:
> > 
> 

> > > > Hello Thomas,
> > > 
> > 
> 

> > > > Your impression is mostly correct: the kernel will *try* to reclaim
> > > 
> > 
> 
> > > > memory by writing out dirty pages before killing processes in a cgroup
> > > 
> > 
> 
> > > > but if it's unable to reclaim sufficient pages within some interval (I
> > > 
> > 
> 
> > > > don't recall this off-hand) then it will start killing things.
> > > 
> > 
> 

> > > > We observed this on a 3.4 kernel where we could overwhelm the disk
> > > 
> > 
> 
> > > > subsystem and trigger an oom. Just how quickly this happens depends on
> > > 
> > 
> 
> > > > how fast you're writing compared to how fast your disk subsystem can
> > > 
> > 
> 
> > > > write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
> > > 
> > 
> 
> > > > contained in a memory cgroup will fill the cache quickly, reach its
> > > 
> > 
> 
> > > > limit and get oom'ed. We were not able to reproduce this under 3.10
> > > 
> > 
> 
> > > > and 3.11 kernels. Which kernel are you using?
> > > 
> > 
> 

> > > > Example: under 3.4:
> > > 
> > 
> 

> > > > [idownes@hostname tmp]$ cat /proc/self/cgroup
> > > 
> > 
> 
> > > > 6:perf_event:/
> > > 
> > 
> 
> > > > 4:memory:/test
> > > 
> > 
> 
> > > > 3:freezer:/
> > > 
> > 
> 
> > > > 2:cpuacct:/
> > > 
> > 
> 
> > > > 1:cpu:/
> > > 
> > 
> 
> > > > [idownes@hostname tmp]$ cat
> > > 
> > 
> 
> > > > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # 128 MB
> > > 
> > 
> 
> > > > 134217728
> > > 
> > 
> 
> > > > [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
> > > 
> > 
> 
> > > > Killed
> > > 
> > 
> 
> > > > [idownes@hostname tmp]$ ls -lah lotsazeros
> > > 
> > 
> 
> > > > -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros
> > > 
> > 
> 

> > > > You can also look in /proc/vmstat at nr_dirty to see how many dirty
> > > 
> > 
> 
> > > > pages there are (system wide). If you wrote at a rate sustainable by
> > > 
> > 
> 
> > > > your disk subsystem then you would see a sawtooth pattern _/|_/| ...
> > > 
> > 
> 
> > > > (use something like watch) as the cgroup approached its limit and the
> > > 
> > 
> 
> > > > kernel flushed dirty pages to bring it down.
> > > 
> > 
> 

> > > > This might be an interesting read:
> > > 
> > 
> 
> > > > http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
> > > 
> > 
> 

> > > > Hope this helps! Please do let us know if you're seeing this on a
> > > 
> > 
> 
> > > > kernel >= 3.10, otherwise it's likely this is a kernel issue rather
> > > 
> > 
> 
> > > > than something with Mesos.
> > > 
> > 
> 

> > > > Thanks,
> > > 
> > 
> 
> > > > Ian
> > > 
> > 
> 

> > > > On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr < [email protected] >
> > > > wrote:
> > > 
> > 
> 
> > > > > Hello,
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > We're running Mesos 0.18.0 with cgroups isolation, and have run into
> > > 
> > 
> 
> > > > > situations where lots of file I/O causes tasks to be killed due to
> > > > > exceeding
> > > 
> > 
> 
> > > > > memory limits. Here's an example:
> > > 
> > 
> 
> > > > > https://gist.github.com/tpetr/ce5d80a0de9f713765f0
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > We were under the impression that if cache was using a lot of memory
> > > > > it
> > > 
> > 
> 
> > > > > would be reclaimed *before* the OOM process decides to kills the
> > > > > task.
> > > > > Is
> > > 
> > 
> 
> > > > > this accurate? We also found MESOS-762 while trying to diagnose --
> > > > > could
> > > 
> > 
> 
> > > > > this be a regression?
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Thanks,
> > > 
> > 
> 
> > > > > Tom
> > > 
> > 
> 

> > --
> 
> > Cheers,
> 
> > Tim
> 
> > Freedom, Features, Friends, First -> Fedora
> 
> > https://fedoraproject.org/wiki/SIGs/bigdata
> 

-- 
Cheers, 
Tim 
Freedom, Features, Friends, First -> Fedora 
https://fedoraproject.org/wiki/SIGs/bigdata 

Reply via email to