Purely from a user expectation point of view, I am wondering if such an
"abuse" (overuse?) of I/O bandwidth/rate should translate into I/O
bandwidth getting throttled for the job instead of it manifesting into an
OOM that results in a job kill. Such I/O overuse translating into memory
overuse seems like an implementation detail (for lack of a better phrase)
of the OS that uses cache'ing. It's not like the job asked for its memory
to be used up for I/O cache'ing :-)

I do see that this isn't Mesos specific, but, rather a containerization
artifact that is inevitable in a shared resource environment.

That said, specifying memory size for jobs is not trivial in a shared
resource environment. Conservative safe margins do help prevent OOMs, but,
they also come with the side effect of fragmenting resources and reducing
utilization. In some cases, they can cause job starvation to some extent,
if most available memory is allocated to the conservative buffering for
every job.
Another approach that could help, if feasible, is to have containers with
elastic boundaries (different from over-subscription) that manage things
such that sum of actual usage of all containers is <= system resources.
This helps when not all jobs have peak use of resources simultaneously.


On Wed, Jun 18, 2014 at 1:42 PM, Tim St Clair <[email protected]> wrote:

> FWIW -  There is classic grid mantra that applies here.  Test your
> workflow on an upper bound, then over provision to be safe.
>
> Mesos is no different then SGE, PBS, LSF, Condor, etc.
>
> Also, there is no hunting algo for "jobs", that would have to live outside
> of mesos itself, on some batch system built atop.
>
> Cheers,
> Tim
>
> ------------------------------
>
> *From: *"Thomas Petr" <[email protected]>
> *To: *"Ian Downes" <[email protected]>
> *Cc: *[email protected], "Eric Abbott" <[email protected]>
> *Sent: *Wednesday, June 18, 2014 9:36:51 AM
> *Subject: *Re: cgroups memory isolation
>
>
> Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
> kernel.
>
> I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
> some weird results. I initially gave the task 256 MB, and it never exceeded
> the memory allocation (I killed the task manually after 5 minutes when the
> file hit 50 GB). Then I noticed your example was 128 MB, so I resized and
> tried again. It exceeded memory
> <https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82> almost
> immediately. The next (replacement) task our framework started ran
> successfully and never exceeded memory. I watched nr_dirty and it
> fluctuated between 10000 to 14000 when the task is running. The slave host
> is a c3.xlarge in EC2, if it makes a difference.
>
> As Mesos users, we'd like an isolation strategy that isn't affected by
> cache this much -- it makes it harder for us to appropriately size things.
> Is it possible through Mesos or cgroups itself to make the page cache not
> count towards the total memory consumption? If the answer is no, do you
> think it'd be worth looking at using Docker for isolation instead?
>
> -Tom
>
>
> On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes <[email protected]> wrote:
>
>> Hello Thomas,
>>
>> Your impression is mostly correct: the kernel will *try* to reclaim
>> memory by writing out dirty pages before killing processes in a cgroup
>> but if it's unable to reclaim sufficient pages within some interval (I
>> don't recall this off-hand) then it will start killing things.
>>
>> We observed this on a 3.4 kernel where we could overwhelm the disk
>> subsystem and trigger an oom. Just how quickly this happens depends on
>> how fast you're writing compared to how fast your disk subsystem can
>> write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
>> contained in a memory cgroup will fill the cache quickly, reach its
>> limit and get oom'ed. We were not able to reproduce this under 3.10
>> and 3.11 kernels. Which kernel are you using?
>>
>> Example: under 3.4:
>>
>> [idownes@hostname tmp]$ cat /proc/self/cgroup
>> 6:perf_event:/
>> 4:memory:/test
>> 3:freezer:/
>> 2:cpuacct:/
>> 1:cpu:/
>> [idownes@hostname tmp]$ cat
>> /sys/fs/cgroup/memory/test/memory.limit_in_bytes  # 128 MB
>> 134217728
>> [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
>> Killed
>> [idownes@hostname tmp]$ ls -lah lotsazeros
>> -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros
>>
>>
>> You can also look in /proc/vmstat at nr_dirty to see how many dirty
>> pages there are (system wide). If you wrote at a rate sustainable by
>> your disk subsystem then you would see a sawtooth pattern _/|_/| ...
>> (use something like watch) as the cgroup approached its limit and the
>> kernel flushed dirty pages to bring it down.
>>
>> This might be an interesting read:
>>
>> http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>>
>> Hope this helps! Please do let us know if you're seeing this on a
>> kernel >= 3.10, otherwise it's likely this is a kernel issue rather
>> than something with Mesos.
>>
>> Thanks,
>> Ian
>>
>>
>> On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <[email protected]> wrote:
>> > Hello,
>> >
>> > We're running Mesos 0.18.0 with cgroups isolation, and have run into
>> > situations where lots of file I/O causes tasks to be killed due to
>> exceeding
>> > memory limits. Here's an example:
>> > https://gist.github.com/tpetr/ce5d80a0de9f713765f0
>> >
>> > We were under the impression that if cache was using a lot of memory it
>> > would be reclaimed *before* the OOM process decides to kills the task.
>> Is
>> > this accurate? We also found MESOS-762 while trying to diagnose -- could
>> > this be a regression?
>> >
>> > Thanks,
>> > Tom
>>
>
>
>
>
> --
> Cheers,
> Tim
> Freedom, Features, Friends, First -> Fedora
> https://fedoraproject.org/wiki/SIGs/bigdata
>

Reply via email to