Re: Problems with OOM

CCAAT Sat, 27 Sep 2014 08:43:09 -0700

Hello one and all,

From my research, the most significant point to using mesos,
is to use "container" in lieu of a VM configuration [1].
I'd be curious as to informative points that illuminate this
issue. I guess the main point is that for mesos to "be all it can be"
were talking about "containers" on "bare metal"?

Also, "kernelshark" is available in debian and most major linux OSdistros. It can be useful to track down all sorts of problems; ymmv.


curiously
James

[1] http://openstacksv.com/2014/09/02/make-no-small-plans/



On 09/26/14 10:45, Stephan Erb wrote:

@Tomas: I am currently only running a single slave in a VM. It uses the
isolator and the logs are clean.
@Tom: Thanks for the interesting hint! I will look into it.

Best Regards,
Stephan

On Fr 26 Sep 2014 16:53:22 CEST, Tom Arnfeld wrote:

I'm not sure if this at all related to the issue you're seeing, but we
ran into this fun issue (or at least this seems to be the cause)
helpfully documented on this blog article:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html.

TLDR: OOM killer getting into an infinite loop, causing the CPU to
spin out of control on our VMs.

More details in this commit message to the OOM killer earlier this
year;
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0c740d0afc3bff0a097ad03a1c8df92757516f5c

Hope this helps somewhat...

On 26 September 2014 14:15, Tomas Barton <barton.to...@gmail.com
<mailto:barton.to...@gmail.com>> wrote:

Just to make sure, all slaves are running with:

--isolation='cgroups/cpu,cgroups/mem'

Is there something suspicious in mesos slave logs?

On 26 September 2014 13:20, Stephan Erb
<stephan....@blue-yonder.com <mailto:stephan....@blue-yonder.com>>
wrote:

Hi everyone,

I am having issues with the cgroups isolation of Mesos. It
seems like tasks are prevented from allocating more memory
than their limit. However, they are never killed.

* My scheduled task allocates memory in a tight loop.
According to 'ps', once its memory requirements are
exceeded it is not killed, but ends up in the state D
("uninterruptible sleep (usually IO)").
* The task is still considered running by Mesos.
* There is no indication of an OOM in dmesg.
* There is neither an OOM notice nor any other output
related to the task in the slave log.
* According to htop, the system load is increased with a
significant portion of CPU time spend within the kernel.
Commonly the load is so high that all zookeeper
connections time out.

I am running Aurora and Mesos 0.20.1 using the cgroups
isolation on Debian 7 (kernel 3.2.60-1+deb7u3). .

Sorry for the somewhat unspecific error description. Still,
anyone an idea what might be wrong here?

Thanks and Best Regards,
Stephan

Re: Problems with OOM

Reply via email to