OOM not always detected by Mesos Slave

Whitney Sorenson Thu, 28 Aug 2014 05:52:51 -0700

Recently, I've seen at least one case where a process inside of a task
inside of a cgroup exceeded memory limits and the process was killed
directly. The executor recognized the process was killed and sent a
TASK_FAILED. However, it seems far more common to see the executor process
itself destroyed and the mesos slave (I'm making some assumptions here
about how it all works) sends a TASK_FAILED which includes information
about the memory usage.


Is there something we can do to make this behavior more consistent?

Alternatively, can we provide some functionality to hook into so we don't
need to duplicate the work of the mesos slave in order to provide the same
information in the TASK_FAILED message? I think users would like to know
definitively that the task OOM'd, whereas in the case where the underlying
task is killed it may take a lot of digging to find the underlying cause if
you aren't looking for it.

-Whitney

Here are relevant lines from messages in case something else is amiss:

Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in
/mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of
/mesos/2dda5398-6aa6-49bb-8904-37548eae837e
Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage
917420kB, limit 917504kB, failcnt 106672
Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked
oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0

OOM not always detected by Mesos Slave

Reply via email to