Re: OOM not always detected by Mesos Slave

Ian Downes Thu, 13 Nov 2014 15:28:30 -0800

Created:

[MESOS-2105] Reliably report OOM even if the executor exits normally


https://issues.apache.org/jira/browse/MESOS-2105

On Thu, Nov 13, 2014 at 12:07 PM, Whitney Sorenson <[email protected]>
wrote:

> Yeah I think so, ultimately what me and my users are looking for is
> consistency in the reporting of TASK_FAILED when an OOM is involved. If any
> OOM happens I'd rather the entire process tree always be taken out and that
> it be reliably reported as such.
>
>
>
> On Thu, Nov 13, 2014 at 1:03 PM, Ian Downes <[email protected]> wrote:
>
>> In reply to your original issue:
>>
>> It is possible to influence the kernel OOM killer in its decision on
>> which process to kill to free memory. An OOM score is computed for
>> each process and it depends on age (tends to kill shortest living) and
>> usage (tends to kill larger memory users), i.e., this generally favors
>> killing something other than the executor. This score could be
>> adjusted to more strongly prefer not killing the executor by setting
>> and OOM adjustment. See
>> https://issues.apache.org/jira/browse/MESOS-416 which discusses this
>> setting for the master and slave.
>>
>> We could then check for an OOM, even if the executor exits 0, and
>> report accordingly. Does that address your original question?
>>
>> Ian
>>
>> On Thu, Nov 13, 2014 at 5:29 AM, Whitney Sorenson <[email protected]>
>> wrote:
>> > I found no such file in this case.
>> >
>> >
>> > On Wed, Nov 12, 2014 at 8:53 PM, Benjamin Mahler <
>> [email protected]>
>> > wrote:
>> >>
>> >> I find the OOM logging from the kernel in /var/log/kern.log.
>> >>
>> >> On Wed, Nov 12, 2014 at 2:51 PM, Whitney Sorenson <
>> [email protected]>
>> >> wrote:
>> >>>
>> >>> I missed the call-to-action here, regarding adding logs. I have some
>> logs
>> >>> from a recent occurrence (this seems to happen quite frequently.)
>> >>>
>> >>> However, in this case, I can't find a corresponding message anywhere
>> on
>> >>> the system that refers to a kernel OOM (is there a place to check
>> besides
>> >>> /var/log/messages or /var/log/dmesg?)
>> >>>
>> >>> One problem we have with sizing for JVM-based tasks is appropriately
>> >>> estimating max thread counts.
>> >>>
>> >>> https://gist.github.com/wsorenson/d2e49b96e84af86c9492
>> >>>
>> >>>
>> >>> On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler
>> >>> <[email protected]> wrote:
>> >>>>
>> >>>> +Ian
>> >>>>
>> >>>> Sorry for the delay, when your cgroup OOMs a few things will occur:
>> >>>>
>> >>>> (1) The kernel will notify mesos-slave about the OOM event.
>> >>>> (2) The kernel's OOM killer will pick a process in your cgroup to
>> kill.
>> >>>> (3) Once notified, mesos-slave will begin destroying the cgroup.
>> >>>> (4) Once the executor terminates, any tasks that were non-terminal on
>> >>>> the executor will have status updates sent with the OOM message.
>> >>>>
>> >>>> This does not all happen atomically, so it is possible that the
>> kernel
>> >>>> kills your task process and your executor sends a status update
>> before the
>> >>>> slave completes the destruction of the cgroup.
>> >>>>
>> >>>> Userspace OOM handling is supported, and we tried using it in the
>> past,
>> >>>> but it is not reliable:
>> >>>>
>> >>>> https://issues.apache.org/jira/browse/MESOS-662
>> >>>> http://lwn.net/Articles/317814/
>> >>>> http://lwn.net/Articles/552789/
>> >>>> http://lwn.net/Articles/590960/
>> >>>> http://lwn.net/Articles/591990/
>> >>>>
>> >>>> Since you have the luxury of avoiding the OOM killer (JVM flags w/
>> >>>> padding), I would recommend leveraging that for now.
>> >>>>
>> >>>> Do you have the logs for your issue? My guess is that it took time
>> for
>> >>>> us to destroy the cgroup (possibly due to freezer issues) and so
>> there was
>> >>>> plenty of time for your executor to send the status update to the
>> slave.
>> >>>>
>> >>>> On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson <
>> [email protected]>
>> >>>> wrote:
>> >>>>>
>> >>>>> We already pad the JVM and make room for our executor, and we try to
>> >>>>> get users to give the correct allowances.
>> >>>>>
>> >>>>> However, to be fair, your answer to my question about how Mesos is
>> >>>>> handling OOMs is to suggest we avoid them. I think we're always
>> going to
>> >>>>> experience some cgroup OOMs and if we'd be better off if we had a
>> consistent
>> >>>>> way of handling them.
>> >>>>>
>> >>>>>
>> >>>>> On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton <
>> [email protected]>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> There is some overhead for the JVM itself, which should be added to
>> >>>>>> the total usage of memory for the task. So you can't have the same
>> amount of
>> >>>>>> memory for the task as you pass to java, -Xmx parameter.
>> >>>>>>
>> >>>>>>
>> >>>>>> On 2 September 2014 20:43, Benjamin Mahler <
>> [email protected]>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> Looks like you're using the JVM, can you set all of your JVM
>> flags to
>> >>>>>>> limit the memory consumption? This would favor an
>> OutOfMemoryError instead
>> >>>>>>> of OOMing the cgroup.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson
>> >>>>>>> <[email protected]> wrote:
>> >>>>>>>>
>> >>>>>>>> Recently, I've seen at least one case where a process inside of a
>> >>>>>>>> task inside of a cgroup exceeded memory limits and the process
>> was killed
>> >>>>>>>> directly. The executor recognized the process was killed and
>> sent a
>> >>>>>>>> TASK_FAILED. However, it seems far more common to see the
>> executor process
>> >>>>>>>> itself destroyed and the mesos slave (I'm making some
>> assumptions here about
>> >>>>>>>> how it all works) sends a TASK_FAILED which includes information
>> about the
>> >>>>>>>> memory usage.
>> >>>>>>>>
>> >>>>>>>> Is there something we can do to make this behavior more
>> consistent?
>> >>>>>>>>
>> >>>>>>>> Alternatively, can we provide some functionality to hook into so
>> we
>> >>>>>>>> don't need to duplicate the work of the mesos slave in order to
>> provide the
>> >>>>>>>> same information in the TASK_FAILED message? I think users would
>> like to
>> >>>>>>>> know definitively that the task OOM'd, whereas in the case where
>> the
>> >>>>>>>> underlying task is killed it may take a lot of digging to find
>> the
>> >>>>>>>> underlying cause if you aren't looking for it.
>> >>>>>>>>
>> >>>>>>>> -Whitney
>> >>>>>>>>
>> >>>>>>>> Here are relevant lines from messages in case something else is
>> >>>>>>>> amiss:
>> >>>>>>>>
>> >>>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task
>> in
>> >>>>>>>> /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result
>> of limit of
>> >>>>>>>> /mesos/2dda5398-6aa6-49bb-8904-37548eae837e
>> >>>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334]
>> memory:
>> >>>>>>>> usage 917420kB, limit 917504kB, failcnt 106672
>> >>>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7
>> >>>>>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0,
>> oom_score_adj=0
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>

Re: OOM not always detected by Mesos Slave

Reply via email to