Created: [MESOS-2105] Reliably report OOM even if the executor exits normally
https://issues.apache.org/jira/browse/MESOS-2105 On Thu, Nov 13, 2014 at 12:07 PM, Whitney Sorenson <[email protected]> wrote: > Yeah I think so, ultimately what me and my users are looking for is > consistency in the reporting of TASK_FAILED when an OOM is involved. If any > OOM happens I'd rather the entire process tree always be taken out and that > it be reliably reported as such. > > > > On Thu, Nov 13, 2014 at 1:03 PM, Ian Downes <[email protected]> wrote: > >> In reply to your original issue: >> >> It is possible to influence the kernel OOM killer in its decision on >> which process to kill to free memory. An OOM score is computed for >> each process and it depends on age (tends to kill shortest living) and >> usage (tends to kill larger memory users), i.e., this generally favors >> killing something other than the executor. This score could be >> adjusted to more strongly prefer not killing the executor by setting >> and OOM adjustment. See >> https://issues.apache.org/jira/browse/MESOS-416 which discusses this >> setting for the master and slave. >> >> We could then check for an OOM, even if the executor exits 0, and >> report accordingly. Does that address your original question? >> >> Ian >> >> On Thu, Nov 13, 2014 at 5:29 AM, Whitney Sorenson <[email protected]> >> wrote: >> > I found no such file in this case. >> > >> > >> > On Wed, Nov 12, 2014 at 8:53 PM, Benjamin Mahler < >> [email protected]> >> > wrote: >> >> >> >> I find the OOM logging from the kernel in /var/log/kern.log. >> >> >> >> On Wed, Nov 12, 2014 at 2:51 PM, Whitney Sorenson < >> [email protected]> >> >> wrote: >> >>> >> >>> I missed the call-to-action here, regarding adding logs. I have some >> logs >> >>> from a recent occurrence (this seems to happen quite frequently.) >> >>> >> >>> However, in this case, I can't find a corresponding message anywhere >> on >> >>> the system that refers to a kernel OOM (is there a place to check >> besides >> >>> /var/log/messages or /var/log/dmesg?) >> >>> >> >>> One problem we have with sizing for JVM-based tasks is appropriately >> >>> estimating max thread counts. >> >>> >> >>> https://gist.github.com/wsorenson/d2e49b96e84af86c9492 >> >>> >> >>> >> >>> On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler >> >>> <[email protected]> wrote: >> >>>> >> >>>> +Ian >> >>>> >> >>>> Sorry for the delay, when your cgroup OOMs a few things will occur: >> >>>> >> >>>> (1) The kernel will notify mesos-slave about the OOM event. >> >>>> (2) The kernel's OOM killer will pick a process in your cgroup to >> kill. >> >>>> (3) Once notified, mesos-slave will begin destroying the cgroup. >> >>>> (4) Once the executor terminates, any tasks that were non-terminal on >> >>>> the executor will have status updates sent with the OOM message. >> >>>> >> >>>> This does not all happen atomically, so it is possible that the >> kernel >> >>>> kills your task process and your executor sends a status update >> before the >> >>>> slave completes the destruction of the cgroup. >> >>>> >> >>>> Userspace OOM handling is supported, and we tried using it in the >> past, >> >>>> but it is not reliable: >> >>>> >> >>>> https://issues.apache.org/jira/browse/MESOS-662 >> >>>> http://lwn.net/Articles/317814/ >> >>>> http://lwn.net/Articles/552789/ >> >>>> http://lwn.net/Articles/590960/ >> >>>> http://lwn.net/Articles/591990/ >> >>>> >> >>>> Since you have the luxury of avoiding the OOM killer (JVM flags w/ >> >>>> padding), I would recommend leveraging that for now. >> >>>> >> >>>> Do you have the logs for your issue? My guess is that it took time >> for >> >>>> us to destroy the cgroup (possibly due to freezer issues) and so >> there was >> >>>> plenty of time for your executor to send the status update to the >> slave. >> >>>> >> >>>> On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson < >> [email protected]> >> >>>> wrote: >> >>>>> >> >>>>> We already pad the JVM and make room for our executor, and we try to >> >>>>> get users to give the correct allowances. >> >>>>> >> >>>>> However, to be fair, your answer to my question about how Mesos is >> >>>>> handling OOMs is to suggest we avoid them. I think we're always >> going to >> >>>>> experience some cgroup OOMs and if we'd be better off if we had a >> consistent >> >>>>> way of handling them. >> >>>>> >> >>>>> >> >>>>> On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton < >> [email protected]> >> >>>>> wrote: >> >>>>>> >> >>>>>> There is some overhead for the JVM itself, which should be added to >> >>>>>> the total usage of memory for the task. So you can't have the same >> amount of >> >>>>>> memory for the task as you pass to java, -Xmx parameter. >> >>>>>> >> >>>>>> >> >>>>>> On 2 September 2014 20:43, Benjamin Mahler < >> [email protected]> >> >>>>>> wrote: >> >>>>>>> >> >>>>>>> Looks like you're using the JVM, can you set all of your JVM >> flags to >> >>>>>>> limit the memory consumption? This would favor an >> OutOfMemoryError instead >> >>>>>>> of OOMing the cgroup. >> >>>>>>> >> >>>>>>> >> >>>>>>> On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson >> >>>>>>> <[email protected]> wrote: >> >>>>>>>> >> >>>>>>>> Recently, I've seen at least one case where a process inside of a >> >>>>>>>> task inside of a cgroup exceeded memory limits and the process >> was killed >> >>>>>>>> directly. The executor recognized the process was killed and >> sent a >> >>>>>>>> TASK_FAILED. However, it seems far more common to see the >> executor process >> >>>>>>>> itself destroyed and the mesos slave (I'm making some >> assumptions here about >> >>>>>>>> how it all works) sends a TASK_FAILED which includes information >> about the >> >>>>>>>> memory usage. >> >>>>>>>> >> >>>>>>>> Is there something we can do to make this behavior more >> consistent? >> >>>>>>>> >> >>>>>>>> Alternatively, can we provide some functionality to hook into so >> we >> >>>>>>>> don't need to duplicate the work of the mesos slave in order to >> provide the >> >>>>>>>> same information in the TASK_FAILED message? I think users would >> like to >> >>>>>>>> know definitively that the task OOM'd, whereas in the case where >> the >> >>>>>>>> underlying task is killed it may take a lot of digging to find >> the >> >>>>>>>> underlying cause if you aren't looking for it. >> >>>>>>>> >> >>>>>>>> -Whitney >> >>>>>>>> >> >>>>>>>> Here are relevant lines from messages in case something else is >> >>>>>>>> amiss: >> >>>>>>>> >> >>>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task >> in >> >>>>>>>> /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result >> of limit of >> >>>>>>>> /mesos/2dda5398-6aa6-49bb-8904-37548eae837e >> >>>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] >> memory: >> >>>>>>>> usage 917420kB, limit 917504kB, failcnt 106672 >> >>>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 >> >>>>>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, >> oom_score_adj=0 >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > >> > >

