I found no such file in this case.
On Wed, Nov 12, 2014 at 8:53 PM, Benjamin Mahler <[email protected]> wrote: > I find the OOM logging from the kernel in /var/log/kern.log. > > On Wed, Nov 12, 2014 at 2:51 PM, Whitney Sorenson <[email protected]> > wrote: > >> I missed the call-to-action here, regarding adding logs. I have some logs >> from a recent occurrence (this seems to happen quite frequently.) >> >> However, in this case, I can't find a corresponding message anywhere on >> the system that refers to a kernel OOM (is there a place to check besides >> /var/log/messages or /var/log/dmesg?) >> >> One problem we have with sizing for JVM-based tasks is appropriately >> estimating max thread counts. >> >> https://gist.github.com/wsorenson/d2e49b96e84af86c9492 >> >> >> On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler < >> [email protected]> wrote: >> >>> +Ian >>> >>> Sorry for the delay, when your cgroup OOMs a few things will occur: >>> >>> (1) The kernel will notify mesos-slave about the OOM event. >>> (2) The kernel's OOM killer will pick a process in your cgroup to kill. >>> (3) Once notified, mesos-slave will begin destroying the cgroup. >>> (4) Once the executor terminates, any tasks that were non-terminal on >>> the executor will have status updates sent with the OOM message. >>> >>> This does not all happen atomically, so it is possible that the kernel >>> kills your task process and your executor sends a status update before the >>> slave completes the destruction of the cgroup. >>> >>> Userspace OOM handling is supported, and we tried using it in the past, >>> but it is not reliable: >>> >>> https://issues.apache.org/jira/browse/MESOS-662 >>> http://lwn.net/Articles/317814/ >>> http://lwn.net/Articles/552789/ >>> http://lwn.net/Articles/590960/ >>> http://lwn.net/Articles/591990/ >>> >>> Since you have the luxury of avoiding the OOM killer (JVM flags w/ >>> padding), I would recommend leveraging that for now. >>> >>> Do you have the logs for your issue? My guess is that it took time for >>> us to destroy the cgroup (possibly due to freezer issues) and so there was >>> plenty of time for your executor to send the status update to the slave. >>> >>> On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson <[email protected]> >>> wrote: >>> >>>> We already pad the JVM and make room for our executor, and we try to >>>> get users to give the correct allowances. >>>> >>>> However, to be fair, your answer to my question about how Mesos is >>>> handling OOMs is to suggest we avoid them. I think we're always going to >>>> experience some cgroup OOMs and if we'd be better off if we had a >>>> consistent way of handling them. >>>> >>>> >>>> On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton <[email protected]> >>>> wrote: >>>> >>>>> There is some overhead for the JVM itself, which should be added to >>>>> the total usage of memory for the task. So you can't have the same amount >>>>> of memory for the task as you pass to java, -Xmx parameter. >>>>> >>>>> >>>>> On 2 September 2014 20:43, Benjamin Mahler <[email protected]> >>>>> wrote: >>>>> >>>>>> Looks like you're using the JVM, can you set all of your JVM flags to >>>>>> limit the memory consumption? This would favor an OutOfMemoryError >>>>>> instead >>>>>> of OOMing the cgroup. >>>>>> >>>>>> >>>>>> On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Recently, I've seen at least one case where a process inside of a >>>>>>> task inside of a cgroup exceeded memory limits and the process was >>>>>>> killed >>>>>>> directly. The executor recognized the process was killed and sent a >>>>>>> TASK_FAILED. However, it seems far more common to see the executor >>>>>>> process >>>>>>> itself destroyed and the mesos slave (I'm making some assumptions here >>>>>>> about how it all works) sends a TASK_FAILED which includes information >>>>>>> about the memory usage. >>>>>>> >>>>>>> Is there something we can do to make this behavior more consistent? >>>>>>> >>>>>>> Alternatively, can we provide some functionality to hook into so we >>>>>>> don't need to duplicate the work of the mesos slave in order to provide >>>>>>> the >>>>>>> same information in the TASK_FAILED message? I think users would like to >>>>>>> know definitively that the task OOM'd, whereas in the case where the >>>>>>> underlying task is killed it may take a lot of digging to find the >>>>>>> underlying cause if you aren't looking for it. >>>>>>> >>>>>>> -Whitney >>>>>>> >>>>>>> Here are relevant lines from messages in case something else is >>>>>>> amiss: >>>>>>> >>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in >>>>>>> /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit >>>>>>> of >>>>>>> /mesos/2dda5398-6aa6-49bb-8904-37548eae837e >>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: >>>>>>> usage 917420kB, limit 917504kB, failcnt 106672 >>>>>>> Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 >>>>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >

