Re: memory limit exceeded ==> KILL instead of TERM (first)

Erik Weathers Fri, 12 Feb 2016 10:42:29 -0800

hey Harry,

As Vinod said, the mesos-slave/agent will issue a status update about the
OOM condition.  This will be received by the scheduler of the framework.
In the storm-mesos framework we just log the messages (see below), but you
might consider somehow exposing these messages directly to the app owners:


Received status update:
{"task_id":"TASK_ID","slave_id":"20150806-001422-1801655306-5050-22041-S65","state":"TASK_FAILED","message":"Memory
limit exceeded: Requested: 2200MB Maximum Used: 2200MB\n\nMEMORY
STATISTICS: \ncache 20480\nrss 1811943424\nmapped_file 0\npgpgin
8777434\npgpgout 8805691\nswap 96878592\ninactive_anon
644186112\nactive_anon 1357594624\ninactive_file 20480\nactive_file
0\nunevictable 0\nhierarchical_memory_limit
2306867200\nhierarchical_memsw_limit 9223372036854775807\ntotal_cache
20480\ntotal_rss 1811943424\ntotal_mapped_file 0\ntotal_pgpgin
8777434\ntotal_pgpgout 8805691\ntotal_swap 96878592\ntotal_inactive_anon
644186112\ntotal_active_anon 1355497472\ntotal_inactive_file
20480\ntotal_active_file 0\ntotal_unevictable 0"}

- Erik

On Fri, Feb 12, 2016 at 10:24 AM, Harry Metske <[email protected]>
wrote:

> David,
>
> that's exactly the scenario I am afraid of, developers specifying way too
> large memory requirements, just to make sure their tasks don't get killed.
> Any suggestions on how to report this *why* to the developers, as far as I
> know the only place where you find the reason is in the logfile of the
> slave, the UI only tells the task failed, not the reason.
>
> (we could put some logfile monitoring in place picking up these messages
> of course, but if there are better ways, we are always interested)
>
> kind regards,
> Harry
>
>
> On 12 February 2016 at 15:08, David J. Palaitis <
> [email protected]> wrote:
>
>> In larger deployments, with many applications, you may not always be able
>> to ask good memory practices from app developers. We've found that
>> reporting *why* a job was killed, with details of container utilization, is
>> an effective way of helping app developers get better at mem mgmt.
>>
>> The alternative, just having jobs die, incentives bad behaviors. For
>> example, a hurried job owner may just double memory of the executor,
>> trading slack for stability.
>>
>> On Fri, Feb 12, 2016 at 6:36 AM Harry Metske <[email protected]>
>> wrote:
>>
>>> We don't want to use Docker (yet) in this environment, so 
>>> DockerContainerizer
>>> is not an option.
>>> After thinking a bit longer, I tend to agree with Kamil and let the
>>> problem be handled differently.
>>>
>>> Thanks for the amazing fast responses!
>>>
>>> kind regards,
>>> Harry
>>>
>>>
>>> On 12 February 2016 at 12:28, Kamil Chmielewski <[email protected]>
>>> wrote:
>>>
>>>> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Is there a specific reason why the slave does not first send a TERM
>>>>>>>> signal, and if that does not help after a certain timeout, send a KILL
>>>>>>>> signal?
>>>>>>>> That would give us a chance to cleanup consul registrations (and
>>>>>>>> other cleanup).
>>>>>>>>
>>>>>>>>
>>>> First of all it's wrong that you want to handle memory limit in your
>>>> app. Things like this are outside of its scope. Your app can be lost
>>>> because many different system or hardware failures that you just can't
>>>> caught. You need to let it crash and design your architecture with this in
>>>> mind.
>>>> Secondly Mesos SIGKILL is consistent with linux OOM killer and it do
>>>> the right thing
>>>> https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586
>>>>
>>>> Best regards,
>>>> Kamil
>>>>
>>>
>>>
>

Re: memory limit exceeded ==> KILL instead of TERM (first)

Reply via email to