[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049223#comment-14049223
 ] 

Anubhav Dhoot commented on YARN-2175:
-------------------------------------

I should clarify the AM can kill this container manually but each AM will have 
to implement this logic to detect when localization takes longer and kill when 
its taking too long. Updating description.
We can make it much simpler for administrators and AM writers by having an 
automatic way to mitigate this. The NodeManager knows each state of the 
container. Instead of having a back and forth between AM and NM, it will be 
easier if we just let this be done by NM. We can start with a configurable 
timeout with a reasonable default. In future we can add ability in the AM to 
override this during the container request.
Lemme know what you guys think.

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-2175
>                 URL: https://issues.apache.org/jira/browse/YARN-2175
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to