[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037337#comment-14037337
 ] 

Jason Lowe commented on YARN-2175:
----------------------------------

I also wonder if there's been a regression, since at least in 0.23 containers 
that are localizing can be killed by the ApplicationMaster.  The MR AM does 
this when mapreduce.task.timeout triggers a kill of a task due to lack of 
progress.  The MR AM kills the container and that, in turn, causes the 
localizer to die because the NM tells the localizer to DIE during its next 
heartbeat.

Although if the localizer gets stuck and stops heartbeating and the NM lost 
track of it due to the container kill then it seems like we could leak a hung 
localizer process.

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-2175
>                 URL: https://issues.apache.org/jira/browse/YARN-2175
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no way to kill an task if its stuck in these states. These may 
> have nothing to do with the task itself and could be an issue within the 
> platform. 
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request.
> This jira will be used to limit localization time and we open others if we 
> feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to