[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050533#comment-14050533
 ] 

Anubhav Dhoot commented on YARN-2175:
-------------------------------------

We have seen it happen when the source file system had issues. Some jobs would 
intermittently take a long time to fail and would succeed in rerun because the 
jars were put in a new distributed cache location when rerun. Without this 
timeout we have no lever to mitigate underlying HDFS/Hardware issues out in 
production until the root cause is identified and fixed. 
Also in comparison with the mapreduce.task.timeout this seems very focussed on 
a specific operation - localization. I would expect this timeout would be 
defaulted to a large value in production (say 30 min) and used only to mitigate 
when a issue occurs in production.

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-2175
>                 URL: https://issues.apache.org/jira/browse/YARN-2175
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to