Vinod Kumar Vavilapalli commented on YARN-2175:

That is a reasonable proposal, but I'd like to see if there are any other bugs 
that are causing this to happen. Have we seen this in practice? If so, what is 
the underlying reason? Too big a resource? The source file-system is down? Or 
NM has a bug? We should try to address the right individual problem with its 
solution before we put a band-aid that may still be useful for issues that we 
cannot just address directly if any.

Contrast this with mapreduce.task.timeout. Arguably the config helped users 
timeout their jobs, but from my experience it prevented us from focusing on 
fixing point bugs that were hidden in the framework for a long time - it kind 
of hides the issues. It still is useful, for those unmanageable and unsolvable 
bugs, but I'd rather first fix the point problems and then put the band-aid. 

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> -----------------------------------------------------------------------------------
>                 Key: YARN-2175
>                 URL: https://issues.apache.org/jira/browse/YARN-2175
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.

This message was sent by Atlassian JIRA

Reply via email to