[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159791#comment-15159791
 ] 

Arun Suresh commented on YARN-3998:
-----------------------------------

Was spending some time thinking about this..

Would it make sense to add a time dimension. Instead of just specifying a retry 
count, it might be better to specify something like :
"Kill container if it restarts X times in Y seconds".

This way, we can distinguish between transient and steady state errors. Most 
un-recoverable errors are transient (something that pops up during startup 
etc.) which will cause the container to restart multiple times really fast and 
thus should be permanently killed. A steady state restart (for eg. a container 
hosting a web server going down due to some 500 error.. caused due to some 
weird user request) should not contribute to the retry increment.

It might also make sense to extend this as a full fledged Container restart 
*Strategy*... Similar to what is found in systems written in Erlang 
(http://erlang.org/doc/design_principles/sup_princ.html) and the Scala Akka 
supervisor (http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html)

Thoughts ?

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to