[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159891#comment-15159891
 ] 

Vinod Kumar Vavilapalli commented on YARN-3998:
-----------------------------------------------

[~vvasudev], [~hex108]
bq. Unification with AM restart policies
My point was mainly about creating and reusing a common _policy-framework_ even 
if the actual policies may not be entirely reused. We should seriously consider 
this instead of creating adhoc APIs for custom hard-coded policies.

bq.     We should also handle restarting of any containers that may have 
crashed during the NM reboot.
bq. The current version of the patch doesn't work with LCE because the NM can't 
cleanup the launch container script and the tokens(they're owned by the user 
who submitted the job/nobody)
bq. Yes, it will be better to store the log dir and work dir if we aims to make 
it more accurate. I was thinking to make minimal changes for this feature.
I'm okay creating separate JIRAs under YARN-3998 if you both think of doing so, 
but treat (some of the above) as blockers for releasing this feature. Given 
that, does it make sense to work on this in a branch?

[~asuresh]
bq. Would it make sense to add a time dimension. Instead of just specifying a 
retry count, it might be better to specify something like : "Kill container if 
it restarts X times in Y seconds".
As I was saying, this is similar to AM restarts where we support a sliding 
window of restarts so as to forget about very-old failures instead of 
accumulating them for-ever. This was added at YARN-611.

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to