[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

Jun Gong (JIRA) Thu, 14 Jan 2016 23:45:08 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101383#comment-15101383
 ]


Jun Gong commented on YARN-3998:
--------------------------------

[~vvasudev] Thanks for the detailed review and suggestions! I just attached a 
new patch to address above problems.

{quote}
In your implementation, the relaunched container will go through the 
launchContainer call which will try to setup the container launch environment 
again(like creating the local and log dirs, creating tokens, etc). Won't this 
lead to FileAlreadyExistsException being thrown as part of the launchContainer 
call? In addition, this also means that on a node with more than one local dir, 
different attempts could get allocated to different local dirs. I wonder if 
it's better to move the retry logic into the launchContainer function instead 
of adding a new state transition?
{quote}
The reason for adding a new state transition are as following:
1. Between retry interval, container is not running actually, it seems more 
reasonable to make it in LOCALIZED state.
2. For NM restart, it will not be enough to just add retry logic into 
*ContainerLaunch#call()*. When NM restart, it will call 
*RecoveredContainerLaunch#call*, then we also need add retry logic at this 
place, otherwise container might exit with failure with no retry. The logic 
seems more clear to add a state transition, and avoids duplicated codes.

In order to avoid FileAlreadyExistsException, I add some 
code(*cleanupContainerFilesForRelaunch*) to cleanup files(token file and launch 
script). We also need to cleanup previous PID file, NM will try to get PID 
through this file when NM restart.

In order to use same container working directory and log directory, we need to 
record these path, and need to store these path to NMStateStore for NM restart 
case. According to [~vvasudev]'s suggestion, we use a simple heuristic – if a 
good work directory with the container tokens file already exists, use that 
directory otherwise use a new one. That way we don’t need to worry about 
storing the directories in the state store. However there is not a file likes 
'tokens file' for log directory, so we use the file 'stdout' as this kind of 
file. We assume there is 'stdout' in most containers' log direcotry.

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

Reply via email to