[
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101383#comment-15101383
]
Jun Gong commented on YARN-3998:
--------------------------------
[~vvasudev] Thanks for the detailed review and suggestions! I just attached a
new patch to address above problems.
{quote}
In your implementation, the relaunched container will go through the
launchContainer call which will try to setup the container launch environment
again(like creating the local and log dirs, creating tokens, etc). Won't this
lead to FileAlreadyExistsException being thrown as part of the launchContainer
call? In addition, this also means that on a node with more than one local dir,
different attempts could get allocated to different local dirs. I wonder if
it's better to move the retry logic into the launchContainer function instead
of adding a new state transition?
{quote}
The reason for adding a new state transition are as following:
1. Between retry interval, container is not running actually, it seems more
reasonable to make it in LOCALIZED state.
2. For NM restart, it will not be enough to just add retry logic into
*ContainerLaunch#call()*. When NM restart, it will call
*RecoveredContainerLaunch#call*, then we also need add retry logic at this
place, otherwise container might exit with failure with no retry. The logic
seems more clear to add a state transition, and avoids duplicated codes.
In order to avoid FileAlreadyExistsException, I add some
code(*cleanupContainerFilesForRelaunch*) to cleanup files(token file and launch
script). We also need to cleanup previous PID file, NM will try to get PID
through this file when NM restart.
In order to use same container working directory and log directory, we need to
record these path, and need to store these path to NMStateStore for NM restart
case. According to [~vvasudev]'s suggestion, we use a simple heuristic – if a
good work directory with the container tokens file already exists, use that
directory otherwise use a new one. That way we don’t need to worry about
storing the directories in the state store. However there is not a file likes
'tokens file' for log directory, so we use the file 'stdout' as this kind of
file. We assume there is 'stdout' in most containers' log direcotry.
> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch,
> YARN-3998.03.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM
> launches containers, it could specify the value. Then NM will re-launch the
> container 'retry-times' times when it fails to run(e.g.exit code is not 0).
> It will save a lot of time. It avoids container localization. RM does not
> need to re-schedule the container. And local files in container's working
> directory will be left for re-use.(If container have downloaded some big
> files, it does not need to re-download them when running again.)
> We find it is useful in systems like Storm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)