[
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127732#comment-15127732
]
Varun Vasudev commented on YARN-3998:
-------------------------------------
My apologies for the slow response [~hex108]. Thank you for your patience.
Comments on the latest patch
1) The patch doesn't apply cleanly on trunk anymore
2) In ContainerLaunchContext.java, change "Retry strategy when container fails
to run.” to "Retry strategy when container exits with failure."
3) In yarn_protos.proto and ContainerRetryPolicy.java, rename ALWAYS_RETRY to
RETRY_ON_ALL_ERRORS
4) In ApplicationMaster.java and Client.java, rename "retry_policy" to
"container_retry_policy", "error_codes" to "container_retry_error_codes",
"max_retries" to "container_max_retries", and "retry_interval" to
"container_retry_interval"
5) In ContainerImpl.java, change
{code}
+ LOG.info("Relaunch Container " + container.getContainerId()
+ + ". Remain retry attempts : " + container.remainingRetryAttempts
+ + ". Retry interval is "
+ + container.containerRetryContext.getRetryInterval() + "ms");
+ container.isRelaunch = true;
{code}
to
{code}
+ LOG.info("Relaunching container " + container.getContainerId()
+ + ". Remaining retry attempts(after relaunch) : " +
container.remainingRetryAttempts
+ + ". Interval between retries is "
+ + container.containerRetryContext.getRetryInterval() + "ms");
+ container.isRelaunch = true;
{code}
6) In ContainerImpl.java, can we avoid creating a new thread if the interval is
0?
{code}
+ new Thread() {
+ @Override
+ public void run() {
+ try {
+ Thread.sleep(container.containerRetryContext.getRetryInterval());
+ container.sendLaunchEvent();
+ } catch (InterruptedException e) {
+ return;
+ }
+ }
+ }.start();
{code}
7) In ContainerLaunch.java, change "If container is in relaunch" to "If
container is being relaunched"
8) In ContainerLaunch.java, rename container1 to targetContainer
9) In ContainerLaunch.java, in getContainerWorkDir, we should use
dirsHandler.getLocalPathForWrite instead of dirsHandler.getLocalPathForRead. As
a result,
{code}
+ public Path getLocalPathForRead(String pathStr) throws IOException {
+ return getPathToRead(pathStr, getLocalDirsForRead());
+ }
{code}
is not required
10) In ContainerLaunch.java, in getContainerLogDir, instead of matching stdout
exactly, can we use FileSystem.globStatus and match *out instead?
11) In the current version of the patch, the diagnosticInfo String is unbounded
and there is no indication as to which attempt the diagnosticInfo came from.
Can you update the patch to
- put some seperator(like "Diagnostic message from attempt:") in the
diagnostic message
- limit the size of the diagnostic message to some upper bound
> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch,
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM
> launches containers, it could specify the value. Then NM will re-launch the
> container 'retry-times' times when it fails to run(e.g.exit code is not 0).
> It will save a lot of time. It avoids container localization. RM does not
> need to re-schedule the container. And local files in container's working
> directory will be left for re-use.(If container have downloaded some big
> files, it does not need to re-download them when running again.)
> We find it is useful in systems like Storm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)