[
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225716#comment-15225716
]
Varun Vasudev commented on YARN-3998:
-------------------------------------
Thanks for the patch [~hex108]
# {code}+ List<String> containerLocalDirs = new
ArrayList<>(localDirs.size());{code}
containerLocalDirs is created but never populated
# In ContainerRelaunch#call, there's a lot of duplicate code copied from
ContainerLaunch#call. Can you please refactor the code in ContainerLaunch#call
so that it can be re-used?
>From my analysis, these pieces can be moved into their own functions are
>called from ContainerLaunch#call and ContainerRelaunch#call
{code}
+ // CONTAINER_KILLED_ON_REQUEST should not be missed if the container
+ // is already at KILLING
+ if (container.getContainerState() == ContainerState.KILLING) {
+ dispatcher.getEventHandler().handle(
+ new ContainerExitEvent(containerID,
+ ContainerEventType.CONTAINER_KILLED_ON_REQUEST,
+ Shell.WINDOWS ?
ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode() :
+ ContainerExecutor.ExitCode.TERMINATED.getExitCode(),
+ "Container terminated before launch."));
+ return 0;
+ }
{code}
This can be moved into it's own function that returns a boolean that we can use
to get the same behaviour
{code}
+ localResources = container.getLocalizedResources();
+ if (localResources == null) {
+ throw RPCUtil.getRemoteException(
+ "Unable to get local resources when Container " + containerID +
+ " is at " + container.getContainerState());
+ }
{code}
This can be moved into it's own function that returns
{code}localResources{code} if it's not null and throws an exception otherwise.
{code}
+ List<String> logDirs = dirsHandler.getLogDirs();
+
+ List<String> containerLogDirs = new ArrayList<>();
+ String relativeContainerLogDir = ContainerLaunch
+ .getRelativeContainerLogDir(appIdStr, containerIdStr);
+ for(String logDir : logDirs) {
+ containerLogDirs.add(logDir + Path.SEPARATOR +
relativeContainerLogDir);
+ }
{code}
can be moved into it's own function that returns the populated List
{code}
+ StringBuilder diagnosticInfo =
+ new StringBuilder("Container exited with a non-zero exit code ");
+ diagnosticInfo.append(ret);
+ diagnosticInfo.append(". ");
+ if (ret == ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode()
+ || ret == ContainerExecutor.ExitCode.TERMINATED.getExitCode()) {
+ // If the process was killed, Send container_cleanedup_after_kill and
+ // just break out of this method.
+ dispatcher.getEventHandler().handle(
+ new ContainerExitEvent(containerID,
+ ContainerEventType.CONTAINER_KILLED_ON_REQUEST, ret,
+ diagnosticInfo.toString()));
+ return ret;
+ }
{code}
can be moved it's own function.
I suspect we can move
{code}
+ // LaunchContainer is a blocking call. We are here almost means the
+ // container is launched, so send out the event.
+ dispatcher.getEventHandler().handle(new ContainerEvent(
+ containerID,
+ ContainerEventType.CONTAINER_LAUNCHED));
+ context.getNMStateStore().storeContainerLaunched(containerID);
+
+ // Check if the container is signalled to be killed.
+ if (!shouldLaunchContainer.compareAndSet(false, true)) {
+ LOG.info("Container " + containerIdStr + " not launched as "
+ + "cleanup already called");
+ ret = ContainerExecutor.ExitCode.TERMINATED.getExitCode();
+ } else {
+ exec.activateContainer(containerID, pidFilePath);
+ ret = exec.launchContainer(new ContainerStartContext.Builder()
+ .setContainer(container)
+ .setLocalizedResources(localResources)
+ .setNmPrivateContainerScriptPath(nmPrivateContainerScriptPath)
+ .setNmPrivateTokensPath(nmPrivateTokensPath)
+ .setUser(container.getUser())
+ .setAppId(appIdStr)
+ .setContainerWorkDir(containerWorkDir)
+ .setLocalDirs(localDirs)
+ .setLogDirs(logDirs)
+ .setContainerLocalDirs(containerLocalDirs)
+ .setContainerLogDirs(containerLogDirs)
+ .build());
+ }
{code}
into its own function as well.
> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch,
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch,
> YARN-3998.06.patch, YARN-3998.07.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM
> launches containers, it could specify the value. Then NM will re-launch the
> container 'retry-times' times when it fails to run(e.g.exit code is not 0).
> It will save a lot of time. It avoids container localization. RM does not
> need to re-schedule the container. And local files in container's working
> directory will be left for re-use.(If container have downloaded some big
> files, it does not need to re-download them when running again.)
> We find it is useful in systems like Storm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)