[
https://issues.apache.org/jira/browse/YARN-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhijie Shen updated YARN-906:
-----------------------------
Attachment: YARN-906.1.patch
One solution is not to cancel ContainerLaunch.call(), and it seems to be safe
to do that. Assume that the thread of ContainerLaunch.call() (scheduled by
ExecutorService when LAUNCH_CONTAINER) and that of
ContainerLaunch.cleanupContainer() (executed given CLEANUP_CONTAINER because of
stopContainer()) are racing. In call():
{code}
// Check if the container is signalled to be killed.
if (!shouldLaunchContainer.compareAndSet(false, true)) {
LOG.info("Container " + containerIdStr + " not launched as "
+ "cleanup already called");
ret = ExitCode.TERMINATED.getExitCode();
}
{code}
And in cleanupContainer():
{code}
// launch flag will be set to true if process already launched
boolean alreadyLaunched = !shouldLaunchContainer.compareAndSet(false, true);
if (!alreadyLaunched) {
LOG.info("Container " + containerIdStr + " not launched."
+ " No cleanup needed to be done");
return;
}
{code}
Both thread will check shouldLaunchContainer. If call() does first, it will
continue to launch the container. Either CONTAINER_EXITED_WITH_SUCCESS or
CONTAINER_EXITED_WITH_FAILURE will be emitted. Whichever the event is, the
container will move on to DONE finally. On the other side, cleanupContainer()
will not return early, and cleanup the container.
If cleanupContainer() does first, it will return early, and not cleanup the
container. On the other side, call() will result in
CONTAINER_KILLED_ON_REQUEST, which can also move the container towards DONE
(see the previous comments).
Need to do local tests to confirm the above analysis.
> TestNMClient.testNMClientNoCleanupOnStop fails occasionally
> -----------------------------------------------------------
>
> Key: YARN-906
> URL: https://issues.apache.org/jira/browse/YARN-906
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Zhijie Shen
> Assignee: Zhijie Shen
> Attachments: YARN-906.1.patch
>
>
> See
> https://builds.apache.org/job/PreCommit-YARN-Build/1435//testReport/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClientNoCleanupOnStop/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira