[
https://issues.apache.org/jira/browse/YARN-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710475#comment-13710475
]
Zhijie Shen commented on YARN-906:
----------------------------------
Did some further investigation, and found Container wasn't even able to enter
the stage of cleaner container resources, because CONTAINER_KILLED_ON_REQUEST
was not received. This event should be emitted in ContainerLaunch.call().
However, the execution of this method was not logged (It was logged in my local
log of a successful test run). Lacking CONTAINER_KILLED_ON_REQUEST, the
container was stuck at KILLING.
In detail, as mentioned in the previous comment, the container was stopped,
such that it moved from LOCALIZED to KILLING, KillTransition was executed,
CLEANUP_CONTAINER was handled by ContainersLauncher. Here's a piece of code:
{code}
if (rContainer != null
&& !rContainer.isDone()) {
// Cancel the future so that it won't be launched
// if it isn't already.
rContainer.cancel(false);
}
{code}
It tried to cancel the execution of ContainerLaunch.call() which was scheduled
when handling LAUNCH_CONTAINER. If ContainerLaunch.call() is unfortunately
still not started, it will be canceled here. Therefore, the following code in
ContainerLaunch.call() will not be executed.
{code}
if (ret == ExitCode.FORCE_KILLED.getExitCode()
|| ret == ExitCode.TERMINATED.getExitCode()) {
// If the process was killed, Send container_cleanedup_after_kill and
// just break out of this method.
dispatcher.getEventHandler().handle(
new ContainerExitEvent(containerID,
ContainerEventType.CONTAINER_KILLED_ON_REQUEST, ret,
"Container exited with a non-zero exit code " + ret));
return ret;
}
{code}
The container will then never receive CONTAINER_KILLED_ON_REQUEST to trigger
the next transition.
I'll work on a patch to fix the problem
> TestNMClient.testNMClientNoCleanupOnStop fails occasionally
> -----------------------------------------------------------
>
> Key: YARN-906
> URL: https://issues.apache.org/jira/browse/YARN-906
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Zhijie Shen
> Assignee: Zhijie Shen
>
> See
> https://builds.apache.org/job/PreCommit-YARN-Build/1435//testReport/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClientNoCleanupOnStop/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira