[ 
https://issues.apache.org/jira/browse/YARN-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710475#comment-13710475
 ] 

Zhijie Shen commented on YARN-906:
----------------------------------

Did some further investigation, and found Container wasn't even able to enter 
the stage of cleaner container resources, because CONTAINER_KILLED_ON_REQUEST 
was not received. This event should be emitted in ContainerLaunch.call(). 
However, the execution of this method was not logged (It was logged in my local 
log of a successful test run). Lacking CONTAINER_KILLED_ON_REQUEST, the 
container was stuck at KILLING.

In detail, as mentioned in the previous comment, the container was stopped, 
such that it moved from LOCALIZED to KILLING, KillTransition was executed, 
CLEANUP_CONTAINER was handled by ContainersLauncher. Here's a piece of code:
{code}
        if (rContainer != null 
            && !rContainer.isDone()) {
          // Cancel the future so that it won't be launched 
          // if it isn't already.
          rContainer.cancel(false);
        }
{code}
It tried to cancel the execution of ContainerLaunch.call() which was scheduled 
when handling LAUNCH_CONTAINER. If ContainerLaunch.call() is unfortunately 
still not started, it will be canceled here. Therefore, the following code in 
ContainerLaunch.call() will not be executed.
{code}
    if (ret == ExitCode.FORCE_KILLED.getExitCode()
        || ret == ExitCode.TERMINATED.getExitCode()) {
      // If the process was killed, Send container_cleanedup_after_kill and
      // just break out of this method.
      dispatcher.getEventHandler().handle(
            new ContainerExitEvent(containerID,
                ContainerEventType.CONTAINER_KILLED_ON_REQUEST, ret,
                "Container exited with a non-zero exit code " + ret));
      return ret;
    }
{code}
The container will then never receive CONTAINER_KILLED_ON_REQUEST to trigger 
the next transition.

I'll work on a patch to fix the problem
                
> TestNMClient.testNMClientNoCleanupOnStop fails occasionally
> -----------------------------------------------------------
>
>                 Key: YARN-906
>                 URL: https://issues.apache.org/jira/browse/YARN-906
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>
> See 
> https://builds.apache.org/job/PreCommit-YARN-Build/1435//testReport/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClientNoCleanupOnStop/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to