[ 
https://issues.apache.org/jira/browse/YARN-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-906:
-----------------------------

    Attachment: YARN-906.1.patch

One solution is not to cancel ContainerLaunch.call(), and it seems to be safe 
to do that. Assume that the thread of ContainerLaunch.call() (scheduled by 
ExecutorService when LAUNCH_CONTAINER) and that of 
ContainerLaunch.cleanupContainer() (executed given CLEANUP_CONTAINER because of 
stopContainer()) are racing. In call():
{code}
      // Check if the container is signalled to be killed.
      if (!shouldLaunchContainer.compareAndSet(false, true)) {
        LOG.info("Container " + containerIdStr + " not launched as "
            + "cleanup already called");
        ret = ExitCode.TERMINATED.getExitCode();
      }
{code}
And in cleanupContainer():
{code}
// launch flag will be set to true if process already launched
    boolean alreadyLaunched = !shouldLaunchContainer.compareAndSet(false, true);
    if (!alreadyLaunched) {
      LOG.info("Container " + containerIdStr + " not launched."
          + " No cleanup needed to be done");
      return;
    }
{code}
Both thread will check shouldLaunchContainer. If call() does first, it will 
continue to launch the container. Either CONTAINER_EXITED_WITH_SUCCESS or 
CONTAINER_EXITED_WITH_FAILURE will be emitted. Whichever the event is, the 
container will move on to DONE finally. On the other side, cleanupContainer() 
will not return early, and cleanup the container.

If cleanupContainer() does first, it will return early, and not cleanup the 
container. On the other side, call() will result in 
CONTAINER_KILLED_ON_REQUEST, which can also move the container towards DONE 
(see the previous comments).

Need to do local tests to confirm the above analysis.
                
> TestNMClient.testNMClientNoCleanupOnStop fails occasionally
> -----------------------------------------------------------
>
>                 Key: YARN-906
>                 URL: https://issues.apache.org/jira/browse/YARN-906
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>         Attachments: YARN-906.1.patch
>
>
> See 
> https://builds.apache.org/job/PreCommit-YARN-Build/1435//testReport/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClientNoCleanupOnStop/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to