[jira] [Commented] (YARN-8209) NPE in DeletionService

Eric Badger (JIRA) Thu, 26 Apr 2018 15:23:24 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455474#comment-16455474
 ]


Eric Badger commented on YARN-8209:
-----------------------------------

After looking at this failure a little more closely, I think that the fix is 
going to require more than simply fixing the NPE that showed up in the Docker 
deletion task. The NPE was caused because of the following lines: 
{noformat}
    String commandFile =
        dockerClient.writeCommandToTempFile(dockerCommand,
        nmContext.getContainers().get(ContainerId.fromString(containerId)),
        nmContext);
{noformat}
Because the container is dead, the nmContext has no bookkeeping on it. So when 
the code does a lookup of the container from the containerID, it doesn't find 
it and passes back null. Then, we we enter into {{writeCommandToTempFile()}}, 
it immediately tries to get the containerId from the container and NPEs. This 
is an easily fixable problem because we don't need to get the container from 
the nmContext. {{writeCommandToTempFile()}} only uses the containerId, so we 
can just pass that into the function instead of the container. I figured this 
would fix the issue and be a minor patch update. However, that is not the case. 

After fixing the NPE, I tested out the code again and found that some Docker 
containers were failing to be cleaned up after the debug delay, because the 
.cmd file was failing to be written. The .cmd file is what we pass to the 
container-executor to invoke all docker commands, including {{docker rm}}. 
{noformat}
2018-04-26 19:28:17,631 WARN  [DeletionService #0] docker.DockerClient 
(DockerClient.java:writeCommandToTempFile(151)) - Unable to write docker 
command to 
/tmp/hadoop-local/nmPrivate/application_1524758461670_0003/container_1524758461670_0003_01_000003
{noformat}
This is where it gets a little trickier. The container's directory inside of 
nmPrivate seems to be getting cleaned up before the docker .cmd file is being 
written, but it doesn't happen for all tasks. It actually appears to be a race 
in the code between the deletion threads for the nmPrivate directory and the 
{{DockerContainerDeletionTask}} thread.
{noformat:title=ContainerImpl.java:1530}
      if (DockerLinuxContainerRuntime.isDockerContainerRequested(
          container.getLaunchContext().getEnvironment())) {
        removeDockerContainer(container);
      }

      if (clCleanupRequired) {
        container.dispatcher.getEventHandler().handle(
            new ContainersLauncherEvent(container,
                ContainersLauncherEventType.CLEANUP_CONTAINER));
      }

      container.cleanup();
{noformat}
{{removeDockerContainer(container);}} handles the creation of the 
{{DockerContainerDeletionTask}} thread, while {{container.cleanup();}} creates 
the {{FileDeletionTask}} that removes the container's nmPrivate directory. 
These both get scheduled to be executed after {{delete.debug-delay-sec}} 
seconds. But, they both get added to thread pools where there is no guarantee 
on the ordering between them. Sometimes, the {{DockerContainerDeletionTask}} 
wins the race (most of the time in my testing), and some times the 
{{FileDeletionTask}} wins the race. In the latter case, the docker .cmd file 
will fail to be written and so the docker container will fail to get removed, 
and will stick around in the Exited state until cleaned up manually. 

Because of all of this, *I believe that the correct course of action is to 
revert YARN-8064 to unblock the impending 3.1 release*. That way, if they would 
like to go ahead with a new release, they may do so without this blocker. 
However, we should try and get a fix in quickly so that this can go in, since 
[~vinodkv] mentioned that he would like YARN-8064 in the next release. 

As for a fix, my proposed solution would be to entirely remove the .cmd file 
requirement on invoking the container-executor for docker commands. That way we 
can just call the container-executor directly from the 
{{DockerContainerDeletionTask}} without the need to write a file anywhere. 
Since we don't need to write a file, we won't race with the container's 
nmPrivate directory getting removed. 

Additionally, I think this is just a good change overall regardless of fixing 
this removal. It is unnecessary to write a file to disk just so that we can do 
a {{docker rm <container>}}. If the disk is pegged, doing this write could be 
very slow and unreliable. Plus, it's just overall unnecessary and if we don't 
have to write to disk, we shouldn't. For things like the {{docker run}} 
command, it makes sense to use a .cmd file because of the potentially massive 
number of volume mount arguments. But, for smaller commands, I think we should 
just add them in as separate commands for the container executor. This also 
makes the code path significantly smaller within the container-executor, since 
it wouldn't have to parse the .cmd file. Overall, I think it gives us many 
wins. 

cc [[email protected]], [~jlowe], [~eyang], [~Jim_Brennan]

> NPE in DeletionService
> ----------------------
>
>                 Key: YARN-8209
>                 URL: https://issues.apache.org/jira/browse/YARN-8209
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Chandni Singh
>            Assignee: Eric Badger
>            Priority: Major
>
> {code:java}
> 2018-04-25 23:38:41,039 WARN  concurrent.ExecutorHelper 
> (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in 
> thread DeletionService #1:
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerClient.writeCommandToTempFile(DockerClient.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:85)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:192)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:128)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:935)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-8209) NPE in DeletionService

Reply via email to