Eric Badger commented on YARN-7189:

Attaching first patch to fix this issue. There is a race in the removal of the 
docker container where the pid may not be valid anymore (no such process), but 
the docker container is still in the running state. Because of that, I have 
added an exponential backoff of removal in this patch. It will try for 5 
iterations of increasing sleep times and eventually give up after the last one. 

> Container-executor doesn't remove Docker containers that error out early
> ------------------------------------------------------------------------
>                 Key: YARN-7189
>                 URL: https://issues.apache.org/jira/browse/YARN-7189
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: yarn
>    Affects Versions: 2.9.0, 2.8.3, 3.0.1
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>         Attachments: YARN-7189-b3.0.001.patch
> Once the docker run command is executed, the docker container is created 
> unless the return code is 125 meaning that the run command itself failed 
> (https://docs.docker.com/engine/reference/run/#exit-status). Any error that 
> happens after the docker run needs to remove the container during cleanup.
> {noformat:title=container-executor.c:launch_docker_container_as_user}
>   snprintf(docker_command_with_binary, command_size, "%s %s", docker_binary, 
> docker_command);
>   fprintf(LOGFILE, "Launching docker container...\n");
>   FILE* start_docker = popen(docker_command_with_binary, "r");
> {noformat}
> This is fixed by YARN-5366, which changes how we remove containers. However, 
> that was committed into 3.1.0. 2.8, 2.9, and 3.0 are all affected

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to