[
https://issues.apache.org/jira/browse/YARN-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558720#comment-16558720
]
Eric Yang edited comment on YARN-8587 at 7/26/18 6:30 PM:
----------------------------------------------------------
This bug is result of docker run detach reports exit_code 0, but the process
inside the container fail to run. For a brief period of time, node manager
will report back that container is in RUNNING state, then fail the container
later. One possible solution is to change container-executor for
non-entry-point mode to become more similar to entry_point mode to run docker
run in the foreground, and parent process have a set of retries for docker
inspect to obtain PID. This removes the possible false positive reporting of
RUNNING state. The synthetic timeout approach may kill container prematurely
(or wait longer than necessary for failing container), if container takes more
than 30 seconds (or configured values) to start the first process in the
container. Do we want to make non-entry-point to work like entry-point to
prevent the false positive or we are ok with current state?
was (Author: eyang):
This bug is result of docker run detach reports exit_code 0, but the process
inside the container fail to run. For a brief period of time, node manager
will report back that container is in RUNNING state, then fail the container
later. One possible solution is to change container-executor for
non-entry-point mode to become more similar to entry_point mode to run docker
run in the foreground, and parent process have a set of retries for docker
inspect to obtain PID. This removes the possible false positive reporting of
RUNNING state. The synthetic timeout approach may kill container prematurely
(or wait longer than necessary for failing container), if container takes more
than 30 seconds (or configured values) to start the first process in the
container.
> Delays are noticed to launch docker container
> ---------------------------------------------
>
> Key: YARN-8587
> URL: https://issues.apache.org/jira/browse/YARN-8587
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.1.1
> Reporter: Yesha Vora
> Priority: Major
>
> Launch dshell application. Wait for application to go in RUNNING state.
> {code:java}
> yarn jar /xx/hadoop-yarn-applications-distributedshell-*.jar -shell_command
> "sleep 300" -num_containers 1 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker
> -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=httpd:0.1 -shell_env
> YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true -jar
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell-xx.jar
> {code}
> Find out container allocation. Run docker inspect command for docker
> containers launched by app.
> Sometimes, the container is allocated to NM but docker PID is not up.
> {code:java}
> Command ssh -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
> xxx "sudo su - -c \"docker ps -a | grep
> container_e02_1531189225093_0003_01_000002\" root" failed after 0 retries
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]