[ 
https://issues.apache.org/jira/browse/YARN-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558720#comment-16558720
 ] 

Eric Yang edited comment on YARN-8587 at 7/26/18 6:30 PM:
----------------------------------------------------------

This bug is result of docker run detach reports exit_code 0, but the process 
inside the container fail to run.  For a brief period of time, node manager 
will report back that container is in RUNNING state, then fail the container 
later.  One possible solution is to change container-executor for 
non-entry-point mode to become more similar to entry_point mode to run docker 
run in the foreground, and parent process have a set of retries for docker 
inspect to obtain PID.  This removes the possible false positive reporting of 
RUNNING state.  The synthetic timeout approach may kill container prematurely 
(or wait longer than necessary for failing container), if container takes more 
than 30 seconds (or configured values) to start the first process in the 
container.  Do we want to make non-entry-point to work like entry-point to 
prevent the false positive or we are ok with current state?


was (Author: eyang):
This bug is result of docker run detach reports exit_code 0, but the process 
inside the container fail to run.  For a brief period of time, node manager 
will report back that container is in RUNNING state, then fail the container 
later.  One possible solution is to change container-executor for 
non-entry-point mode to become more similar to entry_point mode to run docker 
run in the foreground, and parent process have a set of retries for docker 
inspect to obtain PID.  This removes the possible false positive reporting of 
RUNNING state.  The synthetic timeout approach may kill container prematurely 
(or wait longer than necessary for failing container), if container takes more 
than 30 seconds (or configured values) to start the first process in the 
container.

> Delays are noticed to launch docker container
> ---------------------------------------------
>
>                 Key: YARN-8587
>                 URL: https://issues.apache.org/jira/browse/YARN-8587
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: Yesha Vora
>            Priority: Major
>
> Launch dshell application. Wait for application to go in RUNNING state.
> {code:java}
> yarn  jar /xx/hadoop-yarn-applications-distributedshell-*.jar  -shell_command 
> "sleep 300" -num_containers 1 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker 
> -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=httpd:0.1 -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true -jar 
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell-xx.jar
> {code}
> Find out container allocation. Run docker inspect command for docker 
> containers launched by app.
> Sometimes, the container is allocated to NM but docker PID is not up.
> {code:java}
> Command ssh -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null 
> xxx "sudo su - -c \"docker ps  -a | grep 
> container_e02_1531189225093_0003_01_000002\" root" failed after 0 retries 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to