[jira] [Created] (YARN-7278) LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow.

zhengchenyu (JIRA) Sat, 30 Sep 2017 03:01:46 -0700

zhengchenyu created YARN-7278:
---------------------------------

             Summary: LinuxContainer in docker mode will be failed when 
nodemanager restart, because timeout for docker is too slow.
                 Key: YARN-7278
                 URL: https://issues.apache.org/jira/browse/YARN-7278
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.7.1
         Environment: CentOS
            Reporter: zhengchenyu
             Fix For: 2.9.0



In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer with 
docker mode.
Container may be failed when nodemanager restart, exception is below:

{code}
[2017-09-29T15:47:14.433+08:00] [INFO] 
containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java 
472) [Container Monitor] : Memory usage of ProcessTree 120523 for container-id 
container_1506600355508_0023_01_000004: -1B of 10 GB physical memory used; -1B 
of 31 GB virtual memory used
[2017-09-29T15:47:15.219+08:00] [ERROR] 
containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java
 93) [ContainersLauncher #1] : Unable to recover container 
container_1506600355508_0023_01_000004
java.io.IOException: Timeout while waiting for exit code from 
container_1506600355508_0023_01_000004
[2017-09-29T15:47:15.220+08:00] [INFO] 
containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) 
[AsyncDispatcher event handler] : Container 
container_1506600355508_0023_01_000004 transitioned from RUNNING to 
EXITED_WITH_FAILURE
[2017-09-29T15:47:15.221+08:00] [INFO] 
containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java 
440) [AsyncDispatcher event handler] : Cleaning up container 
container_1506600355508_0023_01_000004
{code}

I guess the proccess is done, but 2 seconde later( the variable is msecLeft), 
the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The 
container is succeed when nodemanger is restart.
So I think it is to short for docker container to complete the work.

In docker mode of LinuxContainer, nm monitor the real proccess which is 
launched by "docker run" command. Then "docker wait" command will wait for 
exitcode, then "docker rm" will delete the docker container. Lastly, 
container-executor will write the exit code. So if some docker command is slow 
enough, nm wouldn't monitor the container. In fact, docker rm is always slow. 

I think the exit code of docker rm dosen't matter with the real task, so I 
think we could move the operation of write "*.pid.exitcode" before the command 
of docker rm. Or monitor the docker wait proccess, but not the real task.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YARN-7278) LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow.

Reply via email to