[
https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shane Kumpf resolved YARN-7278.
-------------------------------
Resolution: Duplicate
Fix Version/s: (was: 2.9.1)
> LinuxContainer in docker mode will be failed when nodemanager restart,
> because timeout for docker is too slow.
> --------------------------------------------------------------------------------------------------------------
>
> Key: YARN-7278
> URL: https://issues.apache.org/jira/browse/YARN-7278
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.8.0
> Environment: CentOS
> Reporter: zhengchenyu
> Priority: Major
> Original Estimate: 1m
> Remaining Estimate: 1m
>
> In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer
> with docker mode.
> Container may be failed when nodemanager restart, exception is below:
> {code}
> [2017-09-29T15:47:14.433+08:00] [INFO]
> containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java
> 472) [Container Monitor] : Memory usage of ProcessTree 120523 for
> container-id container_1506600355508_0023_01_000004: -1B of 10 GB physical
> memory used; -1B of 31 GB virtual memory used
> [2017-09-29T15:47:15.219+08:00] [ERROR]
> containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java
> 93) [ContainersLauncher #1] : Unable to recover container
> container_1506600355508_0023_01_000004
> java.io.IOException: Timeout while waiting for exit code from
> container_1506600355508_0023_01_000004
> [2017-09-29T15:47:15.220+08:00] [INFO]
> containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142)
> [AsyncDispatcher event handler] : Container
> container_1506600355508_0023_01_000004 transitioned from RUNNING to
> EXITED_WITH_FAILURE
> [2017-09-29T15:47:15.221+08:00] [INFO]
> containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java
> 440) [AsyncDispatcher event handler] : Cleaning up container
> container_1506600355508_0023_01_000004
> {code}
> I guess the proccess is done, but 2 seconde later( the variable is msecLeft),
> the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The
> container is succeed when nodemanger is restart.
> So I think it is too short for docker container to complete the work.
> In docker mode of LinuxContainer, nm monitor the real task which is launched
> by "docker run" command. Then "docker wait" command will wait for exitcode,
> then "docker rm" will delete the docker container. Lastly, container-executor
> will write the exit code. So if some docker command is slow enough, nm
> wouldn't monitor the container. In fact, docker rm is always slow.
> I think the exit code of docker rm dosen't matter with the real task, so I
> think we could move the operation of write "*.pid.exitcode" before the
> command of docker rm. Or monitor the docker wait proccess, but not the real
> task.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]