[
https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197011#comment-16197011
]
Shane Kumpf commented on YARN-7278:
-----------------------------------
Writing the pid out to a file also suffers from this issue. If a container is
immediately requested to stop, the pid file is not yet available, and thus the
container can leak. I'll also note that with Docker's live restore feature we
need to eliminate the use of {{docker wait}} completely as this breaks when
Docker is restarted. Improving container lifecycle management is a high
priority that I expect to be able to revisit soon now that YARN-6623 is
wrapping up. See YARN-5366, YARN-5818, and YARN-6305 for additional items I've
noticed on this topic.
> LinuxContainer in docker mode will be failed when nodemanager restart,
> because timeout for docker is too slow.
> --------------------------------------------------------------------------------------------------------------
>
> Key: YARN-7278
> URL: https://issues.apache.org/jira/browse/YARN-7278
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.8.0
> Environment: CentOS
> Reporter: zhengchenyu
> Fix For: 2.9.0
>
> Original Estimate: 1m
> Remaining Estimate: 1m
>
> In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer
> with docker mode.
> Container may be failed when nodemanager restart, exception is below:
> {code}
> [2017-09-29T15:47:14.433+08:00] [INFO]
> containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java
> 472) [Container Monitor] : Memory usage of ProcessTree 120523 for
> container-id container_1506600355508_0023_01_000004: -1B of 10 GB physical
> memory used; -1B of 31 GB virtual memory used
> [2017-09-29T15:47:15.219+08:00] [ERROR]
> containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java
> 93) [ContainersLauncher #1] : Unable to recover container
> container_1506600355508_0023_01_000004
> java.io.IOException: Timeout while waiting for exit code from
> container_1506600355508_0023_01_000004
> [2017-09-29T15:47:15.220+08:00] [INFO]
> containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142)
> [AsyncDispatcher event handler] : Container
> container_1506600355508_0023_01_000004 transitioned from RUNNING to
> EXITED_WITH_FAILURE
> [2017-09-29T15:47:15.221+08:00] [INFO]
> containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java
> 440) [AsyncDispatcher event handler] : Cleaning up container
> container_1506600355508_0023_01_000004
> {code}
> I guess the proccess is done, but 2 seconde later( the variable is msecLeft),
> the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The
> container is succeed when nodemanger is restart.
> So I think it is too short for docker container to complete the work.
> In docker mode of LinuxContainer, nm monitor the real task which is launched
> by "docker run" command. Then "docker wait" command will wait for exitcode,
> then "docker rm" will delete the docker container. Lastly, container-executor
> will write the exit code. So if some docker command is slow enough, nm
> wouldn't monitor the container. In fact, docker rm is always slow.
> I think the exit code of docker rm dosen't matter with the real task, so I
> think we could move the operation of write "*.pid.exitcode" before the
> command of docker rm. Or monitor the docker wait proccess, but not the real
> task.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]