[
https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe resolved YARN-4549.
------------------------------
Resolution: Invalid
Glad to hear the cause was found! Be sure to check that the NM state store is
also stored outside of /tmp, or it too can become a victim of tmpwatch.
> Containers stuck in KILLING state
> ---------------------------------
>
> Key: YARN-4549
> URL: https://issues.apache.org/jira/browse/YARN-4549
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.1
> Reporter: Danil Serdyuchenko
>
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the
> container-executor with cgroups configuration. Also we have NM recovery
> enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the
> NM tries to kill them. The container remains running indefinitely, this
> causes some duplication as new containers are brought up to replace them.
> Looking through the logs NM can't seem to get the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping
> container with container Id: container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user
> IP=10.51.111.243 OPERATION=Stop Container Request
> TARGET=ContainerManageImpl RESULT=SUCCESS
> APPID=application_1448454866800_0023
> CONTAINERID=container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container
> container_1448454866800_0023_01_000005 transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container
> container_1448454866800_0023_01_000005
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for
> container_1448454866800_0023_01_000005. Waited for 2000 ms.
> {noformat}
> The PID files for containers in the KILLING state are missing, and a few
> other container that have been in the RUNNING state for a few weeks are also
> missing them. We waren't able to consistently replicate this and hoping that
> someone has come across this before.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)