Danil Serdyuchenko created YARN-4549: ----------------------------------------
Summary: Containers stuck in KILLING state Key: YARN-4549 URL: https://issues.apache.org/jira/browse/YARN-4549 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: Danil Serdyuchenko We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the container-executor with cgroups configuration. Also we have NM recovery enabled. We observe a lot of containers that get stuck in the KIILLING state after the NM tries to kill them. The container remains running indefinitely, this causes some duplication as new containers are brought up to replace them. Looking through the logs NM can't seem to get the container PID. {noformat} 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping container with container Id: container_1448454866800_0023_01_000005 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user IP=10.51.111.243 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1448454866800_0023 CONTAINERID=container_1448454866800_0023_01_000005 16/01/05 05:16:44 INFO container.ContainerImpl: Container container_1448454866800_0023_01_000005 transitioned from RUNNING to KILLING 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container container_1448454866800_0023_01_000005 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for container_1448454866800_0023_01_000005. Waited for 2000 ms. {noformat} The PID files for each container seem to be present on the node. We waren't able to consistently replicate this and hoping that someone has come across this before. -- This message was sent by Atlassian JIRA (v6.3.4#6332)