Danil Serdyuchenko created YARN-4549:
----------------------------------------

             Summary: Containers stuck in KILLING state
                 Key: YARN-4549
                 URL: https://issues.apache.org/jira/browse/YARN-4549
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Danil Serdyuchenko


We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the 
container-executor with cgroups configuration. Also we have NM recovery enabled.

We observe a lot of containers that get stuck in the KIILLING state after the 
NM tries to kill them. The container remains running indefinitely, this causes 
some duplication as new containers are brought up to replace them. Looking 
through the logs NM can't seem to get the container PID.

{noformat}
16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping 
container with container Id: container_1448454866800_0023_01_000005
16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user 
IP=10.51.111.243        OPERATION=Stop Container Request        
TARGET=ContainerManageImpl      RESULT=SUCCESS  
APPID=application_1448454866800_0023    
CONTAINERID=container_1448454866800_0023_01_000005
16/01/05 05:16:44 INFO container.ContainerImpl: Container 
container_1448454866800_0023_01_000005 transitioned from RUNNING to KILLING
16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container 
container_1448454866800_0023_01_000005
16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for 
container_1448454866800_0023_01_000005. Waited for 2000 ms.
{noformat}

The PID files for each container seem to be present on the node. We waren't 
able to consistently replicate this and hoping that someone has come across 
this before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to