Danil Serdyuchenko commented on YARN-4549:

We did some more digging and found that a few containers that are currently in 
RUNNING state, are missing directories under {{nmPrivate}} dir. The web 
interface reports that the containers are running on that node and the 
container processes are there too, but we are missing the the entire 
application dir under {{nmPrivate}}.

[~jlowe] This usually happens to long running containers. The PID files are 
missing for containers in KILLING state, and for certain RUNNING containers. 
The pid file should be under {{nm-local-dir}}, for us it's: 

> Containers stuck in KILLING state
> ---------------------------------
>                 Key: YARN-4549
>                 URL: https://issues.apache.org/jira/browse/YARN-4549
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Danil Serdyuchenko
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the 
> container-executor with cgroups configuration. Also we have NM recovery 
> enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the 
> NM tries to kill them. The container remains running indefinitely, this 
> causes some duplication as new containers are brought up to replace them. 
> Looking through the logs NM can't seem to get the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping 
> container with container Id: container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user 
> IP=        OPERATION=Stop Container Request        
> TARGET=ContainerManageImpl      RESULT=SUCCESS  
> APPID=application_1448454866800_0023    
> CONTAINERID=container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container 
> container_1448454866800_0023_01_000005 transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container 
> container_1448454866800_0023_01_000005
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for 
> container_1448454866800_0023_01_000005. Waited for 2000 ms.
> {noformat}
> The PID files for each container seem to be present on the node. We waren't 
> able to consistently replicate this and hoping that someone has come across 
> this before.

This message was sent by Atlassian JIRA

Reply via email to