[jira] [Commented] (YARN-4549) Containers stuck in KILLING state

Danil Serdyuchenko (JIRA) Fri, 08 Jan 2016 01:53:31 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088981#comment-15088981
 ]


Danil Serdyuchenko commented on YARN-4549:
------------------------------------------

Yep, looks like we had tmpwatch delete all files older than 10 days. We have 
set the NM local-dirs to be outside of tmp. [~jlowe] thanks for your help. 

> Containers stuck in KILLING state
> ---------------------------------
>
>                 Key: YARN-4549
>                 URL: https://issues.apache.org/jira/browse/YARN-4549
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Danil Serdyuchenko
>
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the 
> container-executor with cgroups configuration. Also we have NM recovery 
> enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the 
> NM tries to kill them. The container remains running indefinitely, this 
> causes some duplication as new containers are brought up to replace them. 
> Looking through the logs NM can't seem to get the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping 
> container with container Id: container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user 
> IP=10.51.111.243        OPERATION=Stop Container Request        
> TARGET=ContainerManageImpl      RESULT=SUCCESS  
> APPID=application_1448454866800_0023    
> CONTAINERID=container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container 
> container_1448454866800_0023_01_000005 transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container 
> container_1448454866800_0023_01_000005
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for 
> container_1448454866800_0023_01_000005. Waited for 2000 ms.
> {noformat}
> The PID files for containers in the KILLING state are missing, and a few 
> other container that have been in the RUNNING state for a few weeks are also 
> missing them.  We waren't able to consistently replicate this and hoping that 
> someone has come across this before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4549) Containers stuck in KILLING state

Reply via email to