[jira] [Commented] (YARN-4549) Containers stuck in KILLING state

2016-01-08 Thread Danil Serdyuchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088981#comment-15088981
 ] 

Danil Serdyuchenko commented on YARN-4549:
--

Yep, looks like we had tmpwatch delete all files older than 10 days. We have 
set the NM local-dirs to be outside of tmp. [~jlowe] thanks for your help. 

> Containers stuck in KILLING state
> -
>
> Key: YARN-4549
> URL: https://issues.apache.org/jira/browse/YARN-4549
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Danil Serdyuchenko
>
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the 
> container-executor with cgroups configuration. Also we have NM recovery 
> enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the 
> NM tries to kill them. The container remains running indefinitely, this 
> causes some duplication as new containers are brought up to replace them. 
> Looking through the logs NM can't seem to get the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping 
> container with container Id: container_1448454866800_0023_01_05
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user 
> IP=10.51.111.243OPERATION=Stop Container Request
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1448454866800_0023
> CONTAINERID=container_1448454866800_0023_01_05
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container 
> container_1448454866800_0023_01_05 transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container 
> container_1448454866800_0023_01_05
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for 
> container_1448454866800_0023_01_05. Waited for 2000 ms.
> {noformat}
> The PID files for containers in the KILLING state are missing, and a few 
> other container that have been in the RUNNING state for a few weeks are also 
> missing them.  We waren't able to consistently replicate this and hoping that 
> someone has come across this before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4549) Containers stuck in KILLING state

2016-01-07 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087548#comment-15087548
 ] 

Jason Lowe commented on YARN-4549:
--

If this only happens to long-running containers and the pid files are missing 
for even RUNNING containers that have been up a while then I'm thinking 
something is coming along at some point and blowing away the pid files because 
they're too old.  Is there a tmp cleaner like tmpwatch or some other periodic 
maintenance process that could be cleaning up these "old" files?  A while back 
someone reported NM recovery issues because they were storing the NM leveldb 
state store files in /tmp and a tmp cleaner was periodically deleting some of 
the old leveldb files and corrupting the database.

You could also look in other areas under nmPrivate and see if some of the 
distributed cache directories have also been removed.  If that's the case then 
you should see messages like "Resource XXX is missing, localizing it again" in 
the NM logs as it tries to re-use a distcache entry but then discovers it's 
mysteriously missing from the local disk.  If whole directories have been 
reaped including the dist cache entries then it would strongly point to 
something like a periodic cleanup like tmpwatch or something similar.

> Containers stuck in KILLING state
> -
>
> Key: YARN-4549
> URL: https://issues.apache.org/jira/browse/YARN-4549
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Danil Serdyuchenko
>
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the 
> container-executor with cgroups configuration. Also we have NM recovery 
> enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the 
> NM tries to kill them. The container remains running indefinitely, this 
> causes some duplication as new containers are brought up to replace them. 
> Looking through the logs NM can't seem to get the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping 
> container with container Id: container_1448454866800_0023_01_05
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user 
> IP=10.51.111.243OPERATION=Stop Container Request
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1448454866800_0023
> CONTAINERID=container_1448454866800_0023_01_05
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container 
> container_1448454866800_0023_01_05 transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container 
> container_1448454866800_0023_01_05
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for 
> container_1448454866800_0023_01_05. Waited for 2000 ms.
> {noformat}
> The PID files for containers in the KILLING state are missing, and a few 
> other container that have been in the RUNNING state for a few weeks are also 
> missing them.  We waren't able to consistently replicate this and hoping that 
> someone has come across this before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4549) Containers stuck in KILLING state

2016-01-07 Thread Danil Serdyuchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087217#comment-15087217
 ] 

Danil Serdyuchenko commented on YARN-4549:
--

We did some more digging and found that a few containers that are currently in 
RUNNING state, are missing directories under {{nmPrivate}} dir. The web 
interface reports that the containers are running on that node and the 
container processes are there too, but we are missing the the entire 
application dir under {{nmPrivate}}.

[~jlowe] This usually happens to long running containers. The PID files are 
missing for containers in KILLING state, and for certain RUNNING containers. 
The pid file should be under {{nm-local-dir}}, for us it's: 
{{/tmp/hadoop-ec2-user/nm-local-dir/nmPrivate///.pid}}.

> Containers stuck in KILLING state
> -
>
> Key: YARN-4549
> URL: https://issues.apache.org/jira/browse/YARN-4549
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Danil Serdyuchenko
>
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the 
> container-executor with cgroups configuration. Also we have NM recovery 
> enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the 
> NM tries to kill them. The container remains running indefinitely, this 
> causes some duplication as new containers are brought up to replace them. 
> Looking through the logs NM can't seem to get the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping 
> container with container Id: container_1448454866800_0023_01_05
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user 
> IP=10.51.111.243OPERATION=Stop Container Request
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1448454866800_0023
> CONTAINERID=container_1448454866800_0023_01_05
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container 
> container_1448454866800_0023_01_05 transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container 
> container_1448454866800_0023_01_05
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for 
> container_1448454866800_0023_01_05. Waited for 2000 ms.
> {noformat}
> The PID files for each container seem to be present on the node. We waren't 
> able to consistently replicate this and hoping that someone has come across 
> this before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4549) Containers stuck in KILLING state

2016-01-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085737#comment-15085737
 ] 

Jason Lowe commented on YARN-4549:
--

Did the kill occur shortly after the container was started?  I'm wondering if 
the pid file somehow appeared _after_ the attempt to kill.  What does {{ls -l 
--full-time}} show for the pid file, and how does that correlate to the 
timestamps in the NM log?  Also just to verify it's in the right place, where 
is the pid file located relative to the yarn local directory root?

You mentioned NM recovery is enabled.  Does this only occur on containers that 
were recovered on NM startup or also for containers that are started and killed 
within the same NM session?


> Containers stuck in KILLING state
> -
>
> Key: YARN-4549
> URL: https://issues.apache.org/jira/browse/YARN-4549
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Danil Serdyuchenko
>
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the 
> container-executor with cgroups configuration. Also we have NM recovery 
> enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the 
> NM tries to kill them. The container remains running indefinitely, this 
> causes some duplication as new containers are brought up to replace them. 
> Looking through the logs NM can't seem to get the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping 
> container with container Id: container_1448454866800_0023_01_05
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user 
> IP=10.51.111.243OPERATION=Stop Container Request
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1448454866800_0023
> CONTAINERID=container_1448454866800_0023_01_05
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container 
> container_1448454866800_0023_01_05 transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container 
> container_1448454866800_0023_01_05
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for 
> container_1448454866800_0023_01_05. Waited for 2000 ms.
> {noformat}
> The PID files for each container seem to be present on the node. We waren't 
> able to consistently replicate this and hoping that someone has come across 
> this before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)