[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

Jason Lowe (JIRA) Thu, 01 Oct 2015 06:14:55 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939791#comment-14939791
 ]


Jason Lowe commented on YARN-4216:
----------------------------------

The problem is the NM is thinking it is being torn down _not_ for a restart and 
is trying to clean up.  From the NM log:
{noformat}
2015-10-01 14:58:40,688 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: 
SIGTERM
2015-10-01 14:58:40,720 INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Successfully 
Unregistered the Node localhost:38153 with ResourceManager.
2015-10-01 14:58:40,731 INFO org.mortbay.log: Stopped 
[email protected]:8042
2015-10-01 14:58:40,836 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Applications still running : [application_1443685464627_0007]
2015-10-01 14:58:40,836 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Waiting for Applications to be Finished
2015-10-01 14:58:40,837 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Application application_1443685464627_0007 transitioned from RUNNING to 
FINISHING_CONTAINERS_WAIT
2015-10-01 14:58:40,837 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1443685464627_0007_01_000014 transitioned from RUNNING to 
KILLING
2015-10-01 14:58:40,837 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1443685464627_0007_01_000001 transitioned from RUNNING to 
KILLING
{noformat}

For a proper recovery the NM should not be trying to kill containers.  Part of 
the issue here is having the NM distinguish a shutdown that will be restarted 
from a shutdown that won't be restarted.  In the former it should _not_ kill 
containers since the restart will recover them.  For the latter it _should_ 
kill containers since there won't be an NM around later to control them.  See 
YARN-1362 for more details.

Does this problem occur if you stop the NM with kill -9 before restarting?

> Container logs not shown for newly assigned containers  after NM  recovery
> --------------------------------------------------------------------------
>
>                 Key: YARN-4216
>                 URL: https://issues.apache.org/jira/browse/YARN-4216
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation, nodemanager
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

Reply via email to