[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005159#comment-15005159 ] Hadoop QA commented on YARN-4216: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 5s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 56s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 26s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 18s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in trunk has 3 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 52s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 25s {color} | {color:red} Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 33, now 33). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 49s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 34s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 59s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 41m 45s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-14 | | JIRA Patch U
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945020#comment-14945020 ] Jason Lowe commented on YARN-4216: -- If we're decommissioning a node then we're not doing a rolling upgrade of it. Decomm of a node should kill all of the containers on the node, upload the logs, then shutdown the node. That's not a rolling upgrade since we lose work. It may be rolling in the sense that we can go through the nodes in a serial fashion, but since work is being lost at each step it's significantly different than the rolling upgrade with work-preserving restart. What we're talking about here is reinsertion of a previously decomm'd node that ends up running containers for an application that already had logs aggregated which is slightly different than the JIRA title which implies work-preserving restart. Having the NM append the new logs would be a reasonable approach to try to avoid log loss, although there's the problem of active readers for the logs. If we're appending then we can end up with partially written logs at the end when readers come along to parse the logs. We'd either have to live with that possibility or have the NM copy the existing logs to the .tmp file before appending the new logs then atomically replacing the previous logs with the new version. Not all filesystems support atomic replace, but HDFS can do it. > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944524#comment-14944524 ] Bibin A Chundatt commented on YARN-4216: {quote} That is intentional. Decommission + nm restart doesn't make sense to me. Either we are decommissioning a node and don't expect it to return, or we are going to restart it and expect it to return shortly. {quote} For *rolling upgrade* the same scenarios can happen *( decommmision (logs upload) --> upgrade --> start NM --> new container assignment --> on finish log upload )* and container log loss happens. Append logs during aggregation could be one solution in this case rt? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943781#comment-14943781 ] Jason Lowe commented on YARN-4216: -- The container logs should not be uploaded on NM stop if we are doing recovery. That is intentional. Decommission + nm restart doesn't make sense to me. Either we are decommissioning a node and don't expect it to return, or we are going to restart it and expect it to return shortly. For the former, we want the NM to linger a bit to try to finish log aggregation. For the latter it should not. If we are decommissioning the node then context.getDecommissioned() in the boolean clause above should be true which means shouldAbort would be false. That means it should not do the same thing as a shutdown under supervision. My apologies if I'm missing something. > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943574#comment-14943574 ] Bibin A Chundatt commented on YARN-4216: When yarn.nodemanager.recovery.supervised=true and nodemanager stoppped abort aggregation is called 2015-10-05 20:17:20,634 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: *Aborting log aggregation for application_1444056058955_0002* {noformat} 2015-10-05 20:17:20,634 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-10-05 20:17:20,634 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2015-10-05 20:17:20,634 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Aborting log aggregation for application_1444056058955_0002 2015-10-05 20:17:20,634 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Aggregation did not complete for application application_1444056058955_0002 2015-10-05 20:17:20,639 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2015-10-05 20:17:20,664 INFO org.apache.hadoop.ipc.Server: Stopping server on 8040 2015-10-05 20:17:20,665 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8040 2015-10-05 20:17:20,665 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-10-05 20:17:20,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting 2015-10-05 20:17:20,665 WARN org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting. 2015-10-05 20:17:20,671 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system... 2015-10-05 20:17:20,674 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped. {noformat} Container logs are not cleaned up and uploaded to HDFS on stop But decommision + nm restart while application is running should cause the same log missing scenario as per {{LogAggregationService#stopAggregators}} {code} boolean supervised = getConfig().getBoolean( YarnConfiguration.NM_RECOVERY_SUPERVISED, YarnConfiguration.DEFAULT_NM_RECOVERY_SUPERVISED); // if recovery on restart is supported then leave outstanding aggregations // to the next restart boolean shouldAbort = context.getNMStateStore().canRecover() && !context.getDecommissioned() && supervised; // politely ask to finish for (AppLogAggregator aggregator : appLogAggregators.values()) { if (shouldAbort) { aggregator.abortLogAggregation(); } else { aggregator.finishLogAggregation(); } } {code} ' > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943368#comment-14943368 ] Jason Lowe commented on YARN-4216: -- Yes, the document should be updated to cover that property. Did you try setting that property to true, and does it solve your issue? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943124#comment-14943124 ] Bibin A Chundatt commented on YARN-4216: [Document|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html] doesn't mention about the *yarn.nodemanager.recovery.supervised* . Should i update doc? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943113#comment-14943113 ] Bibin A Chundatt commented on YARN-4216: {quote} That's why YARN-1362 was done, so we can explicitly tell the nodemanager whether or not the NM is under supervision and likely to restart. {quote} *yarn.nodemanager.recovery.supervised=false* in my current setup. In this case as i understand from above comment i am supposed to set *yarn.nodemanager.recovery.supervised* as true to inform restart is under supervision. [~jlowe] so should i close this jira ?? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939931#comment-14939931 ] Jason Lowe commented on YARN-4216: -- Not necessarily. A nodemanager could also be shutting down due to an uncaught exception, crash, etc. or an admin could be shutting down a nodemanager without an intention of restarting it. That's why YARN-1362 was done, so we can explicitly tell the nodemanager whether or not the NM is under supervision and likely to restart. If the NM is not under supervision then kill -9 should be used for the restart scenario and yarn --daemon stop nodemanager used for shutting it down. > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939924#comment-14939924 ] Bibin A Chundatt commented on YARN-4216: So {{yarn --daemon stop nodemanager}} should be considered as a recoverable case rt and only when decommission of NM is done should have uploaded the logs on stop ?. > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939839#comment-14939839 ] Jason Lowe commented on YARN-4216: -- Then I think this is largely a problem of the NM thinking it is being shutdown for a non-recovery case. > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939830#comment-14939830 ] Bibin A Chundatt commented on YARN-4216: [~jlowe] The problem doesnt happen when NM is killed using kill -9 before restarting. > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939791#comment-14939791 ] Jason Lowe commented on YARN-4216: -- The problem is the NM is thinking it is being torn down _not_ for a restart and is trying to clean up. From the NM log: {noformat} 2015-10-01 14:58:40,688 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM 2015-10-01 14:58:40,720 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Successfully Unregistered the Node localhost:38153 with ResourceManager. 2015-10-01 14:58:40,731 INFO org.mortbay.log: Stopped SelectChannelConnector@127.0.0.1:8042 2015-10-01 14:58:40,836 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Applications still running : [application_1443685464627_0007] 2015-10-01 14:58:40,836 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Waiting for Applications to be Finished 2015-10-01 14:58:40,837 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1443685464627_0007 transitioned from RUNNING to FINISHING_CONTAINERS_WAIT 2015-10-01 14:58:40,837 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1443685464627_0007_01_14 transitioned from RUNNING to KILLING 2015-10-01 14:58:40,837 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1443685464627_0007_01_01 transitioned from RUNNING to KILLING {noformat} For a proper recovery the NM should not be trying to kill containers. Part of the issue here is having the NM distinguish a shutdown that will be restarted from a shutdown that won't be restarted. In the former it should _not_ kill containers since the restart will recover them. For the latter it _should_ kill containers since there won't be an NM around later to control them. See YARN-1362 for more details. Does this problem occur if you stop the NM with kill -9 before restarting? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)