[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-11-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005159#comment-15005159
 ] 

Hadoop QA commented on YARN-4216:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 5s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
56s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 18s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
50s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 52s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 25s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 33, now 33). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 49s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 34s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 59s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
23s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 41m 45s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-11-14 |
| JIRA Patch U

[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945020#comment-14945020
 ] 

Jason Lowe commented on YARN-4216:
--

If we're decommissioning a node then we're not doing a rolling upgrade of it.  
Decomm of a node should kill all of the containers on the node, upload the 
logs, then shutdown the node.  That's not a rolling upgrade since we lose work. 
 It may be rolling in the sense that we can go through the nodes in a serial 
fashion, but since work is being lost at each step it's significantly different 
than the rolling upgrade with work-preserving restart.

What we're talking about here is reinsertion of a previously decomm'd node that 
ends up running containers for an application that already had logs aggregated 
which is slightly different than the JIRA title which implies work-preserving 
restart.  Having the NM append the new logs would be a reasonable approach to 
try to avoid log loss, although there's the problem of active readers for the 
logs.  If we're appending then we can end up with partially written logs at the 
end when readers come along to parse the logs.  We'd either have to live with 
that possibility or have the NM copy the existing logs to the .tmp file before 
appending the new logs then atomically replacing the previous logs with the new 
version.  Not all filesystems support atomic replace, but HDFS can do it.

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944524#comment-14944524
 ] 

Bibin A Chundatt commented on YARN-4216:


{quote}
 That is intentional. Decommission + nm restart doesn't make sense to me. 
Either we are decommissioning a node and don't expect it to return, or we are 
going to restart it and expect it to return shortly.
{quote}
For *rolling upgrade* the same scenarios can happen *( decommmision (logs 
upload) --> upgrade --> start NM --> new container assignment --> on finish log 
upload )* and container log loss happens. Append logs during aggregation could 
be one solution in this case rt?

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943781#comment-14943781
 ] 

Jason Lowe commented on YARN-4216:
--

The container logs should not be uploaded on NM stop if we are doing recovery.  
That is intentional.  Decommission + nm restart doesn't make sense to me.  
Either we are decommissioning a node and don't expect it to return, or we are 
going to restart it and expect it to return shortly.  For the former, we want 
the NM to linger a bit to try to finish log aggregation.  For the latter it 
should not.

If we are decommissioning the node then context.getDecommissioned() in the 
boolean clause above should be true which means shouldAbort would be false.  
That means it should not do the same thing as a shutdown under supervision.  My 
apologies if I'm missing something.

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943574#comment-14943574
 ] 

Bibin A Chundatt commented on YARN-4216:


When yarn.nodemanager.recovery.supervised=true and  nodemanager stoppped abort 
aggregation is called

2015-10-05 20:17:20,634 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 *Aborting log aggregation for application_1444056058955_0002*
{noformat}
2015-10-05 20:17:20,634 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder
2015-10-05 20:17:20,634 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
 waiting for pending aggregation during exit
2015-10-05 20:17:20,634 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 Aborting log aggregation for application_1444056058955_0002
2015-10-05 20:17:20,634 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 Aggregation did not complete for application application_1444056058955_0002
2015-10-05 20:17:20,639 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 is interrupted. Exiting.
2015-10-05 20:17:20,664 INFO org.apache.hadoop.ipc.Server: Stopping server on 
8040
2015-10-05 20:17:20,665 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
listener on 8040
2015-10-05 20:17:20,665 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder
2015-10-05 20:17:20,665 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Public cache exiting
2015-10-05 20:17:20,665 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is 
interrupted. Exiting.
2015-10-05 20:17:20,671 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping NodeManager metrics system...
2015-10-05 20:17:20,674 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
NodeManager metrics system stopped.
{noformat}

Container logs are not cleaned up and uploaded to HDFS on stop

But decommision + nm restart while application is running should cause  the 
same  log missing scenario as per {{LogAggregationService#stopAggregators}}

{code}
 boolean supervised = getConfig().getBoolean(
YarnConfiguration.NM_RECOVERY_SUPERVISED,
YarnConfiguration.DEFAULT_NM_RECOVERY_SUPERVISED);
// if recovery on restart is supported then leave outstanding aggregations
// to the next restart
boolean shouldAbort = context.getNMStateStore().canRecover()
&& !context.getDecommissioned() && supervised;
// politely ask to finish
for (AppLogAggregator aggregator : appLogAggregators.values()) {
  if (shouldAbort) {
aggregator.abortLogAggregation();
  } else {
aggregator.finishLogAggregation();
  }
}
{code}
'

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943368#comment-14943368
 ] 

Jason Lowe commented on YARN-4216:
--

Yes, the document should be updated to cover that property.  Did you try 
setting that property to true, and does it solve your issue?

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943124#comment-14943124
 ] 

Bibin A Chundatt commented on YARN-4216:


[Document|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html]
 doesn't mention  about the  *yarn.nodemanager.recovery.supervised* . Should i 
update doc?

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943113#comment-14943113
 ] 

Bibin A Chundatt commented on YARN-4216:


{quote}
That's why YARN-1362 was done, so we can explicitly tell the nodemanager 
whether or not the NM is under supervision and likely to restart.
{quote}
*yarn.nodemanager.recovery.supervised=false* in my current setup.
In this case as i understand from above comment i am supposed to set 
*yarn.nodemanager.recovery.supervised* as true to inform restart is under 
supervision.
 
[~jlowe] so should i close this jira ??

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939931#comment-14939931
 ] 

Jason Lowe commented on YARN-4216:
--

Not necessarily.  A nodemanager could also be shutting down due to an uncaught 
exception, crash, etc. or an admin could be shutting down a nodemanager without 
an intention of restarting it.  That's why YARN-1362 was done, so we can 
explicitly tell the nodemanager whether or not the NM is under supervision and 
likely to restart.  If the NM is not under supervision then kill -9 should be 
used for the restart scenario and yarn --daemon stop nodemanager used for 
shutting it down.

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939924#comment-14939924
 ] 

Bibin A Chundatt commented on YARN-4216:


So {{yarn --daemon stop nodemanager}} should be considered as a recoverable 
case rt  and only when decommission of NM is done should have uploaded the logs 
on stop ?. 

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939839#comment-14939839
 ] 

Jason Lowe commented on YARN-4216:
--

Then I think this is largely a problem of the NM thinking it is being shutdown 
for a non-recovery case.

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939830#comment-14939830
 ] 

Bibin A Chundatt commented on YARN-4216:


[~jlowe]
The problem doesnt happen when NM is killed using kill -9 before restarting.

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939791#comment-14939791
 ] 

Jason Lowe commented on YARN-4216:
--

The problem is the NM is thinking it is being torn down _not_ for a restart and 
is trying to clean up.  From the NM log:
{noformat}
2015-10-01 14:58:40,688 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: 
SIGTERM
2015-10-01 14:58:40,720 INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Successfully 
Unregistered the Node localhost:38153 with ResourceManager.
2015-10-01 14:58:40,731 INFO org.mortbay.log: Stopped 
SelectChannelConnector@127.0.0.1:8042
2015-10-01 14:58:40,836 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Applications still running : [application_1443685464627_0007]
2015-10-01 14:58:40,836 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Waiting for Applications to be Finished
2015-10-01 14:58:40,837 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Application application_1443685464627_0007 transitioned from RUNNING to 
FINISHING_CONTAINERS_WAIT
2015-10-01 14:58:40,837 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1443685464627_0007_01_14 transitioned from RUNNING to 
KILLING
2015-10-01 14:58:40,837 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1443685464627_0007_01_01 transitioned from RUNNING to 
KILLING
{noformat}

For a proper recovery the NM should not be trying to kill containers.  Part of 
the issue here is having the NM distinguish a shutdown that will be restarted 
from a shutdown that won't be restarted.  In the former it should _not_ kill 
containers since the restart will recover them.  For the latter it _should_ 
kill containers since there won't be an NM around later to control them.  See 
YARN-1362 for more details.

Does this problem occur if you stop the NM with kill -9 before restarting?

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)