[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268546#comment-14268546
 ] 

Jian He commented on YARN-2997:
---

ah, I think the problem that container statuses whose application are stopped 
may be lost on NM resync exists before.  thanks for your clarification.  one 
minor comment: {{LinkedHashMapContainerId, ContainerStatus()}},  a regular 
HashMap should be enough instead of a linkedHashMap?

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, 
 YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-07 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268670#comment-14268670
 ] 

Chengbing Liu commented on YARN-2997:
-

Yes, a HashMap should be enough. I will upload a new one. Thanks.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, 
 YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268714#comment-14268714
 ] 

Hadoop QA commented on YARN-2997:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12690687/YARN-2997.5.patch
  against trunk revision ef237bd.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6276//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6276//console

This message is automatically generated.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, 
 YARN-2997.5.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265865#comment-14265865
 ] 

Hadoop QA commented on YARN-2997:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12690286/YARN-2997.4.patch
  against trunk revision 4cd66f7.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6252//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6252//console

This message is automatically generated.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, 
 YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-06 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267210#comment-14267210
 ] 

Chengbing Liu commented on YARN-2997:
-

Once got an RESYNC, NM calls {{getNMContainerStatuses}}, which will loop over 
all containers in the NM context, remove those whose app is not in the NM 
context, finally report to RM. The method {{getNMContainerStatuses}} remains 
unchanged before and after this patch. The logic of removing containers from 
context is also unchanged.

From a different viewpoint, {{pendingCompletedContainers}} contains the 
following:
* completed containers, whose app is stopped, and the container is removed from 
the NM context.
* completed containers, whose app is NOT stopped (which implies their apps are 
in the NM context), and the container is NOT removed from the NM context.

The first kind will not be reported to RM since they are not in the NM context, 
so they will not be looped.
The second kind will be reported to RM since they are in the NM context, and 
their apps must be in the NM context.

Finally, the changes of this patch can be summarized as follows:
* Does not send finished container statuses repeatedly for running application
* Send completed container statuses again in case of lost heartbeat (normal 
heartbeat, not RESYNC)

I hope this will clarify your doubts.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, 
 YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-06 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266500#comment-14266500
 ] 

Jian He commented on YARN-2997:
---

thanks for updating. 
bq. also in RESYNC section to clear the cache so that these outdated container 
statuses will not be reported to the restarted RM.
could you explain more why this is added ? I think before this patch, these 
container statuses will be sent to the restarted RM. after the patch, it won't 

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, 
 YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-05 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265572#comment-14265572
 ] 

Chengbing Liu commented on YARN-2997:
-

{quote}
I think this is not possible given that we are looping 
this.context.getContainers() which is based on containerId to Container map. Or 
we can just use a list.
{quote}
We are looping over {{context.getContainers()}}, plus possible remainders from 
the previous heartbeat (in case of a lost heartbeat). If the previously 
completed container has its status changed somehow, there would be two 
different ContainerStatus with same ID reported. That's why I use a map, and 
use {{pendingCompletedContainers.put(containerId, containerStatus)}} instead of 
{{containerStatuses.add(containerStatus)}} directly, in order to prevent such 
duplications
{quote}
then we should send the pendingCompletedContainers in getNMContainerStatuses 
method too
{quote}
We may not need to change {{getNMContainerStatuses}}, as it will send all 
container statuses in NM context, except the containers whose application is 
not in NM context. I think that will cover all elements in 
{{pendingCompletedContainers}}. And lost heartbeat is not a problem with 
{{getNMContainerStatuses}}.
{quote}
or we can just put it at the last line of 
removeOrTrackCompletedContainersFromContext so as to avoid the newly added 
method. 
{quote}
That's a good idea. I will change this in the next patch. Thanks for your 
advice!

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-05 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265663#comment-14265663
 ] 

Chengbing Liu commented on YARN-2997:
-

[~jianhe] Can we perhaps deal with {{getNMContainerStatuses}} issue in another 
JIRA? This one has not changed anything for RM restart yet. If so, the only 
thing left is the {{pendingCompletedContainers.clear()}} thing. What do you 
think?

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265637#comment-14265637
 ] 

Jian He commented on YARN-2997:
---

bq. except the containers whose application is not in NM context.
I think we should send containers whose application is not in NMContext too for 
recovery.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264975#comment-14264975
 ] 

Jian He commented on YARN-2997:
---

bq. we may run into a situation where we report two different ContainerStatus 
with same ID
I think this is not possible given that we are looping 
{{this.context.getContainers()}} which is based on containerId to Container 
map. Or we can just use a list.
bq. So in my opinion, that could be a potential leak.
I see. then we should send the pendingCompletedContainers in 
getNMContainerStatuses method too.  and {{pendingCompletedContainers.clear()}}; 
 should be put after {{if (response.getNodeAction() == NodeAction.RESYNC) }}, 
or we can just put it at the last line of 
removeOrTrackCompletedContainersFromContext so as to avoid the newly added 
method. 

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263792#comment-14263792
 ] 

Hadoop QA commented on YARN-2997:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689964/YARN-2997.3.patch
  against trunk revision 947578c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6238//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6238//console

This message is automatically generated.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-03 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263656#comment-14263656
 ] 

Jian He commented on YARN-2997:
---

I think we can simplify the logic in getContainerStatuses as such:
{code}
  if (containerStatus.getState() == ContainerState.COMPLETE) {
if (!isContainerRecentlyStopped(containerId)) {
  addCompletedContainer(containerId);
  containerStatuses.add(containerStatus);
}
  } else {
containerStatuses.add(containerStatus);
  }
{code}
bq. I didn't see an equals method defined in the abstract class
the sub class has the equal method.
bq.  So I guess we have to keep it
that's limitation of the test, we should fix the tests.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-03 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263747#comment-14263747
 ] 

Chengbing Liu commented on YARN-2997:
-

{quote}
I think we can simplify the logic in getContainerStatuses as such:
{quote}
It seems that if we do not remove the containers whose app is already stopped, 
we will rely on the heartbeat response from RM to remove containers acked by 
AM. If something goes wrong on the AM or RM side, the NM will never remove 
these containers from context. So in my opinion, that could be a potential leak.

{quote}
the sub class has the equal method.
{quote}
Yes, you are right. However, I'm still not sure if it is a good idea to use 
{{SetContainerStatus}} instead of {{MapContainerId, ContainerStatus}} for 
the following reasons:
* {{ContainerId}} is a unique identifier for a container, while 
{{ContainerStatus}} can be changed over time, even for the same container.
* We want to ensure no duplicate container status reported to RM. 
{{ContainerStatus}} has not only containerId, but also container state, exit 
status and diagnostic message, we may run into a situation where we report two 
different {{ContainerStatus}} with same ID and different states or other stuffs.
* {{ContainerId}} has {{equals}} method and annotated as public and stable, 
while {{ContainerStatus}} has no {{equals}} method and 
{{ContainerStatusPBImpl}} is annotated as private and unstable. It may not be a 
good idea to rely on the implementation of {{ContainerStatus}}.
* The use {{SetContainerStatus}} never appears in the current code base.

{quote}
that's limitation of the test, we should fix the tests.
{quote}
Yes, I see. I will fix them.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-02 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263236#comment-14263236
 ] 

Jian He commented on YARN-2997:
---

[~chengbing.liu], thanks for your explanation !  patch looks good overall, few 
comments:
- for simplicity, we can use the addAll method for the for loop.
{code}
for (ContainerStatus containerStatus : pendingCompletedContainers.values()) {
  containerStatuses.add(containerStatus);
}
{code}
- pendingCompletedContainers, maybe use a set instead of a map?
- pendingCompletedContainers.remove(containerId); this line may be not needed, 
given pendingCompletedContainers.clear() is invoked earlier
- I found pendingContainersToRemove potentially has a leak, we should probably 
add following in the while loop of removeOrTrackCompletedContainersFromContext, 
would you mind fixing this too ?
{code}
  if (nmContainer == null) {
iter.remove();
  }
{code}
- could you add code comments on the modified test cases, so that people can 
reason more easily ? thx
{code}
if (heartBeatID == 2) {
Assert.assertEquals(statuses.size(), 4);
  } else {
Assert.assertEquals(statuses.size(), 2);
  }
{code}

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-02 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263389#comment-14263389
 ] 

Chengbing Liu commented on YARN-2997:
-

{quote}
for simplicity, we can use the addAll method for the for loop. 
{quote}
Yes, I will change this.
{quote}
pendingCompletedContainers, maybe use a set instead of a map?
{quote}
I'm not sure if {{ContainerStatus}} can be compared, because I didn't see an 
{{equals}} method defined in the abstract class {{ContainerStatus}}.
{quote}
pendingCompletedContainers.remove(containerId); this line may be not needed
{quote}
I added this line in method {{removeOrTrackCompletedContainersFromContext}} 
after I discovered the method is called independently in the test 
{{testRemovePreviousCompletedContainersFromContext}}. The test first calls 
{{removeOrTrackCompletedContainersFromContext}} then {{getContainerStatuses}}, 
and expects the container status to be removed from the result. So I guess we 
have to keep it?
{quote}
I found pendingContainersToRemove potentially has a leak,
{quote}
Yes you are right, I will fix this and add comments for modified test cases.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2014-12-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262134#comment-14262134
 ] 

Hadoop QA commented on YARN-2997:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689672/YARN-2997.2.patch
  against trunk revision e7257ac.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6226//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6226//console

This message is automatically generated.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2014-12-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261526#comment-14261526
 ] 

Karthik Kambatla commented on YARN-2997:


With work-preserving restart, the NM is required to intimate the RM repeatedly 
in case the RM goes down and loses this information. I propose we ignore the 
latter updates, or add code to identify them duplicates and then ignore. 

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2014-12-30 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261682#comment-14261682
 ] 

Jian He commented on YARN-2997:
---

bq.  add code to identify them duplicates and then ignore.
I think it is now ignored. perhaps we should clarify the logging and move it to 
debug level.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2014-12-30 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261792#comment-14261792
 ] 

Chengbing Liu commented on YARN-2997:
-

[~kasha] [~jianhe] Thanks for your review. I think NM will call 
{{getNMContainerStatuses()}} instead of {{getContainerStatuses()}} upon 
receiving RESYNC from restarted RM. {{getNMContainerStatuses()}} remains 
unchanged and still reports all the completed containers for non-completed 
apps. The uploaded patch will let the normal heartbeat (not after receiving 
RESYNC) send only useful container status information to RM. So I guess the 
work-preserving RM restart is not affected.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2014-12-30 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261797#comment-14261797
 ] 

Jian He commented on YARN-2997:
---

bq. The uploaded patch will let the normal heartbeat
The intention was to let NM remove containers from its context only after RM 
acks it has received these containers. More context in YARN-1372

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2014-12-30 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261825#comment-14261825
 ] 

Chengbing Liu commented on YARN-2997:
-

[~jianhe] The containers are not removed in this patch, they are just not 
reported to RM when the following conditions are met:
* The application is not finished
* The container was completed and was already in {{recentlyStoppedContainers}}
* It is a normal heartbeat with RM, not after RM restart

Note that the container is not removed from the NM context. In a resync with 
RM, these completed applications will still be reported for work-preserving 
recovery.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2014-12-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260882#comment-14260882
 ] 

Hadoop QA commented on YARN-2997:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689447/YARN-2997.patch
  against trunk revision 249cc90.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6215//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6215//console

This message is automatically generated.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)