[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268714#comment-14268714 ] Hadoop QA commented on YARN-2997: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690687/YARN-2997.5.patch against trunk revision ef237bd. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6276//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6276//console This message is automatically generated. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, > YARN-2997.5.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268670#comment-14268670 ] Chengbing Liu commented on YARN-2997: - Yes, a HashMap should be enough. I will upload a new one. Thanks. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, > YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268546#comment-14268546 ] Jian He commented on YARN-2997: --- ah, I think the problem that container statuses whose application are stopped may be lost on NM resync exists before. thanks for your clarification. one minor comment: {{LinkedHashMap()}}, a regular HashMap should be enough instead of a linkedHashMap? > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, > YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267210#comment-14267210 ] Chengbing Liu commented on YARN-2997: - Once got an RESYNC, NM calls {{getNMContainerStatuses}}, which will loop over all containers in the NM context, remove those whose app is not in the NM context, finally report to RM. The method {{getNMContainerStatuses}} remains unchanged before and after this patch. The logic of removing containers from context is also unchanged. >From a different viewpoint, {{pendingCompletedContainers}} contains the >following: * completed containers, whose app is stopped, and the container is removed from the NM context. * completed containers, whose app is NOT stopped (which implies their apps are in the NM context), and the container is NOT removed from the NM context. The first kind will not be reported to RM since they are not in the NM context, so they will not be looped. The second kind will be reported to RM since they are in the NM context, and their apps must be in the NM context. Finally, the changes of this patch can be summarized as follows: * Does not send finished container statuses repeatedly for running application * Send completed container statuses again in case of lost heartbeat (normal heartbeat, not RESYNC) I hope this will clarify your doubts. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, > YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266500#comment-14266500 ] Jian He commented on YARN-2997: --- thanks for updating. bq. also in RESYNC section to clear the cache so that these outdated container statuses will not be reported to the restarted RM. could you explain more why this is added ? I think before this patch, these container statuses will be sent to the restarted RM. after the patch, it won't > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, > YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265865#comment-14265865 ] Hadoop QA commented on YARN-2997: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690286/YARN-2997.4.patch against trunk revision 4cd66f7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6252//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6252//console This message is automatically generated. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, > YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265790#comment-14265790 ] Jian He commented on YARN-2997: --- sure, I will open a jira for it. thx > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265663#comment-14265663 ] Chengbing Liu commented on YARN-2997: - [~jianhe] Can we perhaps deal with {{getNMContainerStatuses}} issue in another JIRA? This one has not changed anything for RM restart yet. If so, the only thing left is the {{pendingCompletedContainers.clear()}} thing. What do you think? > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265637#comment-14265637 ] Jian He commented on YARN-2997: --- bq. except the containers whose application is not in NM context. I think we should send containers whose application is not in NMContext too for recovery. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265572#comment-14265572 ] Chengbing Liu commented on YARN-2997: - {quote} I think this is not possible given that we are looping this.context.getContainers() which is based on containerId to Container map. Or we can just use a list. {quote} We are looping over {{context.getContainers()}}, plus possible remainders from the previous heartbeat (in case of a lost heartbeat). If the previously completed container has its status changed somehow, there would be two different ContainerStatus with same ID reported. That's why I use a map, and use {{pendingCompletedContainers.put(containerId, containerStatus)}} instead of {{containerStatuses.add(containerStatus)}} directly, in order to prevent such duplications {quote} then we should send the pendingCompletedContainers in getNMContainerStatuses method too {quote} We may not need to change {{getNMContainerStatuses}}, as it will send all container statuses in NM context, except the containers whose application is not in NM context. I think that will cover all elements in {{pendingCompletedContainers}}. And lost heartbeat is not a problem with {{getNMContainerStatuses}}. {quote} or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as to avoid the newly added method. {quote} That's a good idea. I will change this in the next patch. Thanks for your advice! > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264975#comment-14264975 ] Jian He commented on YARN-2997: --- bq. we may run into a situation where we report two different ContainerStatus with same ID I think this is not possible given that we are looping {{this.context.getContainers()}} which is based on containerId to Container map. Or we can just use a list. bq. So in my opinion, that could be a potential leak. I see. then we should send the pendingCompletedContainers in getNMContainerStatuses method too. and {{pendingCompletedContainers.clear()}}; should be put after {{if (response.getNodeAction() == NodeAction.RESYNC) }}, or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as to avoid the newly added method. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263792#comment-14263792 ] Hadoop QA commented on YARN-2997: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689964/YARN-2997.3.patch against trunk revision 947578c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6238//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6238//console This message is automatically generated. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263747#comment-14263747 ] Chengbing Liu commented on YARN-2997: - {quote} I think we can simplify the logic in getContainerStatuses as such: {quote} It seems that if we do not remove the containers whose app is already stopped, we will rely on the heartbeat response from RM to remove containers acked by AM. If something goes wrong on the AM or RM side, the NM will never remove these containers from context. So in my opinion, that could be a potential leak. {quote} the sub class has the equal method. {quote} Yes, you are right. However, I'm still not sure if it is a good idea to use {{Set}} instead of {{Map}} for the following reasons: * {{ContainerId}} is a unique identifier for a container, while {{ContainerStatus}} can be changed over time, even for the same container. * We want to ensure no duplicate container status reported to RM. {{ContainerStatus}} has not only containerId, but also container state, exit status and diagnostic message, we may run into a situation where we report two different {{ContainerStatus}} with same ID and different states or other stuffs. * {{ContainerId}} has {{equals}} method and annotated as public and stable, while {{ContainerStatus}} has no {{equals}} method and {{ContainerStatusPBImpl}} is annotated as private and unstable. It may not be a good idea to rely on the implementation of {{ContainerStatus}}. * The use {{Set}} never appears in the current code base. {quote} that's limitation of the test, we should fix the tests. {quote} Yes, I see. I will fix them. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263656#comment-14263656 ] Jian He commented on YARN-2997: --- I think we can simplify the logic in getContainerStatuses as such: {code} if (containerStatus.getState() == ContainerState.COMPLETE) { if (!isContainerRecentlyStopped(containerId)) { addCompletedContainer(containerId); containerStatuses.add(containerStatus); } } else { containerStatuses.add(containerStatus); } {code} bq. I didn't see an equals method defined in the abstract class the sub class has the equal method. bq. So I guess we have to keep it that's limitation of the test, we should fix the tests. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263389#comment-14263389 ] Chengbing Liu commented on YARN-2997: - {quote} for simplicity, we can use the addAll method for the for loop. {quote} Yes, I will change this. {quote} pendingCompletedContainers, maybe use a set instead of a map? {quote} I'm not sure if {{ContainerStatus}} can be compared, because I didn't see an {{equals}} method defined in the abstract class {{ContainerStatus}}. {quote} pendingCompletedContainers.remove(containerId); this line may be not needed {quote} I added this line in method {{removeOrTrackCompletedContainersFromContext}} after I discovered the method is called independently in the test {{testRemovePreviousCompletedContainersFromContext}}. The test first calls {{removeOrTrackCompletedContainersFromContext}} then {{getContainerStatuses}}, and expects the container status to be removed from the result. So I guess we have to keep it? {quote} I found pendingContainersToRemove potentially has a leak, {quote} Yes you are right, I will fix this and add comments for modified test cases. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263236#comment-14263236 ] Jian He commented on YARN-2997: --- [~chengbing.liu], thanks for your explanation ! patch looks good overall, few comments: - for simplicity, we can use the addAll method for the for loop. {code} for (ContainerStatus containerStatus : pendingCompletedContainers.values()) { containerStatuses.add(containerStatus); } {code} - pendingCompletedContainers, maybe use a set instead of a map? - pendingCompletedContainers.remove(containerId); this line may be not needed, given pendingCompletedContainers.clear() is invoked earlier - I found pendingContainersToRemove potentially has a leak, we should probably add following in the while loop of removeOrTrackCompletedContainersFromContext, would you mind fixing this too ? {code} if (nmContainer == null) { iter.remove(); } {code} - could you add code comments on the modified test cases, so that people can reason more easily ? thx {code} if (heartBeatID == 2) { Assert.assertEquals(statuses.size(), 4); } else { Assert.assertEquals(statuses.size(), 2); } {code} > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14262134#comment-14262134 ] Hadoop QA commented on YARN-2997: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689672/YARN-2997.2.patch against trunk revision e7257ac. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6226//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6226//console This message is automatically generated. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.2.patch, YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261825#comment-14261825 ] Chengbing Liu commented on YARN-2997: - [~jianhe] The containers are not removed in this patch, they are just not reported to RM when the following conditions are met: * The application is not finished * The container was completed and was already in {{recentlyStoppedContainers}} * It is a normal heartbeat with RM, not after RM restart Note that the container is not removed from the NM context. In a resync with RM, these completed applications will still be reported for work-preserving recovery. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261797#comment-14261797 ] Jian He commented on YARN-2997: --- bq. The uploaded patch will let the normal heartbeat The intention was to let NM remove containers from its context only after RM acks it has received these containers. More context in YARN-1372 > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261792#comment-14261792 ] Chengbing Liu commented on YARN-2997: - [~kasha] [~jianhe] Thanks for your review. I think NM will call {{getNMContainerStatuses()}} instead of {{getContainerStatuses()}} upon receiving RESYNC from restarted RM. {{getNMContainerStatuses()}} remains unchanged and still reports all the completed containers for non-completed apps. The uploaded patch will let the normal heartbeat (not after receiving RESYNC) send only useful container status information to RM. So I guess the work-preserving RM restart is not affected. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261682#comment-14261682 ] Jian He commented on YARN-2997: --- bq. add code to identify them duplicates and then ignore. I think it is now ignored. perhaps we should clarify the logging and move it to debug level. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261526#comment-14261526 ] Karthik Kambatla commented on YARN-2997: With work-preserving restart, the NM is required to intimate the RM repeatedly in case the RM goes down and loses this information. I propose we ignore the latter updates, or add code to identify them duplicates and then ignore. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260882#comment-14260882 ] Hadoop QA commented on YARN-2997: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689447/YARN-2997.patch against trunk revision 249cc90. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6215//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6215//console This message is automatically generated. > NM keeps sending finished containers to RM until app is finished > > > Key: YARN-2997 > URL: https://issues.apache.org/jira/browse/YARN-2997 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: YARN-2997.patch > > > We have seen in RM log a lot of > {quote} > INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {quote} > It is caused by NM sending completed containers repeatedly until the app is > finished. On the RM side, the container is already released, hence > {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)