[jira] [Commented] (YARN-7214) duplicated container completed To AM
[ https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171402#comment-16171402 ] rangjiaheng commented on YARN-7214: --- We found this problem in Spark streaming application, a long-running application, which has fixed number of containers; after NM lost, NM restarted and RM restarted, a more container were allocated. > duplicated container completed To AM > > > Key: YARN-7214 > URL: https://issues.apache.org/jira/browse/YARN-7214 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1, 3.0.0-alpha3 > Environment: hadoop 2.7.1 rm recovery and nm recovery enabled >Reporter: zhangshilong > Attachments: screenshot-1.png > > > env: hadoop 2.7.1 with rm recovery and nm recovery enabled > case: > spark app(app1) running least one container(named c1) in NM1. > 1、NM1 crashed,and RM found NM1 expired in 10 minutes. > 2、RM will remove all containers in NM1(RMNodeImpl). and app1 will receive > c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 > lost. > 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 > is lost and will not handle containers from NM1. > 4、NM1 will not heartbeat with c1(c1 not in heartbeat request). So c1 will > not removed from context of NM1. > 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. > RM will send c1 complted message to AM of app1. So, app1 received duplicated > c1. > once spark AM receive one container completed from RM, it will allocate one > new container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7214) duplicated container completed To AM
[ https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171396#comment-16171396 ] zhangshilong commented on YARN-7214: !screenshot-1.png! generally, 1、 NM complete one container(c) and send to RM 2、RM sent c to AM, tell AM c is completed. 3、RM sent c to NM, tell NM c can be removed from NM. If RM restart before step 3, c will be in in context of NM for ever. If RM restart again, c will be duplicated container completed to AM. > duplicated container completed To AM > > > Key: YARN-7214 > URL: https://issues.apache.org/jira/browse/YARN-7214 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1, 3.0.0-alpha3 > Environment: hadoop 2.7.1 rm recovery and nm recovery enabled >Reporter: zhangshilong > Attachments: screenshot-1.png > > > env: hadoop 2.7.1 with rm recovery and nm recovery enabled > case: > spark app(app1) running least one container(named c1) in NM1. > 1、NM1 crashed,and RM found NM1 expired in 10 minutes. > 2、RM will remove all containers in NM1(RMNodeImpl). and app1 will receive > c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 > lost. > 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 > is lost and will not handle containers from NM1. > 4、NM1 will not heartbeat with c1(c1 not in heartbeat request). So c1 will > not removed from context of NM1. > 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. > RM will send c1 complted message to AM of app1. So, app1 received duplicated > c1. > once spark AM receive one container completed from RM, it will allocate one > new container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7214) duplicated container completed To AM
[ https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171391#comment-16171391 ] rangjiaheng commented on YARN-7214: --- aa > duplicated container completed To AM > > > Key: YARN-7214 > URL: https://issues.apache.org/jira/browse/YARN-7214 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1, 3.0.0-alpha3 > Environment: hadoop 2.7.1 rm recovery and nm recovery enabled >Reporter: zhangshilong > Attachments: screenshot-1.png > > > env: hadoop 2.7.1 with rm recovery and nm recovery enabled > case: > spark app(app1) running least one container(named c1) in NM1. > 1、NM1 crashed,and RM found NM1 expired in 10 minutes. > 2、RM will remove all containers in NM1(RMNodeImpl). and app1 will receive > c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 > lost. > 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 > is lost and will not handle containers from NM1. > 4、NM1 will not heartbeat with c1(c1 not in heartbeat request). So c1 will > not removed from context of NM1. > 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. > RM will send c1 complted message to AM of app1. So, app1 received duplicated > c1. > once spark AM receive one container completed from RM, it will allocate one > new container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7214) duplicated container completed To AM
[ https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171322#comment-16171322 ] zhangshilong commented on YARN-7214: in my thought, containers in recentlyStoppedContainers can be removed from NMContext if NM heartbeat normally with RM. > duplicated container completed To AM > > > Key: YARN-7214 > URL: https://issues.apache.org/jira/browse/YARN-7214 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1, 3.0.0-alpha3 > Environment: hadoop 2.7.1 rm recovery and nm recovery enabled >Reporter: zhangshilong > > env: hadoop 2.7.1 with rm recovery and nm recovery enabled > case: > spark app(app1) running least one container(named c1) in NM1. > 1、NM1 crashed,and RM found NM1 expired in 10 minutes. > 2、RM will remove all containers in NM1(RMNodeImpl). and app1 will receive > c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 > lost. > 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 > is lost and will not handle containers from NM1. > 4、NM1 will not heartbeat with c1(c1 not in heartbeat request). So c1 will > not removed from context of NM1. > 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. > RM will send c1 complted message to AM of app1. So, app1 received duplicated > c1. > once spark AM receive one container completed from RM, it will allocate one > new container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7214) duplicated container completed To AM
[ https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171270#comment-16171270 ] zhangshilong commented on YARN-7214: 3. {code:java} public static class AddNodeTransition implements SingleArcTransition{ @Override public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Inform the scheduler RMNodeStartedEvent startEvent = (RMNodeStartedEvent) event; List containers = null; NodeId nodeId = rmNode.nodeId; RMNode previousRMNode = rmNode.context.getInactiveRMNodes().remove(nodeId); if (previousRMNode != null) { rmNode.updateMetricsForRejoinedNode(previousRMNode.getState()); } else { NodeId unknownNodeId = NodesListManager.createUnknownNodeId(nodeId.getHost()); previousRMNode = rmNode.context.getInactiveRMNodes().remove(unknownNodeId); if (previousRMNode != null) { ClusterMetrics.getMetrics().decrDecommisionedNMs(); } // Increment activeNodes explicitly because this is a new node. ClusterMetrics.getMetrics().incrNumActiveNodes(); containers = startEvent.getNMContainerStatuses(); if (containers != null && !containers.isEmpty()) { for (NMContainerStatus container : containers) { if (container.getContainerState() == ContainerState.RUNNING || container.getContainerState() == ContainerState.SCHEDULED) { rmNode.launchedContainers.add(container.getContainerId()); } } } } if (null != startEvent.getRunningApplications()) { for (ApplicationId appId : startEvent.getRunningApplications()) { handleRunningAppOnNode(rmNode, rmNode.context, appId, rmNode.nodeId); } } rmNode.context.getDispatcher().getEventHandler() .handle(new NodeAddedSchedulerEvent(rmNode, containers)); rmNode.context.getDispatcher().getEventHandler().handle( new NodesListManagerEvent( NodesListManagerEventType.NODE_USABLE, rmNode)); } } {code} 4、 in NodeStatusUpdaterImpl.java before register: getNMContainerStatuses will be called. So completedContainer will be put into recentlyStoppedContainers. in register request: completed containers will be sent to RM. {code:java} public void addCompletedContainer(ContainerId containerId) { synchronized (recentlyStoppedContainers) { removeVeryOldStoppedContainersFromCache(); if (!recentlyStoppedContainers.containsKey(containerId)) { recentlyStoppedContainers.put(containerId, System.currentTimeMillis() + durationToTrackStoppedContainers); } } } {code} normal heartbeat, getContainerStatuses is called. So completed container will not be put into containerStatuses beacause it is in recentlyStoppedContainers. So completed container will not be sent to RM. {code:java} protected List getContainerStatuses() throws IOException { List containerStatuses = new ArrayList(); for (Container container : this.context.getContainers().values()) { ContainerId containerId = container.getContainerId(); ApplicationId applicationId = containerId.getApplicationAttemptId() .getApplicationId(); org.apache.hadoop.yarn.api.records.ContainerStatus containerStatus = container.cloneAndGetContainerStatus(); if (containerStatus.getState() == ContainerState.COMPLETE) { if (isApplicationStopped(applicationId)) { if (LOG.isDebugEnabled()) { LOG.debug(applicationId + " is completing, " + " remove " + containerId + " from NM context."); } context.getContainers().remove(containerId); pendingCompletedContainers.put(containerId, containerStatus); } else { if (!isContainerRecentlyStopped(containerId)) { pendingCompletedContainers.put(containerId, containerStatus); } } // Adding to finished containers cache. Cache will keep it around at // least for #durationToTrackStoppedContainers duration. In the // subsequent call to stop container it will get removed from cache. addCompletedContainer(containerId); } else { containerStatuses.add(containerStatus); } } containerStatuses.addAll(pendingCompletedContainers.values()); if (LOG.isDebugEnabled()) { LOG.debug("Sending out " + containerStatuses.size() + " container statuses: " + containerStatuses); } return containerStatuses; } {code} > duplicated container completed To AM > > > Key: YARN-7214 > URL: https://issues.apache.org/jira/browse/YARN-7214 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: