[ https://issues.apache.org/jira/browse/YARN-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713280#comment-13713280 ]
Omkar Vinit Joshi commented on YARN-245: ---------------------------------------- Thanks [~mayank_bansal] for the patch.. I agree that checking heartbeat ids will test this issue... few comments.. {code} + conf.setBoolean(YarnConfiguration.LOG_AGGREGATION_ENABLED, true); {code} why are we doing this? {code} + NodeStatus nodeStatus = request.getNodeStatus(); + nodeStatus.setResponseId(heartBeatID++); {code} required? can be removed? * There is one issue at present with NodeStatusUpdaterImpl.java ...imagine if we get such a heartbeat then we will not wait but try again.. check finally code {} which won't get executed..... and will keep pinging RM until we get correct response with response-id. Should we wait or immediately request? thoughts? {code} + Thread.sleep(1000l); {code} can we make it 1000? .. * test will need timeout. however I see there are certain tests without timeout... if adding timeout then add little larger value... :) {code} + if (nodeStatus.getKeepAliveApplications() != null + && nodeStatus.getKeepAliveApplications().size() > 0) { + for (ApplicationId appId : nodeStatus.getKeepAliveApplications()) { + List<Long> list = keepAliveRequests.get(appId); + if (list == null) { + list = new LinkedList<Long>(); + keepAliveRequests.put(appId, list); + } + list.add(System.currentTimeMillis()); + } + } {code} {code} + if (heartBeatID == 2) { + LOG.info("Sending FINISH_APP for application: [" + appId + "]"); + this.context.getApplications().put(appId, mock(Application.class)); + nhResponse.addAllApplicationsToCleanup(Collections.singletonList(appId)); + } {code} {code} + rt.context.getApplications().remove(rt.appId); {code} {code} + private Map<ApplicationId, List<Long>> keepAliveRequests = + new HashMap<ApplicationId, List<Long>>(); + private ApplicationId appId = BuilderUtils.newApplicationId(1, 1); {code} do we need this? can we remove all application related stuff? as we are now checking only heartbeat ids..we can remove this.. thoughts? > Node Manager can not handle duplicate responses > ----------------------------------------------- > > Key: YARN-245 > URL: https://issues.apache.org/jira/browse/YARN-245 > Project: Hadoop YARN > Issue Type: Sub-task > Affects Versions: 2.0.2-alpha, 2.0.1-alpha > Reporter: Devaraj K > Assignee: Mayank Bansal > Attachments: YARN-245-trunk-1.patch, YARN-245-trunk-2.patch, > YARN-245-trunk-3.patch > > > {code:xml} > 2012-11-25 12:56:11,795 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > FINISH_APPLICATION at FINISHED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) > at java.lang.Thread.run(Thread.java:662) > 2012-11-25 12:56:11,796 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1353818859056_0004 transitioned from FINISHED to null > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira