[ https://issues.apache.org/jira/browse/YARN-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702308#comment-13702308 ]
Omkar Vinit Joshi commented on YARN-245: ---------------------------------------- I think this will not fix the root cause. Looking at the current transitions it seems that ApplicationImpl got 2 events (APPLICATION_FINISH) when it only expects one in its life cycle. The first event made the successful transition but second event which in this case occurred at FINISHED state create invalid transition. Looking at the code it looks like below code sent two events in consecutive loop cycles (node heartbeats)..which caused the problem.. [~devaraj.k] is there any way we can reproduce this? did you see that error again? NodeStatusUpdaterImpl.run {code} if (appsToCleanup.size() != 0) { dispatcher.getEventHandler().handle( new CMgrCompletedAppsEvent(appsToCleanup)); } {code} [~mayank_bansal] I think we need to fix nodeStatusUpdaterImpl.run code. At present it doesn't check if nm received 2 identical responses i.e. NM sent heartbeat but didn't get response from rm so it sent the heartbeat again. In turn RM sent 2 identical responses. The side effect of this is that NM for first response already sent the application finished event... which will create problem if it tries to send it again on next identical heartbeat. {code} lastHeartBeatID = response.getResponseId(); List<ContainerId> containersToCleanup = response .getContainersToCleanup(); if (containersToCleanup.size() != 0) { dispatcher.getEventHandler().handle( new CMgrCompletedContainersEvent(containersToCleanup, CMgrCompletedContainersEvent.Reason.BY_RESOURCEMANAGER)); } List<ApplicationId> appsToCleanup = response.getApplicationsToCleanup(); //Only start tracking for keepAlive on FINISH_APP trackAppsForKeepAlive(appsToCleanup); if (appsToCleanup.size() != 0) { dispatcher.getEventHandler().handle( new CMgrCompletedAppsEvent(appsToCleanup)); } {code} I think we can reproduce this if we send same heartbeat response again which includes application finish event. any thoughts? > Node Manager gives InvalidStateTransitonException for FINISH_APPLICATION at > FINISHED > ------------------------------------------------------------------------------------ > > Key: YARN-245 > URL: https://issues.apache.org/jira/browse/YARN-245 > Project: Hadoop YARN > Issue Type: Sub-task > Affects Versions: 2.0.2-alpha, 2.0.1-alpha > Reporter: Devaraj K > Assignee: Mayank Bansal > Attachments: YARN-245-trunk-1.patch > > > {code:xml} > 2012-11-25 12:56:11,795 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > FINISH_APPLICATION at FINISHED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) > at java.lang.Thread.run(Thread.java:662) > 2012-11-25 12:56:11,796 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1353818859056_0004 transitioned from FINISHED to null > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira