[ 
https://issues.apache.org/jira/browse/YARN-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702308#comment-13702308
 ] 

Omkar Vinit Joshi commented on YARN-245:
----------------------------------------

I think this will not fix the root cause. Looking at the current transitions it 
seems that ApplicationImpl got 2 events (APPLICATION_FINISH) when it only 
expects one in its life cycle. The first event made the successful transition 
but second event which in this case occurred at FINISHED state create invalid 
transition. Looking at the code it looks like below code sent two events in 
consecutive loop cycles (node heartbeats)..which caused the problem.. 

[~devaraj.k] is there any way we can reproduce this? did you see that error 
again?

NodeStatusUpdaterImpl.run
{code}
            if (appsToCleanup.size() != 0) {
              dispatcher.getEventHandler().handle(
                  new CMgrCompletedAppsEvent(appsToCleanup));
            }
{code}

[~mayank_bansal] I think we need to fix nodeStatusUpdaterImpl.run code. At 
present it doesn't check if nm received 2 identical responses i.e. NM sent 
heartbeat but didn't get response from rm so it sent the heartbeat again. In 
turn RM sent 2 identical responses. The side effect of this is that NM for 
first response already sent the application finished event... which will create 
problem if it tries to send it again on next identical heartbeat.

{code}
            lastHeartBeatID = response.getResponseId();
            List<ContainerId> containersToCleanup = response
                .getContainersToCleanup();
            if (containersToCleanup.size() != 0) {
              dispatcher.getEventHandler().handle(
                  new CMgrCompletedContainersEvent(containersToCleanup, 
                      CMgrCompletedContainersEvent.Reason.BY_RESOURCEMANAGER));
            }
            List<ApplicationId> appsToCleanup =
                response.getApplicationsToCleanup();
            //Only start tracking for keepAlive on FINISH_APP
            trackAppsForKeepAlive(appsToCleanup);
            if (appsToCleanup.size() != 0) {
              dispatcher.getEventHandler().handle(
                  new CMgrCompletedAppsEvent(appsToCleanup));
            }
{code}

I think we can reproduce this if we send same heartbeat response again which 
includes application finish event. any thoughts?
                
> Node Manager gives InvalidStateTransitonException for FINISH_APPLICATION at 
> FINISHED
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-245
>                 URL: https://issues.apache.org/jira/browse/YARN-245
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 2.0.2-alpha, 2.0.1-alpha
>            Reporter: Devaraj K
>            Assignee: Mayank Bansal
>         Attachments: YARN-245-trunk-1.patch
>
>
> {code:xml}
> 2012-11-25 12:56:11,795 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> FINISH_APPLICATION at FINISHED
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:662)
> 2012-11-25 12:56:11,796 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1353818859056_0004 transitioned from FINISHED to null
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to