[ 
https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223136#comment-14223136
 ] 

Rohith commented on YARN-2025:
------------------------------

        I ran into weird scenario where I got the NPE in 
{{CapacityScheduler.addApplicationAttempt}} in a different manner. I could able 
to get some informationf from the logs but not fully since log were rolled out.

        Application final state is FAILED but ApplicationAttempt final state is 
null. This looks very strange that how can RMApp->FAILED but 
RMAppAttempt->null..?
Extracted log from RM is below. Because of this scenario, application recovery 
throw NPE since RMAppAttempt tries to add attempt to scheduler but application 
details are not added to schedulers.
{noformat}
2014-11-24 23:53:32,608 | INFO  | main-EventThread | Recovering app: 
application_1416805604019_0038 with 1 attempts and final state = FAILED | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:700)
2014-11-24 23:53:32,609 | INFO  | main-EventThread | Recovering attempt: 
appattempt_1416805604019_0038_000001 with final state: null | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:735)
{noformat}

NPE trace as follows.
{noformat}
2014-11-24 23:53:32,610 | ERROR | main-EventThread | Failed to load/recover 
state | 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:527)
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:607)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:941)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:97)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:963)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:931)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:698)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:803)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:95)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:825)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:808)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:681)
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:335)
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1148)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:523)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:927)
{noformat}

> Possible NPE in schedulers#addApplicationAttempt()
> --------------------------------------------------
>
>                 Key: YARN-2025
>                 URL: https://issues.apache.org/jira/browse/YARN-2025
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Tsuyoshi OZAWA
>            Assignee: Tsuyoshi OZAWA
>         Attachments: YARN-2025.1.patch
>
>
> In FifoScheduler/FairScheduler/CapacityScheduler#addApplicationAttempt(), we 
> don't check whether {{application}} is null. This can cause NPE in following 
> sequences: addApplication() -> doneApplication() (e.g. AppKilledTransition) 
> -> addApplicationAttempt().
> {code}
>     SchedulerApplication application =
>         applications.get(applicationAttemptId.getApplicationId());
>     String user = application.getUser();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to