Zhijie Shen commented on YARN-2308:

Investigated into the problem: when submitting the app to a non-existing queue, 
the app is going to be rejected by CS. It works fine in a normal submission, 
because addAppAttempt happens after RMApp enters ACCEPTED, when addApp has 
already been executed successfully. However, in the recover mode, addAppAttempt 
is triggered independent of the result of addApp. At this moment, app doesn't 
exist in CS as it has been rejected, while addAppAttempt assumes it should 
exist, and result in NPE.

The fix makes sense to more. Some additional comments:

bq. + conf.setBoolean(YarnConfiguration.RM_WORK_PRESERVING_RECOVERY_ENABLED, 

It should be true to imitate the failure case in the description, right? 
According AttemptRecoveredTransition, if isWorkPreservingRecoveryEnabled = 
true, AppAttemptAddedSchedulerEvent will not scheduled. However, whether 
AppAttemptAddedSchedulerEvent is scheduled or not, the app should get rejected 
finally, shouldn't it? What was the test failure when 
isWorkPreservingRecoveryEnabled = false?

> NPE happened when RM restart after CapacityScheduler queue configuration 
> changed 
> ---------------------------------------------------------------------------------
>                 Key: YARN-2308
>                 URL: https://issues.apache.org/jira/browse/YARN-2308
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, scheduler
>    Affects Versions: 2.6.0
>            Reporter: Wangda Tan
>            Assignee: chang li
>            Priority: Critical
>         Attachments: jira2308.patch, jira2308.patch, jira2308.patch
> I encountered a NPE when RM restart
> {code}
> 2014-07-16 07:22:46,957 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:744)
> {code}
> And RM will be failed to restart.
> This is caused by queue configuration changed, I removed some queues and 
> added new queues. So when RM restarts, it tries to recover history 
> applications, and when any of queues of these applications removed, NPE will 
> be raised.

This message was sent by Atlassian JIRA

Reply via email to