[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

Jian He (JIRA) Mon, 08 Sep 2014 14:41:47 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126156#comment-14126156
 ]


Jian He commented on YARN-2308:
-------------------------------

Looked at this again, I think the solution mentioned by [~sunilg] is reasonable:
bq. During RMAppRecoveredTransition in RMAppImpl, may be we can check recovered 
app queue (can get this from submission context) is still a valid queue? If 
this queue not present, recovery for that app can be made failed, and may be 
need to do some more RMApp clean up. Sounds doable?
We can check if the queue exists on recovery. If not, directly return FAILED 
state and no need to add the attempts anymore.  Thoughts ?


> NPE happened when RM restart after CapacityScheduler queue configuration 
> changed 
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-2308
>                 URL: https://issues.apache.org/jira/browse/YARN-2308
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, scheduler
>    Affects Versions: 2.6.0
>            Reporter: Wangda Tan
>            Assignee: chang li
>            Priority: Critical
>         Attachments: jira2308.patch, jira2308.patch, jira2308.patch
>
>
> I encountered a NPE when RM restart
> {code}
> 2014-07-16 07:22:46,957 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:744)
> {code}
> And RM will be failed to restart.
> This is caused by queue configuration changed, I removed some queues and 
> added new queues. So when RM restarts, it tries to recover history 
> applications, and when any of queues of these applications removed, NPE will 
> be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

Reply via email to