[
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095036#comment-14095036
]
Zhijie Shen commented on YARN-2308:
-----------------------------------
Actually, app is rejected at AppAddedSchedulerEvent, but as I mentioned above
AppAttemptAddedSchedulerEvent is scheduled regardless the app is added to CS or
not. In fact, under the recover mode, RMApp will enter ACCEPTED regardless the
app is added or not as well.
The thorough fix might be moving recovered APP to another state, and wait for
the event from CS to move it ACCEPTED, and recover the attempts, including
scheduling AppAttemptAddedSchedulerEvent. My feeling is that it is over-kill if
we want to this single race condition. Thoughts?
bq. I think set RM_WORK_PRESERVING_RECOVERY_ENABLED=true in test should be
enough for this fix.
RM_WORK_PRESERVING_RECOVERY_ENABLED=true reflects the failure case in the
description, but I'm wondering why RM_WORK_PRESERVING_RECOVERY_ENABLED=false,
the test is going to fail. App will anyway be rejected, won't it?
> NPE happened when RM restart after CapacityScheduler queue configuration
> changed
> ---------------------------------------------------------------------------------
>
> Key: YARN-2308
> URL: https://issues.apache.org/jira/browse/YARN-2308
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager, scheduler
> Affects Versions: 2.6.0
> Reporter: Wangda Tan
> Assignee: chang li
> Priority: Critical
> Attachments: jira2308.patch, jira2308.patch, jira2308.patch
>
>
> I encountered a NPE when RM restart
> {code}
> 2014-07-16 07:22:46,957 FATAL
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> And RM will be failed to restart.
> This is caused by queue configuration changed, I removed some queues and
> added new queues. So when RM restarts, it tries to recover history
> applications, and when any of queues of these applications removed, NPE will
> be raised.
--
This message was sent by Atlassian JIRA
(v6.2#6252)