[
https://issues.apache.org/jira/browse/YARN-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204128#comment-14204128
]
Vinod Kumar Vavilapalli commented on YARN-2834:
-----------------------------------------------
bq. Even in the regular case, RM doesn't fail the app if token renew fails, why
do we need to fail the app if token-renew fails on recovery.
After more discussions with [~jianhe] offline, for things like Timeline tokens
which are automatically obtained whether the app needs it or not (we should fix
this to be user driven), we can ignore failures. But for HDFS Tokens etc,
ignoring failures is bad because it (1) wastes resources as AMs will continue
and eventually fail (2) app doesn't know what happened it fails eventually.
Anyways, treating renewal failures is broken today. I am okay ignoring renewal
failures during recovery in this ticket. But let's file a blocker for handling
them correctly in 2.7.
> Resource manager crashed with Null Pointer Exception
> ----------------------------------------------------
>
> Key: YARN-2834
> URL: https://issues.apache.org/jira/browse/YARN-2834
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Yesha Vora
> Assignee: Jian He
> Priority: Critical
> Attachments: YARN-2834.1.patch
>
>
> Resource manager failed after restart.
> {noformat}
> 2014-11-09 04:12:53,013 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:initializeQueues(467)) - Initialized root queue root:
> numChildQueue= 2, capacity=1.0, absoluteCapacity=1.0,
> usedResources=<memory:0, vCores:0>usedCapacity=0.0, numApps=0, numContainers=0
> 2014-11-09 04:12:53,013 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:initializeQueueMappings(436)) - Initialized queue
> mappings, override: false
> 2014-11-09 04:12:53,013 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:initScheduler(305)) - Initialized CapacityScheduler
> with calculator=class
> org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator,
> minimumAllocation=<<memory:256, vCores:1>>, maximumAllocation=<<memory:2048,
> vCores:32>>, asynchronousScheduling=false, asyncScheduleInterval=5ms
> 2014-11-09 04:12:53,015 INFO service.AbstractService
> (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in
> state STARTED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:734)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1089)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1041)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1005)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:821)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:101)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:843)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:826)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:701)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:413)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1207)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:590)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1014)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1051)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1047)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1047)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1091)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)