[ https://issues.apache.org/jira/browse/YARN-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782274#comment-13782274 ]
Jian He commented on YARN-1255: ------------------------------- RM might be killed while it's saving the app data(after the app file is created, before the data is written into the file), when RM recovers it loads an empty file and gets a NULL exception, reproduced this locally and see the same exception stack. > RM fails to start up with Failed to load/recover state error in a HA setup > -------------------------------------------------------------------------- > > Key: YARN-1255 > URL: https://issues.apache.org/jira/browse/YARN-1255 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.1.1-beta > Reporter: Arpit Gupta > > {code} > 2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler > (CapacityScheduler.java:parseQueue(408)) - Initialized queue: default: > capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, > vCores:0>usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, > numContainers=0 > 2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler > (CapacityScheduler.java:parseQueue(408)) - Initialized queue: root: > numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, > usedResources=<memory:0, vCores:0>usedCapacity=0.0, numApps=0, numContainers=0 > 2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler > (CapacityScheduler.java:initializeQueues(306)) - Initialized root queue root: > numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, > usedResources=<memory:0, vCores:0>usedCapacity=0.0, numApps=0, numContainers=0 > 2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler > (CapacityScheduler.java:reinitialize(270)) - Initialized CapacityScheduler > with calculator=class > org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, > minimumAllocation=<<memory:1024, vCores:1>>, maximumAllocation=<<memory:8192, > vCores:32>> > 2013-09-30 09:12:09,240 INFO event.AsyncDispatcher > (AsyncDispatcher.java:register(157)) - Registering class > org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager > 2013-09-30 09:12:09,250 INFO event.AsyncDispatcher > (AsyncDispatcher.java:register(157)) - Registering class > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType > for class > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher > 2013-09-30 09:12:09,252 INFO resourcemanager.RMNMInfo > (RMNMInfo.java:<init>(63)) - Registered RMNMInfo MBean > 2013-09-30 09:12:09,253 INFO util.HostsFileReader > (HostsFileReader.java:refresh(84)) - Refreshing hosts (include/exclude) list > 2013-09-30 09:12:09,278 INFO security.UserGroupInformation > (UserGroupInformation.java:loginUserFromKeytab(843)) - Login successful for > user rm/hostname@realm using keytab file > /etc/security/keytabs/rm.service.keytab > 2013-09-30 09:12:09,278 INFO security.RMContainerTokenSecretManager > (RMContainerTokenSecretManager.java:rollMasterKey(103)) - Rolling master-key > for container-tokens > 2013-09-30 09:12:09,279 INFO security.AMRMTokenSecretManager > (AMRMTokenSecretManager.java:rollMasterKey(107)) - Rolling master-key for > amrm-tokens > 2013-09-30 09:12:09,281 INFO security.NMTokenSecretManagerInRM > (NMTokenSecretManagerInRM.java:rollMasterKey(97)) - Rolling master-key for > nm-tokens > 2013-09-30 09:12:10,196 INFO recovery.FileSystemRMStateStore > (FileSystemRMStateStore.java:loadRMAppState(131)) - Loading application from > node: application_1380531989689_0002 > 2013-09-30 09:12:10,217 INFO recovery.FileSystemRMStateStore > (FileSystemRMStateStore.java:loadRMAppState(131)) - Loading application from > node: application_1380531989689_0003 > 2013-09-30 09:12:10,232 INFO security.RMDelegationTokenSecretManager > (RMDelegationTokenSecretManager.java:recover(181)) - recovering > RMDelegationTokenSecretManager. > 2013-09-30 09:12:10,234 INFO resourcemanager.RMAppManager > (RMAppManager.java:recover(329)) - Recovering 2 applications > 2013-09-30 09:12:10,234 ERROR resourcemanager.ResourceManager > (ResourceManager.java:serviceStart(640)) - Failed to load/recover state > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:332) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:842) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:636) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:855) > 2013-09-30 09:12:10,236 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - > Exiting with status 1 > 2013-09-30 09:17:20,144 INFO resourcemanager.ResourceManager > (StringUtils.java:startupShutdownMessage(601)) - STARTUP_MSG: > {code} -- This message was sent by Atlassian JIRA (v6.1#6144)