[ 
https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782363#comment-13782363
 ] 

Arpit Gupta commented on YARN-1185:
-----------------------------------

Here is the stack trace from the RM when it tries to recover partially written 
data

{code}
2013-09-30 09:12:09,206 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:parseQueue(408)) - Initialized queue: default: 
capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, 
vCores:0>usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0
2013-09-30 09:12:09,206 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:parseQueue(408)) - Initialized queue: root: 
numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, 
vCores:0>usedCapacity=0.0, numApps=0, numContainers=0
2013-09-30 09:12:09,206 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:initializeQueues(306)) - Initialized root queue root: 
numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, 
vCores:0>usedCapacity=0.0, numApps=0, numContainers=0
2013-09-30 09:12:09,206 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:reinitialize(270)) - Initialized CapacityScheduler with 
calculator=class 
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, 
minimumAllocation=<<memory:1024, vCores:1>>, maximumAllocation=<<memory:8192, 
vCores:32>>
2013-09-30 09:12:09,240 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:register(157)) - Registering class 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
2013-09-30 09:12:09,250 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:register(157)) - Registering class 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType 
for class 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher
2013-09-30 09:12:09,252 INFO  resourcemanager.RMNMInfo 
(RMNMInfo.java:<init>(63)) - Registered RMNMInfo MBean
2013-09-30 09:12:09,253 INFO  util.HostsFileReader 
(HostsFileReader.java:refresh(84)) - Refreshing hosts (include/exclude) list
2013-09-30 09:12:09,278 INFO  security.UserGroupInformation 
(UserGroupInformation.java:loginUserFromKeytab(843)) - Login successful for 
user rm/hostname@realm using keytab file /etc/security/keytabs/rm.service.keytab
2013-09-30 09:12:09,278 INFO  security.RMContainerTokenSecretManager 
(RMContainerTokenSecretManager.java:rollMasterKey(103)) - Rolling master-key 
for container-tokens
2013-09-30 09:12:09,279 INFO  security.AMRMTokenSecretManager 
(AMRMTokenSecretManager.java:rollMasterKey(107)) - Rolling master-key for 
amrm-tokens
2013-09-30 09:12:09,281 INFO  security.NMTokenSecretManagerInRM 
(NMTokenSecretManagerInRM.java:rollMasterKey(97)) - Rolling master-key for 
nm-tokens
2013-09-30 09:12:10,196 INFO  recovery.FileSystemRMStateStore 
(FileSystemRMStateStore.java:loadRMAppState(131)) - Loading application from 
node: application_1380531989689_0002
2013-09-30 09:12:10,217 INFO  recovery.FileSystemRMStateStore 
(FileSystemRMStateStore.java:loadRMAppState(131)) - Loading application from 
node: application_1380531989689_0003
2013-09-30 09:12:10,232 INFO  security.RMDelegationTokenSecretManager 
(RMDelegationTokenSecretManager.java:recover(181)) - recovering 
RMDelegationTokenSecretManager.
2013-09-30 09:12:10,234 INFO  resourcemanager.RMAppManager 
(RMAppManager.java:recover(329)) - Recovering 2 applications
2013-09-30 09:12:10,234 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:serviceStart(640)) - Failed to load/recover state
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:332)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:842)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:636)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:855)
2013-09-30 09:12:10,236 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
Exiting with status 1
2013-09-30 09:17:20,144 INFO  resourcemanager.ResourceManager 
(StringUtils.java:startupShutdownMessage(601)) - STARTUP_MSG:
{code}

> FileSystemRMStateStore can leave partial files that prevent subsequent 
> recovery
> -------------------------------------------------------------------------------
>
>                 Key: YARN-1185
>                 URL: https://issues.apache.org/jira/browse/YARN-1185
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>
> FileSystemRMStateStore writes directly to the destination file when storing 
> state. However if the RM were to crash in the middle of the write, the 
> recovery method could encounter a partially-written file and either outright 
> crash during recovery or silently load incomplete state.
> To avoid this, the data should be written to a temporary file and renamed to 
> the destination file afterwards.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to