[
https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782363#comment-13782363
]
Arpit Gupta commented on YARN-1185:
-----------------------------------
Here is the stack trace from the RM when it tries to recover partially written
data
{code}
2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler
(CapacityScheduler.java:parseQueue(408)) - Initialized queue: default:
capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0,
vCores:0>usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0
2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler
(CapacityScheduler.java:parseQueue(408)) - Initialized queue: root:
numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0,
vCores:0>usedCapacity=0.0, numApps=0, numContainers=0
2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler
(CapacityScheduler.java:initializeQueues(306)) - Initialized root queue root:
numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0,
vCores:0>usedCapacity=0.0, numApps=0, numContainers=0
2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler
(CapacityScheduler.java:reinitialize(270)) - Initialized CapacityScheduler with
calculator=class
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator,
minimumAllocation=<<memory:1024, vCores:1>>, maximumAllocation=<<memory:8192,
vCores:32>>
2013-09-30 09:12:09,240 INFO event.AsyncDispatcher
(AsyncDispatcher.java:register(157)) - Registering class
org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
2013-09-30 09:12:09,250 INFO event.AsyncDispatcher
(AsyncDispatcher.java:register(157)) - Registering class
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType
for class
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher
2013-09-30 09:12:09,252 INFO resourcemanager.RMNMInfo
(RMNMInfo.java:<init>(63)) - Registered RMNMInfo MBean
2013-09-30 09:12:09,253 INFO util.HostsFileReader
(HostsFileReader.java:refresh(84)) - Refreshing hosts (include/exclude) list
2013-09-30 09:12:09,278 INFO security.UserGroupInformation
(UserGroupInformation.java:loginUserFromKeytab(843)) - Login successful for
user rm/hostname@realm using keytab file /etc/security/keytabs/rm.service.keytab
2013-09-30 09:12:09,278 INFO security.RMContainerTokenSecretManager
(RMContainerTokenSecretManager.java:rollMasterKey(103)) - Rolling master-key
for container-tokens
2013-09-30 09:12:09,279 INFO security.AMRMTokenSecretManager
(AMRMTokenSecretManager.java:rollMasterKey(107)) - Rolling master-key for
amrm-tokens
2013-09-30 09:12:09,281 INFO security.NMTokenSecretManagerInRM
(NMTokenSecretManagerInRM.java:rollMasterKey(97)) - Rolling master-key for
nm-tokens
2013-09-30 09:12:10,196 INFO recovery.FileSystemRMStateStore
(FileSystemRMStateStore.java:loadRMAppState(131)) - Loading application from
node: application_1380531989689_0002
2013-09-30 09:12:10,217 INFO recovery.FileSystemRMStateStore
(FileSystemRMStateStore.java:loadRMAppState(131)) - Loading application from
node: application_1380531989689_0003
2013-09-30 09:12:10,232 INFO security.RMDelegationTokenSecretManager
(RMDelegationTokenSecretManager.java:recover(181)) - recovering
RMDelegationTokenSecretManager.
2013-09-30 09:12:10,234 INFO resourcemanager.RMAppManager
(RMAppManager.java:recover(329)) - Recovering 2 applications
2013-09-30 09:12:10,234 ERROR resourcemanager.ResourceManager
(ResourceManager.java:serviceStart(640)) - Failed to load/recover state
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:332)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:842)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:636)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:855)
2013-09-30 09:12:10,236 INFO util.ExitUtil (ExitUtil.java:terminate(124)) -
Exiting with status 1
2013-09-30 09:17:20,144 INFO resourcemanager.ResourceManager
(StringUtils.java:startupShutdownMessage(601)) - STARTUP_MSG:
{code}
> FileSystemRMStateStore can leave partial files that prevent subsequent
> recovery
> -------------------------------------------------------------------------------
>
> Key: YARN-1185
> URL: https://issues.apache.org/jira/browse/YARN-1185
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.1.0-beta
> Reporter: Jason Lowe
>
> FileSystemRMStateStore writes directly to the destination file when storing
> state. However if the RM were to crash in the middle of the write, the
> recovery method could encounter a partially-written file and either outright
> crash during recovery or silently load incomplete state.
> To avoid this, the data should be written to a temporary file and renamed to
> the destination file afterwards.
--
This message was sent by Atlassian JIRA
(v6.1#6144)