[
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ming Ma moved HADOOP-11305 to YARN-2862:
----------------------------------------
Key: YARN-2862 (was: HADOOP-11305)
Project: Hadoop YARN (was: Hadoop Common)
> RM might not start if the machine was hard shutdown and
> FileSystemRMStateStore was used
> ---------------------------------------------------------------------------------------
>
> Key: YARN-2862
> URL: https://issues.apache.org/jira/browse/YARN-2862
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Ming Ma
>
> This might be a known issue. Given FileSystemRMStateStore isn't used for HA
> scenario, it might not be that important, unless there is something we need
> to fix at RM layer to make it more tolerant to RMStore issue.
> When RM was hard shutdown, OS might not get a chance to persist blocks. Some
> of the stored application data end up with size zero after reboot. And RM
> didn't like that.
> {noformat}
> ls -al
> /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
> total 156
> drwxr-xr-x. 2 x y 4096 Nov 13 16:45 .
> drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
> -rw-r--r--. 1 x y 0 Nov 13 16:45
> appattempt_1412702189634_324351_000001
> -rw-r--r--. 1 x y 0 Nov 13 16:45
> .appattempt_1412702189634_324351_000001.crc
> -rw-r--r--. 1 x y 0 Nov 13 16:45 application_1412702189634_324351
> -rw-r--r--. 1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc
> {noformat}
> When RM starts up
> {noformat}
> 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem
> opening checksum file:
> file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
> Ignoring exception:
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
> at
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
> ...
> 2014-11-13 17:40:48,876 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to
> load/recover state
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)