[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212935#comment-14212935 ]
Gera Shegalov commented on YARN-2862: ------------------------------------- [~jianhe], to add more details: we use 2.4+patches, YARN-1185 is in 2.3. > RM might not start if the machine was hard shutdown and > FileSystemRMStateStore was used > --------------------------------------------------------------------------------------- > > Key: YARN-2862 > URL: https://issues.apache.org/jira/browse/YARN-2862 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Ming Ma > > This might be a known issue. Given FileSystemRMStateStore isn't used for HA > scenario, it might not be that important, unless there is something we need > to fix at RM layer to make it more tolerant to RMStore issue. > When RM was hard shutdown, OS might not get a chance to persist blocks. Some > of the stored application data end up with size zero after reboot. And RM > didn't like that. > {noformat} > ls -al > /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 > total 156 > drwxr-xr-x. 2 x y 4096 Nov 13 16:45 . > drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. > -rw-r--r--. 1 x y 0 Nov 13 16:45 > appattempt_1412702189634_324351_000001 > -rw-r--r--. 1 x y 0 Nov 13 16:45 > .appattempt_1412702189634_324351_000001.crc > -rw-r--r--. 1 x y 0 Nov 13 16:45 application_1412702189634_324351 > -rw-r--r--. 1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc > {noformat} > When RM starts up > {noformat} > 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem > opening checksum file: > file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. > Ignoring exception: > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) > ... > 2014-11-13 17:40:48,876 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to > load/recover state > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)