[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212989#comment-14212989
 ] 

Zhijie Shen commented on YARN-2862:
-----------------------------------

It is likely that the assumption we made in 
[YARN-1776|https://issues.apache.org/jira/browse/YARN-1776?focusedCommentId=13942201&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13942201]
 is not fully correct.

When updating a state file, we (1) write the new file to .new, (2) delete the 
existing one, and (3) rename the .new to the existing file name. If crash 
happens before (2), we use .new to recover the state file when loading the 
state (see FileSystemRMStateStore#checkAndResumeUpdateOperation).

According to the description here, RM can crash when (1) is in progress, and 
leave a corrupted .new file. It seems that we have to do additional validation 
to check if .new file is corrupted or not, or just simply ignore it .

> RM might not start if the machine was hard shutdown and 
> FileSystemRMStateStore was used
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-2862
>                 URL: https://issues.apache.org/jira/browse/YARN-2862
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ming Ma
>
> This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
> scenario, it might not be that important, unless there is something we need 
> to fix at RM layer to make it more tolerant to RMStore issue.
> When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
> of the stored application data end up with size zero after reboot. And RM 
> didn't like that.
> {noformat}
> ls -al 
> /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
> total 156
> drwxr-xr-x.    2 x y   4096 Nov 13 16:45 .
> drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
> -rw-r--r--.    1 x y      0 Nov 13 16:45 
> appattempt_1412702189634_324351_000001
> -rw-r--r--.    1 x y      0 Nov 13 16:45 
> .appattempt_1412702189634_324351_000001.crc
> -rw-r--r--.    1 x y      0 Nov 13 16:45 application_1412702189634_324351
> -rw-r--r--.    1 x y      0 Nov 13 16:45 .application_1412702189634_324351.crc
> {noformat}
> When RM starts up
> {noformat}
> 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
> opening checksum file: 
> file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
>   Ignoring exception:
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:197)
>         at java.io.DataInputStream.readFully(DataInputStream.java:169)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
> ...
> 2014-11-13 17:40:48,876 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
> load/recover state
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to