[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596837#comment-14596837
 ] 

Ming Ma commented on YARN-2862:
-------------------------------

Thanks, [~rohithsharma] and [~leftnoteasy]. Yes, YARN-3410 will be useful. So 
admins still need to look through RM logs to identify those apps. Will it be 
useful to provide a new RM startup option to delete or skip such apps 
automatically?

> RM might not start if the machine was hard shutdown and 
> FileSystemRMStateStore was used
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-2862
>                 URL: https://issues.apache.org/jira/browse/YARN-2862
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ming Ma
>
> This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
> scenario, it might not be that important, unless there is something we need 
> to fix at RM layer to make it more tolerant to RMStore issue.
> When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
> of the stored application data end up with size zero after reboot. And RM 
> didn't like that.
> {noformat}
> ls -al 
> /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
> total 156
> drwxr-xr-x.    2 x y   4096 Nov 13 16:45 .
> drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
> -rw-r--r--.    1 x y      0 Nov 13 16:45 
> appattempt_1412702189634_324351_000001
> -rw-r--r--.    1 x y      0 Nov 13 16:45 
> .appattempt_1412702189634_324351_000001.crc
> -rw-r--r--.    1 x y      0 Nov 13 16:45 application_1412702189634_324351
> -rw-r--r--.    1 x y      0 Nov 13 16:45 .application_1412702189634_324351.crc
> {noformat}
> When RM starts up
> {noformat}
> 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
> opening checksum file: 
> file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
>   Ignoring exception:
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:197)
>         at java.io.DataInputStream.readFully(DataInputStream.java:169)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
> ...
> 2014-11-13 17:40:48,876 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
> load/recover state
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to