[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596837#comment-14596837 ] Ming Ma commented on YARN-2862: --- Thanks, [~rohithsharma] and [~leftnoteasy]. Yes, YARN-3410 will be useful. So admins still need to look through RM logs to identify those apps. Will it be useful to provide a new RM startup option to delete or skip such apps automatically? RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523707#comment-14523707 ] Wangda Tan commented on YARN-2862: -- [~rohithsharma], took a quick look, I think YARN-3410 can solve this problem, do you think so? Please resolve this issue if you think so. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214895#comment-14214895 ] Ming Ma commented on YARN-2862: --- Thanks, [~jira.shegalov], [~jianhe], [~zjshen]. I am able to repro the issue in trunk. a) pick an application in FileSystemRMStateStore; b) run cat /dev/null application__ size; c) restart RM. The corrupted .new file might be another issue. There is no .new file in this specific case where the state file has been written or updated from RM point of view. However, it appears the state file hasn't been flushed from OS to disk before the machine hard shutdown. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212640#comment-14212640 ] Ming Ma commented on YARN-2862: --- Here are some possible ways to fix it. 1) Fix RMAppManager's recoverApplication to ignore any unrecoverable app. 2) Fix RawLocalFileSystem used by FileSystemRMStateStore to force sync data to disk device. 3) Fix FileSystemRMStateStore to skip app with null ApplicationState#context. Sounds like #3 is the best given the usage scenario of FileSystemRMStateStore. Also RM should expect each implementation of RMStateStore#loadState load valid ApplicationState into RMState. Thoughts? RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212671#comment-14212671 ] Gera Shegalov commented on YARN-2862: - [~mingma], It's potentially already fixed by YARN-2010. We can try it for our scenario. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212836#comment-14212836 ] Jian He commented on YARN-2862: --- YARN-2010 may not solve this. YARN-1185 might have fixed this. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212935#comment-14212935 ] Gera Shegalov commented on YARN-2862: - [~jianhe], to add more details: we use 2.4+patches, YARN-1185 is in 2.3. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212989#comment-14212989 ] Zhijie Shen commented on YARN-2862: --- It is likely that the assumption we made in [YARN-1776|https://issues.apache.org/jira/browse/YARN-1776?focusedCommentId=13942201page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13942201] is not fully correct. When updating a state file, we (1) write the new file to .new, (2) delete the existing one, and (3) rename the .new to the existing file name. If crash happens before (2), we use .new to recover the state file when loading the state (see FileSystemRMStateStore#checkAndResumeUpdateOperation). According to the description here, RM can crash when (1) is in progress, and leave a corrupted .new file. It seems that we have to do additional validation to check if .new file is corrupted or not, or just simply ignore it . RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)