[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2015-06-22 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596837#comment-14596837
 ] 

Ming Ma commented on YARN-2862:
---

Thanks, [~rohithsharma] and [~leftnoteasy]. Yes, YARN-3410 will be useful. So 
admins still need to look through RM logs to identify those apps. Will it be 
useful to provide a new RM startup option to delete or skip such apps 
automatically?

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2015-05-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523707#comment-14523707
 ] 

Wangda Tan commented on YARN-2862:
--

[~rohithsharma], took a quick look, I think YARN-3410 can solve this problem, 
do you think so? Please resolve this issue if you think so. 

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2014-11-17 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214895#comment-14214895
 ] 

Ming Ma commented on YARN-2862:
---

Thanks, [~jira.shegalov], [~jianhe], [~zjshen].

I am able to repro the issue in trunk. a) pick an application in 
FileSystemRMStateStore; b) run cat /dev/null   application__ size; 
c) restart RM.

The corrupted .new file might be another issue. There is no .new file in this 
specific case where the state file has been written or updated from RM point of 
view. However, it appears the state file hasn't been flushed from OS to disk 
before the machine hard shutdown.

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2014-11-14 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212640#comment-14212640
 ] 

Ming Ma commented on YARN-2862:
---

Here are some possible ways to fix it.

1) Fix RMAppManager's recoverApplication to ignore any unrecoverable app.
2) Fix RawLocalFileSystem used by FileSystemRMStateStore to force sync data to 
disk device.
3) Fix FileSystemRMStateStore to skip app with null ApplicationState#context.

Sounds like #3 is the best given the usage scenario of FileSystemRMStateStore. 
Also RM should expect each implementation of RMStateStore#loadState load valid 
ApplicationState into RMState.

Thoughts?

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2014-11-14 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212671#comment-14212671
 ] 

Gera Shegalov commented on YARN-2862:
-

[~mingma], It's potentially already fixed by YARN-2010. We can try it for our 
scenario.

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2014-11-14 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212836#comment-14212836
 ] 

Jian He commented on YARN-2862:
---

YARN-2010 may not solve this. YARN-1185 might have fixed this.

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2014-11-14 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212935#comment-14212935
 ] 

Gera Shegalov commented on YARN-2862:
-

[~jianhe], to add more details: we use 2.4+patches, YARN-1185 is in 2.3.

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2014-11-14 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212989#comment-14212989
 ] 

Zhijie Shen commented on YARN-2862:
---

It is likely that the assumption we made in 
[YARN-1776|https://issues.apache.org/jira/browse/YARN-1776?focusedCommentId=13942201page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13942201]
 is not fully correct.

When updating a state file, we (1) write the new file to .new, (2) delete the 
existing one, and (3) rename the .new to the existing file name. If crash 
happens before (2), we use .new to recover the state file when loading the 
state (see FileSystemRMStateStore#checkAndResumeUpdateOperation).

According to the description here, RM can crash when (1) is in progress, and 
leave a corrupted .new file. It seems that we have to do additional validation 
to check if .new file is corrupted or not, or just simply ignore it .

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)