[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340148#comment-14340148 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701354/YARN-2820.007.patch against trunk revision 4f75b15. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6780//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6780//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6780//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, YARN-2820.007.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340307#comment-14340307 ] Tsuyoshi Ozawa commented on YARN-2820: -- +1, committing this shortly. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, YARN-2820.007.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340304#comment-14340304 ] Tsuyoshi Ozawa commented on YARN-2820: -- [~zxu] My bad. Closeable should be idempotent, so it's OK. http://docs.oracle.com/javase/7/docs/api/java/lang/AutoCloseable.html Please ignore the above comment. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, YARN-2820.007.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340265#comment-14340265 ] Tsuyoshi Ozawa commented on YARN-2820: -- [~zxu] Thank you for updating. I rethink abut closeInternal(). If we call fs.close() twice or more, it can close another file descriptor unexpectedly. It can lead unexpected behaviours. We should remove closeWithRetries and call fs.close() in closeInternal() to avoid the problems. What do you think? Thank you for dealing with iterative reviews. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, YARN-2820.007.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339910#comment-14339910 ] Tsuyoshi Ozawa commented on YARN-2820: -- [~zxu] try-with-resources statement is a new statement from JDK7 for instances which implements Closable: http://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html {code} try (ByteArrayInputStream is = new ByteArrayInputStream(childData); DataInputStream fsIn = new DataInputStream(is);){ // processing something here } // closes is and fsIn automatically after the block. {code} It's useful since we don't need finally block with null check. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340023#comment-14340023 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701314/YARN-2820.007.patch against trunk revision 48c7ee7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1151 javac compiler warnings (more than the trunk's current 205 warnings). {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 47 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6778//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6778//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339954#comment-14339954 ] zhihai xu commented on YARN-2820: - [~ozawa], Cool, I just learned this new syntax. I uploaded a new patch YARN-2820.007.patch which use try-with-resources statement. Please review it. thanks zhihai Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339774#comment-14339774 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701267/YARN-2820.006.patch against trunk revision 8ca0d95. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6775//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6775//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6775//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338589#comment-14338589 ] Tsuyoshi Ozawa commented on YARN-2820: -- [~zxu] thanks for your updating! The implementation of FSAction looks good to me. I found following points to be fixed: 1. In startInternal, fs.mkdirs can be replaced with mkdirsWithRetries: {code} fs.mkdirs(rmDTSecretManagerRoot); fs.mkdirs(rmAppRoot); fs.mkdirs(amrmTokenSecretManagerRoot); {code} 2. All readFile() should be replaced with readFileWithRetries like writeFileWithRetries. 3. fs.listStatus() should be replaced with listStatusWithRetries. 4. We can use try-with-resources in storeRMDTMasterKeyState to close fsOut. I know it's not related to this patch, but it's better to be fixed here. {code} DataOutputStream fsOut = new DataOutputStream(os); {code} Do you mind updating a patch again? Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338170#comment-14338170 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700999/YARN-2820.005.patch against trunk revision 71385f9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 6 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6753//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6753//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6753//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339716#comment-14339716 ] zhihai xu commented on YARN-2820: - [~ozawa], thanks for your thorough review, I am really appreciated. I uploaded a new patch YARN-2820.005.patch, which addressed all your comments, It also put fsIn.close in try-with-resources at loadRMDTSecretManagerState, which is similar as fsOut.close at storeRMDTMasterKeyState. please review it, thanks zhihai Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) -
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339592#comment-14339592 ] zhihai xu commented on YARN-2820: - That is good finding, I double-checked all the FS operations in FileSystemRMStateStore: With your above finding, there is one more missing: which is in closeInternal {code} fs.close(); {code} I will upload a new patch shortly to include retries for all these missing cases. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339703#comment-14339703 ] Tsuyoshi Ozawa commented on YARN-2820: -- Good catch! Yes, we should retry there also. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, YARN-2820.005.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334748#comment-14334748 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700337/YARN-2820.004.patch against trunk revision b610c68. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6709//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6709//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6709//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335005#comment-14335005 ] Tsuyoshi OZAWA commented on YARN-2820: -- [~zxu] Great job! We are almost there. To avoid repeating code for retry, I think it's better to have FSAction like ZKAction in ZKRMStateStore. What do you think? Minor nits: I prefer to have a line break after = for readability. {code} + public static final String FS_RM_STATE_STORE_NUM_RETRIES = RM_PREFIX + + fs.state-store.num-retries; + public static final String FS_RM_STATE_STORE_RETRY_INTERVAL_MS = RM_PREFIX + + fs.state-store.retry-interval-ms; {code} {code} public static final String FS_RM_STATE_STORE_NUM_RETRIES = RM_PREFIX + fs.state-store.num-retries; public static final String FS_RM_STATE_STORE_RETRY_INTERVAL_MS = RM_PREFIX + fs.state-store.retry-interval-ms; {code} Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334381#comment-14334381 ] zhihai xu commented on YARN-2820: - [~ozawa], Sorry for the delay to update the patch. Your review was really thorough. Thanks for that. I uploaded a new patch YARN-2820.004.patch which addressed all your comments. Please review it. thanks zhihai Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328862#comment-14328862 ] Tsuyoshi OZAWA commented on YARN-2820: -- [~zxu] Thank you for updating a patch. 1. Should we create *WithRetries methods for deleteFile/renameFile/createFile/getFileStatus too? Note that we should update replaceFile to use renameFileWithRetires instead of calling fs.rename(srcPath, dstPath) directly: {code} protected void replaceFile(Path srcPath, Path dstPath) throws Exception { if (fs.exists(dstPath)) { deleteFile(dstPath); } else { LOG.info(File doesn't exist. Skip deleting the file + dstPath); } fs.rename(srcPath, dstPath); } {code} 2. Should we create existsWithRetries and use it instead of fs.exists()? 2. Please move *WithRetries methods below the following comment: {code} // FileSystem related code {code} Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328593#comment-14328593 ] Tsuyoshi OZAWA commented on YARN-2820: -- I'll take a look. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328576#comment-14328576 ] zhihai xu commented on YARN-2820: - All these 5 findbugs are not related to my change. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325589#comment-14325589 ] zhihai xu commented on YARN-2820: - [~ozawa], thanks for the review. Your suggestion is good. I uploaded a new patch YARN-2820.003.patch, which addressed your comment. please review it. thanks zhihai Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325676#comment-14325676 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699443/YARN-2820.003.patch against trunk revision b6fc1f3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6658//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6658//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6658//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325616#comment-14325616 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699439/YARN-2820.002.patch against trunk revision b6fc1f3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6657//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6657//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6657//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325621#comment-14325621 ] zhihai xu commented on YARN-2820: - I checked the warning message, all these 5 findbugs are not related to my change. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at