[jira] [Commented] (YARN-2820) Retry in FileSystemRMStateStore when FS's operations fail due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341586#comment-14341586 ] Hudson commented on YARN-2820: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2068 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2068/]) YARN-2820. Retry in FileSystemRMStateStore when FS's operations fail due to IOException. Contributed by Zhihai Xu. (ozawa: rev 01a1621930df17a745dd37892996c68fca3447d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * hadoop-yarn-project/CHANGES.txt > Retry in FileSystemRMStateStore when FS's operations fail due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.7.0 > > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore
[jira] [Commented] (YARN-2820) Retry in FileSystemRMStateStore when FS's operations fail due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341548#comment-14341548 ] Hudson commented on YARN-2820: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #109 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/109/]) YARN-2820. Retry in FileSystemRMStateStore when FS's operations fail due to IOException. Contributed by Zhihai Xu. (ozawa: rev 01a1621930df17a745dd37892996c68fca3447d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java > Retry in FileSystemRMStateStore when FS's operations fail due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.7.0 > > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore
[jira] [Commented] (YARN-2820) Retry in FileSystemRMStateStore when FS's operations fail due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341541#comment-14341541 ] Hudson commented on YARN-2820: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2050 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2050/]) YARN-2820. Retry in FileSystemRMStateStore when FS's operations fail due to IOException. Contributed by Zhihai Xu. (ozawa: rev 01a1621930df17a745dd37892996c68fca3447d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java > Retry in FileSystemRMStateStore when FS's operations fail due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.7.0 > > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
[jira] [Commented] (YARN-2820) Retry in FileSystemRMStateStore when FS's operations fail due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341518#comment-14341518 ] Hudson commented on YARN-2820: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #118 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/118/]) YARN-2820. Retry in FileSystemRMStateStore when FS's operations fail due to IOException. Contributed by Zhihai Xu. (ozawa: rev 01a1621930df17a745dd37892996c68fca3447d1) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java > Retry in FileSystemRMStateStore when FS's operations fail due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.7.0 > > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RM
[jira] [Commented] (YARN-2820) Retry in FileSystemRMStateStore when FS's operations fail due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341488#comment-14341488 ] Hudson commented on YARN-2820: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #852 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/852/]) YARN-2820. Retry in FileSystemRMStateStore when FS's operations fail due to IOException. Contributed by Zhihai Xu. (ozawa: rev 01a1621930df17a745dd37892996c68fca3447d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java > Retry in FileSystemRMStateStore when FS's operations fail due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.7.0 > > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) >
[jira] [Commented] (YARN-2820) Retry in FileSystemRMStateStore when FS's operations fail due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341475#comment-14341475 ] Hudson commented on YARN-2820: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #118 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/118/]) YARN-2820. Retry in FileSystemRMStateStore when FS's operations fail due to IOException. Contributed by Zhihai Xu. (ozawa: rev 01a1621930df17a745dd37892996c68fca3447d1) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml > Retry in FileSystemRMStateStore when FS's operations fail due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.7.0 > > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore
[jira] [Commented] (YARN-2820) Retry in FileSystemRMStateStore when FS's operations fail due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340714#comment-14340714 ] zhihai xu commented on YARN-2820: - Thanks [~ozawa] for valuable feedback and committing the patch! Greatly appreciated. > Retry in FileSystemRMStateStore when FS's operations fail due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.7.0 > > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still > not started >
[jira] [Commented] (YARN-2820) Retry in FileSystemRMStateStore when FS's operations fail due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340322#comment-14340322 ] Hudson commented on YARN-2820: -- FAILURE: Integrated in Hadoop-trunk-Commit #7220 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7220/]) YARN-2820. Retry in FileSystemRMStateStore when FS's operations fail due to IOException. Contributed by Zhihai Xu. (ozawa: rev 01a1621930df17a745dd37892996c68fca3447d1) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java > Retry in FileSystemRMStateStore when FS's operations fail due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.7.0 > > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java: