subject:"\[jira\] \[Commented\] $YARN\-2820$ Do retry in FileSystemRMStateStore for better error recovery when update\/store failure due to IOException."

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-27 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340148#comment-14340148
 ] 

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12701354/YARN-2820.007.patch
  against trunk revision 4f75b15.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6780//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6780//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6780//console

This message is automatically generated.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, 
 YARN-2820.007.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-27 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340307#comment-14340307
 ] 

Tsuyoshi Ozawa commented on YARN-2820:
--

+1, committing this shortly.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, 
 YARN-2820.007.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-27 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340304#comment-14340304
 ] 

Tsuyoshi Ozawa commented on YARN-2820:
--

[~zxu] My bad. Closeable should be idempotent, so it's OK. 
http://docs.oracle.com/javase/7/docs/api/java/lang/AutoCloseable.html Please 
ignore the above comment.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, 
 YARN-2820.007.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-27 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340265#comment-14340265
 ] 

Tsuyoshi Ozawa commented on YARN-2820:
--

[~zxu] Thank you for updating. I rethink abut closeInternal(). If we call 
fs.close() twice or more, it can close another file descriptor unexpectedly. It 
can lead unexpected behaviours. We should remove closeWithRetries and call 
fs.close() in closeInternal() to avoid the problems. What do you think? Thank 
you for dealing with iterative reviews.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, 
 YARN-2820.007.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-27 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339910#comment-14339910
 ] 

Tsuyoshi Ozawa commented on YARN-2820:
--

[~zxu] try-with-resources statement is a new statement from JDK7 for instances 
which implements Closable:
http://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html

{code}
try (ByteArrayInputStream is = new ByteArrayInputStream(childData);
DataInputStream fsIn = new DataInputStream(is);){
  // processing something here
}
// closes is and fsIn automatically after the block.
{code}

It's useful since we don't need finally block with null check. 

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch, YARN-2820.006.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-27 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340023#comment-14340023
]

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12701314/YARN-2820.007.patch
against trunk revision 48c7ee7.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:red}-1 javac{color}. The applied patch generated 1151 javac
compiler warnings (more than the trunk's current 205 warnings).

{color:red}-1 javadoc{color}. The javadoc tool appears to have generated
47 warning messages.
See
https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/diffJavadocWarnings.txt
for details.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 5 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart

The following test timeouts occurred in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6778//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Javac warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6778//console

This message is automatically generated.

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
--

Key: YARN-2820
URL: https://issues.apache.org/jira/browse/YARN-2820
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Attachments: YARN-2820.000.patch, YARN-2820.001.patch,
YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch,
YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We
saw the following IOexception cause the RM shutdown.
{code}
2014-10-29 23:49:12,202 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Updating info for attempt: appattempt_1409135750325_109118_01 at:
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01
2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...
2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...
2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...
2014-10-29 23:49:46,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Error updating info for attempt: appattempt_1409135750325_109118_01
java.io.IOException: Unable to close file because the last block does not
have enough number of replicas.
2014-10-29 23:49:46,284 ERROR
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error storing/updating appAttempt: appattempt_1409135750325_109118_01
2014-10-29 23:49:46,916 FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-27 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339954#comment-14339954
 ] 

zhihai xu commented on YARN-2820:
-

[~ozawa], Cool, I just learned this new syntax. I uploaded a new patch 
YARN-2820.007.patch which use try-with-resources statement.
Please review it. thanks zhihai

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-26 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339774#comment-14339774
]

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12701267/YARN-2820.006.patch
against trunk revision 8ca0d95.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 5 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6775//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6775//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6775//console

This message is automatically generated.

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
--

at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-26 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338589#comment-14338589
 ] 

Tsuyoshi Ozawa commented on YARN-2820:
--

[~zxu] thanks for your updating! The implementation of FSAction looks good to 
me. I found following points to be fixed:

1. In startInternal, fs.mkdirs can be replaced with mkdirsWithRetries:
{code}
fs.mkdirs(rmDTSecretManagerRoot);
fs.mkdirs(rmAppRoot);
fs.mkdirs(amrmTokenSecretManagerRoot);
{code}

2. All readFile() should be replaced with readFileWithRetries like 
writeFileWithRetries. 
3. fs.listStatus() should be replaced with listStatusWithRetries.

4. We can use try-with-resources in storeRMDTMasterKeyState to close fsOut. I 
know it's not related to this patch, but it's better to be fixed here.
{code}
DataOutputStream fsOut = new DataOutputStream(os);
{code}

Do you mind updating a patch again?

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-26 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338170#comment-14338170
]

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12700999/YARN-2820.005.patch
against trunk revision 71385f9.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 6 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6753//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6753//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6753//console

This message is automatically generated.

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
--

at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-26 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339716#comment-14339716
 ] 

zhihai xu commented on YARN-2820:
-

[~ozawa], thanks for your thorough review, I am really appreciated.
I uploaded a new patch YARN-2820.005.patch, which addressed all your comments, 
It also put fsIn.close in try-with-resources at loadRMDTSecretManagerState, 
which is similar as fsOut.close at storeRMDTMasterKeyState.
please review it, thanks zhihai

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch, YARN-2820.006.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) -

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-26 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339592#comment-14339592
 ] 

zhihai xu commented on YARN-2820:
-

That is good finding, I double-checked all the FS operations in 
FileSystemRMStateStore: With your above finding, there is one more missing:
which is in closeInternal
{code}
fs.close();
{code} 
I will upload a new patch shortly to include retries for all these missing 
cases.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-26 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339703#comment-14339703
 ] 

Tsuyoshi Ozawa commented on YARN-2820:
--

Good catch! Yes, we should retry there also.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, 
 YARN-2820.005.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-24 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334748#comment-14334748
]

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12700337/YARN-2820.004.patch
against trunk revision b610c68.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 5 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6709//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6709//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6709//console

This message is automatically generated.

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
--

at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-24 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335005#comment-14335005
 ] 

Tsuyoshi OZAWA commented on YARN-2820:
--

[~zxu] Great job! We are almost there. To avoid repeating code for retry, I 
think it's better to have FSAction like ZKAction in ZKRMStateStore. What do you 
think?

Minor nits: I prefer to have a line break after = for readability.
{code}
+  public static final String FS_RM_STATE_STORE_NUM_RETRIES = RM_PREFIX
+  + fs.state-store.num-retries;
+  public static final String FS_RM_STATE_STORE_RETRY_INTERVAL_MS = RM_PREFIX
+  + fs.state-store.retry-interval-ms;
{code}

{code}
  public static final String FS_RM_STATE_STORE_NUM_RETRIES = 
  RM_PREFIX + fs.state-store.num-retries;
  public static final String FS_RM_STATE_STORE_RETRY_INTERVAL_MS =
  RM_PREFIX + fs.state-store.retry-interval-ms;
{code}

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-23 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334381#comment-14334381
 ] 

zhihai xu commented on YARN-2820:
-

[~ozawa], Sorry for the delay to update the patch. Your review was really 
thorough. Thanks for that.
I uploaded a new patch YARN-2820.004.patch which addressed all your comments.
Please review it.
thanks
zhihai

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-20 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328862#comment-14328862
 ] 

Tsuyoshi OZAWA commented on YARN-2820:
--

[~zxu] Thank you for updating a patch. 

1. Should we create *WithRetries methods for 
deleteFile/renameFile/createFile/getFileStatus too? Note that we should update 
replaceFile to use renameFileWithRetires instead of calling 
fs.rename(srcPath, dstPath) directly:
{code}
  protected void replaceFile(Path srcPath, Path dstPath) throws Exception {
if (fs.exists(dstPath)) {
  deleteFile(dstPath);
} else {
  LOG.info(File doesn't exist. Skip deleting the file  + dstPath);
}
fs.rename(srcPath, dstPath);
  }
{code}

2. Should we create existsWithRetries and use it instead of fs.exists()?

2. Please move *WithRetries methods below the following comment:
{code}

  // FileSystem related code

{code}

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-19 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328593#comment-14328593
 ] 

Tsuyoshi OZAWA commented on YARN-2820:
--

I'll take a look.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-19 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328576#comment-14328576
 ] 

zhihai xu commented on YARN-2820:
-

All these 5 findbugs are not related to my change.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-18 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325589#comment-14325589
 ] 

zhihai xu commented on YARN-2820:
-

[~ozawa], thanks for the review. Your suggestion is good.
I uploaded a new patch YARN-2820.003.patch, which addressed your comment.
please review it.
thanks
zhihai

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-18 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325676#comment-14325676
]

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12699443/YARN-2820.003.patch
against trunk revision b6fc1f3.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 5 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6658//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6658//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6658//console

This message is automatically generated.

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
--

Key: YARN-2820
URL: https://issues.apache.org/jira/browse/YARN-2820
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Attachments: YARN-2820.000.patch, YARN-2820.001.patch,
YARN-2820.002.patch, YARN-2820.003.patch

at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-18 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325616#comment-14325616
]

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12699439/YARN-2820.002.patch
against trunk revision b6fc1f3.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 5 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6657//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6657//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6657//console

This message is automatically generated.

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
--

Key: YARN-2820
URL: https://issues.apache.org/jira/browse/YARN-2820
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Attachments: YARN-2820.000.patch, YARN-2820.001.patch,
YARN-2820.002.patch, YARN-2820.003.patch

at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

2015-02-18 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325621#comment-14325621
 ] 

zhihai xu commented on YARN-2820:
-

I checked the warning message, all these 5 findbugs are not related to my 
change.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

23 matches

Site Navigation

Mail list logo

Footer information