[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303866#comment-14303866
 ] 

zhihai xu commented on YARN-1778:
---------------------------------

Hi [~jlowe], thanks for your information. I think HDFS clients retry will have 
a lot of corner case to cover, it may not be easy to cover all these cases . 
For example In YARN-2820, we hit the issue:HDFS IOException after HDFS client 
retry at dfsClient.namenode.complete which is the sub-function(low level) retry 
in FileSystemRMStateStore#updateFile in the following log.
{code}
2014-10-29 23:49:12,202 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Updating info for attempt: appattempt_1409135750325_109118_000001 at: 
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_000001

2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_000001.new.tmp retrying...

2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_000001.new.tmp retrying...

2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_000001.new.tmp retrying...

2014-10-29 23:49:46,283 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Error updating info for attempt: appattempt_1409135750325_109118_000001
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas.
2014-10-29 23:49:46,284 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error storing/updating appAttempt: appattempt_1409135750325_109118_000001
2014-10-29 23:49:46,916 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED.
{code}
HDFS client is low level retry. It doesn't know how the upper layer use it. 
IMO, It make senses to do the retry in the upper layer for the whole 
functionality retry, which is similar as doing the retry at different network 
layers: retry at physical layer, link layer and TCP/IP layer.



> TestFSRMStateStore fails on trunk
> ---------------------------------
>
>                 Key: YARN-1778
>                 URL: https://issues.apache.org/jira/browse/YARN-1778
>             Project: Hadoop YARN
>          Issue Type: Test
>            Reporter: Xuan Gong
>            Assignee: zhihai xu
>         Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to