[
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303866#comment-14303866
]
zhihai xu commented on YARN-1778:
---------------------------------
Hi [~jlowe], thanks for your information. I think HDFS clients retry will have
a lot of corner case to cover, it may not be easy to cover all these cases .
For example In YARN-2820, we hit the issue:HDFS IOException after HDFS client
retry at dfsClient.namenode.complete which is the sub-function(low level) retry
in FileSystemRMStateStore#updateFile in the following log.
{code}
2014-10-29 23:49:12,202 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Updating info for attempt: appattempt_1409135750325_109118_000001 at:
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_000001
2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_000001.new.tmp retrying...
2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_000001.new.tmp retrying...
2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_000001.new.tmp retrying...
2014-10-29 23:49:46,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Error updating info for attempt: appattempt_1409135750325_109118_000001
java.io.IOException: Unable to close file because the last block does not have
enough number of replicas.
2014-10-29 23:49:46,284 ERROR
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error storing/updating appAttempt: appattempt_1409135750325_109118_000001
2014-10-29 23:49:46,916 FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type
STATE_STORE_OP_FAILED.
{code}
HDFS client is low level retry. It doesn't know how the upper layer use it.
IMO, It make senses to do the retry in the upper layer for the whole
functionality retry, which is similar as doing the retry at different network
layers: retry at physical layer, link layer and TCP/IP layer.
> TestFSRMStateStore fails on trunk
> ---------------------------------
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
> Issue Type: Test
> Reporter: Xuan Gong
> Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)