[
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308663#comment-14308663
]
zhihai xu commented on YARN-1778:
---------------------------------
[~ozawa], Not sure what do you mean. The retries is not hard-coded based on the
following code at
[DFSOutputStream#completeFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L1540]
{code}
int retries = dfsClient.getConf().nBlockWriteLocateFollowingRetry;
{code}
nBlockWriteLocateFollowingRetry is decided by configuration
"dfs.client.block.write.locateFollowingBlock.retries".
The problem for me is the retry in DFSOutputStream#completeFile doesn't work.
Based on the log,
It retry 5 times in more than 30 seconds and it still doesn't work, then the
exception "Unable to close file because the last block does not have enough
number of replicas" generated from
[FileSystemRMStateStore#writeFile|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java#L583]
caused RM restart().
My patch will work better with retry at both high layer(new code) and low
layer(old code) because it retry in FileSystemRMStateStore#writeFile, if any
exception happen, it will [overwrite the
file|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java#L581]
and redo everything.
> TestFSRMStateStore fails on trunk
> ---------------------------------
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
> Issue Type: Test
> Reporter: Xuan Gong
> Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)