[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

    [ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308663#comment-14308663
 ]


zhihai xu commented on YARN-1778:
---------------------------------

[~ozawa], Not sure what do you mean. The retries is not hard-coded based on the 
following code at 
[DFSOutputStream#completeFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L1540]
{code}
    int retries = dfsClient.getConf().nBlockWriteLocateFollowingRetry;
{code}
nBlockWriteLocateFollowingRetry is decided by configuration 
"dfs.client.block.write.locateFollowingBlock.retries".
The problem for me is the retry in DFSOutputStream#completeFile doesn't work. 
Based on the log,
It retry 5 times in more than 30 seconds and it still doesn't work, then the 
exception "Unable to close file because the last block does not have enough 
number of replicas" generated from 
[FileSystemRMStateStore#writeFile|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java#L583]
 caused RM restart(). 
My patch will work better with retry at both high layer(new code) and low 
layer(old code) because it retry in FileSystemRMStateStore#writeFile, if any 
exception happen, it will [overwrite the 
file|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java#L581]
 and redo everything.

> TestFSRMStateStore fails on trunk
> ---------------------------------
>
>                 Key: YARN-1778
>                 URL: https://issues.apache.org/jira/browse/YARN-1778
>             Project: Hadoop YARN
>          Issue Type: Test
>            Reporter: Xuan Gong
>            Assignee: zhihai xu
>         Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

Reply via email to