[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time.

Allen Wittenauer (JIRA) Thu, 14 May 2015 12:31:21 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544235#comment-14544235
 ]


Allen Wittenauer commented on YARN-3639:
----------------------------------------

OK, just making sure there wasn't some weird thing going on where the problem 
really did exist because they happened to be on the same node.  That'd be very 
very bad. :D

> It takes too long time for RM to recover all apps if the original active RM 
> and NN go down at the same time.
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3639
>                 URL: https://issues.apache.org/jira/browse/YARN-3639
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Xianyin Xin
>         Attachments: YARN-3639-recovery_log_1_app.txt
>
>
> If the active RM and NN go down at the same time, the new RM will take long 
> time to recover all apps. After analysis, we found the root cause is renewing 
> HDFS tokens in the recovering process. The HDFS client created by the renewer 
> would firstly try to connect to the original NN, the result of which is 
> time-out after 10~20s, and then the client tries to connect to the new NN. 
> The entire recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time.

Reply via email to