[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543040#comment-14543040
 ] 

Xianyin Xin commented on YARN-3639:
-----------------------------------

Sorry [~aw], i didn't make it clearly. "On the same node" means the original 
active RM and NN were running on the same node(node1). The standby RM and NN 
were running on other nodes. After node1 died, the HDFS token renewer would 
firstly try to connect to the NN on node1, but NN on node1 was not reachable. 
After the connection time-out, the HDFS token renewer then tries to connect to 
the original standy NN.
If the active NN and RM run on different nodes, the problem doesn't exist.


> It takes too long time for RM to recover all apps if the original active RM 
> and namenode is deployed on the same node.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3639
>                 URL: https://issues.apache.org/jira/browse/YARN-3639
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Xianyin Xin
>         Attachments: YARN-3639-recovery_log_1_app.txt
>
>
> If the node on which the active RM runs dies and if the active namenode is 
> running on the same node, the new RM will take long time to recover all apps. 
> After analysis, we found the root cause is renewing HDFS tokens in the 
> recovering process. The HDFS client created by the renewer would firstly try 
> to connect to the original namenode, the result of which is time-out after 
> 10~20s, and then the client tries to connect to the new namenode. The entire 
> recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to