[jira] [Updated] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated YARN-3639:
--
Assignee: (was: Xianyin Xin)

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin

 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated YARN-3639:
--
Attachment: YARN-3639-recovery_log_1_app.txt

Attach recovery log for 1 app. In the log, there's a gap(15s) between two log 
items printed at 02:39:48,998 and 02:40:04,025. Note that TokenRenewer for HIVE 
tokens is not found, but this has nothing to do with the gap. It's HDFS token 
renewing that caused the gap, the reason is just as the {{Description}} states.

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
 Attachments: YARN-3639-recovery_log_1_app.txt


 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)