[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541697#comment-14541697
 ] 

nijel commented on YARN-3639:
-

hi [~xinxianyin]
Thanks for reporting this issue.
Can you attach the logs of this issue ? 

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin

 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543187#comment-14543187
 ] 

Xianyin Xin commented on YARN-3639:
---

Yes, you're right [~aw]. 

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
 Attachments: YARN-3639-recovery_log_1_app.txt


 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542105#comment-14542105
 ] 

Allen Wittenauer commented on YARN-3639:


bq. on the same node

I don't see what that has to do with:

bq. After analysis, we found the root cause is renewing HDFS tokens in the 
recovering process. The HDFS client created by the renewer would firstly try to 
connect to the original namenode, the result of which is time-out after 10~20s, 
and then the client tries to connect to the new namenode. The entire recovery 
cost 15*#apps seconds according our test.

What happens if the two NNs aren't on the same node as the RM?  Does the 
problem still exist?

(It should also be noted that running the NN and RM on the same node for other 
than trivial deployments has never been a recommended configuration.)

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin

 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543040#comment-14543040
 ] 

Xianyin Xin commented on YARN-3639:
---

Sorry [~aw], i didn't make it clearly. On the same node means the original 
active RM and NN were running on the same node(node1). The standby RM and NN 
were running on other nodes. After node1 died, the HDFS token renewer would 
firstly try to connect to the NN on node1, but NN on node1 was not reachable. 
After the connection time-out, the HDFS token renewer then tries to connect to 
the original standy NN.
If the active NN and RM run on different nodes, the problem doesn't exist.


 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
 Attachments: YARN-3639-recovery_log_1_app.txt


 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543053#comment-14543053
 ] 

Xianyin Xin commented on YARN-3639:
---

Maybe we can fix this problem in the following way: the latter apps should 
learn the lesson given by the former apps, i.e., if one app find the original 
NN could not connect and then it connect to the new NN successfully, the latter 
apps should be aware of this to avoid repeating the failure. The token renewer 
creates a HDFS client when it tries to renew a HDFS token for an app, maybe the 
following apps could reuse the client?

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
 Attachments: YARN-3639-recovery_log_1_app.txt


 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543082#comment-14543082
 ] 

Allen Wittenauer commented on YARN-3639:


bq.  On the same node means the original active RM and NN were running on the 
same node(node1)

OK, so yes, this still sounds like an irrelevant detail.  It appears the real 
problem here is that if the NN and the RM both go down at the same time 
(+regardless of location+), the RM doesn't use the prior knowledge of the NN 
being down to stream renewal tokens after the new NN is discovered.

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
 Attachments: YARN-3639-recovery_log_1_app.txt


 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)