[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541697#comment-14541697 ] nijel commented on YARN-3639: - hi [~xinxianyin] Thanks for reporting this issue. Can you attach the logs of this issue ? It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543187#comment-14543187 ] Xianyin Xin commented on YARN-3639: --- Yes, you're right [~aw]. It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Attachments: YARN-3639-recovery_log_1_app.txt If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542105#comment-14542105 ] Allen Wittenauer commented on YARN-3639: bq. on the same node I don't see what that has to do with: bq. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. What happens if the two NNs aren't on the same node as the RM? Does the problem still exist? (It should also be noted that running the NN and RM on the same node for other than trivial deployments has never been a recommended configuration.) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543040#comment-14543040 ] Xianyin Xin commented on YARN-3639: --- Sorry [~aw], i didn't make it clearly. On the same node means the original active RM and NN were running on the same node(node1). The standby RM and NN were running on other nodes. After node1 died, the HDFS token renewer would firstly try to connect to the NN on node1, but NN on node1 was not reachable. After the connection time-out, the HDFS token renewer then tries to connect to the original standy NN. If the active NN and RM run on different nodes, the problem doesn't exist. It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Attachments: YARN-3639-recovery_log_1_app.txt If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543053#comment-14543053 ] Xianyin Xin commented on YARN-3639: --- Maybe we can fix this problem in the following way: the latter apps should learn the lesson given by the former apps, i.e., if one app find the original NN could not connect and then it connect to the new NN successfully, the latter apps should be aware of this to avoid repeating the failure. The token renewer creates a HDFS client when it tries to renew a HDFS token for an app, maybe the following apps could reuse the client? It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Attachments: YARN-3639-recovery_log_1_app.txt If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543082#comment-14543082 ] Allen Wittenauer commented on YARN-3639: bq. On the same node means the original active RM and NN were running on the same node(node1) OK, so yes, this still sounds like an irrelevant detail. It appears the real problem here is that if the NN and the RM both go down at the same time (+regardless of location+), the RM doesn't use the prior knowledge of the NN being down to stream renewal tokens after the new NN is discovered. It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Attachments: YARN-3639-recovery_log_1_app.txt If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)