[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210327#comment-15210327 ] Jun Gong commented on YARN-3809: It seems NMClient also needs same change. > Failed to launch new attempts because ApplicationMasterLauncher's threads all > hang > -- > > Key: YARN-3809 > URL: https://issues.apache.org/jira/browse/YARN-3809 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.7.1 > > Attachments: YARN-3809.01.patch, YARN-3809.02.patch, > YARN-3809.03.patch > > > ApplicationMasterLauncher create a thread pool whose size is 10 to deal with > AMLauncherEventType(LAUNCH and CLEANUP). > In our cluster, there was many NM with 10+ AM running on it, and one shut > down for some reason. After RM found the NM LOST, it cleaned up AMs running > on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. > ApplicationMasterLauncher's thread pool would be filled up, and they all hang > in the code containerMgrProxy.stopContainers(stopRequest) because NM was > down, the default RPC time out is 15 mins. It means that in 15 mins > ApplicationMasterLauncher could not handle new event such as LAUNCH, then new > attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210137#comment-15210137 ] Laxman commented on YARN-3809: -- Thanks [~rohithsharma]. Filed a new jira MAPREDUCE-6659. As I feel it requires MR app master changes, I filed issue under MR project. > Failed to launch new attempts because ApplicationMasterLauncher's threads all > hang > -- > > Key: YARN-3809 > URL: https://issues.apache.org/jira/browse/YARN-3809 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.7.1 > > Attachments: YARN-3809.01.patch, YARN-3809.02.patch, > YARN-3809.03.patch > > > ApplicationMasterLauncher create a thread pool whose size is 10 to deal with > AMLauncherEventType(LAUNCH and CLEANUP). > In our cluster, there was many NM with 10+ AM running on it, and one shut > down for some reason. After RM found the NM LOST, it cleaned up AMs running > on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. > ApplicationMasterLauncher's thread pool would be filled up, and they all hang > in the code containerMgrProxy.stopContainers(stopRequest) because NM was > down, the default RPC time out is 15 mins. It means that in 15 mins > ApplicationMasterLauncher could not handle new event such as LAUNCH, then new > attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209981#comment-15209981 ] Rohith Sharma K S commented on YARN-3809: - Currently NMProxy wait for infinite times i.e 0 is set by default from NMProxy. This is the reason all your AM-NM calls are hung. This can be controlled by configuring "ipc.client.connect.max.retries.on.timeouts". May be new configuration can be added with name *yarn.client.connect.max.retries.on.timeouts* to control NMProxy connections only. Can you raise new ticket for this? > Failed to launch new attempts because ApplicationMasterLauncher's threads all > hang > -- > > Key: YARN-3809 > URL: https://issues.apache.org/jira/browse/YARN-3809 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.7.1 > > Attachments: YARN-3809.01.patch, YARN-3809.02.patch, > YARN-3809.03.patch > > > ApplicationMasterLauncher create a thread pool whose size is 10 to deal with > AMLauncherEventType(LAUNCH and CLEANUP). > In our cluster, there was many NM with 10+ AM running on it, and one shut > down for some reason. After RM found the NM LOST, it cleaned up AMs running > on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. > ApplicationMasterLauncher's thread pool would be filled up, and they all hang > in the code containerMgrProxy.stopContainers(stopRequest) because NM was > down, the default RPC time out is 15 mins. It means that in 15 mins > ApplicationMasterLauncher could not handle new event such as LAUNCH, then new > attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209940#comment-15209940 ] Laxman commented on YARN-3809: -- We are using hadoop 2.6.0. And we are facing the similar issue with AM-NM RPC protocol as well. After AM notified with a lost NM, I see AM threads hung in the code containerMgrProxy.stopContainers(stopRequest). Current patch issue fixes only RM-AM protocol. Should we have similar fix for AM-NM as well? > Failed to launch new attempts because ApplicationMasterLauncher's threads all > hang > -- > > Key: YARN-3809 > URL: https://issues.apache.org/jira/browse/YARN-3809 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.7.1 > > Attachments: YARN-3809.01.patch, YARN-3809.02.patch, > YARN-3809.03.patch > > > ApplicationMasterLauncher create a thread pool whose size is 10 to deal with > AMLauncherEventType(LAUNCH and CLEANUP). > In our cluster, there was many NM with 10+ AM running on it, and one shut > down for some reason. After RM found the NM LOST, it cleaned up AMs running > on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. > ApplicationMasterLauncher's thread pool would be filled up, and they all hang > in the code containerMgrProxy.stopContainers(stopRequest) because NM was > down, the default RPC time out is 15 mins. It means that in 15 mins > ApplicationMasterLauncher could not handle new event such as LAUNCH, then new > attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601070#comment-14601070 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/969/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601051#comment-14601051 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601369#comment-14601369 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601409#comment-14601409 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601463#comment-14601463 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601445#comment-14601445 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600489#comment-14600489 ] Jun Gong commented on YARN-3809: Thanks [~rohithsharma], [~devaraj.k] and [~kasha] for comments and review. Thanks [~jlowe] for suggestions, review and committing the patch. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599702#comment-14599702 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-trunk-Commit #8057 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8057/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597810#comment-14597810 ] Jason Lowe commented on YARN-3809: -- +1 latest patch lgtm. Will commit this later today if there are no objections. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597622#comment-14597622 ] Jun Gong commented on YARN-3809: Same as previous explanation, checkstyle and test case error are not related. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594424#comment-14594424 ] Jun Gong commented on YARN-3809: Attach a new patch to address [~jlowe] 's suggestions. Thanks for the review. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594746#comment-14594746 ] Hadoop QA commented on YARN-3809: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 48s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 39s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 52s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 50m 45s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 98m 47s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / bcb3c40 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8299/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8299/console | This message was automatically generated. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594300#comment-14594300 ] Jun Gong commented on YARN-3809: [~jlowe] Thank you for the very detailed suggestions. It helps me a lot. I will update patch to address your suggestions. I find it might be better to change config in ApplicationMasterLauncher#serviceInit, then we just need change it one time, all newly created AMLauncher will use this config. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch, YARN-3809.02.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593452#comment-14593452 ] Jun Gong commented on YARN-3809: [~jlowe] Thanks for explanation and suggestions. I misunderstood [~kasha] 's suggestion. I thought we just need configure IPC_CLIENT_CONNECT_MAX_RETRIES_ON_SOCKET_TIMEOUTS_KEY to a smaller value, however it will pollute other proies. The new patch addresses your suggestion: update ipc.client.connect.max.retries.on.timeouts to a new introduced config yarn.resourcemanager..container-management.retries whose default value is 10. Then default retry time will be 200 secs. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch, YARN-3809.02.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593490#comment-14593490 ] Jason Lowe commented on YARN-3809: -- Thanks for updating the patch, Jun. ContainerManagementProtocolPBClientImpl is not the appropriate place to make this change. That is used by every client of the ContainerManagementProtocol, which includes AMs, etc. AMLauncher.getContainerMgrProxy is a more appropriate place, although there we still don't want to create a new config every time we create an NM client proxy. Creating configs is expensive. An even better place to put this change is in the AMLauncher constructor, where it can create a copy of the conf then set the IPC config in it, which in turn will be passed to the YarnRPC code to create the proxy. The new properties should have entries in yarn-default.xml for documentation purposes. Nit: container-management is probably not going to be clear to most users what it means. I think it would be clearer to use nodemanager since that's used in many other places, so I suggest a property name like yarn.resourcemanager.nodemanager-connect-retries. Since we're changing the code that sets up the AM launcher thread pool, it'd be really nice to give each of the threads in that pool a descriptive name. It's annoying to see a bunch of threads like pool-1-thread-2 in the jstack all waiting for work but no clue in the name or stack as to what service they belong to. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch, YARN-3809.02.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591806#comment-14591806 ] Jason Lowe commented on YARN-3809: -- I agree the thread pool should be configurable and possibly larger by default, but note that the thread pool size has little to do with the number of AMs running on a node. The pool size needs to be as large as the worst-case number of AMs being launched during the worst-case retry duration to avoid any unnecessary delays. For a 15 minute retry delay on a cluster launching multiple apps per second on average, that's an unreasonable thread pool size. I agree with [~kasha] that we need to lower the retry timeout as part of this fix. As it is today we will expire the NM due to lack of heartbeat before we will give up on an AM launch retry which makes no sense. We can update ipc.client.connect.max.retries.on.timeouts and potentially ipc.client.connect.timeout for the conf passed to the NM proxy we create, although we need to make sure we make a copy of the config to avoid polluting other proxies. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590852#comment-14590852 ] Karthik Kambatla commented on YARN-3809: Or, could we make it so we don't wait as long as 15 minutes? Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590849#comment-14590849 ] Karthik Kambatla commented on YARN-3809: Shouldn't the number of threads in the pool be at least as big as the maximum number of apps that could run on a node? By making it configurable, how do we expect the admins to pick this number? Just pick an arbitrarily high value? Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591093#comment-14591093 ] Jun Gong commented on YARN-3809: [~devaraj.k] and [~kasha], thank you for the comments and suggestions. {quote} Shouldn't the number of threads in the pool be at least as big as the maximum number of apps that could run on a node?By making it configurable, how do we expect the admins to pick this number? Just pick an arbitrarily high value? {quote} Threads in the pool are just launching/stopping AMs, so it will be better that the number of threads in the pool is at least as big as the maximum number of AMs that could run on a node. Although we could not know the max value for all clusters in advance, a larger value will make it faster that deal with AMLauncher events. Admins could just pick the default value, and they could adjust the value if they find the value is a little small. {quote} Or, could we make it so we don't wait as long as 15 minutes? {quote} Yes, we could make it shorter. I think we also need a larger thread pool, then it could deal with more events at the same time. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588334#comment-14588334 ] Hadoop QA commented on YARN-3809: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 31s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 37s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 21s | The applied patch generated 1 new checkstyle issues (total was 213, now 213). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 54s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 50m 49s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 93m 4s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739875/YARN-3809.01.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / b039e69 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8261/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8261/console | This message was automatically generated. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589164#comment-14589164 ] Jun Gong commented on YARN-3809: The checkstyle error is : YarnConfiguration.java: File length is 2,025 lines (max allowed is 2,000). Need fix it? Test case error is addressed in YARN-3790. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589338#comment-14589338 ] Devaraj K commented on YARN-3809: - bq. The checkstyle error is : YarnConfiguration.java: File length is 2,025 lines (max allowed is 2,000). Need fix it? bq. The applied patch generated 1 new checkstyle issues (total was 213, now 213). I think you don't need to fix this checkstyle as part of this issue as the checkstyle count is same after the patch, and the new checkstyle which is showing about file length would probably take reasonable effort to refactor it and exists without the patch as well. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587519#comment-14587519 ] Rohith commented on YARN-3809: -- This is interesting scenario, but am not sure why ThreadPool is set to 10 which is not configurable. bq. the default RPC time out is 15 mins.. I see RPC timeout is 1 minute, am I missing anything? {code} static final int DEFAULT_COMMAND_TIMEOUT = 6; . int expireIntvl = conf.getInt(NM_COMMAND_TIMEOUT, DEFAULT_COMMAND_TIMEOUT); proxy = (ContainerManagementProtocolPB) RPC.getProxy(ContainerManagementProtocolPB.class, clientVersion, addr, ugi, conf, NetUtils.getDefaultSocketFactory(conf), expireIntvl); {code} Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587555#comment-14587555 ] Jun Gong commented on YARN-3809: The stack is as following: {noformat} 2015-06-15 11:16:35,376 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error cleaning master org.apache.hadoop.net.ConnectTimeoutException: Call From docker-10-240-139-221/10.240.139.221 to docker-10-240-139-234:8041 failed on socket timeout exception: org.apache.h adoop.net.ConnectTimeoutException: 2 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=do cker-10-240-139-234/10.240.139.234:8041]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.GeneratedConstructorAccessor107.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749) at org.apache.hadoop.ipc.Client.call(Client.java:1414) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy36.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:138) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:263) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.net.ConnectTimeoutException: 2 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=docker-10-240-139-234/10.240.139.234:8041] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462) at org.apache.hadoop.ipc.Client.call(Client.java:1381) ... 9 more {noformat} Time out is 20 secs, but it will retry 45 times(IPC_CLIENT_CONNECT_MAX_RETRIES_ON_SOCKET_TIMEOUTS_DEFAULT = 45). Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587474#comment-14587474 ] Jun Gong commented on YARN-3809: How about setting thread pool size in ApplicationMasterLauncher larger, or make the size configurable? Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)