[jira] [Updated] (YARN-1123) [YARN-321] Adding ContainerReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1123: -- Attachment: YARN-1123-6.patch +1. I created patch which is almost the same, but fix a minor format issue [YARN-321] Adding ContainerReport and Protobuf implementation - Key: YARN-1123 URL: https://issues.apache.org/jira/browse/YARN-1123 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Mayank Bansal Attachments: YARN-1123-1.patch, YARN-1123-2.patch, YARN-1123-3.patch, YARN-1123-4.patch, YARN-1123-5.patch, YARN-1123-6.patch Like YARN-978, we need some client-oriented class to expose the container history info. Neither Container nor RMContainer is the right one. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-1384) RMAppImpl#createApplicationState should call RMServerUtils#createApplicationState to convert the state
[ https://issues.apache.org/jira/browse/YARN-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen resolved YARN-1384. --- Resolution: Invalid RMAppImpl#createApplicationState should call RMServerUtils#createApplicationState to convert the state -- Key: YARN-1384 URL: https://issues.apache.org/jira/browse/YARN-1384 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: haosdent Priority: Minor RMAppImpl#createApplicationState should call RMServerUtils#createApplicationState to convert the state instead of duplicating the conversion code. Some code refactoring is required here. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1384) RMAppImpl#createApplicationState should call RMServerUtils#createApplicationState to convert the state
[ https://issues.apache.org/jira/browse/YARN-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812676#comment-13812676 ] Zhijie Shen commented on YARN-1384: --- I've validated it the problem again. RMServerUtils#createApplicationState is already removed from trunk since YARN-540. Branch YARN-321 brings it back. We may have some problem when merging branch-2 into YARN-321. Close it as invalid now, and if we need to fix the duplicate code when merging YARN-321 back to branch-2, let's reopen it. Anyway, thanks for your effort, [~haosd...@gmail.com]! RMAppImpl#createApplicationState should call RMServerUtils#createApplicationState to convert the state -- Key: YARN-1384 URL: https://issues.apache.org/jira/browse/YARN-1384 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: haosdent Priority: Minor RMAppImpl#createApplicationState should call RMServerUtils#createApplicationState to convert the state instead of duplicating the conversion code. Some code refactoring is required here. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1123) [YARN-321] Adding ContainerReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812679#comment-13812679 ] Hadoop QA commented on YARN-1123: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12611931/YARN-1123-6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2359//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2359//console This message is automatically generated. [YARN-321] Adding ContainerReport and Protobuf implementation - Key: YARN-1123 URL: https://issues.apache.org/jira/browse/YARN-1123 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Mayank Bansal Attachments: YARN-1123-1.patch, YARN-1123-2.patch, YARN-1123-3.patch, YARN-1123-4.patch, YARN-1123-5.patch, YARN-1123-6.patch Like YARN-978, we need some client-oriented class to expose the container history info. Neither Container nor RMContainer is the right one. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812682#comment-13812682 ] Zhijie Shen commented on YARN-978: -- +1 [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation -- Key: YARN-978 URL: https://issues.apache.org/jira/browse/YARN-978 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: YARN-321 Attachments: YARN-978-1.patch, YARN-978.10.patch, YARN-978.2.patch, YARN-978.3.patch, YARN-978.4.patch, YARN-978.5.patch, YARN-978.6.patch, YARN-978.7.patch, YARN-978.8.patch, YARN-978.9.patch We dont have ApplicationAttemptReport and Protobuf implementation. Adding that. Thanks, Mayank -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1388) fair share do not display info in the scheduler page
Liyin Liang created YARN-1388: - Summary: fair share do not display info in the scheduler page Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1388) fair share do not display info in the scheduler page
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Liang updated YARN-1388: -- Description: YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. fair share do not display info in the scheduler page Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1388) fair share do not display info in the scheduler page
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Liang updated YARN-1388: -- Attachment: yarn-1388.diff fair share do not display info in the scheduler page Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang Attachments: yarn-1388.diff YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) fair share do not display info in the scheduler page
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812828#comment-13812828 ] Hadoop QA commented on YARN-1388: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12611950/yarn-1388.diff against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2360//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2360//console This message is automatically generated. fair share do not display info in the scheduler page Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang Attachments: yarn-1388.diff YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) fair share do not display info in the scheduler page
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812975#comment-13812975 ] Sandy Ryza commented on YARN-1388: -- +1 fair share do not display info in the scheduler page Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang Attachments: yarn-1388.diff YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1320) Custom log4j properties in Distributed shell does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813079#comment-13813079 ] Vinod Kumar Vavilapalli commented on YARN-1320: --- I doubt if the patch is going to work if the remote file-system is HDFS. The propagation of the log4j properties file is via HDFS and it doesn't look like it is handled correctly. Please check. Custom log4j properties in Distributed shell does not work properly. Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1320.1.patch, YARN-1320.2.patch, YARN-1320.3.patch, YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.5.patch, YARN-1320.6.patch, YARN-1320.6.patch, YARN-1320.7.patch Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813089#comment-13813089 ] Zhijie Shen commented on YARN-979: -- The patch is almost good with the following minor issues: * The following javadoc is inconsistent with ApplicationAttemptReport (YARN-978) {code} + * lihost - set to N/A/li + * liRPC port - set to -1/li + * liclient token - set to N/A/li + * lidiagnostics - set to N/A/li + * litracking URL - set to N/A/li {code} * As is mentioned in the other two jiras, please move GetApplicationAttemptReportRequestProtoOrBuilder p = viaProto ? proto : builder; later. {code} + @Override + public ApplicationAttemptId getApplicationAttemptId() { +GetApplicationAttemptReportRequestProtoOrBuilder p += viaProto ? proto : builder; +if (this.applicationAttemptId != null) { + return this.applicationAttemptId; +} +if (!p.hasApplicationAttemptId()) { + return null; +} +this.applicationAttemptId = +convertFromProtoFormat(p.getApplicationAttemptId()); +return this.applicationAttemptId; + } {code} * You need to change hadoop-yarn-api/pom.xml to make application_history_client.proto to be compiled. In addition to the patch's issues, I'd like to raise one design issue here, projecting some future problems. This patch makes different APIs for application/attempt/container, which is going to be a super set of the APIs of ApplicationClientProtocol. Now it's OK if we restrict our problem with the AHS domain. However, probably in the future, we'd like to integrate the ApplicationHistoryProtocol with ApplicationClientProtocol. In other word, from the view of users, they may inquiry any application use a client, which makes it transparent whether the application report is received via ApplicationClientProtocol if the application is running or via ApplicationHistoryProtocol if it is done. Then, ApplicationClientProtocol's and ApplicationHistoryProtocol's APIs mismatch. Users can inquiry finished attempts/containers, but not the running ones. ApplicationClientProtocol may need to add the APIs for attempt/container as well. In addition, another choice of the API design is to still have the only getApplicationReport(), but have the options to load all attempts/containers reports or not. Just think it out aloud. Personally, I incline to the current API design, which is more flexible, but I'm a bit concerned about the future integration. Thoughts? [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol -- Key: YARN-979 URL: https://issues.apache.org/jira/browse/YARN-979 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, YARN-979.2.patch ApplicationHistoryProtocol should have the following APIs as well: * getApplicationAttemptReport * getApplicationAttempts * getContainerReport * getContainers The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1121: -- Attachment: YARN-1121.4.patch - Add a separate drainingStop flag to indicate serviceStop() is called for draining. - Move setDrainingStop() to RMStateStore.serviceInit(). RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Fix For: 2.2.1 Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, YARN-1121.3.patch, YARN-1121.4.patch on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.2.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813109#comment-13813109 ] Omkar Vinit Joshi commented on YARN-1210: - completely removed RECOVERED state. rest of the patch is same. Only major difference is * Before launching new appAttempt RM will check if any of the application attempts were running before. If so then RM will wait instead of starting a new application attempt. If no application attempts are found to be in running (anything other than final state) state then it launch new application attempt. * When Node manager receives resync signal it kills all the running containers and then reports back the killed containers to RM during RM registration. On receiving the container information RM checks if any of the reported container is an AM container If so then sends container_failed event to the related app attempt and eventually starts new application attempt. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813115#comment-13813115 ] Omkar Vinit Joshi commented on YARN-1210: - cancelled the patch as it is based on YARN-674 During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) fair share do not display info in the scheduler page
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813118#comment-13813118 ] Sangjin Lee commented on YARN-1388: --- Looks good to me. Thanks for the patch! fair share do not display info in the scheduler page Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang Attachments: yarn-1388.diff YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813126#comment-13813126 ] Hadoop QA commented on YARN-1121: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12611996/YARN-1121.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2361//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2361//console This message is automatically generated. RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Fix For: 2.2.1 Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, YARN-1121.3.patch, YARN-1121.4.patch on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813139#comment-13813139 ] Luke Lu commented on YARN-311: -- [~djp]: Unfortunately YARN-1343 got in before I tried to merge the patch. Now the patch won't compile due to the old RMNodeImpl ctor usage in TestRMNodeTransition. Can you rebase the patch? Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v10.patch, YARN-311-v11.patch, YARN-311-v12.patch, YARN-311-v12b.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch, YARN-311-v7.patch, YARN-311-v8.patch, YARN-311-v9.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813183#comment-13813183 ] Vinod Kumar Vavilapalli commented on YARN-90: - Thanks for the patch, Song! Some quick comments: - Because you are changing the semantics of checkDirs(), there are more changes that are needed. -- updateDirsAfterFailure() - updateConfAfterDirListChange? -- The log message in updateDirsAfterFailure: Disk(s) failed. should be changed to something like Disk-health report changed: or something like that. - Web UI and Web-services are fine for now I think, nothing to do there. - Drop the extraneous System.out.println lines in all of the patch. - Let's drop the metrics changes. We need to expose this end-to-end and not just metrics - client side reports, jmx and metrics. Worth tracking that effort separately. - Test: -- testAutoDir() - testDisksGoingOnAndOff ? -- Can you also validate the health-report both when disks go off and when they come back again? -- Also just throw unwanted exceptions instead of catching them and printing stack-trace. NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813196#comment-13813196 ] Mayank Bansal commented on YARN-979: [~zjshen] Thanks for the review. bq. You need to change hadoop-yarn-api/pom.xml to make application_history_client.proto to be compiled. Its already there. bq. In addition to the patch's issues, I'd like to raise one design issue here, projecting some future problems. This patch makes different APIs for application/attempt/container, which is going to be a super set of the APIs of ApplicationClientProtocol. Now it's OK if we restrict our problem with the AHS domain. However, probably in the future, we'd like to integrate the ApplicationHistoryProtocol with ApplicationClientProtocol. In other word, from the view of users, they may inquiry any application use a client, which makes it transparent whether the application report is received via ApplicationClientProtocol if the application is running or via ApplicationHistoryProtocol if it is done. Then, ApplicationClientProtocol's and ApplicationHistoryProtocol's APIs mismatch. Users can inquiry finished attempts/containers, but not the running ones. ApplicationClientProtocol may need to add the APIs for attempt/container as well. In addition, another choice of the API design is to still have the only getApplicationReport(), but have the options to load all attempts/containers reports or not. Just think it out aloud. Personally, I incline to the current API design, which is more flexible, but I'm a bit concerned about the future integration. Thoughts? I will create the jira for making applicationclientprotocol similar to applicationHistoryProtocol Thanks, Mayank [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol -- Key: YARN-979 URL: https://issues.apache.org/jira/browse/YARN-979 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, YARN-979.2.patch ApplicationHistoryProtocol should have the following APIs as well: * getApplicationAttemptReport * getApplicationAttempts * getContainerReport * getContainers The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813197#comment-13813197 ] Mayank Bansal commented on YARN-979: Rest of the comments incorporated. Thanks, Mayank [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol -- Key: YARN-979 URL: https://issues.apache.org/jira/browse/YARN-979 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, YARN-979-5.patch, YARN-979.2.patch ApplicationHistoryProtocol should have the following APIs as well: * getApplicationAttemptReport * getApplicationAttempts * getContainerReport * getContainers The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-979: --- Attachment: YARN-979-5.patch Updating the latest patch. Thanks, Mayank [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol -- Key: YARN-979 URL: https://issues.apache.org/jira/browse/YARN-979 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, YARN-979-5.patch, YARN-979.2.patch ApplicationHistoryProtocol should have the following APIs as well: * getApplicationAttemptReport * getApplicationAttempts * getContainerReport * getContainers The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1222: --- Attachment: yarn-1222-4.patch Updating new patch. If HA is enabled, when any of the ZK operations result in KeeperException.NoAuthException, the RM is automatically transitioned to Standby state. Added unit test to verify fencing works. Make improvements in ZKRMStateStore for fencing --- Key: YARN-1222 URL: https://issues.apache.org/jira/browse/YARN-1222 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, yarn-1222-4.patch Using multi-operations for every ZK interaction. In every operation, automatically creating/deleting a lock znode that is the child of the root znode. This is to achieve fencing by modifying the create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813205#comment-13813205 ] Hadoop QA commented on YARN-979: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612013/YARN-979-5.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2363//console This message is automatically generated. [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol -- Key: YARN-979 URL: https://issues.apache.org/jira/browse/YARN-979 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, YARN-979-5.patch, YARN-979.2.patch ApplicationHistoryProtocol should have the following APIs as well: * getApplicationAttemptReport * getApplicationAttempts * getContainerReport * getContainers The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813274#comment-13813274 ] Hadoop QA commented on YARN-1222: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612014/yarn-1222-4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2364//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2364//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2364//console This message is automatically generated. Make improvements in ZKRMStateStore for fencing --- Key: YARN-1222 URL: https://issues.apache.org/jira/browse/YARN-1222 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, yarn-1222-4.patch Using multi-operations for every ZK interaction. In every operation, automatically creating/deleting a lock znode that is the child of the root znode. This is to achieve fencing by modifying the create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1378) Implement a RMStateStore cleaner for deleting application/attempt info
[ https://issues.apache.org/jira/browse/YARN-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813299#comment-13813299 ] Jian He commented on YARN-1378: --- Hi [~ozawa], this jira is oriented only for periodically cleaning app/attempt data in state store, should not block or blocked by them, but may need code level rebase Implement a RMStateStore cleaner for deleting application/attempt info -- Key: YARN-1378 URL: https://issues.apache.org/jira/browse/YARN-1378 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-1378.1.patch Now that we are storing the final state of application/attempt instead of removing application/attempt info on application/attempt completion(YARN-891), we need a separate RMStateStore cleaner for cleaning the application/attempt state. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813301#comment-13813301 ] Zhijie Shen commented on YARN-1121: --- One typo: setDraningStop - setDrainingStop RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Fix For: 2.2.1 Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, YARN-1121.3.patch, YARN-1121.4.patch on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813305#comment-13813305 ] Jian He commented on YARN-1121: --- bq. One typo: setDraningStop - setDrainingStop Nice catch ! will fix it in the next patch. RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Fix For: 2.2.1 Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, YARN-1121.3.patch, YARN-1121.4.patch on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1389) Merging the ApplicationClientProtocol and ApplicationHistoryProtocol
Mayank Bansal created YARN-1389: --- Summary: Merging the ApplicationClientProtocol and ApplicationHistoryProtocol Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen It seems to be expensive to maintain a big number of outstanding t-file writers. RM is likely to run out of the I/O resources. Probably we'd like to limit the number of concurrent outstanding t-file writers, and queue the writing requests. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-1389) Merging the ApplicationClientProtocol and ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-1389: --- Assignee: Mayank Bansal (was: Zhijie Shen) Merging the ApplicationClientProtocol and ApplicationHistoryProtocol Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal At some point we need more infor in applicationClientProtocol which we have in ApplicationHistoryProtocol. We need to merge those. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1389) Merging the ApplicationClientProtocol and ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1389: Description: At some point we need more infor in applicationClientProtocol which we have in ApplicationHistoryProtocol. We need to merge those. was:It seems to be expensive to maintain a big number of outstanding t-file writers. RM is likely to run out of the I/O resources. Probably we'd like to limit the number of concurrent outstanding t-file writers, and queue the writing requests. Merging the ApplicationClientProtocol and ApplicationHistoryProtocol Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen At some point we need more infor in applicationClientProtocol which we have in ApplicationHistoryProtocol. We need to merge those. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-954) [YARN-321] History Service should create the webUI and wire it to HistoryStorage
[ https://issues.apache.org/jira/browse/YARN-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813320#comment-13813320 ] Mayank Bansal commented on YARN-954: [~devaraj.k] Hi Deveraj, there are some changes been done on YARN-321 branch recently and we wanted to put this patch asap, Can you please do the changes or I can take this up ? Thanks, Mayank [YARN-321] History Service should create the webUI and wire it to HistoryStorage Key: YARN-954 URL: https://issues.apache.org/jira/browse/YARN-954 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Devaraj K Attachments: YARN-954-3.patch, YARN-954-v0.patch, YARN-954-v1.patch, YARN-954-v2.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1023) [YARN-321] Webservices REST API's support for Application History
[ https://issues.apache.org/jira/browse/YARN-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813322#comment-13813322 ] Mayank Bansal commented on YARN-1023: - [~devaraj.k] Hi Deveraj, there are some changes been done on YARN-321 branch recently and we wanted to put this patch asap, Can you please do the changes or I can take this up ? Thanks, Mayank [YARN-321] Webservices REST API's support for Application History - Key: YARN-1023 URL: https://issues.apache.org/jira/browse/YARN-1023 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: YARN-321 Reporter: Devaraj K Assignee: Devaraj K Attachments: YARN-1023-v0.patch, YARN-1023-v1.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1374) Resource Manager fails to start due to ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813337#comment-13813337 ] Vinod Kumar Vavilapalli commented on YARN-1374: --- I agree with both the sides. But more to the last point that Karthik made - monitors are getting added to RM directly, though the intention wasn't that. +1 for this patch as it fixes that issue. Let's file a separate ticket for the CompositeService issue. Checking this in. Resource Manager fails to start due to ConcurrentModificationException -- Key: YARN-1374 URL: https://issues.apache.org/jira/browse/YARN-1374 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Devaraj K Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1374-1.patch, yarn-1374-1.patch Resource Manager is failing to start with the below ConcurrentModificationException. {code:xml} 2013-10-30 20:22:42,371 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list 2013-10-30 20:22:42,376 INFO org.apache.hadoop.service.AbstractService: Service ResourceManager failed in state INITED; cause: java.util.ConcurrentModificationException java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:944) 2013-10-30 20:22:42,378 INFO org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService: Transitioning to standby 2013-10-30 20:22:42,378 INFO org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService: Transitioned to standby 2013-10-30 20:22:42,378 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:944) 2013-10-30 20:22:42,379 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down ResourceManager at HOST-10-18-40-24/10.18.40.24 / {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1374) Resource Manager fails to start due to ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813346#comment-13813346 ] Steve Loughran commented on YARN-1374: -- [~bikassaha] -if we clone the list before iterating, the newly added siblings won't cause problems during the init or start operations -they won't get called. But: if you do then add an uninited service during init, it won't get inited; add uninited or inited to start they won't get started. Maybe: allow an addition, but the service you add must always be in the same state of the composite service. That way, if you do add a new service -you have to get it into the correct state before the add() call. Resource Manager fails to start due to ConcurrentModificationException -- Key: YARN-1374 URL: https://issues.apache.org/jira/browse/YARN-1374 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Devaraj K Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1374-1.patch, yarn-1374-1.patch Resource Manager is failing to start with the below ConcurrentModificationException. {code:xml} 2013-10-30 20:22:42,371 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list 2013-10-30 20:22:42,376 INFO org.apache.hadoop.service.AbstractService: Service ResourceManager failed in state INITED; cause: java.util.ConcurrentModificationException java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:944) 2013-10-30 20:22:42,378 INFO org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService: Transitioning to standby 2013-10-30 20:22:42,378 INFO org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService: Transitioned to standby 2013-10-30 20:22:42,378 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:944) 2013-10-30 20:22:42,379 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down ResourceManager at HOST-10-18-40-24/10.18.40.24 / {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1121: -- Attachment: YARN-1121.5.patch - Fixed the typo - Added a new DrainEventHandler for ignoring events while draining to stop. - Created a new field handlerInstance for recording the earlier handler instance. RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Fix For: 2.2.1 Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813409#comment-13813409 ] Hadoop QA commented on YARN-1121: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612047/YARN-1121.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2365//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2365//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2365//console This message is automatically generated. RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Fix For: 2.2.1 Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813463#comment-13813463 ] Sandy Ryza commented on YARN-445: - In 0.21, when a task was going to be killed due to timeout, a SIGQUIT would be sent to it to dump its stacks to standard out (MAPREDUCE-1119). This was a useful feature that I'm currently working on backporting to branch-1 in MAPREDUCE-5592. It would be good to make sure that whatever we do here can accommodate something similar. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813464#comment-13813464 ] Sandy Ryza commented on YARN-445: - To expand on that, it would be nice not to require SIGQUIT-then-SIGTERM-then-SIGKILL to need multiple RPCs. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813468#comment-13813468 ] Jason Lowe commented on YARN-445: - However it would also be nice to not always tie SIGQUIT to SIGTERM/SIGKILL. I'd love to give users the ability to diagnose tasks by themselves without killing them in the process. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813470#comment-13813470 ] Sandy Ryza commented on YARN-445: - Very true Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813471#comment-13813471 ] Sandy Ryza commented on YARN-445: - Oops didn't realize that that feature was the original motivator for this JIRA. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1266) Adding ApplicationHistoryProtocolPBService to make web apps to work
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1266: Attachment: YARN-1266-2.patch Cleaning up patch and moving rest of the stuff to corresponding JIRAS. Thanks, Mayank Adding ApplicationHistoryProtocolPBService to make web apps to work --- Key: YARN-1266 URL: https://issues.apache.org/jira/browse/YARN-1266 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1266-1.patch, YARN-1266-2.patch Adding ApplicationHistoryProtocolPBService to make web apps to work and changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813500#comment-13813500 ] Jian He commented on YARN-1210: --- - Instead of passing running containers as parameter in RegisterNodeManagerRequest, is it possible to just call heartBeat immediately after registerCall and then unBlockNewContainerRequests ? That way we can take advantage of the existing heartbeat logic, cover other things like keep app alive for log aggregation after AM container completes. -- Or at least we can send the list of ContainerStatus(including diagnostics) instead of just container Ids and also the list of keep-alive apps (separate jira)? - Unnecessary import changes in DefaultContainerExecutor.java and LinuxContainerExecutor, ContainerLaunch, ContainersLauncher - Finished containers may not necessary be killed. The containers can also normal finish and remain in the NM cache before NM resync. {code} RMAppAttemptContainerFinishedEvent evt = new RMAppAttemptContainerFinishedEvent(appAttemptId, ContainerStatus.newInstance(cId, ContainerState.COMPLETE, Killed due to RM restart, ExitCode.FORCE_KILLED.getExitCode())); {code} - wrong LOG class name. {code} private static final Log LOG = LogFactory.getLog(RMAppImpl.class); {code} - Isn't always the case that after this patch only the last attempt can be running ? a new attempt will not be launched until the previous attempt reports back it really exits. If this is case, it can be a bug. We may only need to check that if the last attempt is finished or not. {code} // check if any application attempt was running // if yes then don't start new application attempt. for (EntryApplicationAttemptId, RMAppAttempt attempt : app.attempts .entrySet()) { boolean appAttemptInFinalState = RMAppAttemptImpl.isAttemptInFinalState(attempt.getValue()); LOG.info(attempt : + attempt.getKey().toString() + in final state : + appAttemptInFinalState); if (!appAttemptInFinalState) { // One of the application attempt is not in final state. // Not starting new application attempt. return RMAppState.RUNNING; } } {code} - should we return RUNNING or ACCEPTED for apps that are not in final state ? It's ok to return RUNNING in the scope of this patch because anyways we are launching a new attempt. Later on in working preserving restart, RM can crash before attempt register, attempt can register with RM after RM comes back in which case we can then move app from ACCEPTED to RUNNING? During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) Adding ApplicationHistoryProtocolPBService to make web apps to work
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813505#comment-13813505 ] Hadoop QA commented on YARN-1266: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612074/YARN-1266-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2366//console This message is automatically generated. Adding ApplicationHistoryProtocolPBService to make web apps to work --- Key: YARN-1266 URL: https://issues.apache.org/jira/browse/YARN-1266 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1266-1.patch, YARN-1266-2.patch Adding ApplicationHistoryProtocolPBService to make web apps to work and changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1323) Set HTTPS webapp address along with other RPC addresses
[ https://issues.apache.org/jira/browse/YARN-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813529#comment-13813529 ] Sandy Ryza commented on YARN-1323: -- +1 Set HTTPS webapp address along with other RPC addresses --- Key: YARN-1323 URL: https://issues.apache.org/jira/browse/YARN-1323 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Labels: ha Attachments: yarn-1323-1.patch YARN-1232 adds the ability to configure multiple RMs, but missed out the https web app address. Need to add that in. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) fair share do not display info in the scheduler page
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813540#comment-13813540 ] Liyin Liang commented on YARN-1388: --- Because it is a small UI change, the patch didn't add new tests. Manual steps to verify this patch: 1. Configure RM to use FairScheduler 2. Go to the scheduler page in RM 3. Click any queue to display the detailed info 4. Without this patch, the fair share entry does not display info 5. With this patch, the fair share entry shows memory and vcore info fair share do not display info in the scheduler page Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang Attachments: yarn-1388.diff YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1323) Set HTTPS webapp address along with other RPC addresses in HAUtil
[ https://issues.apache.org/jira/browse/YARN-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1323: - Summary: Set HTTPS webapp address along with other RPC addresses in HAUtil (was: Set HTTPS webapp address along with other RPC addresses) Set HTTPS webapp address along with other RPC addresses in HAUtil - Key: YARN-1323 URL: https://issues.apache.org/jira/browse/YARN-1323 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Labels: ha Attachments: yarn-1323-1.patch YARN-1232 adds the ability to configure multiple RMs, but missed out the https web app address. Need to add that in. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813547#comment-13813547 ] Junping Du commented on YARN-311: - Sure. Will update patch soon. Thx! Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v10.patch, YARN-311-v11.patch, YARN-311-v12.patch, YARN-311-v12b.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch, YARN-311-v7.patch, YARN-311-v8.patch, YARN-311-v9.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813560#comment-13813560 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~vinodkv] for review... bq. Does this patch also include YARN-1210? Seems like it, we should separate that code. No .. anything specific? YARN-1210 is more about waiting for older AM to finish before launching a new AM. bq. Depending on the final patch, I think we should split RMAppManager.submitApp into two, one for regular submit and one for submit after recovery. Splitting the method into 2. * submitApplication - normal application submission * submitRecoveredApplication - submitting recovered application bq. RMAppState.java change is unnecessary. fixed bq. ForwardingEventHandler is a bottleneck for renewals now - especially during submission. We need to have a thread pool. Create fixed thread pool service with thread count controllable via configuration (Not adding this to yarn-default). Keeping default thread count to be 5. fair enough? bq. Once we do the above, the old concurrency test should be added back. yeah..added that test back.. bq. We are undoing most of YARN-1107. Good that we laid the groundwork there. Let's make sure we remove all the dead code. One comment stands out Anything did I miss here? didn't understand. The comment I have not removed as it is still valid. bq. The newly added test can have race conditions? We may be lucky in the test, but in real life scenario, client has to submit app and poll for app failure due to invalid tokens I think it will not. For clients yes after they submit the application they will have to keep polling to know the status of the application (got accepted or failed due to token renewal). bq. Similarly we should add a test for successful submission after renewal. sure added one.. checking for RMAppEvent.START Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.5.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1279) Expose a client API to allow clients to figure if log aggregation is complete
[ https://issues.apache.org/jira/browse/YARN-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813562#comment-13813562 ] Jian He commented on YARN-1279: --- - LogAggregationState: DISABLE - DISABLED, NOT_START - NOT_STARTED - Log Aggregation is NM side config, this is getting from RM itself. {code} if (!conf.getBoolean(YarnConfiguration.LOG_AGGREGATION_ENABLED, YarnConfiguration.DEFAULT_LOG_AGGREGATION_ENABLED)) { return LogAggregationState.DISABLE; } {code} - LogAggregationStatus may come via heartbeat before FinalTransition is called, inside which containerLogAggregationStatus is initialized with the containers. In this case, the log status is lost. {code} public void updateLogAggregationStatus(ContainerLogAggregationStatus status) { this.writeLock.lock(); try { if (containerLogAggregationStatus.containsKey(status.getContainerId())) { LogAggregationState currentState = containerLogAggregationStatus.get(status.getContainerId()); if (currentState != LogAggregationState.COMPLETED currentState != LogAggregationState.FAILED) { if (status.getLogAggregationState() == LogAggregationState.COMPLETED) { LogAggregationCompleted.getAndAdd(1); } else if (status.getLogAggregationState() == LogAggregationState.FAILED) { LogAggregationFailed.getAndAdd(1); } containerLogAggregationStatus.put(status.getContainerId(), status.getLogAggregationState()); } } } finally { this.writeLock.unlock(); } } {code} Expose a client API to allow clients to figure if log aggregation is complete - Key: YARN-1279 URL: https://issues.apache.org/jira/browse/YARN-1279 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Xuan Gong Attachments: YARN-1279.1.patch, YARN-1279.2.patch, YARN-1279.2.patch, YARN-1279.3.patch, YARN-1279.3.patch, YARN-1279.4.patch, YARN-1279.4.patch Expose a client API to allow clients to figure if log aggregation is complete -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1323) Set HTTPS webapp address along with other RPC addresses in HAUtil
[ https://issues.apache.org/jira/browse/YARN-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813564#comment-13813564 ] Hudson commented on YARN-1323: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4692 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4692/]) YARN-1323. Set HTTPS webapp address along with other RPC addresses in HAUtil (Karthik Kambatla via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1538851) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/HAUtil.java Set HTTPS webapp address along with other RPC addresses in HAUtil - Key: YARN-1323 URL: https://issues.apache.org/jira/browse/YARN-1323 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Labels: ha Fix For: 2.3.0 Attachments: yarn-1323-1.patch YARN-1232 adds the ability to configure multiple RMs, but missed out the https web app address. Need to add that in. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1388) Fair Scheduler page always displays blank fair share
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1388: - Summary: Fair Scheduler page always displays blank fair share (was: fair share do not display info in the scheduler page) Fair Scheduler page always displays blank fair share Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang Attachments: yarn-1388.diff YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) Fair Scheduler page always displays blank fair share
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813566#comment-13813566 ] Sandy Ryza commented on YARN-1388: -- I just committed this. THanks [~liangly]! Fair Scheduler page always displays blank fair share Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang Assignee: Liyin Liang Fix For: 2.2.1 Attachments: yarn-1388.diff YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-311: Attachment: YARN-311-v13.patch Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v10.patch, YARN-311-v11.patch, YARN-311-v12.patch, YARN-311-v12b.patch, YARN-311-v13.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch, YARN-311-v7.patch, YARN-311-v8.patch, YARN-311-v9.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813574#comment-13813574 ] Junping Du commented on YARN-311: - Updated in v13 patch. Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v10.patch, YARN-311-v11.patch, YARN-311-v12.patch, YARN-311-v12b.patch, YARN-311-v13.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch, YARN-311-v7.patch, YARN-311-v8.patch, YARN-311-v9.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813580#comment-13813580 ] Bikas Saha commented on YARN-1197: -- Wangda, sorry for the delayed response. Was caught up with other work. I will take a look at the new proposal. [~vinodkv] Can you please take a look at the latest proposal? Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Assignee: Wangda Tan Attachments: yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197.pdf Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) Fair Scheduler page always displays blank fair share
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813581#comment-13813581 ] Hudson commented on YARN-1388: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4693 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4693/]) YARN-1388. Fair Scheduler page always displays blank fair share (Liyin Liang via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1538855) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/FairSchedulerPage.java Fair Scheduler page always displays blank fair share Key: YARN-1388 URL: https://issues.apache.org/jira/browse/YARN-1388 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Liyin Liang Assignee: Liyin Liang Fix For: 2.2.1 Attachments: yarn-1388.diff YARN-1044 fixed min/max/used resource display problem in the scheduler page. But the Fair Share has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813638#comment-13813638 ] Hadoop QA commented on YARN-674: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612089/YARN-674.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2368//console This message is automatically generated. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813661#comment-13813661 ] Hadoop QA commented on YARN-311: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612092/YARN-311-v13.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2367//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2367//console This message is automatically generated. Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v10.patch, YARN-311-v11.patch, YARN-311-v12.patch, YARN-311-v12b.patch, YARN-311-v13.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch, YARN-311-v7.patch, YARN-311-v8.patch, YARN-311-v9.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813672#comment-13813672 ] Bikas Saha commented on YARN-674: - We were intentionally going through the same submitApplication() method to make sure that all the initialization and setup code paths are consistently followed in both cases by keeping the code path identical as much as possible. The RM would submit a recovered application, in essence proxying a user submitting the application. Its a general pattern followed through the recovery logic - to be minimally invasive to the mainline code path so that we can avoid functional bugs as much as possible. Separating them into 2 methods has resulted in code duplication in both methods without any huge benefit that I can see. It also leave us susceptible to future code changes made in one code path and not the other. Why is isSecurityEnabled() being checked at this internal level. The code should not even reach this point if security is not enabled. It should already be taken care of in the public apis, right? Also why is it calling rmContext.getDelegationTokenRenewer().addApplication(event) instead of DelegationTokenRenewer.this.addApplication(). Same for rmContext.getDelegationTokenRenewer().applicationFinished(evt); {code} @SuppressWarnings(unchecked) +private void handleDTRenewerEvent( +DelegationTokenRenewerAppSubmitEvent event) { + try { +// Setup tokens for renewal +if (UserGroupInformation.isSecurityEnabled()) { + rmContext.getDelegationTokenRenewer().addApplication(event); + rmContext.getDispatcher().getEventHandler() + .handle(new RMAppEvent(event.getAppicationId(), + event.isApplicationRecovered() ? RMAppEventType.RECOVER + : RMAppEventType.START)); +} + } catch (Throwable t) { {code} These assumptions may make the code brittle to future changes. Also Typo in comments. We should probably assert that the application state is NEW over here so that the broken assumption is caught at the source instead of at the destination app causing a state machine crash. {code} +Unable to add the application to the delegation token renewer., +t); +// Sending APP_REJECTED is fine, since we assume that the +// RMApp is in NEW state and thus we havne't yet informed the +// Scheduler about the existence of the application +rmContext.getDispatcher().getEventHandler().handle( +new RMAppRejectedEvent(event.getAppicationId(), t.getMessage())); + } {code} typo {code} public ApplicationId getAppicationId() { {code} @Private + @VisibleForTesting??? {code} + //Only for Testing + public int getInProcessDelegationTokenRenewerEventsCount() { +return this.renewerCount.get(); + } {code} Can DelegationTokenRenewerAppSubmitEvent event objects have an event type different from VERIFY_AND_START_APPLICATION? If not, we dont need this check and we can change the constructor of DelegationTokenRenewerAppSubmitEvent to not expect an event type argument. It should set the VERIFY_AND_START_APPLICATION within the constructor. {code} + if (evt.getType().equals( + DelegationTokenRenewerEventType.VERIFY_AND_START_APPLICATION) + evt instanceof DelegationTokenRenewerAppSubmitEvent) { {code} Rename DelegationTokenRenewerThread to not have misleading Thread in the name ? Why is this warning not happening for other services? Whats special in the code for DelegationTokenRenewer? {code} + !-- Ignore Synchronization issues as they are never going to occur for + methods like serviceInit(), serviceStart() and handle() -- + Match +Class name=org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer / +Bug pattern=IS2_INCONSISTENT_SYNC / + /Match {code} Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1279) Expose a client API to allow clients to figure if log aggregation is complete
[ https://issues.apache.org/jira/browse/YARN-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813678#comment-13813678 ] Xuan Gong commented on YARN-1279: - bq.LogAggregationState: DISABLE - DISABLED, NOT_START - NOT_STARTED Changed bq. Log Aggregation is NM side config, this is getting from RM itself. Yes, you are right. Removed. Will rely on the containerLogAggregationState. bq. LogAggregationStatus may come via heartbeat before FinalTransition is called, inside which containerLogAggregationStatus is initialized with the containers. In this case, the log status is lost. Removed the initialization in FinalTransition. Only get the number of finished Containers at FinalTransition state Expose a client API to allow clients to figure if log aggregation is complete - Key: YARN-1279 URL: https://issues.apache.org/jira/browse/YARN-1279 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Xuan Gong Attachments: YARN-1279.1.patch, YARN-1279.2.patch, YARN-1279.2.patch, YARN-1279.3.patch, YARN-1279.3.patch, YARN-1279.4.patch, YARN-1279.4.patch Expose a client API to allow clients to figure if log aggregation is complete -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1279) Expose a client API to allow clients to figure if log aggregation is complete
[ https://issues.apache.org/jira/browse/YARN-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1279: Attachment: YARN-1279.5.patch Expose a client API to allow clients to figure if log aggregation is complete - Key: YARN-1279 URL: https://issues.apache.org/jira/browse/YARN-1279 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Xuan Gong Attachments: YARN-1279.1.patch, YARN-1279.2.patch, YARN-1279.2.patch, YARN-1279.3.patch, YARN-1279.3.patch, YARN-1279.4.patch, YARN-1279.4.patch, YARN-1279.5.patch Expose a client API to allow clients to figure if log aggregation is complete -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-261: - Attachment: YARN-261--n7.patch Uploading a patch rebased after YARN-891 and with fixes according to Jason's comments. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, YARN-261--n7.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813685#comment-13813685 ] Bikas Saha commented on YARN-1121: -- There are 3 new booleans with 8 combinations possible between them. Which combinations are legal? Which are impossible? Some comments will help understand their interaction. Naming could be better. e.g. drainEventsOnStop instead of drainingStopNeeded and drainOnStop instead of drainingStop. {code} + private volatile boolean drained = true; + private volatile boolean drainingStopNeeded = false; + private volatile boolean drainingStop = false; {code} Typo {code} + LOG.info(Ignoring events as AsyncDispatcher is draning to stop.); {code} Isnt this almost a tight loop? Given that storing stuff will be over the network and slow, why not have a wait notify between this thread and the draining thread? DrainEventHandler sounds misleading. It doesnt really drain. It ignores or drops events. The other thing we can do is take a count of the number of pending events to drain at service stop. Then make sure we drain only those many, thus ignoring the new ones. This removes the need of drainingStop and reduces the combinatorics of booleans. RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Fix For: 2.2.1 Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813686#comment-13813686 ] Bikas Saha commented on YARN-1121: -- Code for comment above - Isnt this almost a tight loop? Given that storing stuff will be over the network and slow, why not have a wait notify between this thread and the draining thread? {code} protected void serviceStop() throws Exception { +if (drainingStopNeeded) { + drainingStop = true; + while(!drained) { +Thread.yield(); + } {code} RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Fix For: 2.2.1 Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1307) Rethink znode structure for RM HA
[ https://issues.apache.org/jira/browse/YARN-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813689#comment-13813689 ] Bikas Saha commented on YARN-1307: -- This probably needs major rebasing after recent changes to the state store apis that retain completed applications instead of deleting them. Rethink znode structure for RM HA - Key: YARN-1307 URL: https://issues.apache.org/jira/browse/YARN-1307 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1307.1.patch, YARN-1307.2.patch, YARN-1307.3.patch Rethink for znode structure for RM HA is proposed in some JIRAs(YARN-659, YARN-1222). The motivation of this JIRA is quoted from Bikas' comment in YARN-1222: {quote} We should move to creating a node hierarchy for apps such that all znodes for an app are stored under an app znode instead of the app root znode. This will help in removeApplication and also in scaling better on ZK. The earlier code was written this way to ensure create/delete happens under a root znode for fencing. But given that we have moved to multi-operations globally, this isnt required anymore. {quote} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813702#comment-13813702 ] Zhijie Shen commented on YARN-979: -- I still have one question w.r.t. the annotations of the getter/setter of GetRequest/Response. Some of them are marked as \@Stable, and some are marked as \@Unstable. In addition, some setters are marked as \@Private, and some are marked as \@Public. Do you have special consideration here? Maybe we should mark all as \@Unstable for the initial AHS? bq. I will create the jira for making applicationclientprotocol similar to applicationHistoryProtocol Thanks for file the ticket. Ideally, we'd like to have to paired ApplicationClientProtocol and ApplicationHistoryProtocol. Then YarnClient can implement to query running application/attempt/container from ApplicationClientProtocol and the finished from ApplicationHistoryProtocol, making it transparent to users. [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol -- Key: YARN-979 URL: https://issues.apache.org/jira/browse/YARN-979 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, YARN-979-5.patch, YARN-979.2.patch ApplicationHistoryProtocol should have the following APIs as well: * getApplicationAttemptReport * getApplicationAttempts * getContainerReport * getContainers The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1279) Expose a client API to allow clients to figure if log aggregation is complete
[ https://issues.apache.org/jira/browse/YARN-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813706#comment-13813706 ] Hadoop QA commented on YARN-1279: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612118/YARN-1279.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2370//console This message is automatically generated. Expose a client API to allow clients to figure if log aggregation is complete - Key: YARN-1279 URL: https://issues.apache.org/jira/browse/YARN-1279 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Xuan Gong Attachments: YARN-1279.1.patch, YARN-1279.2.patch, YARN-1279.2.patch, YARN-1279.3.patch, YARN-1279.3.patch, YARN-1279.4.patch, YARN-1279.4.patch, YARN-1279.5.patch Expose a client API to allow clients to figure if log aggregation is complete -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analog APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1389: -- Description: As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. was: At some point we need more infor in applicationClientProtocol which we have in ApplicationHistoryProtocol. We need to merge those. Summary: ApplicationClientProtocol and ApplicationHistoryProtocol should expose analog APIs (was: Merging the ApplicationClientProtocol and ApplicationHistoryProtocol) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analog APIs -- Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-955) [YARN-321] Implementation of ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813712#comment-13813712 ] Mayank Bansal commented on YARN-955: Thanks [~zjshen] for the review bq. 1. Is there any special reason to rename ASHService to ApplicationHistoryClientService? Its more verbose name and same as other classes. bq. 2. Inner ApplicationHSClientProtocolHandler is not necessary. ApplicationHistoryClientService can directly implement ApplicationHistoryProtocol, which is what ASHService did before. I used the same design pattern used in Job History server. And moreover its more cleaner design then having service derived from everything. Secondly you have multiple protocols implementation. bq. 3. Incorrect log bellow: Done. bq. 4. We should use the newInstance method from the record class for GetApplicationAttemptReportResponse and all the other records. Done. bq. 5. Some methods missed @Override, for example They are not override methods, those are helper functions. bq. 6. The two methods bellow is not implemented, but we can do it separately, because we need to implement a DelegationTokenSecretManager first. Those will be implemented once we implement security. bq. 7. Did you miss ApplicationHistoryContext in the patch or is it included in the patch of other Jira? History Context is part of YARN-987. bq. 8. Why the method bellow has the default access control? Used in Test. bq. 9. In RM and NM, we usually add a protected create() method for a sub service, such that we can override it, and change to another implementation. It is convenient when we want to mock some part of AHS when drafting the test cases. Done. bq. 10. Shall we have the test cases for the ApplicationHistoryProtocol implementation? Done. [YARN-321] Implementation of ApplicationHistoryProtocol --- Key: YARN-955 URL: https://issues.apache.org/jira/browse/YARN-955 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-955-1.patch, YARN-955-2.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-955) [YARN-321] Implementation of ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-955: --- Attachment: YARN-955-2.patch Adding the latest patch. Thanks, Mayank [YARN-321] Implementation of ApplicationHistoryProtocol --- Key: YARN-955 URL: https://issues.apache.org/jira/browse/YARN-955 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-955-1.patch, YARN-955-2.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813716#comment-13813716 ] Bikas Saha commented on YARN-1222: -- @Private? {code}+ public static String getConfValueForRMInstance(String prefix,{code} If RM is the one creating root znode then how can someone else's ACL's be present on that znode? ie. how can the ACLs on root znode have any other entries? My concern is that we are only adding new ACLs every time we failover but never deleting them. Is it possible that we end up creating too many ACLs for the root znode and hit ZK issues? {code} +Id rmId = new Id(zkRootNodeAuthScheme, +DigestAuthenticationProvider.generateDigest( +zkRootNodeUsername + : + zkRootNodePassword)); +zkRootNodeAcl.add(new ACL(CREATE_DELETE_PERMS, rmId)); +return zkRootNodeAcl; {code} For both of the above, can we use well-known prefixes for the root znode acls (rm-admin-acl and rm-cd-acl). When fencing we dont touch the rm-admin-acl but remove all rm-cd-acl's. We then add a new rm-cd-acl for ourselves. we dont touch any other acl. Where is the shared rm-admin-acl being set such that both RMs have admin access to the root znode? How is the following case going to work? How can the root node acl be set in the conf? Upon active, we have to remove the old RM's cd-acl and set our cd-acl. That cannot be statically set in conf right? {code} if (HAUtil.isHAEnabled(conf)) { + String zkRootNodeAclConf = HAUtil.getConfValueForRMInstance + (YarnConfiguration.ZK_RM_STATE_STORE_ROOT_NODE_ACL, conf); + if (zkRootNodeAclConf != null) { +zkRootNodeAclConf = ZKUtil.resolveConfIndirection(zkRootNodeAclConf); +try { + zkRootNodeAcl = ZKUtil.parseACLs(zkRootNodeAclConf); +} catch (ZKUtil.BadAclFormatException bafe) { + LOG.error(Invalid format for + + YarnConfiguration.ZK_RM_STATE_STORE_ROOT_NODE_ACL); + throw bafe; +} + } {code} The test should probably create separate copies of conf for the 2 RM's Wont we get an exception/error from this? {code}+ rmService.submitApplication(SubmitApplicationRequest.newInstance(asc)); {code} Lets put a comment saying, triggering a state store operation that makes rm1 realize that its not the master because it got fenced by the store. This and other similar places need an @Private {code}+ @VisibleForTesting + public void createWithRetries({code} Can you please specify in comments which operations are exempt from multi-operation. Looks like only write operations go through multi. Exceptions being initial znode creation and fence-on-active. Right? Can we move this logic into the common RMStateStore and notify it about HA state loss via a standard HA exception. Will the null return make the state store crash? {code} +} catch (KeeperException.NoAuthException nae) { + if (HAUtil.isHAEnabled(getConfig())) { +// Transition to standby +RMHAServiceTarget target = new RMHAServiceTarget( +(YarnConfiguration)getConfig()); +target.getProxy(getConfig(), 1000).transitionToStandby( +new HAServiceProtocol.StateChangeRequestInfo( +HAServiceProtocol.RequestSource.REQUEST_BY_USER_FORCED)); +return null; + } {code} Make improvements in ZKRMStateStore for fencing --- Key: YARN-1222 URL: https://issues.apache.org/jira/browse/YARN-1222 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, yarn-1222-4.patch Using multi-operations for every ZK interaction. In every operation, automatically creating/deleting a lock znode that is the child of the root znode. This is to achieve fencing by modifying the create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-987) Adding History Service to use Store and converting Historydata to Report
[ https://issues.apache.org/jira/browse/YARN-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813721#comment-13813721 ] Zhijie Shen commented on YARN-987: -- bq. As we discussed offline, Yes thats similar to JHS design and as we decided to go for Cache implementation I think that makes sense to have clear sepration between these two. As we're going to have cache, the abstraction of ApplicationHistoryContext may be necessary. However, one more question here: webUI and services are going to use ApplicationHistoryContext as well, right? if they are, returning report PB is actually not necessary for web. If they're not, webUI and services need a duplicate abstraction of combining cache and store, which is concise in terms of coding. Some detailed comments on the patch: * Add the config to yarn-default.xml as well. Btw, is store.class a bit better, as we have XXXApplicationHistoryStore, not XXXApplicationHistoryStorage? {code} + /** AHS STORAGE CLASS */ + public static final String AHS_STORAGE = AHS_PREFIX + storage.class; {code} * The methods in ApplicationHistoryContext need to be annotated as well. * Unnecessary code. ApplicationHistoryStore must be a service {code} +if (historyStore instanceof Service) { + ((Service) historyStore).init(conf); +} {code} * Is it better to rename getFinal to convertToReport? * For ApplicationReport, you may want to get the history data of its last application attempt to fill the empty fields bellow. {code} +return ApplicationReport.newInstance(appHistory.getApplicationId(), null, + appHistory.getUser(), appHistory.getQueue(), appHistory +.getApplicationName(), , 0, null, null, , , appHistory +.getStartTime(), appHistory.getFinishTime(), appHistory +.getFinalApplicationStatus(), null, , 100, appHistory +.getApplicationType(), null); {code} Adding History Service to use Store and converting Historydata to Report Key: YARN-987 URL: https://issues.apache.org/jira/browse/YARN-987 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-987-1.patch, YARN-987-2.patch, YARN-987-3.patch, YARN-987-4.patch -- This message was sent by Atlassian JIRA (v6.1#6144)