[jira] [Commented] (YARN-2397) RM web interface sometimes returns request is a replay error in secure mode
[ https://issues.apache.org/jira/browse/YARN-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090368#comment-14090368 ] Hadoop QA commented on YARN-2397: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660551/apache-yarn-2397.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4559//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4559//console This message is automatically generated. RM web interface sometimes returns request is a replay error in secure mode --- Key: YARN-2397 URL: https://issues.apache.org/jira/browse/YARN-2397 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2397.0.patch The RM web interface sometimes returns a request is a replay error if the default kerberos http filter is enabled. This is because it uses the new RMAuthenticationFilter in addition to the AuthenticationFilter. There is a workaround to set yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled to false. This bug is to fix the code to use only the RMAuthenticationFilter and not both. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090406#comment-14090406 ] Hadoop QA commented on YARN-2249: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660542/YARN-2249.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4560//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4560//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4560//console This message is automatically generated. RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-807) When querying apps by queue, iterating over all apps is inefficient and limiting
[ https://issues.apache.org/jira/browse/YARN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090420#comment-14090420 ] Sandy Ryza commented on YARN-807: - bq. If you think it's a bug, we can resolve it in YARN-2385. bq. We may need to create a Mapqueue-name, app-id in RMContext. It's also worth considering only holding this map for completed applications, so we don't need to keep two maps for running applications. When querying apps by queue, iterating over all apps is inefficient and limiting - Key: YARN-807 URL: https://issues.apache.org/jira/browse/YARN-807 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.3.0 Attachments: YARN-807-1.patch, YARN-807-2.patch, YARN-807-3.patch, YARN-807-4.patch, YARN-807.patch The question which apps are in queue x can be asked via the RM REST APIs, through the ClientRMService, and through the command line. In all these cases, the question is answered by scanning through every RMApp and filtering by the app's queue name. All schedulers maintain a mapping of queues to applications. I think it would make more sense to ask the schedulers which applications are in a given queue. This is what was done in MR1. This would also have the advantage of allowing a parent queue to return all the applications on leaf queues under it, and allow queue name aliases, as in the way that root.default and default refer to the same queue in the fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1372: Attachment: YARN-1372.prelim2.patch Second patch uploaded that adds expiration to the entries in NM getContainersToCleanup is used to remove containers currently. Not sure how we can resuse it for acking the containers are notified to AM. Are you saying first time a containerId is in that list, its for removing it and the next time its used to ack the AM has received it? Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-807) When querying apps by queue, iterating over all apps is inefficient and limiting
[ https://issues.apache.org/jira/browse/YARN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090440#comment-14090440 ] Wangda Tan commented on YARN-807: - bq. It's also worth considering only holding this map for completed applications, so we don't need to keep two maps for running applications. I suggest we can do this way: 1) Rename scheduler side getAppsInQueue to getRunningAppsInQueue 2) Create MapQueue-name, SetApp-ID in RMContext, it will contain completed/running apps. The benefit to store them separately is we don't need query two places while client want to get applications. And getRunningAppsInQueue in scheduler side will be used when we need query running apps in queue like YARN-2378. Thanks, Wangda When querying apps by queue, iterating over all apps is inefficient and limiting - Key: YARN-807 URL: https://issues.apache.org/jira/browse/YARN-807 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.3.0 Attachments: YARN-807-1.patch, YARN-807-2.patch, YARN-807-3.patch, YARN-807-4.patch, YARN-807.patch The question which apps are in queue x can be asked via the RM REST APIs, through the ClientRMService, and through the command line. In all these cases, the question is answered by scanning through every RMApp and filtering by the app's queue name. All schedulers maintain a mapping of queues to applications. I think it would make more sense to ask the schedulers which applications are in a given queue. This is what was done in MR1. This would also have the advantage of allowing a parent queue to return all the applications on leaf queues under it, and allow queue name aliases, as in the way that root.default and default refer to the same queue in the fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2397) RM web interface sometimes returns request is a replay error in secure mode
[ https://issues.apache.org/jira/browse/YARN-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2397: Attachment: apache-yarn-2397.1.patch Patch to address [~zjshen] comments and fix the test case. RM web interface sometimes returns request is a replay error in secure mode --- Key: YARN-2397 URL: https://issues.apache.org/jira/browse/YARN-2397 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2397.0.patch, apache-yarn-2397.1.patch The RM web interface sometimes returns a request is a replay error if the default kerberos http filter is enabled. This is because it uses the new RMAuthenticationFilter in addition to the AuthenticationFilter. There is a workaround to set yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled to false. This bug is to fix the code to use only the RMAuthenticationFilter and not both. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-807) When querying apps by queue, iterating over all apps is inefficient and limiting
[ https://issues.apache.org/jira/browse/YARN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090447#comment-14090447 ] Sandy Ryza commented on YARN-807: - I just remembered a couple reasons why it's important that we go through the scheduler: * *Getting all the apps underneath a parent queue* - the scheduler holds queue hierarchy information that allows us to return applications in all leaf queues underneath a parent queue. * *Alisases* - In the Fair Scheduler, default is shorthand for root.default, so querying on either of these names should return applications in that queue. I'm open to approaches that don't require going through the scheduler, but I think we should make sure they keep supporting these capabilities. When querying apps by queue, iterating over all apps is inefficient and limiting - Key: YARN-807 URL: https://issues.apache.org/jira/browse/YARN-807 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.3.0 Attachments: YARN-807-1.patch, YARN-807-2.patch, YARN-807-3.patch, YARN-807-4.patch, YARN-807.patch The question which apps are in queue x can be asked via the RM REST APIs, through the ClientRMService, and through the command line. In all these cases, the question is answered by scanning through every RMApp and filtering by the app's queue name. All schedulers maintain a mapping of queues to applications. I think it would make more sense to ask the schedulers which applications are in a given queue. This is what was done in MR1. This would also have the advantage of allowing a parent queue to return all the applications on leaf queues under it, and allow queue name aliases, as in the way that root.default and default refer to the same queue in the fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-807) When querying apps by queue, iterating over all apps is inefficient and limiting
[ https://issues.apache.org/jira/browse/YARN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090455#comment-14090455 ] Wangda Tan commented on YARN-807: - Hi Sandy, Thanks for your elaboration. As you said, I agree we need to go through scheduler according to two capabilities you mentioned. Maybe a possible way is saving completed app in leaf queue as you mentioned, I remember now YARN will evict some apps when total number of apps exceeds a configuration number (like 10,000). We should do such evicting for completed app in leaf queue as well. When querying apps by queue, iterating over all apps is inefficient and limiting - Key: YARN-807 URL: https://issues.apache.org/jira/browse/YARN-807 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.3.0 Attachments: YARN-807-1.patch, YARN-807-2.patch, YARN-807-3.patch, YARN-807-4.patch, YARN-807.patch The question which apps are in queue x can be asked via the RM REST APIs, through the ClientRMService, and through the command line. In all these cases, the question is answered by scanning through every RMApp and filtering by the app's queue name. All schedulers maintain a mapping of queues to applications. I think it would make more sense to ask the schedulers which applications are in a given queue. This is what was done in MR1. This would also have the advantage of allowing a parent queue to return all the applications on leaf queues under it, and allow queue name aliases, as in the way that root.default and default refer to the same queue in the fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues
[ https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090476#comment-14090476 ] Janos Matyas commented on YARN-2248: Sounds good - let us know if we can help anyhow - we use this feature internally, so once you submit a patch we can check/test on our side as well. Capacity Scheduler changes for moving apps between queues - Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Assignee: Janos Matyas Fix For: 2.6.0 Attachments: YARN-2248-1.patch, YARN-2248-2.patch, YARN-2248-3.patch We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090495#comment-14090495 ] Varun Saxena commented on YARN-2138: Thanks [~jianhe] for the review. I will make the necessary changes and upload a new patch. Sure [~kkambatl], let me know if any further changes are required. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure
[ https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090501#comment-14090501 ] Steve Loughran commented on YARN-675: - This is dangerous if the logs are more than a few gigabytes In YarnClient, pull AM logs on AM container failure --- Key: YARN-675 URL: https://issues.apache.org/jira/browse/YARN-675 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Li Lu Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to pull its logs from the NM to the client so that they can be displayed immediately to the user. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090502#comment-14090502 ] Steve Loughran commented on YARN-2392: -- ...there's no tests here as there is nothing to test except for visual review of the message. A lot of existing tests do look for the Failed the application string at the end of the message, with that string hard coded into the test methods. Those should really be reworked to use a constant string, as otherwise they are very brittle. This patch leaves the relevant text alone to avoid breaking anything. add more diags about app retry limits on AM failures Key: YARN-2392 URL: https://issues.apache.org/jira/browse/YARN-2392 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Attachments: YARN-2392-001.patch # when an app fails the failure count is shown, but not what the global + local limits are. If the two are different, they should both be printed. # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2397) RM web interface sometimes returns request is a replay error in secure mode
[ https://issues.apache.org/jira/browse/YARN-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090566#comment-14090566 ] Hadoop QA commented on YARN-2397: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660578/apache-yarn-2397.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4561//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4561//console This message is automatically generated. RM web interface sometimes returns request is a replay error in secure mode --- Key: YARN-2397 URL: https://issues.apache.org/jira/browse/YARN-2397 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2397.0.patch, apache-yarn-2397.1.patch The RM web interface sometimes returns a request is a replay error if the default kerberos http filter is enabled. This is because it uses the new RMAuthenticationFilter in addition to the AuthenticationFilter. There is a workaround to set yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled to false. This bug is to fix the code to use only the RMAuthenticationFilter and not both. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2138: --- Attachment: YARN-2138.002.patch Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2373) WebAppUtils Should Use configuration.getPassword for Accessing SSL Passwords
[ https://issues.apache.org/jira/browse/YARN-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090659#comment-14090659 ] Varun Vasudev commented on YARN-2373: - [~lmccay] thanks for the patch! Some general questions(since this is part of a larger effort) - 1. For the null case(where the WebAppUtils.getPassword() returns null), should we add a warning or an audit log that someone was trying to get a password that was null? 2. Will you update documentation in another ticket(just to let users know that they can use a CredentialProvider instead of using plain text)? Other than that, it looks good to me. WebAppUtils Should Use configuration.getPassword for Accessing SSL Passwords Key: YARN-2373 URL: https://issues.apache.org/jira/browse/YARN-2373 Project: Hadoop YARN Issue Type: Bug Reporter: Larry McCay Attachments: YARN-2373.patch, YARN-2373.patch, YARN-2373.patch As part of HADOOP-10904, this jira represents a change to WebAppUtils to uptake the use of the credential provider API through the new method on Configuration called getPassword. This provides an alternative to storing the passwords in clear text within the ssl-server.xml file while maintaining backward compatibility with that behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2373) WebAppUtils Should Use configuration.getPassword for Accessing SSL Passwords
[ https://issues.apache.org/jira/browse/YARN-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090662#comment-14090662 ] Varun Vasudev commented on YARN-2373: - Missed one more question - are you taking care of changes to ssl-client.xml as well? WebAppUtils Should Use configuration.getPassword for Accessing SSL Passwords Key: YARN-2373 URL: https://issues.apache.org/jira/browse/YARN-2373 Project: Hadoop YARN Issue Type: Bug Reporter: Larry McCay Attachments: YARN-2373.patch, YARN-2373.patch, YARN-2373.patch As part of HADOOP-10904, this jira represents a change to WebAppUtils to uptake the use of the credential provider API through the new method on Configuration called getPassword. This provides an alternative to storing the passwords in clear text within the ssl-server.xml file while maintaining backward compatibility with that behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090679#comment-14090679 ] Hudson commented on YARN-2008: -- FAILURE: Integrated in Hadoop-Yarn-trunk #638 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/638/]) YARN-2008. Fixed CapacityScheduler to calculate headroom based on max available capacity instead of configured max capacity. Contributed by Craig Welch (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616580) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DefaultResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DominantResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCSQueueUtils.java CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Fix For: 2.6.0 Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch, YARN-2008.5.patch, YARN-2008.6.patch, YARN-2008.7.patch, YARN-2008.8.patch, YARN-2008.9.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090685#comment-14090685 ] Hudson commented on YARN-2288: -- FAILURE: Integrated in Hadoop-Yarn-trunk #638 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/638/]) YARN-2288. Made persisted data in LevelDB timeline store be versioned. Contributed by Junping Du. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616540) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Fix For: 2.6.0 Attachments: YARN-2288-v2.patch, YARN-2288-v3.patch, YARN-2288-v4.patch, YARN-2288-v5.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090707#comment-14090707 ] Junping Du commented on YARN-2302: -- Thanks for the patch, [~zjshen]! A couple of comments so far: In ApplicationHistoryServer.java, {code} + protected TimelineDataManager timelineDataManager; {code} Better to be private, as it only get consumed within ApplicationHistoryServer. {code} timelineACLsManager = createTimelineACLsManager(conf); {code} Looks like we don’t need timelineACLsManager anymore except initiating TimeLineDataManager. We can completely remove it (method and variable) after merge below {code} protected TimelineACLsManager createTimelineACLsManager(Configuration conf) { return new TimelineACLsManager(conf); } protected TimelineDataManager createTimelineDataManager(Configuration conf) { return new TimelineDataManager(timelineStore, timelineACLsManager); } {code} to: {code} private TimelineDataManager createTimelineDataManager(Configuration conf) { return new TimelineDataManager(timelineStore, new TimelineACLsManager(conf)); } {code} The visibility of method should be private, as it is not get assumed outside of class. There are also some similar unnecessary protected methods around in this class, see if you want to do update here also or we can do it separately later. In TimelineDataManager.java, {code} + try { +if (existingEntity == null) { + injectOwnerInfo(entity, callerUGI.getShortUserName()); +} + } catch (YarnException e) { +// Skip the entity which messes up the primary filter and record the +// error +LOG.warn(Skip the timeline entity: + entityID + , because ++ e.getMessage()); {code} This exception sounds more serious than just a warn, so Log.error here may make more sense? Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2302.1.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2398) TestResourceTrackerOnHA crashes
[ https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090771#comment-14090771 ] Jason Lowe commented on YARN-2398: -- System.exit is being called from the test, which is known to make surefire upset and fail the build. From the test output, it looks like a scheduler event is being dispatched and the test didn't setup a handler for it: {noformat} 2014-08-08 13:48:28,867 INFO [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(387)) - localhost:0 Node Transitioned from NEW to RUNNING 2014-08-08 13:48:28,867 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(179)) - Error in dispatcher thread java.lang.Exception: No handler for registered for class org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:724) 2014-08-08 13:48:28,868 INFO [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(184)) - Exiting, bbye.. {noformat} TestResourceTrackerOnHA crashes --- Key: YARN-2398 URL: https://issues.apache.org/jira/browse/YARN-2398 Project: Hadoop YARN Issue Type: Bug Reporter: Jason Lowe TestResourceTrackerOnHA is currently crashing and failing trunk builds. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090790#comment-14090790 ] Karthik Kambatla commented on YARN-2352: Both tests pass locally for me, and the failures seen here are unrelated to the patch. Committing this. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch, yarn-2352-3.patch, yarn-2352-4.patch, yarn-2352-5.patch, yarn-2352-6.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090821#comment-14090821 ] Hudson commented on YARN-2008: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1831 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1831/]) YARN-2008. Fixed CapacityScheduler to calculate headroom based on max available capacity instead of configured max capacity. Contributed by Craig Welch (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616580) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DefaultResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DominantResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCSQueueUtils.java CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Fix For: 2.6.0 Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch, YARN-2008.5.patch, YARN-2008.6.patch, YARN-2008.7.patch, YARN-2008.8.patch, YARN-2008.9.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090827#comment-14090827 ] Hudson commented on YARN-2288: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1831 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1831/]) YARN-2288. Made persisted data in LevelDB timeline store be versioned. Contributed by Junping Du. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616540) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Fix For: 2.6.0 Attachments: YARN-2288-v2.patch, YARN-2288-v3.patch, YARN-2288-v4.patch, YARN-2288-v5.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2373) WebAppUtils Should Use configuration.getPassword for Accessing SSL Passwords
[ https://issues.apache.org/jira/browse/YARN-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090840#comment-14090840 ] Larry McCay commented on YARN-2373: --- Hi [~vvasudev] - thanks for the review and the good questions: bq. 1. For the null case(where the WebAppUtils.getPassword() returns null), should we add a warning or an audit log that someone was trying to get a password that was null? There was no such log or audit record in that case before adding the additional check for an alias in credential provider - so I didn't add anything new for it. It probably would be a good idea to do so - I don't know that this change makes it any more necessary though. Your question raises an interesting point for the Configuration.getPassword implementation though. I think that it would make sense to log a failure to get a password if there is no provisioned alias and it is configured to not allow fallback to config. We don't currently do that - it will just return null. I think we should file a separate jira for that. bq. 2. Will you update documentation in another ticket(just to let users know that they can use a CredentialProvider instead of using plain text)? We could do that. There is a jira for adding credential provider api documentation already are you thinking that it needs to have YARN specific documentation as well? bq. Missed one more question - are you taking care of changes to ssl-client.xml as well? This is a good point. I will have to track down those usages as well and file separate jiras. Are any of these questions/answers blockers for this patch? Thanks again for the review! WebAppUtils Should Use configuration.getPassword for Accessing SSL Passwords Key: YARN-2373 URL: https://issues.apache.org/jira/browse/YARN-2373 Project: Hadoop YARN Issue Type: Bug Reporter: Larry McCay Attachments: YARN-2373.patch, YARN-2373.patch, YARN-2373.patch As part of HADOOP-10904, this jira represents a change to WebAppUtils to uptake the use of the credential provider API through the new method on Configuration called getPassword. This provides an alternative to storing the passwords in clear text within the ssl-server.xml file while maintaining backward compatibility with that behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090846#comment-14090846 ] Hudson commented on YARN-2352: -- FAILURE: Integrated in Hadoop-trunk-Commit #6037 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6037/]) YARN-2352. Add missing file. FairScheduler: Collect metrics on duration of critical methods that affect performance. (kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616784) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSOpDurations.java YARN-2352. FairScheduler: Collect metrics on duration of critical methods that affect performance. (kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616769) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/impl/MetricsCollectorImpl.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/lib/MutableStat.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch, yarn-2352-3.patch, yarn-2352-4.patch, yarn-2352-5.patch, yarn-2352-6.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2397) RM web interface sometimes returns request is a replay error in secure mode
[ https://issues.apache.org/jira/browse/YARN-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2397: --- Priority: Critical (was: Major) Target Version/s: 2.6.0 RM web interface sometimes returns request is a replay error in secure mode --- Key: YARN-2397 URL: https://issues.apache.org/jira/browse/YARN-2397 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Critical Attachments: apache-yarn-2397.0.patch, apache-yarn-2397.1.patch The RM web interface sometimes returns a request is a replay error if the default kerberos http filter is enabled. This is because it uses the new RMAuthenticationFilter in addition to the AuthenticationFilter. There is a workaround to set yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled to false. This bug is to fix the code to use only the RMAuthenticationFilter and not both. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090864#comment-14090864 ] Hudson commented on YARN-2288: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1857 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1857/]) YARN-2288. Made persisted data in LevelDB timeline store be versioned. Contributed by Junping Du. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616540) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Fix For: 2.6.0 Attachments: YARN-2288-v2.patch, YARN-2288-v3.patch, YARN-2288-v4.patch, YARN-2288-v5.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090861#comment-14090861 ] Hudson commented on YARN-2352: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1857 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1857/]) YARN-2352. Add missing file. FairScheduler: Collect metrics on duration of critical methods that affect performance. (kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616784) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSOpDurations.java YARN-2352. FairScheduler: Collect metrics on duration of critical methods that affect performance. (kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616769) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/impl/MetricsCollectorImpl.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/lib/MutableStat.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch, yarn-2352-3.patch, yarn-2352-4.patch, yarn-2352-5.patch, yarn-2352-6.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090857#comment-14090857 ] Hudson commented on YARN-2008: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1857 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1857/]) YARN-2008. Fixed CapacityScheduler to calculate headroom based on max available capacity instead of configured max capacity. Contributed by Craig Welch (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616580) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DefaultResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DominantResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCSQueueUtils.java CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Fix For: 2.6.0 Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch, YARN-2008.5.patch, YARN-2008.6.patch, YARN-2008.7.patch, YARN-2008.8.patch, YARN-2008.9.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090876#comment-14090876 ] Hadoop QA commented on YARN-2138: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660614/YARN-2138.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4562//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4562//console This message is automatically generated. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2396) RpcClientFactoryPBImpl.stopClient always throws due to missing close method
[ https://issues.apache.org/jira/browse/YARN-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chang li updated YARN-2396: --- Attachment: yarn2396.patch RpcClientFactoryPBImpl.stopClient always throws due to missing close method --- Key: YARN-2396 URL: https://issues.apache.org/jira/browse/YARN-2396 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.4.1 Reporter: Jason Lowe Assignee: chang li Attachments: yarn2396.patch RpcClientFactoryPBImpl.stopClient will throw a YarnRuntimeException if the protocol does not have a close method, despite the log message indicating it is ignoring errors. It's interesting to note that none of the YARN protocol classes currently have a close method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090991#comment-14090991 ] Jian He commented on YARN-2212: --- looks good, +1 ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch, YARN-2212.7.patch, YARN-2212.7.patch, YARN-2212.8.patch, YARN-2212.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091002#comment-14091002 ] Karthik Kambatla commented on YARN-2138: Looks good to me. Thanks Varun for looking into this. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091037#comment-14091037 ] Karthik Kambatla edited comment on YARN-2026 at 8/8/14 5:54 PM: Thanks for bearing with us on this JIRA, Ashwin. That patch looks mostly good. Minor comments: # This is a very subjective opinion. In ComputeFairShares, would it be cleaner/simpler to rename existing {{public computeShares}} to {{private computeSharesInternal}}, and add a new {{public computeShares}} that calls the internal version only with active queues? # Thanks for adding a bunch of tests in TestFairSchedulerFairShare. Post YARN-1474, ## setup() need not call {{scheduler.setRMContext(resourceManager.getRMContext());}} ## configureClusterWithQueuesAndOneNode need not call the following: {code} scheduler.init(conf); scheduler.start(); scheduler.reinitialize(conf, resourceManager.getRMContext()); {code} was (Author: kkambatl): Thanks for bearing with us on this JIRA, Ashwin. That patch looks mostly good. Minor comments: # This is a very subjective opinion. In ComputeFairShares, would it be cleaner/simpler to rename existing {{public computeShares}} to {{private computeSharesInternal}}, and add a new {{public computeShares}} that takes calls the internal version only with active queues? # Thanks for adding a bunch of tests in TestFairSchedulerFairShare. Post YARN-1474, ## setup() need not call {{scheduler.setRMContext(resourceManager.getRMContext());}} ## configureClusterWithQueuesAndOneNode need not call the following: {code} scheduler.init(conf); scheduler.start(); scheduler.reinitialize(conf, resourceManager.getRMContext()); {code} Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt, YARN-2026-v4.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091037#comment-14091037 ] Karthik Kambatla commented on YARN-2026: Thanks for bearing with us on this JIRA, Ashwin. That patch looks mostly good. Minor comments: # This is a very subjective opinion. In ComputeFairShares, would it be cleaner/simpler to rename existing {{public computeShares}} to {{private computeSharesInternal}}, and add a new {{public computeShares}} that takes calls the internal version only with active queues? # Thanks for adding a bunch of tests in TestFairSchedulerFairShare. Post YARN-1474, ## setup() need not call {{scheduler.setRMContext(resourceManager.getRMContext());}} ## configureClusterWithQueuesAndOneNode need not call the following: {code} scheduler.init(conf); scheduler.start(); scheduler.reinitialize(conf, resourceManager.getRMContext()); {code} Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt, YARN-2026-v4.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2393) Fair Scheduler : Implement static fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091152#comment-14091152 ] Wei Yan commented on YARN-2393: --- Hey, [~ashwinshankar77], would u mind if I take this one? Fair Scheduler : Implement static fair share Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Ashwin Shankar Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2317) Update documentation about how to write YARN applications
[ https://issues.apache.org/jira/browse/YARN-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091154#comment-14091154 ] Zhijie Shen commented on YARN-2317: --- [~gtCarrera9], thanks for updating this document, which I think will be really helpful to the new YARN app developers. I read through the updated document, and it looks good to me in general. I've some minor comments so far: 0. Resource Manager - ResourceManager, Node Manager - NodeManager, Application Master - ApplicationMaster 1. application submission context {code} + client can then set up application context, prepare the very first container of {code} 2. Should ignore Unix, it is expected to work on windows as well. {code} Unix environment settings {code} 3. YARN cluster {code} + YARN platform, and handles application execution. It performs operations in an {code} 4. object {code} + AMRMClientAsync objects, with event handling methods specified in a {code} 5. Don't say event. Users don't need to know the internal, 4 callback methods? {code} + NMClientAsync. Typical container events include start, stop, status {code} 6. Use Runnable objects to launch containers. can be removed, because it's not necessary to be on a separate thread. {code} +Use Runnable objects to launch containers. Communicate with node managers {code} 7. ContainerManagerProtocol {code} + ApplicationMasterProtocol and ContainerManager) are still preserved. The {code} 8. Is this still valid? {code} + // Set the necessary security tokens as needed + //amContainer.setContainerTokens(containerToken); {code} 9. Perhaps you want to mention unregistration after the AM determines the work is done. 10. In Useful Links section, how about linking to the analog webpages on this web site: YARN Architecture and Capacity Scheduler 11. In Sample Code section, maybe we don't want to talk about the IDE. And maybe call it sample application? BTW, I looked into the updated webpage directly, instead of doing side-by-side comparison between the old and the new webpages. It would be great if you can comment what are the itemized significant changes in your patch, such that the community can be aware of them. Update documentation about how to write YARN applications - Key: YARN-2317 URL: https://issues.apache.org/jira/browse/YARN-2317 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Li Lu Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2317-071714.patch, YARN-2317-073014-1.patch, YARN-2317-073014.patch Some information in WritingYarnApplications webpage is out-dated. Need some refresh work on this document to reflect the most recent changes in YARN APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2393) Fair Scheduler : Implement static fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091177#comment-14091177 ] Ashwin Shankar commented on YARN-2393: -- hey [~ywskycn], please go ahead. Fair Scheduler : Implement static fair share Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Ashwin Shankar Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2393) Fair Scheduler : Implement static fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan reassigned YARN-2393: - Assignee: Wei Yan Fair Scheduler : Implement static fair share Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2396) RpcClientFactoryPBImpl.stopClient always throws due to missing close method
[ https://issues.apache.org/jira/browse/YARN-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091250#comment-14091250 ] Hadoop QA commented on YARN-2396: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660649/yarn2396.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4563//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4563//console This message is automatically generated. RpcClientFactoryPBImpl.stopClient always throws due to missing close method --- Key: YARN-2396 URL: https://issues.apache.org/jira/browse/YARN-2396 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.4.1 Reporter: Jason Lowe Assignee: chang li Attachments: yarn2396.patch RpcClientFactoryPBImpl.stopClient will throw a YarnRuntimeException if the protocol does not have a close method, despite the log message indicating it is ignoring errors. It's interesting to note that none of the YARN protocol classes currently have a close method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2396) RpcClientFactoryPBImpl.stopClient always throws due to missing close method
[ https://issues.apache.org/jira/browse/YARN-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091255#comment-14091255 ] Mit Desai commented on YARN-2396: - lgtm +1 non-binding This is a one line change to remove the throw exception line because it was supposed to ignore the exception in first place. RpcClientFactoryPBImpl.stopClient always throws due to missing close method --- Key: YARN-2396 URL: https://issues.apache.org/jira/browse/YARN-2396 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.4.1 Reporter: Jason Lowe Assignee: chang li Attachments: yarn2396.patch RpcClientFactoryPBImpl.stopClient will throw a YarnRuntimeException if the protocol does not have a close method, despite the log message indicating it is ignoring errors. It's interesting to note that none of the YARN protocol classes currently have a close method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2067) FairScheduler update/continuous-scheduling threads should start only when after the scheduler is started
[ https://issues.apache.org/jira/browse/YARN-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2067. Resolution: Invalid This has been addressed by other JIRAs already. FairScheduler update/continuous-scheduling threads should start only when after the scheduler is started Key: YARN-2067 URL: https://issues.apache.org/jira/browse/YARN-2067 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2277) Add Cross-Origin support to the ATS REST API
[ https://issues.apache.org/jira/browse/YARN-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-2277: -- Attachment: YARN-2277-v3.patch Add Cross-Origin support to the ATS REST API Key: YARN-2277 URL: https://issues.apache.org/jira/browse/YARN-2277 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2277-CORS.patch, YARN-2277-JSONP.patch, YARN-2277-v2.patch, YARN-2277-v3.patch, YARN-2277-v3.patch As the Application Timeline Server is not provided with built-in UI, it may make sense to enable JSONP or CORS Rest API capabilities to allow for remote UI to access the data directly via javascript without cross side server browser blocks coming into play. Example client may be like http://api.jquery.com/jQuery.getJSON/ This can alleviate the need to create a local proxy cache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2277) Add Cross-Origin support to the ATS REST API
[ https://issues.apache.org/jira/browse/YARN-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091304#comment-14091304 ] Jonathan Eagles commented on YARN-2277: --- [~zjshen], Thank you for you feedback as this is going to have a much bigger impact on Hadoop as a whole. I have provided a minimal CORS filter that will give us an idea if this is the direction to go. Based on the direction of this patch, the scope has widened to create a general CrossOriginFilter for use within all Hadoop REST APIs. Probably, we will want to split the different pieces us across JIRAs, umbrella, Filter and FilterInitializer, additional configuration, and individual REST servers. This way we can focus on the end goal of getting Tez UI done in a timely manner without forgetting completeness of CORS support. Add Cross-Origin support to the ATS REST API Key: YARN-2277 URL: https://issues.apache.org/jira/browse/YARN-2277 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2277-CORS.patch, YARN-2277-JSONP.patch, YARN-2277-v2.patch, YARN-2277-v3.patch, YARN-2277-v3.patch As the Application Timeline Server is not provided with built-in UI, it may make sense to enable JSONP or CORS Rest API capabilities to allow for remote UI to access the data directly via javascript without cross side server browser blocks coming into play. Example client may be like http://api.jquery.com/jQuery.getJSON/ This can alleviate the need to create a local proxy cache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2026: - Attachment: YARN-2026-v5.txt Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt, YARN-2026-v4.txt, YARN-2026-v5.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091308#comment-14091308 ] Ashwin Shankar commented on YARN-2026: -- Thanks [~kasha]. All comments addressed in v5 patch. Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt, YARN-2026-v4.txt, YARN-2026-v5.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2212: Attachment: YARN-2212-branch-2.patch ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212-branch-2.patch, YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch, YARN-2212.7.patch, YARN-2212.7.patch, YARN-2212.8.patch, YARN-2212.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2237) MRAppMaster changes for AMRMToken roll-up
[ https://issues.apache.org/jira/browse/YARN-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong resolved YARN-2237. - Resolution: Fixed Fix Version/s: 2.6.0 Fixed and committed with YARN-2212 MRAppMaster changes for AMRMToken roll-up - Key: YARN-2237 URL: https://issues.apache.org/jira/browse/YARN-2237 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2237.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2207) Add ability to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong resolved YARN-2207. - Resolution: Fixed Fix Version/s: 2.6.0 Add ability to roll over AMRMToken -- Key: YARN-2207 URL: https://issues.apache.org/jira/browse/YARN-2207 Project: Hadoop YARN Issue Type: Task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Currently, the master key is fixed after it created. But It is not ideal. We need to add ability to roll over the AMRMToken. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091330#comment-14091330 ] Xuan Gong commented on YARN-2212: - Committed into trunk and branch-2. Thanks Jian for review. ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2212-branch-2.patch, YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch, YARN-2212.7.patch, YARN-2212.7.patch, YARN-2212.8.patch, YARN-2212.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-356) Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env
[ https://issues.apache.org/jira/browse/YARN-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer resolved YARN-356. --- Resolution: Duplicate Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env --- Key: YARN-356 URL: https://issues.apache.org/jira/browse/YARN-356 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.0.2-alpha Reporter: Lohit Vijayarenu At present it is difficult to set different Xmx values for RM and NM without having different yarn-env.sh. Like HDFS, it would be good to have YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2302: -- Attachment: YARN-2302.2.patch [~djp], thanks for your review. The general response to your comments on ApplicationHistoryServer is that protected vars/methods are the legacy things. Anyway before it grows worth, I did some more refactoring for this class in the new patch. In addition, I address the Log level issue in the new patch as well. Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2302.1.patch, YARN-2302.2.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091364#comment-14091364 ] Zhijie Shen edited comment on YARN-2302 at 8/8/14 10:12 PM: [~djp], thanks for your review. The general response to your comments on ApplicationHistoryServer is that protected vars/methods are the legacy things. Anyway before it grows worse, I did some more refactoring for this class in the new patch. In addition, I address the Log level issue in the new patch as well. was (Author: zjshen): [~djp], thanks for your review. The general response to your comments on ApplicationHistoryServer is that protected vars/methods are the legacy things. Anyway before it grows worth, I did some more refactoring for this class in the new patch. In addition, I address the Log level issue in the new patch as well. Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2302.1.patch, YARN-2302.2.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091388#comment-14091388 ] Jian He commented on YARN-2138: --- thanks Karthik for the review ! Varun, patch not applying on trunk any more. mind updating the patch please ? Minor comment: noticed this seems having a tab before RMAppAttemptEventType.ATTEMPT_UPDATE_SAVED)); {code} + new RMAppAttemptEvent(applicationAttempt.getAppAttemptId(), + RMAppAttemptEventType.ATTEMPT_UPDATE_SAVED)); {code} Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091415#comment-14091415 ] Hudson commented on YARN-2212: -- FAILURE: Integrated in Hadoop-trunk-Commit #6039 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6039/]) YARN-2212: ApplicationMaster needs to find a way to update the AMRMToken periodically. Contributed by Xuan Gong (xgong: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616892) * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClientOnRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/AMLauncher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/AMRMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRMWithCustomAMLauncher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestAMRMTokens.java ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2212-branch-2.patch, YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch, YARN-2212.7.patch,
[jira] [Updated] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2249: -- Attachment: YARN-2249.4.patch New patch fixed the comments from Wangda RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2138: --- Attachment: YARN-2138.003.patch Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.003.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091438#comment-14091438 ] Varun Saxena commented on YARN-2138: Thanks [~jianhe] and [~kasha] for the review. I have uploaded a new patch which should apply to trunk. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.003.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091447#comment-14091447 ] Zhijie Shen commented on YARN-1954: --- +1 except some nits: 1. The abstract method can actually be part of AMRMClient(Async) directly, instead of putting it into the impl, right? Just need an additional LOG in AMRMClient(Async). {code} public abstract void waitFor(SupplierBoolean check, int checkEveryMillis, int logInterval) throws InterruptedException, IllegalArgumentException; {code} 2. IllegalArgumentException doesn't need to be declared Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091516#comment-14091516 ] Hadoop QA commented on YARN-2026: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660716/YARN-2026-v5.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4564//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4564//console This message is automatically generated. Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt, YARN-2026-v4.txt, YARN-2026-v5.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can
[jira] [Commented] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091522#comment-14091522 ] Junping Du commented on YARN-2302: -- Thanks for updating the patch, [~zjshen]! The patch looks good now in overall. Some minor comments: {code} + public TimelinePutResponse postEntities( + TimelineEntities entities, + UserGroupInformation callerUGI) throws YarnException, IOException { +if (entities == null) { {code} Shall we rename this method to putEntities? There is slightly different between put and post operation (in REST prospective) while post is create (first time) but put is an update. The internal behavior of the method is like update and call put operation actually, so put could be more properly. In addition, I think we should have javadoc for public methods in TimelineDataManager.java. Other looks fine. Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2302.1.patch, YARN-2302.2.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2277) Add Cross-Origin support to the ATS REST API
[ https://issues.apache.org/jira/browse/YARN-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091527#comment-14091527 ] Hadoop QA commented on YARN-2277: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660715/YARN-2277-v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4565//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4565//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4565//console This message is automatically generated. Add Cross-Origin support to the ATS REST API Key: YARN-2277 URL: https://issues.apache.org/jira/browse/YARN-2277 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2277-CORS.patch, YARN-2277-JSONP.patch, YARN-2277-v2.patch, YARN-2277-v3.patch, YARN-2277-v3.patch As the Application Timeline Server is not provided with built-in UI, it may make sense to enable JSONP or CORS Rest API capabilities to allow for remote UI to access the data directly via javascript without cross side server browser blocks coming into play. Example client may be like http://api.jquery.com/jQuery.getJSON/ This can alleviate the need to create a local proxy cache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091539#comment-14091539 ] Hadoop QA commented on YARN-2302: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660727/YARN-2302.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4566//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4566//console This message is automatically generated. Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2302.1.patch, YARN-2302.2.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2399) FairScheduler: Merge AppSchedulable and FSSchedulerApp into FSAppAttempt
Karthik Kambatla created YARN-2399: -- Summary: FairScheduler: Merge AppSchedulable and FSSchedulerApp into FSAppAttempt Key: YARN-2399 URL: https://issues.apache.org/jira/browse/YARN-2399 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.5.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla FairScheduler has two data structures for an application, making the code hard to track. We should merge these for better maintainability in the long-term. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091577#comment-14091577 ] Karthik Kambatla commented on YARN-2026: +1. Checking this in.. Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt, YARN-2026-v4.txt, YARN-2026-v5.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091583#comment-14091583 ] Hadoop QA commented on YARN-2249: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660744/YARN-2249.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4567//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4567//console This message is automatically generated. RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: YARN-1954.6.patch Thanks for your review, Zhijie. Updated: 1. Removed AMRMClient(Async)Impl#waitFor and put AMRMClient(Async)#waitFor directly. 2. Removed IllegalArgumentException from the method declaration. Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.6.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: (was: YARN-1954.6.patch) Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: YARN-1954.7.patch Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.7.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: (was: YARN-1954.7.patch) Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.7.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: YARN-1954.7.patch Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.7.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091608#comment-14091608 ] Hadoop QA commented on YARN-2138: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660749/YARN-2138.003.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4568//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4568//console This message is automatically generated. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.003.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler: Consider only active queues for computing fairshare
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091609#comment-14091609 ] Ashwin Shankar commented on YARN-2026: -- Thanks a lot [~kasha], [~sandyr] for reviewing and committing my patch ! Fair scheduler: Consider only active queues for computing fairshare --- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Fix For: 2.6.0 Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt, YARN-2026-v4.txt, YARN-2026-v5.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler: Consider only active queues for computing fairshare
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091610#comment-14091610 ] Hudson commented on YARN-2026: -- FAILURE: Integrated in Hadoop-trunk-Commit #6041 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6041/]) YARN-2026. Fair scheduler: Consider only active queues for computing fairshare. (Ashwin Shankar via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616915) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/ComputeFairShares.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerFairShare.java Fair scheduler: Consider only active queues for computing fairshare --- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Fix For: 2.6.0 Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt, YARN-2026-v4.txt, YARN-2026-v5.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of
[jira] [Commented] (YARN-2399) FairScheduler: Merge AppSchedulable and FSSchedulerApp into FSAppAttempt
[ https://issues.apache.org/jira/browse/YARN-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091612#comment-14091612 ] Hadoop QA commented on YARN-2399: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660780/yarn-2399-1.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4569//console This message is automatically generated. FairScheduler: Merge AppSchedulable and FSSchedulerApp into FSAppAttempt Key: YARN-2399 URL: https://issues.apache.org/jira/browse/YARN-2399 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.5.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2399-1.patch FairScheduler has two data structures for an application, making the code hard to track. We should merge these for better maintainability in the long-term. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091621#comment-14091621 ] Hadoop QA commented on YARN-1954: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660786/YARN-1954.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4570//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4570//console This message is automatically generated. Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.7.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091658#comment-14091658 ] Wangda Tan commented on YARN-2249: -- Jian, Thanks for update, My last comment is, Could you rename {{mutex}} to {{pendingReleaseMutex}} or something? Wangda RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)