[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276275#comment-14276275 ] Yi Liu commented on YARN-3055: -- Is it possible the launcher job finishes firstly, but sub-jobs are still running? If so, the issue exists. If not, then the issue is invalid. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276287#comment-14276287 ] Jian He commented on YARN-2637: --- lgtm too , thanks [~cwelch] and [~leftnoteasy] ! maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, YARN-2637.38.patch, YARN-2637.39.patch, YARN-2637.40.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3024) LocalizerRunner should give DIE action when all resources are localized
[ https://issues.apache.org/jira/browse/YARN-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276301#comment-14276301 ] Xuan Gong commented on YARN-3024: - [~chengbing.liu] Thanks for working on this ticket. I am starting to look at the patch. Overall looks good, but * on the latest patch, looks like you change the logic for {code} case FETCH_PENDING: break; {code} Originally, we will directly return the response with LocalizerAction.LIVE But now we have to do: {code} LocalResource next = findNextResource(); if (next != null) { try { ResourceLocalizationSpec resource = NodeManagerBuilderUtils.newResourceLocalizationSpec(next, getPathForLocalization(next)); rsrcs.add(resource); } catch (IOException e) { LOG.error(local path for PRIVATE localization could not be + found. Disks might have failed., e); } catch (URISyntaxException e) { //TODO fail? Already translated several times... } } else if (pending.isEmpty()) { // TODO: Synchronization action = LocalizerAction.DIE; } response.setLocalizerAction(action); response.setResourceSpecs(rsrcs); return response; {code} * Could you fix this format {code} + if (action == LocalizerAction.DIE) { + response.setLocalizerAction(action); + return response; + } {code} LocalizerRunner should give DIE action when all resources are localized --- Key: YARN-3024 URL: https://issues.apache.org/jira/browse/YARN-3024 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-3024.01.patch, YARN-3024.02.patch, YARN-3024.03.patch We have observed that {{LocalizerRunner}} always gives a LIVE action at the end of localization process. The problem is {{findNextResource()}} can return null even when {{pending}} was not empty prior to the call. This method removes localized resources from {{pending}}, therefore we should check the return value, and gives DIE action when it returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3033) implement NM starting the ATS writer companion
[ https://issues.apache.org/jira/browse/YARN-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276306#comment-14276306 ] Vinod Kumar Vavilapalli commented on YARN-3033: --- Thanks for filing this [~sjlee0]! We should try to fit this together with YARN-2141 so that we have one source of the cluster stats. implement NM starting the ATS writer companion -- Key: YARN-3033 URL: https://issues.apache.org/jira/browse/YARN-3033 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Per design in YARN-2928, implement node managers starting the ATS writer companion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276303#comment-14276303 ] Vinod Kumar Vavilapalli commented on YARN-2928: --- Just created origin/YARN-2928 based on origin/branch-2. Let's try keeping it up to date at a pace that suits the branch. [~Naganarasimha], [~varun_saxena], I see you are willing to help with this feature work, tx! We will have to coordinate a little on how we all move together on this. This may involve some readjustments on the order of the tasks and the assignees, please bear with me. Thanks a bunch again! Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2217) Shared cache client side changes
[ https://issues.apache.org/jira/browse/YARN-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2217: --- Attachment: YARN-2217-trunk-v7.patch [~kasha] V7 attached. Added error test cases and coverage around checksum method. Shared cache client side changes Key: YARN-2217 URL: https://issues.apache.org/jira/browse/YARN-2217 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2217-trunk-v1.patch, YARN-2217-trunk-v2.patch, YARN-2217-trunk-v3.patch, YARN-2217-trunk-v4.patch, YARN-2217-trunk-v5.patch, YARN-2217-trunk-v6.patch, YARN-2217-trunk-v7.patch Implement the client side changes for the shared cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276379#comment-14276379 ] Zhijie Shen commented on YARN-2928: --- Sangjin. Some quick thought about the second point. Currently, ATS work-preserving restart only involves recovery of the token information in secured scenario only thanks to the almost stateless nature (YARN-2837). In the following, depending on how the writer is implemented, we may want to preserve the outstanding timeline data that is received by ATS companion but is still not be persisted into the storage backend. IAC, it seem to be the common requirement no matter it's per-node (e.g., restarting) or per-app (e.g., crashing). Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2984) Metrics for container's actual memory usage
[ https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276392#comment-14276392 ] Hadoop QA commented on YARN-2984: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692116/yarn-2984-2.patch against trunk revision c53420f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainerMetrics Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6326//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6326//console This message is automatically generated. Metrics for container's actual memory usage --- Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2984-1.patch, yarn-2984-2.patch, yarn-2984-prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3019) Make work-preserving-recovery the default mechanism for RM recovery
[ https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276394#comment-14276394 ] Hudson commented on YARN-3019: -- FAILURE: Integrated in Hadoop-trunk-Commit #6857 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6857/]) YARN-3019. Make work-preserving-recovery the default mechanism for RM recovery. (Contributed by Jian He) (junping_du: rev f92e5038000a012229c304bc6e5281411eff2883) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Make work-preserving-recovery the default mechanism for RM recovery --- Key: YARN-3019 URL: https://issues.apache.org/jira/browse/YARN-3019 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-3019.1.patch The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default to flip recovery mode to work-preserving recovery from non-work-preserving recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2984) Metrics for container's actual memory usage
[ https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276476#comment-14276476 ] Hadoop QA commented on YARN-2984: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692147/yarn-2984-3.patch against trunk revision f92e503. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6329//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6329//console This message is automatically generated. Metrics for container's actual memory usage --- Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2984-1.patch, yarn-2984-2.patch, yarn-2984-3.patch, yarn-2984-prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276490#comment-14276490 ] Sangjin Lee commented on YARN-2928: --- bq. One additional issue for developing the new feature. We may either create a new sub-module or or reuse the current on: applicationhistoryservice, but put it into blah.blah.v2 package. My vote is to start from a clean slate with a new source project (e.g. applicationtimelineservice or some other distinct name) and new packages. There is a cost of having to copy source into the new project, but it might not be so bad. That way, it can start clean and small and doesn't have to carry code that is not relevant. Also, it won't be affected by rebasing. What do you think? Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2217) Shared cache client side changes
[ https://issues.apache.org/jira/browse/YARN-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276417#comment-14276417 ] Hadoop QA commented on YARN-2217: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692107/YARN-2217-trunk-v7.patch against trunk revision f92e503. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.client.api.impl.TestSharedCacheClientImpl Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6328//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6328//console This message is automatically generated. Shared cache client side changes Key: YARN-2217 URL: https://issues.apache.org/jira/browse/YARN-2217 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2217-trunk-v1.patch, YARN-2217-trunk-v2.patch, YARN-2217-trunk-v3.patch, YARN-2217-trunk-v4.patch, YARN-2217-trunk-v5.patch, YARN-2217-trunk-v6.patch, YARN-2217-trunk-v7.patch Implement the client side changes for the shared cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3058) Fix error msg of tokens activation delay configuration
[ https://issues.apache.org/jira/browse/YARN-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276437#comment-14276437 ] Hadoop QA commented on YARN-3058: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692120/YARN-3058.001.patch against trunk revision c53420f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6327//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6327//console This message is automatically generated. Fix error msg of tokens activation delay configuration -- Key: YARN-3058 URL: https://issues.apache.org/jira/browse/YARN-3058 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Attachments: YARN-3058.001.patch {code} this.rollingInterval = conf.getLong( YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, YarnConfiguration.DEFAULT_RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS) * 1000; ... this.activationDelay = (long) (conf.getLong(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS) * 1.5); ... if (rollingInterval = activationDelay * 2) { throw new IllegalArgumentException( YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + should be more than 2 X + YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS); } {code} The error msg should be {code} YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + should be more than 3 X + YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS); {code} Also It's {{3 X}} instead of {{2 X}}, since it's multiplied by *1.5*. There are few other places having same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3024) LocalizerRunner should give DIE action when all resources are localized
[ https://issues.apache.org/jira/browse/YARN-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276470#comment-14276470 ] Chengbing Liu commented on YARN-3024: - [~xgong] Thanks for reviewing. {quote} on the latest patch, looks like you change the logic for {quote} The logic of giving resources to be localized is actually changed. Previously, {{LocalizedRunner}} does not give the next resource to {{ContainerLocalizer}} until the previous has been downloaded. In this patch, {{LocalizedRunner}} will not wait for the previous resource to be downloaded. {{ContainerLocalizer}} can handle that by submitting the download task to its CompletionService, which is able to queue those tasks, before executing them. The download thread pool of the CompletionService remains a single thread executor. Therefore, it is possible that {{ContainerLocalizer}} sends multiple {{LocalResourceStatus}} to {{LocalizerRunner}} through heartbeat. In this case, I think we should try to find the next resources to be localized even when getting FETCH_PENDING. I have tested it on a real cluster. I specified a large archive which should take a long time to be localized. The result shows they were getting localized serially, and one heartbeat contained multiple statuses of small files (thus reducing the number of heartbeat). {quote} Could you fix this format {quote} My bad, I will fix this. LocalizerRunner should give DIE action when all resources are localized --- Key: YARN-3024 URL: https://issues.apache.org/jira/browse/YARN-3024 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-3024.01.patch, YARN-3024.02.patch, YARN-3024.03.patch We have observed that {{LocalizerRunner}} always gives a LIVE action at the end of localization process. The problem is {{findNextResource()}} can return null even when {{pending}} was not empty prior to the call. This method removes localized resources from {{pending}}, therefore we should check the return value, and gives DIE action when it returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276438#comment-14276438 ] Yongjun Zhang commented on YARN-3021: - Thanks a lot [~adhoot] and [~vinodkv]! {quote} Having said that, if RM cannot validate the token as valid why would the job itself work? Would not the containers themselves face the same issue using the tokens? {quote} Based on the scenario [~qwertymaniac] described in the jira description, the token is from realm B, which can not be validated by realm A's YARN since A and B doesn't trust each other. However, the token can be used by distcp job running in realm A to access B's file (B is the distcp source). For the scenario described in the jira, I think we are aligned that it would be better to add an additional parameter at the time of job submission, so client to can tell YARN explicitly that YARN should not try to renew the token. What I wanted to clarify with my earlier question was, if we support this scenario by having YARN not to validate the token, do we open any security hole? Anyone could submit a job and ask YARN not to renew the token, right? Thanks. YARN's delegation-token handling disallows certain trust setups to operate properly --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Attachments: YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2984) Metrics for container's actual memory usage
[ https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2984: --- Attachment: yarn-2984-3.patch Looks like a timing issue with the test, increased the test timer to fire every 100 ms instead of 50. Metrics for container's actual memory usage --- Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2984-1.patch, yarn-2984-2.patch, yarn-2984-3.patch, yarn-2984-prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2217) Shared cache client side changes
[ https://issues.apache.org/jira/browse/YARN-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276472#comment-14276472 ] Karthik Kambatla commented on YARN-2217: Is it a classpath issue? Shared cache client side changes Key: YARN-2217 URL: https://issues.apache.org/jira/browse/YARN-2217 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2217-trunk-v1.patch, YARN-2217-trunk-v2.patch, YARN-2217-trunk-v3.patch, YARN-2217-trunk-v4.patch, YARN-2217-trunk-v5.patch, YARN-2217-trunk-v6.patch, YARN-2217-trunk-v7.patch Implement the client side changes for the shared cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2861) Timeline DT secret manager should not reuse the RM's configs.
[ https://issues.apache.org/jira/browse/YARN-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2861: -- Attachment: YARN-2861.2.patch Thanks for review, Jian! I updated the patch. Timeline DT secret manager should not reuse the RM's configs. - Key: YARN-2861 URL: https://issues.apache.org/jira/browse/YARN-2861 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2861.1.patch, YARN-2861.2.patch This is the configs for RM DT secret manager. We should create separate ones for timeline DT only. {code} @Override protected void serviceInit(Configuration conf) throws Exception { long secretKeyInterval = conf.getLong(YarnConfiguration.DELEGATION_KEY_UPDATE_INTERVAL_KEY, YarnConfiguration.DELEGATION_KEY_UPDATE_INTERVAL_DEFAULT); long tokenMaxLifetime = conf.getLong(YarnConfiguration.DELEGATION_TOKEN_MAX_LIFETIME_KEY, YarnConfiguration.DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT); long tokenRenewInterval = conf.getLong(YarnConfiguration.DELEGATION_TOKEN_RENEW_INTERVAL_KEY, YarnConfiguration.DELEGATION_TOKEN_RENEW_INTERVAL_DEFAULT); secretManager = new TimelineDelegationTokenSecretManager(secretKeyInterval, tokenMaxLifetime, tokenRenewInterval, 360); secretManager.startThreads(); serviceAddr = TimelineUtils.getTimelineTokenServiceAddress(getConfig()); super.init(conf); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276599#comment-14276599 ] Chen He commented on YARN-1680: --- To address the twice blacklist issue (a node is blacklist by App and later by cluster). I propose two steps: 1. Every time, App asks for blacklist addition, we check wether the nodes in addition are in cluster blacklist or not (O(m), m is the nodes in blacklist addition). If so, remove this node from addition. 2. It is possible that App unblacklist a node (put it in blacklist removal) but the cluster still blacklist it. In this situation, the clusterResource does not contain this node resource. Then, we need to remove this node from App's blacklist removal set in headroom caculation. availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Chen He Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275700#comment-14275700 ] Craig Welch commented on YARN-2637: --- Regarding the findbugs report for LeafQueue.lastClusterResource - access to lastClusterResource appears to be synchronized everywhere except getAbsActualCapacity, which I don't actually see being used anywhere - I'm going to add a findbugs exception and a comment on the method so that if it is used in the future synchronization can be addressed -re [~leftnoteasy] 's latest: -re 1 - actually, user limits are based on absolute queue capacity rather than max capacity - this is apparently intentional because, although a queue can exceed it's absolute capacity, an individual user is not supposed to, hence my basing the user amlimit on the absolute capacity. The approach I use fits with the original logic in CSQueueUtils which allows a user the greater of the userlimit share of the absolute capacity or 1/# active users (so if there are fewer users active than would reach the userlimit they can use the full queue absolute capacity), the only correction being that we are using the actual value of resources by application masters instead of one based on minalloc -re 2 - Actually, the snippet provided is not quite correct, some schedulers provide a cpu value as well. In any case, for encapsulation reasons it's better to use the scheduler's value in case its means of determining this changes in the future. -re 3 - I can't see this making the slightest difference in understandability - since these test's paths don't populate the rmapps I would simply be individually putting mocked ones into the map instead of the single mock + matcher for all the apps. The way it is seems clearer to me as all of the mocking is together instead of distributing the (mock activity, if not mock framework...) process of putting mock rmapps into the collection throughout the test -re 4 - interesting, those were already there, but I also couldn't see why. Test passes fine without them, so I removed them -re 5 - removed uploading updated patch in a few maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, YARN-2637.38.patch, YARN-2637.39.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2637: -- Attachment: YARN-2637.40.patch maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, YARN-2637.38.patch, YARN-2637.39.patch, YARN-2637.40.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-7.patch Thanks [~wangda][~jianhe]and [~sunilg] for reviews Updated the patch Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch, YARN-2933-7.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers
[ https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275825#comment-14275825 ] Peter D Kirchner commented on YARN-3020: I investigated the rates in the third paragraph of my comment immediately above, and found that an application is able to make addContainerRequest()s much faster than this. Bear in mind that the elapsed time for making the client-api call to addContainerRequest() is not a measurement of the performance impact of the reported over-requests sent to the server and the resulting over-allocation of containers. It turns out my application has some extrinsic delay in issuing addContainerRequests which predominated in limiting the rate I measured and reported in the third paragraph of the comment immediately above. To follow up, I measured addContainerRequest() timing with System.nanoTime(). The first call to addContainerRequest() takes around 5 milliseconds. The rest take around half a millisecond on average. Here are some statistics for calling addContainerRequest(): microseconds average=433 count=914 max=11202 min=223 . I measure similar times for consecutive calls (without additional application delays in between addContainerRequest()s). When the over-request bug is fixed, I will still think it tedious to call 1000x for 1000 identical containers but many applications can probably afford the half second to do so. Arguably, the bug exists in part because of the tediousness of bookkeeping on the yarn-client-api side for these requests. If in the process of bug-fixing or cleanup, a change that re-introduces an integer quantity with the request would be welcome. n similar addContainerRequest()s produce n*(n+1)/2 containers - Key: YARN-3020 URL: https://issues.apache.org/jira/browse/YARN-3020 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 Reporter: Peter D Kirchner Original Estimate: 24h Remaining Estimate: 24h BUG: If the application master calls addContainerRequest() n times, but with the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 . The most containers are requested when the interval between calls to addContainerRequest() exceeds the heartbeat interval of calls to allocate() (in AMRMClientImpl's run() method). If the application master calls addContainerRequest() n times, but with a unique priority each time, I get n containers (as I intended). Analysis: There is a logic problem in AMRMClientImpl.java. Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent calls to addContainerRequest(), addResourceRequest() finds the previous matching remoteRequest and increments the container count rather than starting anew, and does an addResourceRequestToAsk() which defeats the ask.clear(). From documentation and code comments, it was hard for me to discern the intended behavior of the API, but the inconsistency reported in this issue suggests one case or the other is implemented incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275686#comment-14275686 ] Eric Payne commented on YARN-2932: -- [~leftnoteasy], thanks very much for your review and comments: bq. 1. Rename {{isQueuePreemptable}} to {{getQueuePreemptable}} for getter/setter consistency in {{CapacitySchedulerConfiguration}} Renamed. bq. 2. Should consider queue reinitialize when queue preemptable in configuration changes (See {{TestQueueParsing}}). And it's best to add a test for verify that. I'm sorry. I don't understand what you mean by the use of the word consider. Calling {{CapacityScheduler.reinitialize}} will follow the queue hierarchy down and eventually call {{AbstractCSQueue#setupQueueConfigs}} for every queue, so I don't think there is any additional code needed, unless I'm missing something. Were you just saying that I need to add a test case for that? {quote} 3. It's better to remove the {{defaultVal}} parameter in {{CapacitySchedulerConfiguration.isPreemptable}}: {code} public boolean isQueuePreemptable(String queue, boolean defaultVal) {code} And the default_value should be placed in {{CapacitySchedulerConfiguration}}, like other queue configuration options. I understand what you trying to do is moving some logic from queue to {{CapacitySchedulerConfiguration}}, but I still think it's better to keep the {{CapacitySchedulerConfiguration}} simply gets some values from configuration file. {quote} The problem is that without the {{defaultval}} parameter, {{AbstractCSQueue#isQueuePathHierarchyPreemptable}} can't tell if the queue has explicitly set its preemptability or if it is just returning the default. For example: {code} root: disable_preemption = true root.A: disable_preemption (the property is not set) root.B: disable_preemption = false (the property is explicitly set to false) {code} Let's say the {{getQueuePreemptable}} interface is changed to remove the {{defaultVal}} parameter, and that when {{getQueuePreemptable}} calls {{getBoolean}}, it uses {{false}} as the default. # {{getQueuePreemptable}} calls {{getBoolean}} on {{root}} ## {{getBoolean}} returns {{true}} because the {{disable_preemption}} property is set to {{true}} ## {{getQueuePreemptable}} inverts {{true}} and returns {{false}} (That is, {{root}} has preemption disabled, so it is not preemptable). # {{getQueuePreemptable}} calls {{getBoolean}} on {{root.A}} ## {{getBoolean}} returns {{false}} because there is no {{disable_preemption}} property set for this queue, so {{getBoolean}} returns the default. ## {{getQueuePreemptable}} inverts {{false}} and returns {{true}} # {{getQueuePreemptable}} calls {{getBoolean}} on {{root.B}} ## {{getBoolean}} returns {{false}} because {{disable_preemption}} property is set to {{false}} for this queue ## {{getQueuePreemptable}} inverts {{false}} and returns {{true}} At this point, {{isQueuePathHierarchyPreemptable}} needs to know if it should use the default preemption from {{root}} or if it should use the value from each child queue. In the case of {{root.A}}, the value from {{root}} ({{false}}) should be used because {{root.A}} does not have the property set. In the case of {{root.B}}, the value should be the one returned for {{root.B}} ({{true}}) because it is explicitly set. But, since both {{root.A}} and {{root.B}} both returned {{true}}, {{isQueuePathHierarchyPreemptable}} can't tell the difference. Does that make sense? Add entry for preemption setting to queue status screen and startup/refresh logging --- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3035) create a test-only backing storage implementation for ATS writes
[ https://issues.apache.org/jira/browse/YARN-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee reassigned YARN-3035: - Assignee: Sangjin Lee create a test-only backing storage implementation for ATS writes Key: YARN-3035 URL: https://issues.apache.org/jira/browse/YARN-3035 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Per design in YARN-2928, create a test-only bare bone backing storage implementation for ATS writes. We could consider something like a no-op or in-memory storage strictly for development and testing purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3052) provide a very simple POC html ATS UI
[ https://issues.apache.org/jira/browse/YARN-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee reassigned YARN-3052: - Assignee: Sangjin Lee provide a very simple POC html ATS UI - Key: YARN-3052 URL: https://issues.apache.org/jira/browse/YARN-3052 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee As part of accomplishing a minimum viable product, we want to be able to show some UI in html (however crude it is). This subtask calls for creating a barebones UI to do that. This should be replaced later with a better-designed and implemented proper UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3030) set up ATS writer with basic request serving structure and lifecycle
[ https://issues.apache.org/jira/browse/YARN-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee reassigned YARN-3030: - Assignee: Sangjin Lee set up ATS writer with basic request serving structure and lifecycle Key: YARN-3030 URL: https://issues.apache.org/jira/browse/YARN-3030 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Per design in YARN-2928, create an ATS writer as a service, and implement the basic service structure including the lifecycle management. Also, as part of this JIRA, we should come up with the ATS client API for sending requests to this ATS writer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275797#comment-14275797 ] Anubhav Dhoot commented on YARN-3021: - Looking at the patch itself we seem to suppress an error that would earlier be visible to user. Thats going to make it harder to detect genuine failures. MR1 seems to be worse than YARN in this aspect and we don't need to make it match that behavior. If we really need to skip validation, as you said, adding a feature into YARN where application could opt in would be better. Having said that, if RM cannot validate the token as valid why would the job itself work? Would not the containers themselves face the same issue using the tokens? YARN's delegation-token handling disallows certain trust setups to operate properly --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Attachments: YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276352#comment-14276352 ] Hudson commented on YARN-2637: -- FAILURE: Integrated in Hadoop-trunk-Commit #6856 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6856/]) YARN-2637. Fixed max-am-resource-percent calculation in CapacityScheduler when activating applications. Contributed by Craig Welch (jianhe: rev c53420f58364b11fbda1dace7679d45534533382) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerLeafQueueInfo.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMActiveServiceContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRMRPCNodeUpdates.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/MockRMApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContextImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestCapacitySchedulerPlanFollower.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Fix For: 2.7.0 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch,
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276358#comment-14276358 ] Sangjin Lee commented on YARN-2928: --- Regarding the per-node approach, I do have some questions (and observations) on the approach in addition to the aspect of losing the isolation/attribution as already discussed. (1) While it may be faster to allocate with the per-node companions, capacity-wise you would end up spending more capacity with the per-node approach. Since these per-node companions are always up although they may be idle for large amount of time. So if capacity is a concern you may lose out. Under what circumstances would per-node companions be more advantageous in terms of capacity? (2) I do have a question about the work-preserving aspect of the per-node ATS companion. One implication of making this a per-node thing (i.e. long-running) is that we need to handle the work-preserving restart. What if we need to restart the ATS companion? Since other YARN daemons (RM and NM) allow for work-preserving restarts, we cannot have the ATS companion break that. So that seems to be a requirement? (3) We still need to handle the lifecycle management aspects of it. Previously we said that when RM allocates an AM it would tell the NM so the NM could spawn the special container. With the per-node approach, the RM would *still* need to tell the NM so that the NM can talk to the per-node ATS companion to initialize the data structure for the given app. These are quick observations. While I do see value in the per-node approach, it's not totally clear how much work it would save over the per-app approach given these observations. What do you think? Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3058) Fix error msg of tokens activation delay configuration
Yi Liu created YARN-3058: Summary: Fix error msg of tokens activation delay configuration Key: YARN-3058 URL: https://issues.apache.org/jira/browse/YARN-3058 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor {code} this.rollingInterval = conf.getLong( YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, YarnConfiguration.DEFAULT_RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS) * 1000; ... this.activationDelay = (long) (conf.getLong(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS) * 1.5); ... if (rollingInterval = activationDelay * 2) { throw new IllegalArgumentException( YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + should be more than 2 X + YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS); } {code} The error msg should be {code} YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + should be more than 3 X + YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS); {code} Also It's {{3 X}} instead of {{2 X}}, since it's multiplied by *1.5*. There are few other places having same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2217) Shared cache client side changes
[ https://issues.apache.org/jira/browse/YARN-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276364#comment-14276364 ] Hadoop QA commented on YARN-2217: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692107/YARN-2217-trunk-v7.patch against trunk revision 85aec75. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.client.api.impl.TestSharedCacheClientImpl Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6325//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6325//console This message is automatically generated. Shared cache client side changes Key: YARN-2217 URL: https://issues.apache.org/jira/browse/YARN-2217 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2217-trunk-v1.patch, YARN-2217-trunk-v2.patch, YARN-2217-trunk-v3.patch, YARN-2217-trunk-v4.patch, YARN-2217-trunk-v5.patch, YARN-2217-trunk-v6.patch, YARN-2217-trunk-v7.patch Implement the client side changes for the shared cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3058) Fix error msg of tokens activation delay configuration
[ https://issues.apache.org/jira/browse/YARN-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3058: - Attachment: YARN-3058.001.patch Fix error msg of tokens activation delay configuration -- Key: YARN-3058 URL: https://issues.apache.org/jira/browse/YARN-3058 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Attachments: YARN-3058.001.patch {code} this.rollingInterval = conf.getLong( YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, YarnConfiguration.DEFAULT_RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS) * 1000; ... this.activationDelay = (long) (conf.getLong(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS) * 1.5); ... if (rollingInterval = activationDelay * 2) { throw new IllegalArgumentException( YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + should be more than 2 X + YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS); } {code} The error msg should be {code} YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + should be more than 3 X + YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS); {code} Also It's {{3 X}} instead of {{2 X}}, since it's multiplied by *1.5*. There are few other places having same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276367#comment-14276367 ] Zhijie Shen commented on YARN-2928: --- Thanks for creating the branch, Vinod! One additional issue for developing the new feature. We may either create a new sub-module or or reuse the current on: applicationhistoryservice, but put it into blah.blah.v2 package. The latter way might make project organization a bit easier given we reuse the existing TS code. But in this case, one step back, we need to correct the his sub-module and package naming first, preventing further propagating the confusing terminology. Thoughts? Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3059) RM web page can not display NM's health report which is healthy
[ https://issues.apache.org/jira/browse/YARN-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang Hao updated YARN-3059: --- Description: If the NM is healthy, it's health report can not display in RM web page. In function reportHealthStatus of NodeHealthMonitorExecutor, I found that if the HealthCheckerExitStatus is successful, the output was set to a empty string in function setHealthStatus. so,I change the code “setHealthStatus(true, , now)” to “setHealthStatus(true,shexec.getOutput(), now)” . Then, the RM web page can display the NM's health report. Maybe set the output to a empty string can decrease the data that transfered between RM and NM. But I think we want to see the health report of NM in some cases. was: If the NM is healthy, it's health report can not display in RM web page. In function reportHealthStatus of NodeHealthMonitorExecutor, I found that if the HealthCheckerExitStatus is successful, the output was set to a empty string in function setHealthStatus. so,I change the code “setHealthStatus(true, , now)” to “setHealthStatus(true,shexec.getOutput(), now)” . Then, the RM web page can display the NM's health report. Maybe set the output to a empty string can decrease the data that transfer ed between RM and NM. But I think we want to see the health report of NM in some cases. RM web page can not display NM's health report which is healthy --- Key: YARN-3059 URL: https://issues.apache.org/jira/browse/YARN-3059 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wang Hao If the NM is healthy, it's health report can not display in RM web page. In function reportHealthStatus of NodeHealthMonitorExecutor, I found that if the HealthCheckerExitStatus is successful, the output was set to a empty string in function setHealthStatus. so,I change the code “setHealthStatus(true, , now)” to “setHealthStatus(true,shexec.getOutput(), now)” . Then, the RM web page can display the NM's health report. Maybe set the output to a empty string can decrease the data that transfered between RM and NM. But I think we want to see the health report of NM in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3059) RM web page can not display NM's health report which is healthy
Wang Hao created YARN-3059: -- Summary: RM web page can not display NM's health report which is healthy Key: YARN-3059 URL: https://issues.apache.org/jira/browse/YARN-3059 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wang Hao If the NM is healthy, it's health report can not display in RM web page. In function reportHealthStatus of NodeHealthMonitorExecutor, I found that if the HealthCheckerExitStatus is successful, the output was set to a empty string in function setHealthStatus. so,I change the code “setHealthStatus(true, , now)” to “setHealthStatus(true,shexec.getOutput(), now)” . Then, the RM web page can display the NM's health report. Maybe set the output to a empty string can decrease the data that transfer ed between RM and NM. But I think we want to see the health report of NM in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276280#comment-14276280 ] Jian He commented on YARN-3055: --- bq. Is it possible the launcher job finishes firstly, but sub-jobs are still running? This is an existing issue as discussed in https://issues.apache.org/jira/browse/YARN-2964?focusedCommentId=14252218page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14252218. And a long-term solution is to have a group Id for a group of applications so that the token lifetime is tied to a group of applications instead of a single application. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276289#comment-14276289 ] Vinod Kumar Vavilapalli commented on YARN-2928: --- bq. On the process side, I propose we do work on a branch with a goal to borrow whatever code is possible to from current Timeline service. Don't see any concerns on this. Creating a branch now and will get people participating in this branch to be branch committers if they aren't already committers. Irrespective of that, I think we should simply see it as RTC on branch - a JIRA for every task, patches uploaded to JIRA and reviewed/committed by someone else etc. Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2965) Enhance Node Managers to monitor and report the resource usage on machines
[ https://issues.apache.org/jira/browse/YARN-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276324#comment-14276324 ] Karthik Kambatla commented on YARN-2965: Look forward to the diagram. I have been thinking about it, but don't have anything concrete in my mind yet. :) Enhance Node Managers to monitor and report the resource usage on machines -- Key: YARN-2965 URL: https://issues.apache.org/jira/browse/YARN-2965 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Robert Grandl Assignee: Robert Grandl Attachments: ddoc_RT.docx This JIRA is about augmenting Node Managers to monitor the resource usage on the machine, aggregates these reports and exposes them to the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2984) Metrics for container's actual memory usage
[ https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2984: --- Attachment: yarn-2984-2.patch Updated patch marks the configs Private, improves the test a tad bit and cleans up ContainersMonitorImpl a little more. Metrics for container's actual memory usage --- Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2984-1.patch, yarn-2984-2.patch, yarn-2984-prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276296#comment-14276296 ] Karthik Kambatla commented on YARN-2928: +1 to work on a branch. Developing features on branches seems to be working very well for HDFS folks. I would like for us to adopt the same model; that becomes easier if *all* features are developed on a branch. Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2965) Enhance Node Managers to monitor and report the resource usage on machines
[ https://issues.apache.org/jira/browse/YARN-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276313#comment-14276313 ] Vinod Kumar Vavilapalli commented on YARN-2965: --- The actual ticket that needs some collaboration is YARN-3033. Agreed, this feature work doesn't need everything that YARN-2928 needs, but they are all similar to me - responsibility of obtaining stats at that node level. After they are collected on a single node, the stats information gets forwarded to RM, per-app agent etc. I'll make a short diagram to illustrate how all of this can be unified. Enhance Node Managers to monitor and report the resource usage on machines -- Key: YARN-2965 URL: https://issues.apache.org/jira/browse/YARN-2965 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Robert Grandl Assignee: Robert Grandl Attachments: ddoc_RT.docx This JIRA is about augmenting Node Managers to monitor the resource usage on the machine, aggregates these reports and exposes them to the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276318#comment-14276318 ] Sangjin Lee commented on YARN-2928: --- We have an unofficial IRC chatroom open for quick dev discussions on this. It's ##hadoop-ats (note 2 #'s) on irc.freenode.net. Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276340#comment-14276340 ] Karthik Kambatla commented on YARN-2928: It would be nice to create a branch based on trunk instead of branch-2, so we can merge into trunk before branch-2. Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276344#comment-14276344 ] Vinod Kumar Vavilapalli commented on YARN-2928: --- Makes sense, recreated the branch off trunk.. Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2217) Shared cache client side changes
[ https://issues.apache.org/jira/browse/YARN-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276389#comment-14276389 ] Chris Trezzo commented on YARN-2217: Re-kicking QA build to confirm the test failures in org.apache.hadoop.yarn.client.api.impl.TestSharedCacheClientImpl (they passed on my local machine). Shared cache client side changes Key: YARN-2217 URL: https://issues.apache.org/jira/browse/YARN-2217 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2217-trunk-v1.patch, YARN-2217-trunk-v2.patch, YARN-2217-trunk-v3.patch, YARN-2217-trunk-v4.patch, YARN-2217-trunk-v5.patch, YARN-2217-trunk-v6.patch, YARN-2217-trunk-v7.patch Implement the client side changes for the shared cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3024) LocalizerRunner should give DIE action when all resources are localized
[ https://issues.apache.org/jira/browse/YARN-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274915#comment-14274915 ] Hadoop QA commented on YARN-3024: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691878/YARN-3024.03.patch against trunk revision c4cba61. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6318//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6318//console This message is automatically generated. LocalizerRunner should give DIE action when all resources are localized --- Key: YARN-3024 URL: https://issues.apache.org/jira/browse/YARN-3024 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-3024.01.patch, YARN-3024.02.patch, YARN-3024.03.patch We have observed that {{LocalizerRunner}} always gives a LIVE action at the end of localization process. The problem is {{findNextResource()}} can return null even when {{pending}} was not empty prior to the call. This method removes localized resources from {{pending}}, therefore we should check the return value, and gives DIE action when it returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3055) Fix allTokens issue in DelegationTokenRenewer
Yi Liu created YARN-3055: Summary: Fix allTokens issue in DelegationTokenRenewer Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3054) Preempt policy in FairScheduler may cause mapreduce job never finish
Peng Zhang created YARN-3054: Summary: Preempt policy in FairScheduler may cause mapreduce job never finish Key: YARN-3054 URL: https://issues.apache.org/jira/browse/YARN-3054 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Preemption policy is related with schedule policy now. Using comparator of schedule policy to find preemption candidate cannot guarantee a subset of containers never be preempted. And this may cause tasks to be preempted periodically before they finish. So job cannot make any progress. I think preemption in YARN should got below assurance: 1. Mapreduce jobs can get additional resources when others are idle; 2. Mapreduce jobs for one user in one queue can still progress with its min share when others preempt resources back. Maybe always preempt the latest app and container can get this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) Fix allTokens issue in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3055: - Attachment: YARN-3055.001.patch [~jianhe], [~kasha] and [~jlowe], can you help to take a look? Fix allTokens issue in DelegationTokenRenewer - Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3027) Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation
[ https://issues.apache.org/jira/browse/YARN-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275252#comment-14275252 ] Hudson commented on YARN-3027: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #69 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/69/]) YARN-3027. Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation. (adhoot via rkanter) (rkanter: rev ae7bf31fe1c63f323ba5271e50fd0e4425a7510f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation - Key: YARN-3027 URL: https://issues.apache.org/jira/browse/YARN-3027 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.0 Attachments: YARN-3027.001.patch, YARN-3027.002.patch YARN-2604 added support for updating maxiumum allocation resource size based on nodes. But it incorrectly uses available resource instead of maximum resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2957) Create unit test to automatically compare YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275251#comment-14275251 ] Hudson commented on YARN-2957: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #69 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/69/]) YARN-2957. Create unit test to automatically compare YarnConfiguration and yarn-default.xml. (rchiang via rkanter) (rkanter: rev f45163191583eadcfbe0df233a3185fd1b2b78f3) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/conf/TestYarnConfigurationFields.java Create unit test to automatically compare YarnConfiguration and yarn-default.xml Key: YARN-2957 URL: https://issues.apache.org/jira/browse/YARN-2957 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Fix For: 2.7.0 Attachments: YARN-2957.001.patch Create a unit test that will automatically compare the fields in YarnConfiguration and yarn-default.xml. It should throw an error if a property is missing in either the class or the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3027) Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation
[ https://issues.apache.org/jira/browse/YARN-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275270#comment-14275270 ] Hudson commented on YARN-3027: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2004 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2004/]) YARN-3027. Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation. (adhoot via rkanter) (rkanter: rev ae7bf31fe1c63f323ba5271e50fd0e4425a7510f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java * hadoop-yarn-project/CHANGES.txt Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation - Key: YARN-3027 URL: https://issues.apache.org/jira/browse/YARN-3027 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.0 Attachments: YARN-3027.001.patch, YARN-3027.002.patch YARN-2604 added support for updating maxiumum allocation resource size based on nodes. But it incorrectly uses available resource instead of maximum resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2643) Don't create a new DominantResourceCalculator on every FairScheduler.allocate call
[ https://issues.apache.org/jira/browse/YARN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275265#comment-14275265 ] Hudson commented on YARN-2643: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2004 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2004/]) YARN-2643. Don't create a new DominantResourceCalculator on every FairScheduler.allocate call. (kasha via rkanter) (rkanter: rev 51881535e659940b1b332d0c5952ee1f9958cc7f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Don't create a new DominantResourceCalculator on every FairScheduler.allocate call -- Key: YARN-2643 URL: https://issues.apache.org/jira/browse/YARN-2643 Project: Hadoop YARN Issue Type: Improvement Reporter: Sandy Ryza Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.7.0 Attachments: yarn-2643-1.patch, yarn-2643.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2957) Create unit test to automatically compare YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275269#comment-14275269 ] Hudson commented on YARN-2957: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2004 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2004/]) YARN-2957. Create unit test to automatically compare YarnConfiguration and yarn-default.xml. (rchiang via rkanter) (rkanter: rev f45163191583eadcfbe0df233a3185fd1b2b78f3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/conf/TestYarnConfigurationFields.java * hadoop-yarn-project/CHANGES.txt Create unit test to automatically compare YarnConfiguration and yarn-default.xml Key: YARN-2957 URL: https://issues.apache.org/jira/browse/YARN-2957 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Fix For: 2.7.0 Attachments: YARN-2957.001.patch Create a unit test that will automatically compare the fields in YarnConfiguration and yarn-default.xml. It should throw an error if a property is missing in either the class or the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3019) Enable RM work-preserving restart by default
[ https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275278#comment-14275278 ] Junping Du commented on YARN-3019: -- bq. The final goal is to support work-preserving recovery only. So the config yarn.resourcemanager.work-preserving-recovery.enabled is not needed any more. Sounds good. Thanks [~jianhe] for explanation. We can mark unnecessary configuration as deprecated later. [~aw], if you don't have further comments, I will commit this simple patch in soon. Enable RM work-preserving restart by default - Key: YARN-3019 URL: https://issues.apache.org/jira/browse/YARN-3019 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-3019.1.patch The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default to flip recovery mode to work-preserving recovery from non-work-preserving recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275291#comment-14275291 ] Hadoop QA commented on YARN-3055: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691941/YARN-3055.002.patch against trunk revision 08ac062. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6322//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6322//console This message is automatically generated. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2643) Don't create a new DominantResourceCalculator on every FairScheduler.allocate call
[ https://issues.apache.org/jira/browse/YARN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275247#comment-14275247 ] Hudson commented on YARN-2643: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #69 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/69/]) YARN-2643. Don't create a new DominantResourceCalculator on every FairScheduler.allocate call. (kasha via rkanter) (rkanter: rev 51881535e659940b1b332d0c5952ee1f9958cc7f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java Don't create a new DominantResourceCalculator on every FairScheduler.allocate call -- Key: YARN-2643 URL: https://issues.apache.org/jira/browse/YARN-2643 Project: Hadoop YARN Issue Type: Improvement Reporter: Sandy Ryza Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.7.0 Attachments: yarn-2643-1.patch, yarn-2643.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3056) add verification for containerLaunchDuration in TestNodeManagerMetrics.
[ https://issues.apache.org/jira/browse/YARN-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reassigned YARN-3056: --- Assignee: zhihai xu add verification for containerLaunchDuration in TestNodeManagerMetrics. --- Key: YARN-3056 URL: https://issues.apache.org/jira/browse/YARN-3056 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial add verification for containerLaunchDuration in TestNodeManagerMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275041#comment-14275041 ] Hadoop QA commented on YARN-2679: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691910/YARN-2679.addendum.1.patch against trunk revision 08ac062. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6320//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6320//console This message is automatically generated. Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2679: Attachment: YARN-2679.addendum.1.patch Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch, YARN-2679.addendum.1.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275033#comment-14275033 ] Hadoop QA commented on YARN-2679: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691909/YARN-2679.addendum.1.patch against trunk revision 08ac062. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6319//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6319//console This message is automatically generated. Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3056) add verification for containerLaunchDuration in TestNodeManagerMetrics.
[ https://issues.apache.org/jira/browse/YARN-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3056: Attachment: YARN-3056.000.patch add verification for containerLaunchDuration in TestNodeManagerMetrics. --- Key: YARN-3056 URL: https://issues.apache.org/jira/browse/YARN-3056 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Attachments: YARN-3056.000.patch add verification for containerLaunchDuration in TestNodeManagerMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2679: Attachment: (was: YARN-2679.addendum.1.patch) Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275001#comment-14275001 ] zhihai xu commented on YARN-2679: - Sorry, I forget to add verification in the test. I attached an addendum patch which add verification in the test. Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch, YARN-2679.addendum.1.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3056) add verification for containerLaunchDuration in TestNodeManagerMetrics.
zhihai xu created YARN-3056: --- Summary: add verification for containerLaunchDuration in TestNodeManagerMetrics. Key: YARN-3056 URL: https://issues.apache.org/jira/browse/YARN-3056 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0 Reporter: zhihai xu Priority: Trivial add verification for containerLaunchDuration in TestNodeManagerMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reopened YARN-2679: - Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2679: Attachment: YARN-2679.addendum.1.patch Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch, YARN-2679.addendum.1.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2679: Attachment: (was: YARN-2679.addendum.1.patch) Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-2679. - Resolution: Fixed Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275038#comment-14275038 ] zhihai xu commented on YARN-2679: - I created YARN-3056 to add verification in the test. Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3057) Need update apps' runnability when reloading allocation files for FairScheduler
Jun Gong created YARN-3057: -- Summary: Need update apps' runnability when reloading allocation files for FairScheduler Key: YARN-3057 URL: https://issues.apache.org/jira/browse/YARN-3057 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong If we submit a app and the number of running app in its corresponding leaf queue has reached its max limit, the app will be put into 'nonRunnableApps'. And its runnabiltiy will only be updated when removing a appattempt(FairScheduler will call `updateRunnabilityOnAppRemoval` at that time). Suppose there are only service apps running, they will not finish, so the submitted app will not be scheduled even we change leaf queue's max limit. I think we need update apps' runnability when reloading allocation files for FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3056) add verification for containerLaunchDuration in TestNodeManagerMetrics.
[ https://issues.apache.org/jira/browse/YARN-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275058#comment-14275058 ] Hadoop QA commented on YARN-3056: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691921/YARN-3056.000.patch against trunk revision 08ac062. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6321//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6321//console This message is automatically generated. add verification for containerLaunchDuration in TestNodeManagerMetrics. --- Key: YARN-3056 URL: https://issues.apache.org/jira/browse/YARN-3056 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Attachments: YARN-3056.000.patch add verification for containerLaunchDuration in TestNodeManagerMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2957) Create unit test to automatically compare YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275103#comment-14275103 ] Hudson commented on YARN-2957: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #806 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/806/]) YARN-2957. Create unit test to automatically compare YarnConfiguration and yarn-default.xml. (rchiang via rkanter) (rkanter: rev f45163191583eadcfbe0df233a3185fd1b2b78f3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/conf/TestYarnConfigurationFields.java * hadoop-yarn-project/CHANGES.txt Create unit test to automatically compare YarnConfiguration and yarn-default.xml Key: YARN-2957 URL: https://issues.apache.org/jira/browse/YARN-2957 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Fix For: 2.7.0 Attachments: YARN-2957.001.patch Create a unit test that will automatically compare the fields in YarnConfiguration and yarn-default.xml. It should throw an error if a property is missing in either the class or the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3056) add verification for containerLaunchDuration in TestNodeManagerMetrics.
[ https://issues.apache.org/jira/browse/YARN-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275111#comment-14275111 ] Karthik Kambatla commented on YARN-3056: Sorry for missing this in my review of YARN-2679, and thanks for following up. The patch looks good. +1. add verification for containerLaunchDuration in TestNodeManagerMetrics. --- Key: YARN-3056 URL: https://issues.apache.org/jira/browse/YARN-3056 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Attachments: YARN-3056.000.patch add verification for containerLaunchDuration in TestNodeManagerMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) Fix allTokens issue in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3055: - Attachment: YARN-3055.002.patch Fix allTokens issue in DelegationTokenRenewer - Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) Fix allTokens issue in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3055: - Attachment: YARN-3055.002.patch Fix allTokens issue in DelegationTokenRenewer - Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) Fix allTokens issue in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3055: - Attachment: (was: YARN-3055.002.patch) Fix allTokens issue in DelegationTokenRenewer - Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) Fix allTokens issue in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275172#comment-14275172 ] Yi Liu commented on YARN-3055: -- Upload a new patch. Fix allTokens issue in DelegationTokenRenewer - Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2643) Don't create a new DominantResourceCalculator on every FairScheduler.allocate call
[ https://issues.apache.org/jira/browse/YARN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275099#comment-14275099 ] Hudson commented on YARN-2643: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #806 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/806/]) YARN-2643. Don't create a new DominantResourceCalculator on every FairScheduler.allocate call. (kasha via rkanter) (rkanter: rev 51881535e659940b1b332d0c5952ee1f9958cc7f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Don't create a new DominantResourceCalculator on every FairScheduler.allocate call -- Key: YARN-2643 URL: https://issues.apache.org/jira/browse/YARN-2643 Project: Hadoop YARN Issue Type: Improvement Reporter: Sandy Ryza Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.7.0 Attachments: yarn-2643-1.patch, yarn-2643.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3027) Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation
[ https://issues.apache.org/jira/browse/YARN-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275104#comment-14275104 ] Hudson commented on YARN-3027: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #806 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/806/]) YARN-3027. Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation. (adhoot via rkanter) (rkanter: rev ae7bf31fe1c63f323ba5271e50fd0e4425a7510f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation - Key: YARN-3027 URL: https://issues.apache.org/jira/browse/YARN-3027 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.0 Attachments: YARN-3027.001.patch, YARN-3027.002.patch YARN-2604 added support for updating maxiumum allocation resource size based on nodes. But it incorrectly uses available resource instead of maximum resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) Fix allTokens issue in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275115#comment-14275115 ] Yi Liu commented on YARN-3055: -- The token is still not be renewed, will update the patch later Fix allTokens issue in DelegationTokenRenewer - Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3055: - Summary: The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer (was: Fix allTokens issue in DelegationTokenRenewer) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275183#comment-14275183 ] Yi Liu commented on YARN-2964: -- It seems this JIRA will cause the token is not renewed properly if it's shared by jobs (oozie), I filed a JIRA YARN-3055, please take a look. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2643) Don't create a new DominantResourceCalculator on every FairScheduler.allocate call
[ https://issues.apache.org/jira/browse/YARN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275081#comment-14275081 ] Hudson commented on YARN-2643: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #72 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/72/]) YARN-2643. Don't create a new DominantResourceCalculator on every FairScheduler.allocate call. (kasha via rkanter) (rkanter: rev 51881535e659940b1b332d0c5952ee1f9958cc7f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Don't create a new DominantResourceCalculator on every FairScheduler.allocate call -- Key: YARN-2643 URL: https://issues.apache.org/jira/browse/YARN-2643 Project: Hadoop YARN Issue Type: Improvement Reporter: Sandy Ryza Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.7.0 Attachments: yarn-2643-1.patch, yarn-2643.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3027) Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation
[ https://issues.apache.org/jira/browse/YARN-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275086#comment-14275086 ] Hudson commented on YARN-3027: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #72 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/72/]) YARN-3027. Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation. (adhoot via rkanter) (rkanter: rev ae7bf31fe1c63f323ba5271e50fd0e4425a7510f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation - Key: YARN-3027 URL: https://issues.apache.org/jira/browse/YARN-3027 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.0 Attachments: YARN-3027.001.patch, YARN-3027.002.patch YARN-2604 added support for updating maxiumum allocation resource size based on nodes. But it incorrectly uses available resource instead of maximum resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3055: - Description: After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. was: In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3027) Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation
[ https://issues.apache.org/jira/browse/YARN-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275327#comment-14275327 ] Hudson commented on YARN-3027: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #73 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/73/]) YARN-3027. Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation. (adhoot via rkanter) (rkanter: rev ae7bf31fe1c63f323ba5271e50fd0e4425a7510f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/CHANGES.txt Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation - Key: YARN-3027 URL: https://issues.apache.org/jira/browse/YARN-3027 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.0 Attachments: YARN-3027.001.patch, YARN-3027.002.patch YARN-2604 added support for updating maxiumum allocation resource size based on nodes. But it incorrectly uses available resource instead of maximum resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2957) Create unit test to automatically compare YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275326#comment-14275326 ] Hudson commented on YARN-2957: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #73 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/73/]) YARN-2957. Create unit test to automatically compare YarnConfiguration and yarn-default.xml. (rchiang via rkanter) (rkanter: rev f45163191583eadcfbe0df233a3185fd1b2b78f3) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/conf/TestYarnConfigurationFields.java Create unit test to automatically compare YarnConfiguration and yarn-default.xml Key: YARN-2957 URL: https://issues.apache.org/jira/browse/YARN-2957 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Fix For: 2.7.0 Attachments: YARN-2957.001.patch Create a unit test that will automatically compare the fields in YarnConfiguration and yarn-default.xml. It should throw an error if a property is missing in either the class or the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2643) Don't create a new DominantResourceCalculator on every FairScheduler.allocate call
[ https://issues.apache.org/jira/browse/YARN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275322#comment-14275322 ] Hudson commented on YARN-2643: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #73 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/73/]) YARN-2643. Don't create a new DominantResourceCalculator on every FairScheduler.allocate call. (kasha via rkanter) (rkanter: rev 51881535e659940b1b332d0c5952ee1f9958cc7f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Don't create a new DominantResourceCalculator on every FairScheduler.allocate call -- Key: YARN-2643 URL: https://issues.apache.org/jira/browse/YARN-2643 Project: Hadoop YARN Issue Type: Improvement Reporter: Sandy Ryza Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.7.0 Attachments: yarn-2643-1.patch, yarn-2643.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2957) Create unit test to automatically compare YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275397#comment-14275397 ] Hudson commented on YARN-2957: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2023 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2023/]) YARN-2957. Create unit test to automatically compare YarnConfiguration and yarn-default.xml. (rchiang via rkanter) (rkanter: rev f45163191583eadcfbe0df233a3185fd1b2b78f3) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/conf/TestYarnConfigurationFields.java Create unit test to automatically compare YarnConfiguration and yarn-default.xml Key: YARN-2957 URL: https://issues.apache.org/jira/browse/YARN-2957 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Fix For: 2.7.0 Attachments: YARN-2957.001.patch Create a unit test that will automatically compare the fields in YarnConfiguration and yarn-default.xml. It should throw an error if a property is missing in either the class or the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3027) Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation
[ https://issues.apache.org/jira/browse/YARN-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275398#comment-14275398 ] Hudson commented on YARN-3027: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2023 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2023/]) YARN-3027. Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation. (adhoot via rkanter) (rkanter: rev ae7bf31fe1c63f323ba5271e50fd0e4425a7510f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/CHANGES.txt Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation - Key: YARN-3027 URL: https://issues.apache.org/jira/browse/YARN-3027 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.0 Attachments: YARN-3027.001.patch, YARN-3027.002.patch YARN-2604 added support for updating maxiumum allocation resource size based on nodes. But it incorrectly uses available resource instead of maximum resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2643) Don't create a new DominantResourceCalculator on every FairScheduler.allocate call
[ https://issues.apache.org/jira/browse/YARN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275393#comment-14275393 ] Hudson commented on YARN-2643: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2023 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2023/]) YARN-2643. Don't create a new DominantResourceCalculator on every FairScheduler.allocate call. (kasha via rkanter) (rkanter: rev 51881535e659940b1b332d0c5952ee1f9958cc7f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Don't create a new DominantResourceCalculator on every FairScheduler.allocate call -- Key: YARN-2643 URL: https://issues.apache.org/jira/browse/YARN-2643 Project: Hadoop YARN Issue Type: Improvement Reporter: Sandy Ryza Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.7.0 Attachments: yarn-2643-1.patch, yarn-2643.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1871) We should eliminate writing *PBImpl code in YARN
[ https://issues.apache.org/jira/browse/YARN-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-1871: - Assignee: (was: Wangda Tan) We should eliminate writing *PBImpl code in YARN Key: YARN-1871 URL: https://issues.apache.org/jira/browse/YARN-1871 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.4.0 Reporter: Wangda Tan Attachments: YARN-1871.demo.patch Currently, We need write PBImpl classes one by one. After running find . -name *PBImpl*.java | xargs wc -l under hadoop source code directory, we can see, there're more than 25,000 LOC. I think we should improve this, which will be very helpful for YARN developers to make changes for YARN protocols. There're only some limited patterns in current *PBImpl, * Simple types, like string, int32, float. * List? types * Map? types * Enum types Code generation should be enough to generate such PBImpl classes. Some other requirements are, * Leave other related code alone, like service implemention (e.g. ContainerManagerImpl). * (If possible) Forward compatibility, developpers can write their own PBImpl or genereate them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1871) We should eliminate writing *PBImpl code in YARN
[ https://issues.apache.org/jira/browse/YARN-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276086#comment-14276086 ] Wangda Tan commented on YARN-1871: -- Making it un-assigned since I don't have bandwidth to do this now. We should eliminate writing *PBImpl code in YARN Key: YARN-1871 URL: https://issues.apache.org/jira/browse/YARN-1871 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.4.0 Reporter: Wangda Tan Attachments: YARN-1871.demo.patch Currently, We need write PBImpl classes one by one. After running find . -name *PBImpl*.java | xargs wc -l under hadoop source code directory, we can see, there're more than 25,000 LOC. I think we should improve this, which will be very helpful for YARN developers to make changes for YARN protocols. There're only some limited patterns in current *PBImpl, * Simple types, like string, int32, float. * List? types * Map? types * Enum types Code generation should be enough to generate such PBImpl classes. Some other requirements are, * Leave other related code alone, like service implemention (e.g. ContainerManagerImpl). * (If possible) Forward compatibility, developpers can write their own PBImpl or genereate them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276101#comment-14276101 ] Wangda Tan commented on YARN-2932: -- [~eepayne], Thanks for response, *Re 2:* You're partially correct, queue finally calls setupQueueConfig when reinitialize is invoked. The CapacityScheduler reinitialization is creating a new set of queues, and copy new parameters to your old queues via {code} setupQueueConfigs( clusterResource, newlyParsedLeafQueue.capacity, newlyParsedLeafQueue.absoluteCapacity, newlyParsedLeafQueue.maximumCapacity, newlyParsedLeafQueue.absoluteMaxCapacity, ... {code} So you need put the parameter you wants to update to setupQueueConfig as well. Without that, queue will not be refreshed. I didn't find any changes to parameter of setupQueueConfig, so I guess so, it's better to add a test to verify it. *Re 3:* You can take a look at how AbstractCSQueue initialize labels, {code} // get labels this.accessibleLabels = cs.getConfiguration().getAccessibleNodeLabels(getQueuePath()); // inherit from parent if labels not set if (this.accessibleLabels == null parent != null) { this.accessibleLabels = parent.getAccessibleNodeLabels(); } {code} I think they have similar logic -- For node label is trying to get value from configuration, if not set, inherit from parent. With this, you can make getPreemptable interface without defaultVal in CapacitySchedulerConfiguration. Add entry for preemption setting to queue status screen and startup/refresh logging --- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276107#comment-14276107 ] Vinod Kumar Vavilapalli commented on YARN-2791: --- Okay folks, I've read the design docs on both YARN-2791 (this JIRA) and YARN-2139. This is indeed part of YARN-2139, and a direct dup of YARN-2618 and other tickets. Yes YARN-2139 is a much larger effort but it encompasses both scheduling and isolation. The important tickets of YARN-2139 already were created before this JIRA. I am going to close this as a duplicate in a day unless I see specific tasks that are not covered under YARN-2139. If there are things that are not covered indeed, I urge Swapnil Daingade, Santosh Marellaand and Yuliya Feldman to file sub-tasks under YARN-2791. As Karthik appealed before, let's have the design discussion over at YARN-2139, merging things that are only here and missing in that JIRA. Due credit will be given to all contributors to the design and implementation there. I am oblivious who contributes code, but let's work together please! Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Attachments: DiskDriveAsResourceInYARN.pdf Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275905#comment-14275905 ] Hadoop QA commented on YARN-2637: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692009/YARN-2637.40.patch against trunk revision 10ac5ab. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6323//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6323//console This message is automatically generated. maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, YARN-2637.38.patch, YARN-2637.39.patch, YARN-2637.40.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275952#comment-14275952 ] Jian He commented on YARN-3055: --- bq. Meanwhile, we should not cancel the timerTask, also we should not remove it from allTokens. IIUC, this is not the case. Because If launcher job first gets added to the appTokens map, DelegationTokenRenewer will not add DelegationTokenToRenew instance for the sub-job. So the tokens in removeApplicationFromRenewal will return empty for the sub-job when the sub-job completes. So the token won’t be removed from the allTokens. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275939#comment-14275939 ] Hadoop QA commented on YARN-2933: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692017/YARN-2933-7.patch against trunk revision 10ac5ab. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6324//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6324//console This message is automatically generated. Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch, YARN-2933-7.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)