[jira] [Updated] (YARN-3319) Implement a Fair SchedulerOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3319: -- Attachment: YARN-3319.39.patch Implement a Fair SchedulerOrderingPolicy Key: YARN-3319 URL: https://issues.apache.org/jira/browse/YARN-3319 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3319.13.patch, YARN-3319.14.patch, YARN-3319.17.patch, YARN-3319.35.patch, YARN-3319.39.patch Implement a Fair Comparator for the Scheduler Comparator Ordering Policy which prefers to allocate to SchedulerProcesses with least current usage, very similar to the FairScheduler's FairSharePolicy. The Policy will offer allocations to applications in a queue in order of least resources used, and preempt applications in reverse order (from most resources used). This will include conditional support for sizeBasedWeight style adjustment An implementation of a Scheduler Comparator for use with the Scheduler Comparator Ordering Policy will be built with the below comparison for ordering applications for container assignment (ascending) and for preemption (descending) Current resource usage - less usage is lesser Submission time - earlier is lesser Optionally, based on a conditional configuration to enable sizeBasedWeight (default false), an adjustment to boost larger applications (to offset the natural preference for smaller applications) will adjust the resource usage value based on demand, dividing it by the below value: Math.log1p(app memory demand) / Math.log(2); In cases where the above is indeterminate (two applications are equal after this comparison), behavior falls back to comparison based on the application name, which is lexically FIFO for that comparison (first submitted is lesser) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3318) Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3318: -- Attachment: YARN-3318.39.patch Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior --- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch, YARN-3318.39.patch Create the initial framework required for using OrderingPolicies with SchedulerApplicaitonAttempts and integrate with the CapacityScheduler. This will include an implementation which is compatible with current FIFO behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3318) Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392211#comment-14392211 ] Craig Welch commented on YARN-3318: --- [~leftnoteasy] SchedulerProcessEvents replaced with containerAllocated and containerReleased Serial and SerialEpoch replaced with compareInputOrderTo(), which is the option 2 for addressing it which we settled on offline Added addSchedulerProcess/removeSchedulerProcess/addAllSchedulerProcesses Changed configuration so that yarn.scheduler.capacity.root.default.ordering-policy=fair will setup the fair configuration, fifo will setup fifo, fair+fifo will setup compound fair + fifo, etc. It is possible to setup a custom ordering policy class using a different configuration, but the base one will handle the friendly setup. [~vinodkv] bq. It is not entirely clear how the ordering and limits work together - as one policy with multiple facets or multiple policy types This should be modeled as different types of policies, so that they can each focus on their particular purpose and avoid a repetition of the intermingling which has made it difficult to mix, match, and share capabilities. Having multiple policy types is essential to make it easy to combine them as needed. bq. let's split the patch that exposes this to the client side / web UI and in the API records into its own JIRA...premature to support this as a publicly supportable configuration... The goal is to make this available quickly but iteratively, keeping the changes small but making them available for use and feedback. Clearly we can mark things unstable, communicate that they are not fully mature/subject to change/should be used gently, but we will need to make it possible to activate the feature and use it in order to accomplish the use and feedback. We should grow it organically, gradually, iteratively, think of it is a facet of the policy framework hooked up and available but with more to follow bq. ...SchedulableEntity better... well, I'd actually talked [~leftnoteasy] into SchedulerProcess :-) So, we can chew on this a bit more see where we go bq. You add/remove applications to/from LeafQueue's policy but addition/removal of containers is an event... This has been factored differently along [~leftnoteasy]'s suggestion, it should now be consistent bq. The notion of a comparator doesn't make sense to an admin. It is simply a policy... Have modeled policy configuration differently, so comparator is out of sight (see above). bq. Depending on how ordering and limits come together, they may become properties of a policy I expect them to be distinct, this is specifically an ordering-policy, limits will be other types of limit-policy(ies) patch with these changes to follow in a few... Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior --- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch Create the initial framework required for using OrderingPolicies with SchedulerApplicaitonAttempts and integrate with the CapacityScheduler. This will include an implementation which is compatible with current FIFO behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3318) Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392345#comment-14392345 ] Hadoop QA commented on YARN-3318: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708923/YARN-3318.39.patch against trunk revision 867d5d2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1149 javac compiler warnings (more than the trunk's current 1148 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.api.impl.TestAMRMClient Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7198//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7198//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7198//console This message is automatically generated. Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior --- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch, YARN-3318.39.patch Create the initial framework required for using OrderingPolicies with SchedulerApplicaitonAttempts and integrate with the CapacityScheduler. This will include an implementation which is compatible with current FIFO behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3334: -- Attachment: YARN-3334.7.patch Last patch looks good to me, but I undo the some unnecessary change in TimelineClinetImpl (which seems to be adde for code debugging). Will hold the patch for a while before committing, in case other folks want to to take a look. [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334.7.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2942: Attachment: ConcatableAggregatedLogsProposal_v4.pdf I've just uploaded ConcatableAggregatedLogsProposal_v4.pdf, with an updated design that uses a slightly modified version of the CombinedAggregatedLogFormat (now ConcatableAggregatedLogFormat) I already wrote and would use HDFS concat to combine the files. [~zjshen], [~kasha], and [~vinodkv], can you take a look at it? Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CombinedAggregatedLogsProposal_v3.pdf, CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, ConcatableAggregatedLogsProposal_v4.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3439) RM fails to renew token when Oozie launcher leaves before sub-job finishes
[ https://issues.apache.org/jira/browse/YARN-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393433#comment-14393433 ] Rohini Palaniswamy commented on YARN-3439: -- bq. Essentially the idea is to reference count the tokens and only attempt to cancel them when the token is no longer referenced. Would be a good idea. I think this is the third time we have had delegation token renewal broken for Oozie with the Hadoop 2.x line. RM fails to renew token when Oozie launcher leaves before sub-job finishes -- Key: YARN-3439 URL: https://issues.apache.org/jira/browse/YARN-3439 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3439.001.patch When the Oozie launcher runs a standard MapReduce job (not Pig) it doesn't linger waiting for the sub-job to finish. At that point the RM stops renewing delegation tokens for the launcher job which wreaks havoc on the sub-job if the sub-job runs long enough for the tokens to expire. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning stats to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393494#comment-14393494 ] Wangda Tan commented on YARN-2901: -- +1 for the patch. Will commit it today if no opposite opinions. Add errors and warning stats to RM, NM web UI - Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3388) Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit
[ https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393493#comment-14393493 ] Hadoop QA commented on YARN-3388: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709050/YARN-3388-v1.patch against trunk revision eccb7d4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.TestRM The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7201//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7201//console This message is automatically generated. Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit - Key: YARN-3388 URL: https://issues.apache.org/jira/browse/YARN-3388 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-3388-v0.patch, YARN-3388-v1.patch When there are multiple active users in a queue, it should be possible for those users to make use of capacity up-to max_capacity (or close). The resources should be fairly distributed among the active users in the queue. This works pretty well when there is a single resource being scheduled. However, when there are multiple resources the situation gets more complex and the current algorithm tends to get stuck at Capacity. Example illustrated in subsequent comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393545#comment-14393545 ] Wangda Tan commented on YARN-2729: -- Some comments: *1) Configuration:* Instead of distributed_node_labels_prefix, do you think is it better to name it : yarn.node-labels.nm.provider? The distributed.node-labels-provider doesn't clearly mentioned it runs in NM side. I don't want to expose class to config unless it is necessary, now we have two options, one is script-based and another is config-based. We can set the two as white-list, if a given value is not in whitelist, trying to get a class from the name. So the option will be: yarn.node-labels.nm.provider = config/script/other-class-name. Revisted interval, I think it's better to make it to be provider configuration instead of script-provider-only configuration. Since config/script will share it (I remember I have some back-and-forth opinions here). If you agree above, the name could be: yarn.node-labels.nm.provider-fetch-interval-ms (and provider-fetch-timeout-ms) And script-related options could be: yarn.node-labels.nm.provider.script.path/opts *2) Implementation of ScriptBasedNodeLabelsProvider* I feel like ScriptBased and ConfigBased can share some implementations, they will all init a time task, get interval and run, check timeout (meaningless for config-based), etc. Can you make an abstract class and inherited by ScriptBased? DISABLE_TIMER_CONFIG should be a part of YarnConfiguration, all default of configurations should be a part of YarnConfiguration. canRun - something like verifyConfiguredScript, and directly throw exception when something wrong (so that admin can know what really happened, such as file not found, doesn't have execution permission, etc.), and it should be private non-static. checkAndThrowLabelName should be called in NodeStatusUpdaterImpl label need to be trim() when called checkAndThrowLabelName(...) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup --- Key: YARN-2729 URL: https://issues.apache.org/jira/browse/YARN-2729 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, YARN-2729.20150402-1.patch Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3439) RM fails to renew token when Oozie launcher leaves before sub-job finishes
[ https://issues.apache.org/jira/browse/YARN-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-3439: - Attachment: YARN-3439.001.patch Daryn is out so posting a prototype patch he developed to get some early feedback. Note that this patch can't go in as-is, as it's a work-in-progress that hacks out the automatic HDFS delegation token logic that was added as part of YARN-2704. Essentially the idea is to reference count the tokens and only attempt to cancel them when the token is no longer referenced. Since the launcher job won't complete until it has successfully submitted the sub-job(s), the token will remain referenced throughout the lifespan of the workflow even if the launcher job exits early. RM fails to renew token when Oozie launcher leaves before sub-job finishes -- Key: YARN-3439 URL: https://issues.apache.org/jira/browse/YARN-3439 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3439.001.patch When the Oozie launcher runs a standard MapReduce job (not Pig) it doesn't linger waiting for the sub-job to finish. At that point the RM stops renewing delegation tokens for the launcher job which wreaks havoc on the sub-job if the sub-job runs long enough for the tokens to expire. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3440) ResourceUsage should be copy-on-write
[ https://issues.apache.org/jira/browse/YARN-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3440: Description: In {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceUsage}}, even if it is thread-safe, but Resource returned by getters could be updated by another thread. All Resource objects in ResourceUsage should be copy-on-write, reader will always get a non-changed Resource. And changes apply on Resource acquired by caller will not affect original Resource. was: In {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceUsage }}, even if it is thread-safe, but Resource returned by getters could be updated by another thread. All Resource objects in ResourceUsage should be copy-on-write, reader will always get a non-changed Resource. And changes apply on Resource acquired by caller will not affect original Resource. ResourceUsage should be copy-on-write - Key: YARN-3440 URL: https://issues.apache.org/jira/browse/YARN-3440 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler, yarn Reporter: Wangda Tan Assignee: Li Lu In {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceUsage}}, even if it is thread-safe, but Resource returned by getters could be updated by another thread. All Resource objects in ResourceUsage should be copy-on-write, reader will always get a non-changed Resource. And changes apply on Resource acquired by caller will not affect original Resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniMRYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393408#comment-14393408 ] Mit Desai commented on YARN-2890: - [~hitesh], did you had any comments on the patch? MiniMRYarnCluster should turn on timeline service if configured to do so Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393412#comment-14393412 ] Hudson commented on YARN-3415: -- FAILURE: Integrated in Hadoop-trunk-Commit #7497 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7497/]) YARN-3415. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue (Zhihai Xu via Sandy Ryza) (sandy: rev 6a6a59db7f1bfda47c3c14fb49676a7b22d2eb06) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/CHANGES.txt Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3437: -- Attachment: YARN-3437.001.patch Patch v.1 posted. This is basically a modification of the YARN-2556 patch (and clean-up of issues etc.) to work against the timeline service v.2. Since the new distributed timeline service collectors are tied to applications, I chose the approach of instantiating the base timeline collector within the mapper task, rather than going through the timeline client. Making it go through the timeline client has a number of challenges (see YARN-3378). But this should be still effective as a way to exercise the bulk of the write performance and scalability. You can try this out by doing for example {code} hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT-tests.jar timelineperformance -m 10 -t 1000 {code} You'll get the output like {noformat} TRANSACTION RATE (per mapper): 5027.652086 ops/s IO RATE (per mapper): 5027.652086 KB/s TRANSACTION RATE (total): 50276.520865 ops/s IO RATE (total): 50276.520865 KB/s {noformat} It is still using pretty simple entities to write to the storage. I'll work on adding handling job history files later in a different JIRA. I would greatly appreciate your review. Thanks! convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393554#comment-14393554 ] Hadoop QA commented on YARN-3437: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709078/YARN-3437.001.patch against trunk revision 6a6a59d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7206//console This message is automatically generated. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3388) Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit
[ https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated YARN-3388: - Attachment: YARN-3388-v1.patch Hi [~leftnoteasy]. Uploaded a new version of patch that addresses the inefficiency and adds unit tests. I think label support is better left for a separate jira when labels are fully working with userlimits. Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit - Key: YARN-3388 URL: https://issues.apache.org/jira/browse/YARN-3388 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-3388-v0.patch, YARN-3388-v1.patch When there are multiple active users in a queue, it should be possible for those users to make use of capacity up-to max_capacity (or close). The resources should be fairly distributed among the active users in the queue. This works pretty well when there is a single resource being scheduled. However, when there are multiple resources the situation gets more complex and the current algorithm tends to get stuck at Capacity. Example illustrated in subsequent comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393451#comment-14393451 ] Wangda Tan commented on YARN-2729: -- Apparently Jenkins ran wrong tests, rekicked Jenkins. Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup --- Key: YARN-2729 URL: https://issues.apache.org/jira/browse/YARN-2729 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, YARN-2729.20150402-1.patch Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393450#comment-14393450 ] Hadoop QA commented on YARN-2942: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709065/ConcatableAggregatedLogsProposal_v4.pdf against trunk revision 6a6a59d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7204//console This message is automatically generated. Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CombinedAggregatedLogsProposal_v3.pdf, CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, ConcatableAggregatedLogsProposal_v4.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently
[ https://issues.apache.org/jira/browse/YARN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2666: Attachment: (was: YARN-2666.000.patch) TestFairScheduler.testContinuousScheduling fails Intermittently --- Key: YARN-2666 URL: https://issues.apache.org/jira/browse/YARN-2666 Project: Hadoop YARN Issue Type: Test Components: scheduler Reporter: Tsuyoshi Ozawa Assignee: zhihai xu The test fails on trunk. {code} Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) Time elapsed: 0.582 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3439) RM fails to renew token when Oozie launcher leaves before sub-job finishes
[ https://issues.apache.org/jira/browse/YARN-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393410#comment-14393410 ] Hadoop QA commented on YARN-3439: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709044/YARN-3439.001.patch against trunk revision eccb7d4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7200//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7200//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7200//console This message is automatically generated. RM fails to renew token when Oozie launcher leaves before sub-job finishes -- Key: YARN-3439 URL: https://issues.apache.org/jira/browse/YARN-3439 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3439.001.patch When the Oozie launcher runs a standard MapReduce job (not Pig) it doesn't linger waiting for the sub-job to finish. At that point the RM stops renewing delegation tokens for the launcher job which wreaks havoc on the sub-job if the sub-job runs long enough for the tokens to expire. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393556#comment-14393556 ] Hadoop QA commented on YARN-2729: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708788/YARN-2729.20150402-1.patch against trunk revision 6a6a59d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7205//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7205//console This message is automatically generated. Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup --- Key: YARN-2729 URL: https://issues.apache.org/jira/browse/YARN-2729 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, YARN-2729.20150402-1.patch Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3318) Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393388#comment-14393388 ] Vinod Kumar Vavilapalli commented on YARN-3318: --- bq. I think it should be fine to make policy interfaces define as well as CapacityScheduler changes together with this patch (only for FifoOrderingPolicy), it's good to see how interfaces and policies work in CS, is it easy or not, etc. = We can still do this with patches on two JIRAs - one for the framework, one for CS, one for FS etc. The Fifo one can be here for demonstration, no problem with that. Why is it so hard to focus one thing in one JIRA? Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior --- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch, YARN-3318.39.patch Create the initial framework required for using OrderingPolicies with SchedulerApplicaitonAttempts and integrate with the CapacityScheduler. This will include an implementation which is compatible with current FIFO behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393530#comment-14393530 ] Sangjin Lee commented on YARN-3437: --- Added a few folks for review. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-3415: - Summary: Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue (was: Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3318) Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393374#comment-14393374 ] Wangda Tan commented on YARN-3318: -- [~cwelch], I took a look at your latest patch as well as [~vinodkv]'s suggestions, comments: *1. I prefer what Vinod suggested, split SchedulerProcess to be QueueSchedulable and AppSchedulable to avoid notes in FairScheduler interface Schedulable like:* {code} /** Start time for jobs in FIFO queues; meaningless for QueueSchedulables.*/ {code} They can both inherit {{Schedulable}}. With this patch, we can limit to AppSchedulable and Schedulable definition. Also, regarding to schedulable comparator, not all Schedulable fit for all comparator, it's meaningless to do FIFO scheduling in parent queue level. I think: {code} Schedulable contains ResourceUsage (class), and name In addition, AppSchedulable contains compareSubmissionOrderTo(AppSchedulable) and Priority {code} *2. About inherit relationships between interfaces/classes, now it's not very clear to me, I spent some time got what they're doing. My suggestion is:* {code} FairOrderingPolicy/FifoOrderingPolicy OrderingPolicy (implements) FairOrderingPolicy and FifoOrderingPolicy could inherit from AbstractOrderingPolicy with commmon implementations FairOrderingPolicy/FifoOrderingPolicy FairSchedulableComparator/FifoSchedulableComparator (uses) It's no need to invent SchedulerComparator interface, use existing Java Comparator interface should be simple and enough. {code} *3. Regarding relationship between OrderingPolicy and comparator:* I understand the method of SchedulerComparator is to reduce unnecessary re-sort Schedulables being added/modified in OrderingPolicy, but actually we can 1) Do this in OrderingPolicy itself. For example, with my above suggestion, FifoOrderingPolicy will simply ignore container changed notifications. 2) Comparator doesn't know about global info, only OrderingPolicy knows how combination of Comparator actors, I don't want containerAllocate/Release coupled in Comparator interface. And we don't need a separated CompoundComparator, this can be put in AbstractOrderingPolicy. *4. Regarding configuration (CapacitySchedulerConfiguration):* I think we don't need ORDERING_POLICY_CLASS, two operations for very similar purpose can confuse user. I suggest only leave ordering-policy, and it name can be: fifo, fair regardless of its internal comparator implementaiton. And in the future we can add priority-fifo, priority-fair. (note the - in name doesn't means AND only, it could be collaborate of the two instead of simply combination). If user specify a name not in white-list-shortname given by us, we will try to load class with the name. *5. Regarding longer term plan, LimitPolicy:* This part seems not well discussed, to limit scope of this JIRA, so I think its implementation and definition should happen in separated ticket. For longer plan, considering YARN-2986 as well, we may configure queue like following: {code} queue name=a queues queue name = a1 policy-properties ordering-policyfair/ordering-policy limit-policy user-limit-policy enabledtrue/enabled user-limit-percentage50/user-limit-percentage /user-limit-policy queue-capacity-policy capacity../capacity max-capacity../max-capacity /queue-capacity-policy /limit-policy /policy-properties /queue queues /queue {code} Changes of this patch in CapacitySchedulerConfiguration seems reasonable, as Craig mentioned, simply mark it to be unstable or experimental should be enough. Longer term is to define and stablize YARN-2986 to make a real uniformed scheduler. *6. Regarding scope of this JIRA* I think it should be fine to make policy interfaces define as well as CapacityScheduler changes together with this patch (only for FifoOrderingPolicy), it's good to see how interfaces and policies work in CS, is it easy or not, etc. = And following I suggest to move to a separated ticket: 1) UI (Web and CLI) 2) REST 3) PB related changes Along with patch getting changes, you don't have to maintain above changes together with the patch. Please feel free to let me know your thoughts. Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior --- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type:
[jira] [Commented] (YARN-3365) Add support for using the 'tc' tool via container-executor
[ https://issues.apache.org/jira/browse/YARN-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393496#comment-14393496 ] Hadoop QA commented on YARN-3365: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707355/YARN-3365.003.patch against trunk revision 6a6a59d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7203//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7203//console This message is automatically generated. Add support for using the 'tc' tool via container-executor -- Key: YARN-3365 URL: https://issues.apache.org/jira/browse/YARN-3365 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Sidharta Seethana Assignee: Sidharta Seethana Attachments: YARN-3365.001.patch, YARN-3365.002.patch, YARN-3365.003.patch We need the following functionality : 1) modify network interface traffic shaping rules - to be able to attach a qdisc, create child classes etc 2) read existing rules in place 3) read stats for the various classes Using tc requires elevated privileges - hence this functionality is to be made available via container-executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393264#comment-14393264 ] Zhijie Shen commented on YARN-3391: --- [~vrushalic], it sounds good to me to set aside the disagreement on the flow name default and move on. As far as I can tell, with the current context info data flow, it's quite simple to change the default value if we figure out the better one later. In addition, the previous debate is also related how we show flows on the web UI by default. I think we can go back to visit the defaults once we reaches the web UI work when we should have a better idea about it. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393351#comment-14393351 ] zhihai xu commented on YARN-3415: - [~sandyr], thanks for the review, The latest patch YARN-3415.002.patch is rebased on the latest code base and it passed the Jenkins test. Let me know whether you have more comments for the patch. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue - Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3318) Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393347#comment-14393347 ] Vinod Kumar Vavilapalli commented on YARN-3318: --- bq. I think it is useful to split off CS changes into their own JIRA. We can strictly focus on the policy framework here. You missed this, let's please do this. bq. well, I'd actually talked Wangda Tan into SchedulerProcess So, we can chew on this a bit more see where we go SchedulerProcess is definitely misleading. It seems to point to a process that is doing scheduling. What you need is a Schedulable / SchedulableEntity / Consumer etc. You could also say SchedulableProcess, but Process is way too overloaded. bq. The goal is to make this available quickly but iteratively, keeping the changes small but making them available for use and feedback. (..) We should grow it organically, gradually, iteratively, think of it is a facet of the policy framework hooked up and available but with more to follow I agree to this, but we are not in a position to support the APIs, CLI, config names in a supportable manner yet. They may or may not change depending on how parent queue policies, limit policies evolve. For that reason alone, I am saying that (1) Don't make the configurations public yet, or put a warning saying that they are unstable and (2) don't expose them in CLI , REST APIs yet. It's okay to put in the web UI, web UI scraping is not a contract. bq. You add/remove applications to/from LeafQueue's policy but addition/removal of containers is an event... bq. This has been factored differently along Wangda Tan's suggestion, it should now be consistent It's a bit better now. Although we are hard-coding Containers. Can revisit this later. Other comments - SchedulerApplicationAttempt.getDemand() should be private. - SchedulerProcess -- updateCaches() - updateState() / updateSchedulingState() as that is what it is doing? -- getCachedConsumption() / getCachedDemand(): simply getCurrent*() ? - SchedulerComparator -- We aren't comparing Schedulers. Given the current name, it should have been SchedulerProcessComparator, but SchedulerProcess itself should be renamed as mentioned before. -- What is the need for reorderOnContainerAllocate () / reorderOnContainerRelease()? - Move all the comparator related classed into their own package. - SchedulerComparatorPolicy -- This is really a ComparatorBasedOrderingPolicy. Do we really see non-comparator based ordering-policy. We are unnecessarily adding two abstractions - adding policies and comparators. -- Use className.getName() instead of hardcoded strings like org.apache.hadoop.yarn.server.resourcemanager.scheduler.policy.FifoComparator Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior --- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch, YARN-3318.39.patch Create the initial framework required for using OrderingPolicies with SchedulerApplicaitonAttempts and integrate with the CapacityScheduler. This will include an implementation which is compatible with current FIFO behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3442) Consider abstracting out user, app limits etc into some sort of a LimitPolicy
Vinod Kumar Vavilapalli created YARN-3442: - Summary: Consider abstracting out user, app limits etc into some sort of a LimitPolicy Key: YARN-3442 URL: https://issues.apache.org/jira/browse/YARN-3442 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Similar to the policies being added in YARN-3318 and YARN-3441 for leaf and parent queues, we should consider extracting an abstraction for limits too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3318) Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393440#comment-14393440 ] Vinod Kumar Vavilapalli commented on YARN-3318: --- Filed YARN-3441 and YARN-3442 for parent queues and for limits. Create Initial OrderingPolicy Framework, integrate with CapacityScheduler LeafQueue supporting present behavior --- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch, YARN-3318.39.patch Create the initial framework required for using OrderingPolicies with SchedulerApplicaitonAttempts and integrate with the CapacityScheduler. This will include an implementation which is compatible with current FIFO behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393441#comment-14393441 ] Junping Du commented on YARN-3334: -- Thanks [~zjshen] for review and comments! bq. but I undo the some unnecessary change in TimelineClientImpl (which seems to be adde for code debugging). I think that is necessary change. Previous message cannot tell too much info especially it return no different message between no response and response with failure. Also, error code should be log out even debug is not on because this is serious failure and should be reported in production environment. Thoughts? [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334.7.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393476#comment-14393476 ] zhihai xu commented on YARN-3415: - Thanks [~ragarwal] for valuable feedback and filing this issue. Thanks [~sandyr] for valuable feedback and committing the patch! Greatly appreciated. Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently
[ https://issues.apache.org/jira/browse/YARN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2666: Attachment: YARN-2666.000.patch TestFairScheduler.testContinuousScheduling fails Intermittently --- Key: YARN-2666 URL: https://issues.apache.org/jira/browse/YARN-2666 Project: Hadoop YARN Issue Type: Test Components: scheduler Reporter: Tsuyoshi Ozawa Assignee: zhihai xu Attachments: YARN-2666.000.patch The test fails on trunk. {code} Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) Time elapsed: 0.582 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393563#comment-14393563 ] Wangda Tan commented on YARN-3410: -- Thanks for your comment, [~rohithsharma]. But what's the use case of using rmadmin removing a state while RM is running? The command is just a way to avoid app entered an un-expected state so RM cannot get started, unless there's any use case of doing that, I suggest to scope this to a RM starting option like YARN-2131. YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393700#comment-14393700 ] Karthik Kambatla commented on YARN-2942: (Canceled the patch to stop Jenkins from evaluating the design doc :) ) [~rkanter] - thanks for updating the design doc. A couple of comments: # If there is an NM X actively concatenating its logs and NM Y can't acquire the lock, what happens? ## Does it do a blocking-wait? If yes, this should likely be in a separate thread. ## I would like for it to be non-blocking. How about a LogConcatenationService in the NM? This service is brought up if you enable log concatenation. This service would periodically go through all of its past aggregated logs and concatenate those that it can acquire a lock for. Delayed concatenation should be okay because we are doing this primarily to handle the problem HDFS has with small files. Also, this way, we don't have do anything different for NM restart. Forward looking, this concat service could potentially take input on how busy HDFS is. # I didn't completely understand the point about a config to specify the format. Are you suggesting we have two different on/off configs - one to turn on concatenation and one to specify the format JHS should be reading. I think just one config that clearly states that the turning on this on an NM (writer) requires the JHS (reader) already has this enabled. In case of rolling upgrades, this translates to requiring a JHS upgrade prior to NM upgrade. Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CombinedAggregatedLogsProposal_v3.pdf, CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, ConcatableAggregatedLogsProposal_v4.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3366) Outbound network bandwidth : classify/shape traffic originating from YARN containers
[ https://issues.apache.org/jira/browse/YARN-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-3366: Attachment: YARN-3366.001.patch Attaching a patch with an implementation of traffic classification/shaping for traffic originating from YARN containers. This patch depends on changes/patches from https://issues.apache.org/jira/browse/YARN-3365 and https://issues.apache.org/jira/browse/YARN-3443 Outbound network bandwidth : classify/shape traffic originating from YARN containers Key: YARN-3366 URL: https://issues.apache.org/jira/browse/YARN-3366 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sidharta Seethana Assignee: Sidharta Seethana Attachments: YARN-3366.001.patch In order to be able to isolate based on/enforce outbound traffic bandwidth limits, we need a mechanism to classify/shape network traffic in the nodemanager. For more information on the design, please see the attached design document in the parent JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3435) AM container to be allocated Appattempt AM container shown as null
[ https://issues.apache.org/jira/browse/YARN-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393975#comment-14393975 ] Hadoop QA commented on YARN-3435: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709003/YARN-3435.001.patch against trunk revision bad070f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7208//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7208//console This message is automatically generated. AM container to be allocated Appattempt AM container shown as null -- Key: YARN-3435 URL: https://issues.apache.org/jira/browse/YARN-3435 Project: Hadoop YARN Issue Type: Bug Environment: 1RM,1DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Trivial Attachments: Screenshot.png, YARN-3435.001.patch Submit yarn application Open http://rm:8088/cluster/appattempt/appattempt_1427984982805_0003_01 Before the AM container is allocated -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393631#comment-14393631 ] Zhijie Shen commented on YARN-3334: --- If so, I suggest combining the two massages together, and record a error-level log (the first message is actually useless, if we always report the second one). [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334.7.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393711#comment-14393711 ] Wangda Tan commented on YARN-3434: -- [~tgraves], I feel like this issue and several related issues are solved by YARN-3243 already. Could you please check if this problem is already solved? Thanks, Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-685) Capacity Scheduler is not distributing the reducers tasks across the cluster
[ https://issues.apache.org/jira/browse/YARN-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-685. - Resolution: Invalid According to test result from [~raviprak], CS fairly distributes reducers to NMs in the cluster. Resolving this as invalid and please reopen this if you still think this is a problem. Capacity Scheduler is not distributing the reducers tasks across the cluster Key: YARN-685 URL: https://issues.apache.org/jira/browse/YARN-685 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.0.4-alpha Reporter: Devaraj K If we have reducers whose total memory required to complete is less than the total cluster memory, it is not assigning the reducers to all the nodes uniformly(~uniformly). Also at that time there are no other jobs or job tasks running in the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393750#comment-14393750 ] Sangjin Lee commented on YARN-3051: --- bq. To plot graphs based on timeseries data, we may need to provide a time window for metrics too. This would be useful in case of getEntity() API. So do we specify this time window separately for each metric to be retrieved or same for all metrics ? My sense is that it should be fine to use the same time window for all metrics. [~gtCarrera9]? [~zjshen]? bq. Queries based on relations i.e. queries such as get all containers for an app. We can return relatesto field while querying for an app. And then client can use this result to fetch detailed info about related entities. Is that fine ? Or we have to be handle it as part of a single query ? For now, let's assume 2 queries from the client side. My thinking was that this is an optimization. If the storage can return two levels of entities efficiently, we could potentially exploit it. But maybe that's nice to have at the moment. bq. Some understanding on how flow id, flow run id will be stored is required. Li just posted the schema design in YARN-3134. That should be helpful. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3365) Add support for using the 'tc' tool via container-executor
[ https://issues.apache.org/jira/browse/YARN-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393773#comment-14393773 ] Hudson commented on YARN-3365: -- FAILURE: Integrated in Hadoop-trunk-Commit #7500 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7500/]) YARN-3365. Enhanced NodeManager to support using the 'tc' tool via container-executor for outbound network traffic control. Contributed by Sidharta Seethana. (vinodkv: rev b21c72777ae664b08fd1a93b4f88fa43f2478d94) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.h * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java Add support for using the 'tc' tool via container-executor -- Key: YARN-3365 URL: https://issues.apache.org/jira/browse/YARN-3365 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Sidharta Seethana Assignee: Sidharta Seethana Fix For: 2.8.0 Attachments: YARN-3365.001.patch, YARN-3365.002.patch, YARN-3365.003.patch We need the following functionality : 1) modify network interface traffic shaping rules - to be able to attach a qdisc, create child classes etc 2) read existing rules in place 3) read stats for the various classes Using tc requires elevated privileges - hence this functionality is to be made available via container-executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3444) Fixed typo (capability)
[ https://issues.apache.org/jira/browse/YARN-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393861#comment-14393861 ] Gabor Liptak commented on YARN-3444: Pull request at https://github.com/apache/hadoop/pull/15 Fixed typo (capability) --- Key: YARN-3444 URL: https://issues.apache.org/jira/browse/YARN-3444 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Reporter: Gabor Liptak Priority: Minor Fixed typo (capability) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) RMTimelineCollector should have the context info of each app
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393615#comment-14393615 ] Sangjin Lee commented on YARN-3390: --- I think we need to either pass in the context per call or have a map of app id to context. I would favor the latter approach because it'd be easier on the perspective of callers of putEntities(). RMTimelineCollector should have the context info of each app Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3134: Attachment: YARN-3134DataSchema.pdf After some community discussion we're finalizing the Phoenix data schema design for the very first phase. In this phase we focus on storing basic entities and their metrics, configs, and events. The attached document is a summary of our discussion results. Comments are more than welcome. [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently
[ https://issues.apache.org/jira/browse/YARN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393743#comment-14393743 ] zhihai xu commented on YARN-2666: - Hi [~ozawa], I rebased the patch YARN-2666.000.patch rebased on the latest code base and it passed the Jenkins test. Do you have time to review/commit the patch? many thanks TestFairScheduler.testContinuousScheduling fails Intermittently --- Key: YARN-2666 URL: https://issues.apache.org/jira/browse/YARN-2666 Project: Hadoop YARN Issue Type: Test Components: scheduler Reporter: Tsuyoshi Ozawa Assignee: zhihai xu Attachments: YARN-2666.000.patch The test fails on trunk. {code} Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) Time elapsed: 0.582 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2901: - Summary: Add errors and warning metrics page to RM, NM web UI (was: Add errors and warning stats to RM, NM web UI) Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393810#comment-14393810 ] Hudson commented on YARN-2901: -- FAILURE: Integrated in Hadoop-trunk-Commit #7501 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7501/]) YARN-2901. Add errors and warning metrics page to RM, NM web UI. (Varun Vasudev via wangda) (wangda: rev bad070fe15a642cc6f3a165612fbd272187e03cb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/NavBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/NMController.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/ErrorsAndWarningsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/WebServer.java * hadoop-common-project/hadoop-common/src/main/conf/log4j.properties * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/NMErrorsAndWarningsPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RmController.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Log4jWarningErrorMetricsAppender.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMErrorsAndWarningsPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/NavBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
[ https://issues.apache.org/jira/browse/YARN-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-3443: Attachment: YARN-3443.001.patch Attaching patch that 1) separates out CGroup implementation into a reusable class 2) creates 'PrivilegedContainerExecutor' that wraps the container-executor binary that can be used for operations that require elevated privileges 3) creates a simple ResourceHandler interface for that be used to plug in support for new resource types. Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM - Key: YARN-3443 URL: https://issues.apache.org/jira/browse/YARN-3443 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Sidharta Seethana Assignee: Sidharta Seethana Attachments: YARN-3443.001.patch The current cgroups implementation is closely tied to supporting CPU as a resource . We need to separate out CGroups support as well a provide a simple ResourceHandler subsystem that will enable us to add support for new resource types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393645#comment-14393645 ] Sangjin Lee commented on YARN-3391: --- I am fine with tabling this discussion and revisiting it later in the interest of making progress. I just wanted to add my 2 cents that this is something we already see and experience with hRaven so it's not theoretical. That's the context from our side. The way I see it is that apps that do not have the flow name are basically a degenerate case of a single-app flow. This is unrelated to the app-to-flow aggregation. It has to do with the flowRun-to-flow aggregation. And it's something we want the users to do when they can set the flow name. FWIW... Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3365) Add support for using the 'tc' tool via container-executor
[ https://issues.apache.org/jira/browse/YARN-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393789#comment-14393789 ] Sidharta Seethana commented on YARN-3365: - Actually, never mind - it seems like the banned user list wasn't affected. -Sid Add support for using the 'tc' tool via container-executor -- Key: YARN-3365 URL: https://issues.apache.org/jira/browse/YARN-3365 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Sidharta Seethana Assignee: Sidharta Seethana Fix For: 2.8.0 Attachments: YARN-3365.001.patch, YARN-3365.002.patch, YARN-3365.003.patch We need the following functionality : 1) modify network interface traffic shaping rules - to be able to attach a qdisc, create child classes etc 2) read existing rules in place 3) read stats for the various classes Using tc requires elevated privileges - hence this functionality is to be made available via container-executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
[ https://issues.apache.org/jira/browse/YARN-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394003#comment-14394003 ] Hadoop QA commented on YARN-3443: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709150/YARN-3443.001.patch against trunk revision bad070f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1150 javac compiler warnings (more than the trunk's current 1148 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7210//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7210//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7210//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7210//console This message is automatically generated. Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM - Key: YARN-3443 URL: https://issues.apache.org/jira/browse/YARN-3443 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Sidharta Seethana Assignee: Sidharta Seethana Attachments: YARN-3443.001.patch The current cgroups implementation is closely tied to supporting CPU as a resource . We need to separate out CGroups support as well a provide a simple ResourceHandler subsystem that will enable us to add support for new resource types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently
[ https://issues.apache.org/jira/browse/YARN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393725#comment-14393725 ] Hadoop QA commented on YARN-2666: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709083/YARN-2666.000.patch against trunk revision 6a6a59d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7207//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7207//console This message is automatically generated. TestFairScheduler.testContinuousScheduling fails Intermittently --- Key: YARN-2666 URL: https://issues.apache.org/jira/browse/YARN-2666 Project: Hadoop YARN Issue Type: Test Components: scheduler Reporter: Tsuyoshi Ozawa Assignee: zhihai xu Attachments: YARN-2666.000.patch The test fails on trunk. {code} Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) Time elapsed: 0.582 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393723#comment-14393723 ] Sangjin Lee commented on YARN-3334: --- I took a quick look at the latest patch, and it looks good for the most part. However, I do worry about the size of the map produced in the response in ResourceTrackerService. It can be potentially quite large every time and has a potential impact on many things as it is part of the NM heartbeat handling. It's OK for now, but we should try to address it sooner than later. [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334.7.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
Sidharta Seethana created YARN-3443: --- Summary: Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM Key: YARN-3443 URL: https://issues.apache.org/jira/browse/YARN-3443 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Sidharta Seethana Assignee: Sidharta Seethana The current cgroups implementation is closely tied to supporting CPU as a resource . We need to separate out CGroups support as well a provide a simple ResourceHandler subsystem that will enable us to add support for new resource types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3444) Fixed typo (capability)
Gabor Liptak created YARN-3444: -- Summary: Fixed typo (capability) Key: YARN-3444 URL: https://issues.apache.org/jira/browse/YARN-3444 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Reporter: Gabor Liptak Priority: Minor Fixed typo (capability) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently
[ https://issues.apache.org/jira/browse/YARN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393976#comment-14393976 ] Tsuyoshi Ozawa commented on YARN-2666: -- OK, I'll check it. TestFairScheduler.testContinuousScheduling fails Intermittently --- Key: YARN-2666 URL: https://issues.apache.org/jira/browse/YARN-2666 Project: Hadoop YARN Issue Type: Test Components: scheduler Reporter: Tsuyoshi Ozawa Assignee: zhihai xu Attachments: YARN-2666.000.patch The test fails on trunk. {code} Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) Time elapsed: 0.582 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3365) Add support for using the 'tc' tool via container-executor
[ https://issues.apache.org/jira/browse/YARN-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3365: -- Fix Version/s: 2.8.0 Add support for using the 'tc' tool via container-executor -- Key: YARN-3365 URL: https://issues.apache.org/jira/browse/YARN-3365 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Sidharta Seethana Assignee: Sidharta Seethana Fix For: 2.8.0 Attachments: YARN-3365.001.patch, YARN-3365.002.patch, YARN-3365.003.patch We need the following functionality : 1) modify network interface traffic shaping rules - to be able to attach a qdisc, create child classes etc 2) read existing rules in place 3) read stats for the various classes Using tc requires elevated privileges - hence this functionality is to be made available via container-executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3365) Add support for using the 'tc' tool via container-executor
[ https://issues.apache.org/jira/browse/YARN-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393775#comment-14393775 ] Sidharta Seethana commented on YARN-3365: - Thanks, Vinod! we'll need a small patch to undo the banned users change in branch-2. Add support for using the 'tc' tool via container-executor -- Key: YARN-3365 URL: https://issues.apache.org/jira/browse/YARN-3365 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Sidharta Seethana Assignee: Sidharta Seethana Fix For: 2.8.0 Attachments: YARN-3365.001.patch, YARN-3365.002.patch, YARN-3365.003.patch We need the following functionality : 1) modify network interface traffic shaping rules - to be able to attach a qdisc, create child classes etc 2) read existing rules in place 3) read stats for the various classes Using tc requires elevated privileges - hence this functionality is to be made available via container-executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3366) Outbound network bandwidth : classify/shape traffic originating from YARN containers
[ https://issues.apache.org/jira/browse/YARN-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393971#comment-14393971 ] Sidharta Seethana commented on YARN-3366: - Since this patch requires uncommitted changes from https://issues.apache.org/jira/browse/YARN-3443, I am not submitting this patch to a pre-commit build for the time being. Outbound network bandwidth : classify/shape traffic originating from YARN containers Key: YARN-3366 URL: https://issues.apache.org/jira/browse/YARN-3366 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sidharta Seethana Assignee: Sidharta Seethana Attachments: YARN-3366.001.patch In order to be able to isolate based on/enforce outbound traffic bandwidth limits, we need a mechanism to classify/shape network traffic in the nodemanager. For more information on the design, please see the attached design document in the parent JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) RMTimelineCollector should have the context info of each app
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393643#comment-14393643 ] Zhijie Shen commented on YARN-3390: --- bq. I would favor the latter approach +1 RMTimelineCollector should have the context info of each app Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393717#comment-14393717 ] Robert Kanter commented on YARN-2942: - Yes, it does a blocking wait. I think this will end up being in a separate thread anyway because it's being done after uploading the logs to HDFS. However, I think making it a separate service is a good idea anyway. As you said, this handles NM restart, and allows us to later add more flexibility. If you upgrade the JHS before the NM, it's not the end of the world. New logs wouldn't be found by the JHS, but that only hurts users trying to view those logs through the JHS. Once the JHS is updated, they would be viewable. In any case, having the two configs is probably more confusing than it needs to be for the user, and we'd have to take care of the case where the new format is disabled but concatenation is enabled (which is invalid). I think we should just make this one config: the new format and concatenation is enabled or neither is. I'll post an updated doc shortly. Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CombinedAggregatedLogsProposal_v3.pdf, CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, ConcatableAggregatedLogsProposal_v4.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2942: Attachment: ConcatableAggregatedLogsProposal_v5.pdf I've uploaded a v5 doc which address those changes. I also clarified a few other things in there too. Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CombinedAggregatedLogsProposal_v3.pdf, CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, ConcatableAggregatedLogsProposal_v4.pdf, ConcatableAggregatedLogsProposal_v5.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3436) Doc WebServicesIntro.html Example Rest API url wrong
[ https://issues.apache.org/jira/browse/YARN-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393935#comment-14393935 ] Hadoop QA commented on YARN-3436: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709010/YARN-3436.001.patch against trunk revision bad070f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7209//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7209//console This message is automatically generated. Doc WebServicesIntro.html Example Rest API url wrong Key: YARN-3436 URL: https://issues.apache.org/jira/browse/YARN-3436 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: YARN-3436.001.patch /docs/current/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html {quote} Response Examples JSON response with single resource HTTP Request: GET http://rmhost.domain:8088/ws/v1/cluster/{color:red}app{color}/application_1324057493980_0001 Response Status Line: HTTP/1.1 200 OK {quote} Url should be ws/v1/cluster/{color:red}apps{color} . 2 examples on same page are wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393938#comment-14393938 ] Sidharta Seethana commented on YARN-2424: - It looks different versions of the patch to fix this were committed to branch-2 and trunk? The corresponding changes to LinuxContainerExecutor.java look different. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Assignee: Allen Wittenauer Priority: Blocker Fix For: 2.6.0 Attachments: Y2424-1.patch, YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) RMTimelineCollector should have the context info of each app
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394059#comment-14394059 ] Naganarasimha G R commented on YARN-3390: - Thanks for the feedback [~zjshen] [~sjlee0], bq. either pass in the context per call or have a map of app id to context. I would favor the latter approach because it'd be easier on the perspective of callers of putEntities(). I too agree it will be easier easier on the perspective of callers of putEntities() but if we favor for map of {{app id to context}} * implicit assumption would be that {{putEntities(TimelineEntities ) }} will be for same appId(/will have have the same context) * TimelineEntities as such do not have appID explicitly, so planning to modify {{TimelineCollector.getTimelineEntityContext()}} to {{TimelineCollector.getTimelineEntityContext(TimelineEntity.Identifier id)}} and subclasses of TimelineCollector can take care of mapping the Id to the Context (via AppId) if required. * code of {{putEntities(TimelineEntities)}} would look something like {code} IteratorTimelineEntity iterator = entities.getEntities().iterator(); TimelineEntity next = (iterator.hasNext())?iterator.next():null; if(null!=next) { TimelineCollectorContext context = getTimelineEntityContext(next.getIdentifier()); return writer.write(context.getClusterId(), context.getUserId(), context.getFlowId(), context.getFlowRunId(), context.getAppId(), entities); } {code} If its ok then shall i work on it ? RMTimelineCollector should have the context info of each app Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3334: - Attachment: YARN-3334-v8.patch Upload v8 patch to address minor comments for log in TimelineClientImpl. [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334-v8.patch, YARN-3334.7.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3445) NM notify RM on running Apps in NM-RM heartbeat
Junping Du created YARN-3445: Summary: NM notify RM on running Apps in NM-RM heartbeat Key: YARN-3445 URL: https://issues.apache.org/jira/browse/YARN-3445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Junping Du Assignee: Junping Du Per discussion in YARN-3334, we need filter out unnecessary collectors info from RM in heartbeat response. Our propose is to add additional field for running apps in NM heartbeat request, so RM only send collectors for local running apps back. This is also needed in YARN-914 (graceful decommission) that if no running apps in NM which is in decommissioning stage, it will get decommissioned immediately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394052#comment-14394052 ] Junping Du commented on YARN-3334: -- Thanks [~zjshen] and [~sjlee0] for comments! bq. If so, I suggest combining the two massages together, and record a error-level log (the first message is actually useless, if we always report the second one). That sounds OK. Will update a quick fix. bq. However, I do worry about the size of the map produced in the response in ResourceTrackerService. It can be potentially quite large every time and has a potential impact on many things as it is part of the NM heartbeat handling. It's OK for now, but we should try to address it sooner than later. Just filed YARN-3445 to track this issue. This is also needed in gracefully decommission (YARN-914) - decommissioning node can be terminated earlier by RM if no running apps. [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334.7.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3334: - Attachment: YARN-3334-v6.patch Incorporate [~zjshen]'s comments in v6 patch. Rebase it to latest YARN-2928 and verified e2e test can pass. [~zjshen], can you look it again? Thanks! [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch, YARN-3334-v6.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3293) Track and display capacity scheduler health metrics in web UI
[ https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393194#comment-14393194 ] Craig Welch commented on YARN-3293: --- Hey [~vvasudev], it seems that the patch doesn't apply cleanly, can you update to latest trunk? Track and display capacity scheduler health metrics in web UI - Key: YARN-3293 URL: https://issues.apache.org/jira/browse/YARN-3293 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch It would be good to display metrics that let users know about the health of the capacity scheduler in the web UI. Today it is hard to get an idea if the capacity scheduler is functioning correctly. Metrics such as the time for the last allocation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3425) NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed
[ https://issues.apache.org/jira/browse/YARN-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392872#comment-14392872 ] Hudson commented on YARN-3425: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #151 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/151/]) YARN-3425. NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed. (Bibin A Chundatt via wangda) (wangda: rev 492239424a3ace9868b6154f44a0f18fa5318235) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed -- Key: YARN-3425 URL: https://issues.apache.org/jira/browse/YARN-3425 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: 1 RM, 1 NM , 1 NN , I DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3425.001.patch Configure yarn.node-labels.enabled to true and yarn.node-labels.fs-store.root-dir /node-labels Start resource manager without starting DN/NM {quote} 2015-03-31 16:44:13,782 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:261) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:267) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:984) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:251) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1207) {quote} {code} protected void stopDispatcher() { AsyncDispatcher asyncDispatcher = (AsyncDispatcher) dispatcher; asyncDispatcher.stop(); } {code} Null check missing during stop -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3435) AM container to be allocated Appattempt AM container shown as null
Bibin A Chundatt created YARN-3435: -- Summary: AM container to be allocated Appattempt AM container shown as null Key: YARN-3435 URL: https://issues.apache.org/jira/browse/YARN-3435 Project: Hadoop YARN Issue Type: Bug Environment: 1RM,1DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Trivial Submit yarn application Open http://rm:8088/cluster/appattempt/appattempt_1427984982805_0003_01 Before the AM container is allocated -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392866#comment-14392866 ] Hudson commented on YARN-3430: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #151 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/151/]) YARN-3430. Made headroom data available on app attempt page of RM WebUI. Contributed by Xuan Gong. (zjshen: rev 8366a36ad356e6318b8ce6c5c96e201149f811bd) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392789#comment-14392789 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #142 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/142/]) YARN-3248. Correct fix version from branch-2.7 to branch-2.8 in the change log. (xgong: rev 2e79f1c2125517586c165a84e99d3c4d38ca0938) * hadoop-yarn-project/CHANGES.txt Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3425) NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed
[ https://issues.apache.org/jira/browse/YARN-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392791#comment-14392791 ] Hudson commented on YARN-3425: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #142 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/142/]) YARN-3425. NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed. (Bibin A Chundatt via wangda) (wangda: rev 492239424a3ace9868b6154f44a0f18fa5318235) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed -- Key: YARN-3425 URL: https://issues.apache.org/jira/browse/YARN-3425 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: 1 RM, 1 NM , 1 NN , I DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3425.001.patch Configure yarn.node-labels.enabled to true and yarn.node-labels.fs-store.root-dir /node-labels Start resource manager without starting DN/NM {quote} 2015-03-31 16:44:13,782 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:261) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:267) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:984) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:251) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1207) {quote} {code} protected void stopDispatcher() { AsyncDispatcher asyncDispatcher = (AsyncDispatcher) dispatcher; asyncDispatcher.stop(); } {code} Null check missing during stop -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392784#comment-14392784 ] Hudson commented on YARN-3430: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #142 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/142/]) YARN-3430. Made headroom data available on app attempt page of RM WebUI. Contributed by Xuan Gong. (zjshen: rev 8366a36ad356e6318b8ce6c5c96e201149f811bd) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3433) Jersey tests failing with Port in Use -again
[ https://issues.apache.org/jira/browse/YARN-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392497#comment-14392497 ] Steve Loughran commented on YARN-3433: -- {code} com.sun.jersey.test.framework.spi.container.TestContainerException: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:413) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:384) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:375) at org.glassfish.grizzly.http.server.NetworkListener.start(NetworkListener.java:549) at org.glassfish.grizzly.http.server.HttpServer.start(HttpServer.java:255) at com.sun.jersey.api.container.grizzly2.GrizzlyServerFactory.createHttpServer(GrizzlyServerFactory.java:326) at com.sun.jersey.api.container.grizzly2.GrizzlyServerFactory.createHttpServer(GrizzlyServerFactory.java:343) at com.sun.jersey.test.framework.spi.container.grizzly2.web.GrizzlyWebTestContainerFactory$GrizzlyWebTestContainer.instantiateGrizzlyWebServer(GrizzlyWebTestContainerFactory.java:219) at com.sun.jersey.test.framework.spi.container.grizzly2.web.GrizzlyWebTestContainerFactory$GrizzlyWebTestContainer.init(GrizzlyWebTestContainerFactory.java:129) at com.sun.jersey.test.framework.spi.container.grizzly2.web.GrizzlyWebTestContainerFactory$GrizzlyWebTestContainer.init(GrizzlyWebTestContainerFactory.java:86) at com.sun.jersey.test.framework.spi.container.grizzly2.web.GrizzlyWebTestContainerFactory.create(GrizzlyWebTestContainerFactory.java:79) at com.sun.jersey.test.framework.JerseyTest.getContainer(JerseyTest.java:342) at com.sun.jersey.test.framework.JerseyTest.init(JerseyTest.java:217) at org.apache.hadoop.yarn.webapp.JerseyTestBase.init(JerseyTestBase.java:27) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps.init(TestRMWebServicesApps.java:111) {code} Jersey tests failing with Port in Use -again Key: YARN-3433 URL: https://issues.apache.org/jira/browse/YARN-3433 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran ASF Jenkins jersey tests failing with port in use exceptions. The YARN-2912 patch tried to fix it, but it defaults to port 9998 and doesn't scan for a spare port —so is too brittle on a busy server -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3425) NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed
[ https://issues.apache.org/jira/browse/YARN-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392543#comment-14392543 ] Hudson commented on YARN-3425: -- FAILURE: Integrated in Hadoop-Yarn-trunk #885 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/885/]) YARN-3425. NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed. (Bibin A Chundatt via wangda) (wangda: rev 492239424a3ace9868b6154f44a0f18fa5318235) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed -- Key: YARN-3425 URL: https://issues.apache.org/jira/browse/YARN-3425 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: 1 RM, 1 NM , 1 NN , I DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3425.001.patch Configure yarn.node-labels.enabled to true and yarn.node-labels.fs-store.root-dir /node-labels Start resource manager without starting DN/NM {quote} 2015-03-31 16:44:13,782 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:261) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:267) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:984) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:251) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1207) {quote} {code} protected void stopDispatcher() { AsyncDispatcher asyncDispatcher = (AsyncDispatcher) dispatcher; asyncDispatcher.stop(); } {code} Null check missing during stop -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3433) Jersey tests failing with Port in Use -again
Steve Loughran created YARN-3433: Summary: Jersey tests failing with Port in Use -again Key: YARN-3433 URL: https://issues.apache.org/jira/browse/YARN-3433 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran ASF Jenkins jersey tests failing with port in use exceptions. The YARN-2912 patch tried to fix it, but it defaults to port 9998 and doesn't scan for a spare port —so is too brittle on a busy server -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3433) Jersey tests failing with Port in Use -again
[ https://issues.apache.org/jira/browse/YARN-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula reassigned YARN-3433: -- Assignee: Brahma Reddy Battula Jersey tests failing with Port in Use -again Key: YARN-3433 URL: https://issues.apache.org/jira/browse/YARN-3433 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula ASF Jenkins jersey tests failing with port in use exceptions. The YARN-2912 patch tried to fix it, but it defaults to port 9998 and doesn't scan for a spare port —so is too brittle on a busy server -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392656#comment-14392656 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2083 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2083/]) YARN-3248. Correct fix version from branch-2.7 to branch-2.8 in the change log. (xgong: rev 2e79f1c2125517586c165a84e99d3c4d38ca0938) * hadoop-yarn-project/CHANGES.txt Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3425) NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed
[ https://issues.apache.org/jira/browse/YARN-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392658#comment-14392658 ] Hudson commented on YARN-3425: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2083 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2083/]) YARN-3425. NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed. (Bibin A Chundatt via wangda) (wangda: rev 492239424a3ace9868b6154f44a0f18fa5318235) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed -- Key: YARN-3425 URL: https://issues.apache.org/jira/browse/YARN-3425 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: 1 RM, 1 NM , 1 NN , I DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3425.001.patch Configure yarn.node-labels.enabled to true and yarn.node-labels.fs-store.root-dir /node-labels Start resource manager without starting DN/NM {quote} 2015-03-31 16:44:13,782 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:261) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:267) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:984) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:251) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1207) {quote} {code} protected void stopDispatcher() { AsyncDispatcher asyncDispatcher = (AsyncDispatcher) dispatcher; asyncDispatcher.stop(); } {code} Null check missing during stop -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
Thomas Graves created YARN-3434: --- Summary: Interaction between reservations and userlimit can result in significant ULF violation Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392751#comment-14392751 ] Thomas Graves commented on YARN-3434: - The issue here is that in if we allow the user to continue from the user limit checks in assignContainers because they have reservations, when it gets down into the assignContainer routine and its allowed to get a container and the node has space we don't double check the user limit in this case. We recheck in all other cases but this one is missed. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392541#comment-14392541 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-Yarn-trunk #885 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/885/]) YARN-3248. Display count of nodes blacklisted by apps in the web UI. (xgong: rev 4728bdfa15809db4b8b235faa286c65de4a48cf6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlockWithMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java YARN-3248. Correct fix version from branch-2.7 to branch-2.8 in the change log. (xgong: rev 2e79f1c2125517586c165a84e99d3c4d38ca0938) * hadoop-yarn-project/CHANGES.txt Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392536#comment-14392536 ] Hudson commented on YARN-3430: -- FAILURE: Integrated in Hadoop-Yarn-trunk #885 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/885/]) YARN-3430. Made headroom data available on app attempt page of RM WebUI. Contributed by Xuan Gong. (zjshen: rev 8366a36ad356e6318b8ce6c5c96e201149f811bd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/CHANGES.txt RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392595#comment-14392595 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #151 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/151/]) YARN-3248. Display count of nodes blacklisted by apps in the web UI. (xgong: rev 4728bdfa15809db4b8b235faa286c65de4a48cf6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlockWithMetrics.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java YARN-3248. Correct fix version from branch-2.7 to branch-2.8 in the change log. (xgong: rev 2e79f1c2125517586c165a84e99d3c4d38ca0938) * hadoop-yarn-project/CHANGES.txt Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392590#comment-14392590 ] Hudson commented on YARN-3430: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #151 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/151/]) YARN-3430. Made headroom data available on app attempt page of RM WebUI. Contributed by Xuan Gong. (zjshen: rev 8366a36ad356e6318b8ce6c5c96e201149f811bd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/CHANGES.txt RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3425) NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed
[ https://issues.apache.org/jira/browse/YARN-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392597#comment-14392597 ] Hudson commented on YARN-3425: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #151 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/151/]) YARN-3425. NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed. (Bibin A Chundatt via wangda) (wangda: rev 492239424a3ace9868b6154f44a0f18fa5318235) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java NPE from RMNodeLabelsManager.serviceStop when NodeLabelsManager.serviceInit failed -- Key: YARN-3425 URL: https://issues.apache.org/jira/browse/YARN-3425 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: 1 RM, 1 NM , 1 NN , I DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3425.001.patch Configure yarn.node-labels.enabled to true and yarn.node-labels.fs-store.root-dir /node-labels Start resource manager without starting DN/NM {quote} 2015-03-31 16:44:13,782 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:261) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:267) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:984) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:251) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1207) {quote} {code} protected void stopDispatcher() { AsyncDispatcher asyncDispatcher = (AsyncDispatcher) dispatcher; asyncDispatcher.stop(); } {code} Null check missing during stop -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392651#comment-14392651 ] Hudson commented on YARN-3430: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2083 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2083/]) YARN-3430. Made headroom data available on app attempt page of RM WebUI. Contributed by Xuan Gong. (zjshen: rev 8366a36ad356e6318b8ce6c5c96e201149f811bd) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
[ https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392687#comment-14392687 ] Thomas Graves commented on YARN-3432: - that will fix it for the capacity scheduler, we need to see if that breaks the FairScheduler though. Cluster metrics have wrong Total Memory when there is reserved memory on CS --- Key: YARN-3432 URL: https://issues.apache.org/jira/browse/YARN-3432 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Brahma Reddy Battula I noticed that when reservations happen when using the Capacity Scheduler, the UI and web services report the wrong total memory. For example. I have a 300GB of total memory in my cluster. I allocate 50 and I reserve 10. The cluster metrics for total memory get reported as 290GB. This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps there is a difference between fair scheduler and capacity scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3435) AM container to be allocated Appattempt AM container shown as null
[ https://issues.apache.org/jira/browse/YARN-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3435: --- Attachment: Screenshot.png Attaching Screen shot for bug AM container to be allocated Appattempt AM container shown as null -- Key: YARN-3435 URL: https://issues.apache.org/jira/browse/YARN-3435 Project: Hadoop YARN Issue Type: Bug Environment: 1RM,1DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Trivial Attachments: Screenshot.png Submit yarn application Open http://rm:8088/cluster/appattempt/appattempt_1427984982805_0003_01 Before the AM container is allocated -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393015#comment-14393015 ] Zhijie Shen commented on YARN-3334: --- Junping, did you have the chance to look at the 3 and 4 of my last patch comment? One more nit: newTimelineServiceEnabled(config) - systemMetricsPublisherEnabled? [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3436) Doc WebServicesIntro.html Example Rest API url wrong
Bibin A Chundatt created YARN-3436: -- Summary: Doc WebServicesIntro.html Example Rest API url wrong Key: YARN-3436 URL: https://issues.apache.org/jira/browse/YARN-3436 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor /docs/current/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html {quote} Response Examples JSON response with single resource HTTP Request: GET http://rmhost.domain:8088/ws/v1/cluster/{color:red}app{color}/application_1324057493980_0001 Response Status Line: HTTP/1.1 200 OK {quote} Url should be ws/v1/cluster/{color:red}apps{color} . 2 examples on same page are wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3437) convert load test driver to timeline service v.2
Sangjin Lee created YARN-3437: - Summary: convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3435) AM container to be allocated Appattempt AM container shown as null
[ https://issues.apache.org/jira/browse/YARN-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3435: --- Attachment: YARN-3435.001.patch AM container to be allocated Appattempt AM container shown as null -- Key: YARN-3435 URL: https://issues.apache.org/jira/browse/YARN-3435 Project: Hadoop YARN Issue Type: Bug Environment: 1RM,1DN Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Trivial Attachments: Screenshot.png, YARN-3435.001.patch Submit yarn application Open http://rm:8088/cluster/appattempt/appattempt_1427984982805_0003_01 Before the AM container is allocated -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3438) add a mode to replay MR job history files to the timeline service
Sangjin Lee created YARN-3438: - Summary: add a mode to replay MR job history files to the timeline service Key: YARN-3438 URL: https://issues.apache.org/jira/browse/YARN-3438 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee The subtask covers the work on top of YARN-3437 to add a mode to replay MR job history files to the timeline service storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)