[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496103#comment-14496103 ] Rohith commented on YARN-2268: -- I propose the following way to handle disallow state store when RM is running. For both HA(Active and Standby) and Non-HA, it is possible to get RM state using REST API getClusterInfo('ws/v1/cluster/info'). This can be make use for identifying RM state. This is independent of any state store implementaions. In HA, ACTIVE state is checked with all the the RM-Id's in a sequential manner. If no ACTIVE state RM is found then format the store otherwise throw an exception *ActiveResourceManagerRunningException*. Cons : Formatting state store when HA is enabled is *Best Effort* basis, there would be scenario where RM state can be chagned after one of the RM is checked. Kindly share your thoughts on this approach.. Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once
Jason Lowe created YARN-3489: Summary: RMServerUtils.validateResourceRequests should only obtain queue info once Key: YARN-3489 URL: https://issues.apache.org/jira/browse/YARN-3489 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Since the label support was added we now get the queue info for each request being validated in SchedulerUtils.validateResourceRequest. If validateResourceRequests needs to validate a lot of requests at a time (e.g.: large cluster with lots of varied locality in the requests) then it will get the queue info for each request. Since we build the queue info this generates a lot of unnecessary garbage, as the queue isn't changing between requests. We should grab the queue info once and pass it down rather than building it again for each request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3476) Nodemanager can fail to delete local logs if log aggregation fails
[ https://issues.apache.org/jira/browse/YARN-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496099#comment-14496099 ] Hadoop QA commented on YARN-3476: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724974/0001-YARN-3476.patch against trunk revision fddd552. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7346//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7346//console This message is automatically generated. Nodemanager can fail to delete local logs if log aggregation fails -- Key: YARN-3476 URL: https://issues.apache.org/jira/browse/YARN-3476 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation, nodemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Rohith Attachments: 0001-YARN-3476.patch If log aggregation encounters an error trying to upload the file then the underlying TFile can throw an illegalstateexception which will bubble up through the top of the thread and prevent the application logs from being deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3477) TimelineClientImpl swallows root cause of retry failures
[ https://issues.apache.org/jira/browse/YARN-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-3477: - Target Version/s: 2.7.1 Affects Version/s: (was: 3.0.0) 2.7.0 TimelineClientImpl swallows root cause of retry failures Key: YARN-3477 URL: https://issues.apache.org/jira/browse/YARN-3477 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran If timeline client fails more than the retry count, the original exception is not thrown. Instead some runtime exception is raised saying retries run out # the failing exception should be rethrown, ideally via NetUtils.wrapException to include URL of the failing endpoing # Otherwise, the raised RTE should (a) state that URL and (b) set the original fault as the inner cause -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3266) RMContext inactiveNodes should have NodeId as map key
[ https://issues.apache.org/jira/browse/YARN-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496385#comment-14496385 ] Hudson commented on YARN-3266: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2114 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2114/]) YARN-3266. RMContext#inactiveNodes should have NodeId as map key. Contributed by Chengbing Liu (jianhe: rev b46ee1e7a31007985b88072d9af3d97c33a261a7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMActiveServiceContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesNodes.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContextImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java RMContext inactiveNodes should have NodeId as map key - Key: YARN-3266 URL: https://issues.apache.org/jira/browse/YARN-3266 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3266.01.patch, YARN-3266.02.patch, YARN-3266.03.patch Under the default NM port configuration, which is 0, we have observed in the current version, lost nodes count is greater than the length of the lost node list. This will happen when we consecutively restart the same NM twice: * NM started at port 10001 * NM restarted at port 10002 * NM restarted at port 10003 * NM:10001 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=1; {{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, {{inactiveNodes}} has 1 element * NM:10002 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=2; {{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, {{inactiveNodes}} still has 1 element Since we allow multiple NodeManagers on one host (as discussed in YARN-1888), {{inactiveNodes}} should be of type {{ConcurrentMapNodeId, RMNode}}. If this will break the current API, then the key string should include the NM's port as well. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3436) Fix URIs in documention of YARN web service REST APIs
[ https://issues.apache.org/jira/browse/YARN-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496383#comment-14496383 ] Hudson commented on YARN-3436: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2114 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2114/]) YARN-3436. Fix URIs in documantion of YARN web service REST APIs. Contributed by Bibin A Chundatt. (ozawa: rev 05007b45e58bd9052f503cfb8c17bcfd22a686e3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/WebServicesIntro.md * hadoop-yarn-project/CHANGES.txt Fix URIs in documention of YARN web service REST APIs - Key: YARN-3436 URL: https://issues.apache.org/jira/browse/YARN-3436 Project: Hadoop YARN Issue Type: Bug Components: documentation, resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3436.001.patch /docs/current/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html {quote} Response Examples JSON response with single resource HTTP Request: GET http://rmhost.domain:8088/ws/v1/cluster/{color:red}app{color}/application_1324057493980_0001 Response Status Line: HTTP/1.1 200 OK {quote} Url should be ws/v1/cluster/{color:red}apps{color} . 2 examples on same page are wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3361) CapacityScheduler side changes to support non-exclusive node labels
[ https://issues.apache.org/jira/browse/YARN-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496384#comment-14496384 ] Hudson commented on YARN-3361: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2114 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2114/]) YARN-3361. CapacityScheduler side changes to support non-exclusive node labels. Contributed by Wangda Tan (jianhe: rev 0fefda645bca935b87b6bb8ca63e6f18340d59f5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestNodeLabelContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/BuilderUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/SchedulingMode.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/Application.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/ResourceUsage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/CHANGES.txt CapacityScheduler side changes
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496239#comment-14496239 ] Thomas Graves commented on YARN-3434: - So I had considered putting it in the ResourceLimits but ResourceLimits seems to be more of a queue level thing to me (not a user level). For instance parentQueue passes this into leafQueue. ParentQueue cares nothing about user limits. If you stored it there you would either need to track the user it was for or track for all users. ResourceLimits get updated when nodes are added and removed. We don't need to compute a particular user limit when that happens. So it would then be out of date or we change to update it when that happens, but that to me is fairly large change and not really needed. The user limit calculation are lower down and recomputed per user, per application, per current request regularly and putting this into the global based on how being calculated and used didn't make sense to me. All you would be using it for is passing it down to assignContainer and then it would be out of date. If someone else started looking at that value assuming it was up to date then it would be wrong (unless of course we started updating it as stated above). But it would only be for a single user, not all users unless again we changed to calculate for every user whenever something changed. That seems a bit excessive. You are correct that needToUnreserve could go away. I started out on 2.6 which didn't have our changes and I could have removed it when I added in amountNeededUnreserve. If we were to store it in the global ResourceLimit then yes the entire LimitsInfo can go away including shouldContinue as you would fall back to use the boolean return from each function. But again based on my above comments I'm not sure ResourceLimit is the correct place to put this. I just noticed that we are already keeping the userLimit in the User class, that would be another option. But again I think we need to make it clear about what it is. This particular check is done per application, per user based on the current requested Resource. The value stored that wouldn't necessarily apply to all the users applications since the resource request size could be different. thoughts or is there something I'm missing about ResourceLimits? Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once
[ https://issues.apache.org/jira/browse/YARN-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3489: -- Assignee: Varun Saxena RMServerUtils.validateResourceRequests should only obtain queue info once - Key: YARN-3489 URL: https://issues.apache.org/jira/browse/YARN-3489 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Since the label support was added we now get the queue info for each request being validated in SchedulerUtils.validateResourceRequest. If validateResourceRequests needs to validate a lot of requests at a time (e.g.: large cluster with lots of varied locality in the requests) then it will get the queue info for each request. Since we build the queue info this generates a lot of unnecessary garbage, as the queue isn't changing between requests. We should grab the queue info once and pass it down rather than building it again for each request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3471) Fix timeline client retry
[ https://issues.apache.org/jira/browse/YARN-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-3471: - Affects Version/s: 2.8.0 Fix timeline client retry - Key: YARN-3471 URL: https://issues.apache.org/jira/browse/YARN-3471 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.8.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3471.1.patch, YARN-3471.2.patch I found that the client retry has some problems: 1. The new put methods will retry on all exception, but they should only do it upon ConnectException. 2. We can reuse TimelineClientConnectionRetry to simplify the retry logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3436) Fix URIs in documention of YARN web service REST APIs
[ https://issues.apache.org/jira/browse/YARN-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496304#comment-14496304 ] Hudson commented on YARN-3436: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #165 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/165/]) YARN-3436. Fix URIs in documantion of YARN web service REST APIs. Contributed by Bibin A Chundatt. (ozawa: rev 05007b45e58bd9052f503cfb8c17bcfd22a686e3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/WebServicesIntro.md * hadoop-yarn-project/CHANGES.txt Fix URIs in documention of YARN web service REST APIs - Key: YARN-3436 URL: https://issues.apache.org/jira/browse/YARN-3436 Project: Hadoop YARN Issue Type: Bug Components: documentation, resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3436.001.patch /docs/current/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html {quote} Response Examples JSON response with single resource HTTP Request: GET http://rmhost.domain:8088/ws/v1/cluster/{color:red}app{color}/application_1324057493980_0001 Response Status Line: HTTP/1.1 200 OK {quote} Url should be ws/v1/cluster/{color:red}apps{color} . 2 examples on same page are wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3266) RMContext inactiveNodes should have NodeId as map key
[ https://issues.apache.org/jira/browse/YARN-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496306#comment-14496306 ] Hudson commented on YARN-3266: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #165 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/165/]) YARN-3266. RMContext#inactiveNodes should have NodeId as map key. Contributed by Chengbing Liu (jianhe: rev b46ee1e7a31007985b88072d9af3d97c33a261a7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMActiveServiceContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContextImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesNodes.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/CHANGES.txt RMContext inactiveNodes should have NodeId as map key - Key: YARN-3266 URL: https://issues.apache.org/jira/browse/YARN-3266 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3266.01.patch, YARN-3266.02.patch, YARN-3266.03.patch Under the default NM port configuration, which is 0, we have observed in the current version, lost nodes count is greater than the length of the lost node list. This will happen when we consecutively restart the same NM twice: * NM started at port 10001 * NM restarted at port 10002 * NM restarted at port 10003 * NM:10001 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=1; {{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, {{inactiveNodes}} has 1 element * NM:10002 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=2; {{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, {{inactiveNodes}} still has 1 element Since we allow multiple NodeManagers on one host (as discussed in YARN-1888), {{inactiveNodes}} should be of type {{ConcurrentMapNodeId, RMNode}}. If this will break the current API, then the key string should include the NM's port as well. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3361) CapacityScheduler side changes to support non-exclusive node labels
[ https://issues.apache.org/jira/browse/YARN-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496305#comment-14496305 ] Hudson commented on YARN-3361: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #165 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/165/]) YARN-3361. CapacityScheduler side changes to support non-exclusive node labels. Contributed by Wangda Tan (jianhe: rev 0fefda645bca935b87b6bb8ca63e6f18340d59f5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestNodeLabelContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/BuilderUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/Application.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/ResourceUsage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/SchedulingMode.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java CapacityScheduler
[jira] [Updated] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities
[ https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-3448: -- Attachment: YARN-3448.8.patch Add Rolling Time To Lives Level DB Plugin Capabilities -- Key: YARN-3448 URL: https://issues.apache.org/jira/browse/YARN-3448 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-3448.1.patch, YARN-3448.2.patch, YARN-3448.3.patch, YARN-3448.4.patch, YARN-3448.5.patch, YARN-3448.7.patch, YARN-3448.8.patch For large applications, the majority of the time in LeveldbTimelineStore is spent deleting old entities record at a time. An exclusive write lock is held during the entire deletion phase which in practice can be hours. If we are to relax some of the consistency constraints, other performance enhancing techniques can be employed to maximize the throughput and minimize locking time. Split the 5 sections of the leveldb database (domain, owner, start time, entity, index) into 5 separate databases. This allows each database to maximize the read cache effectiveness based on the unique usage patterns of each database. With 5 separate databases each lookup is much faster. This can also help with I/O to have the entity and index databases on separate disks. Rolling DBs for entity and index DBs. 99.9% of the data are in these two sections 4:1 ration (index to entity) at least for tez. We replace DB record removal with file system removal if we create a rolling set of databases that age out and can be efficiently removed. To do this we must place a constraint to always place an entity's events into it's correct rolling db instance based on start time. This allows us to stitching the data back together while reading and artificial paging. Relax the synchronous writes constraints. If we are willing to accept losing some records that we not flushed in the operating system during a crash, we can use async writes that can be much faster. Prefer Sequential writes. sequential writes can be several times faster than random writes. Spend some small effort arranging the writes in such a way that will trend towards sequential write performance over random write performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3318) Create Initial OrderingPolicy Framework and FifoOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496519#comment-14496519 ] Hudson commented on YARN-3318: -- FAILURE: Integrated in Hadoop-trunk-Commit #7588 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7588/]) YARN-3318. Create Initial OrderingPolicy Framework and FifoOrderingPolicy. (Craig Welch via wangda) (wangda: rev 5004e753322084e42dfda4be1d2db66677f86a1e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/OrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/ResourceUsage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/MockSchedulableEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/SchedulableEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/AbstractComparatorOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FifoOrderingPolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/TestFifoOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FifoComparator.java Create Initial OrderingPolicy Framework and FifoOrderingPolicy -- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch, YARN-3318.39.patch, YARN-3318.45.patch, YARN-3318.47.patch, YARN-3318.48.patch, YARN-3318.52.patch, YARN-3318.53.patch, YARN-3318.56.patch, YARN-3318.57.patch, YARN-3318.58.patch, YARN-3318.59.patch, YARN-3318.60.patch, YARN-3318.61.patch Create the initial framework required for using OrderingPolicies and an initial FifoOrderingPolicy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496490#comment-14496490 ] Zhijie Shen commented on YARN-3051: --- Hence, regardless the implementation detail, we logically use: 1. entity type, entity id to identify entities that are generated on the same cluster. 2. cluster id, entity type, entity id to identify entities globally across clusters. In terms of compatibility, {{getTimelineEntity(entity type, entity id)}} can assume the cluster ID is either the default one or configured in yarn-site.xml. Does it sound good? [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496534#comment-14496534 ] Varun Saxena commented on YARN-3051: Updated a WIP patch. Will update javadoc after everyone is on same page on the approach and API. Working on unit tests. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496509#comment-14496509 ] Varun Saxena commented on YARN-3051: As per the patch I am currently working on, if clusterid does not come in the query, it is taken from config. So thats consistent. Although I was assuming appid will be part of PK. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496538#comment-14496538 ] Junping Du commented on YARN-3411: -- Thanks [~vrushalic] for delivering the proposal and poc patch which is an excellent job! Some quick comments from walk through proposal: bq. Entity Table - primary key components-putting the UserID first helps to distribute writes across the regions in the hbase cluster. Pros: avoids single region hotspotting. Cons: connections would be open to several region servers during writes from per node ATS. Looks like we are try to get rid of region server hotspotting issues. I agree that this design could helps. However, this is still possible that specific user could submit much more applications than anyone else. In that case, the region hotspot issue will still appear. Isn't it? I think the more general way to solve this problem is making keys get salted with a prefix. Thoughts? bq. Entity Table - column families-config needs to be stored as key value, not as a blob to enable efficient key based querying based on config param name. storing it in a separate column family helps to avoid scanning over config while reading metrics and vice versa +1. This leverage strength of columnar database. We should get rid of storing any default value for key. However, this sounds challengable if TimelineClient only has a configuration object. bq. Entity Table - metrics are written to with an hbase cell timestamp set to top of the minute or top of the 5 minute interval or whatever is decided. This helps in timeseries storage and retrieval in case of querying at the entity level. Can we also let TimelineCollector do some aggregation of metrics in a similar time interval rather than sending to HBase/Pheonix for every metrics when it received? This may help to lease some pressure to backend. bq. Flow by application id table I am still think we should figure out some way to store application attempts info. The typical usecase here is: for some reason (like: bug or hardware capability reason), some flow/application's AM could always get failed more times than other flows/applications. Keeping this info can help us to track these issues. Isn't it? bq. flow summary daily table (aggregation table managed by Phoenix) - could be triggered via coprocessor with each put in flow table or a cron run once per day to aggregate for yesterday (with catchup functionality in case of backlog etc) Do each put in flow table sounds a little expensive especially when putting activity is very frequently. May be we should do some batch mode here? In addition, I think we can leverage per node TimelineCollector to do some first level aggregation which can help to relieve workload in backend. [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3051: --- Attachment: YARN-3051.wip.patch [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496503#comment-14496503 ] Sangjin Lee commented on YARN-3051: --- Yep. That's perfect. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496511#comment-14496511 ] Sangjin Lee commented on YARN-3390: --- {quote} For putIfAbsent and remove, I don't use template method pattern, but let the subclass override the super class method and invoke it inside the override implementation, because I'm not sure if we will need pre process or post process, and if we only invoke the process when adding a new collector. If we're sure about template, I'm okay with the template pattern too. {quote} I'm fine with either approach. The main reason I thought of that is I wanted to be clear that the base implementation of putIfAbsent() and remove() is mandatory (i.e. not optional). Since we control all of it (base and subclasses), it might not be such a big deal either way. Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3390.1.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities
[ https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496528#comment-14496528 ] Hadoop QA commented on YARN-3448: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725620/YARN-3448.8.patch against trunk revision fddd552. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 10 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7347//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7347//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-applicationhistoryservice.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7347//console This message is automatically generated. Add Rolling Time To Lives Level DB Plugin Capabilities -- Key: YARN-3448 URL: https://issues.apache.org/jira/browse/YARN-3448 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-3448.1.patch, YARN-3448.2.patch, YARN-3448.3.patch, YARN-3448.4.patch, YARN-3448.5.patch, YARN-3448.7.patch, YARN-3448.8.patch For large applications, the majority of the time in LeveldbTimelineStore is spent deleting old entities record at a time. An exclusive write lock is held during the entire deletion phase which in practice can be hours. If we are to relax some of the consistency constraints, other performance enhancing techniques can be employed to maximize the throughput and minimize locking time. Split the 5 sections of the leveldb database (domain, owner, start time, entity, index) into 5 separate databases. This allows each database to maximize the read cache effectiveness based on the unique usage patterns of each database. With 5 separate databases each lookup is much faster. This can also help with I/O to have the entity and index databases on separate disks. Rolling DBs for entity and index DBs. 99.9% of the data are in these two sections 4:1 ration (index to entity) at least for tez. We replace DB record removal with file system removal if we create a rolling set of databases that age out and can be efficiently removed. To do this we must place a constraint to always place an entity's events into it's correct rolling db instance based on start time. This allows us to stitching the data back together while reading and artificial paging. Relax the synchronous writes constraints. If we are willing to accept losing some records that we not flushed in the operating system during a crash, we can use async writes that can be much faster. Prefer Sequential writes. sequential writes can be several times faster than random writes. Spend some small effort arranging the writes in such a way that will trend towards sequential write performance over random write performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3490) Add an application decorator to ClientRMService
Jian Fang created YARN-3490: --- Summary: Add an application decorator to ClientRMService Key: YARN-3490 URL: https://issues.apache.org/jira/browse/YARN-3490 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Jian Fang Per the discussion on MAPREDUCE-6304, hadoop cloud service provider wants to hook in some logic to control the allocation of an application on the resource manager side because it is sometimes impractical to control the client side of a hadoop cluster in cloud. Hadoop service provider and hadoop users usually have different privileges, control, and access on a hadoop cluster in cloud. One good example is that application masters should not be allocated to spot instances on Amazon EC2. To achieve that, an application decorator could be provided to orchestrate the ApplicationSubmissionContext by specifying the AM label expression, for example. Hadoop could provide a dummy decorator that does nothing by default, but it should allow users to replace this decorator with their own decorators to meet their specific needs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly
[ https://issues.apache.org/jira/browse/YARN-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2605: Issue Type: Sub-task (was: Bug) Parent: YARN-149 [RM HA] Rest api endpoints doing redirect incorrectly - Key: YARN-2605 URL: https://issues.apache.org/jira/browse/YARN-2605 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: bc Wong Assignee: Anubhav Dhoot Labels: newbie The standby RM's webui tries to do a redirect via meta-refresh. That is fine for pages designed to be viewed by web browsers. But the API endpoints shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd suggest HTTP 303, or return a well-defined error message (json or xml) stating that the standby status and a link to the active RM. The standby RM is returning this today: {noformat} $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics HTTP/1.1 200 OK Cache-Control: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics Content-Length: 117 Server: Jetty(6.1.26) This is standby RM. Redirecting to the current active RM: http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496648#comment-14496648 ] Jian Fang commented on YARN-2306: - Could someone please tell me which JIRA has fixed this bug in trunk? I am working on hadoop 2.6.0 branch and need to see if I need to fix this issue or not. Thanks in advance. leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2306-2.patch, YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
zhihai xu created YARN-3491: --- Summary: Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because FSDownload submission to the thread pool at the following code is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. {code} synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } {code} Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2696) Queue sorting in CapacityScheduler should consider node label
[ https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2696: - Attachment: YARN-2696.2.patch Attached ver.2 patch fixed findbugs warning and test failures (TestRMDelegationTokens is not related). I've thought about Jian's comment: bq. We can merge PartitionedQueueComparator and nonPartitionedQueueComparator into a single QueueComparator. After think about this, I think we cannot, NonPartitionedQueueComparator is stateless, and PartitionedQueueComparator is stateful, someone can modify partitionToLookAt for Partitioned.., but we should keep NonPartitionedQueueComparator only and always sort by default partition. Queue sorting in CapacityScheduler should consider node label - Key: YARN-2696 URL: https://issues.apache.org/jira/browse/YARN-2696 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2696.1.patch, YARN-2696.2.patch In the past, when trying to allocate containers under a parent queue in CapacityScheduler. The parent queue will choose child queues by the used resource from smallest to largest. Now we support node label in CapacityScheduler, we should also consider used resource in child queues by node labels when allocating resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496617#comment-14496617 ] Wangda Tan commented on YARN-3434: -- [~tgraves], I think your concerns may not be a problem, ResourceLimits will be replaced (instead of updated) when node heartbeat. And ResourceLimits object itself is to decouple Parent and Child (e.g. ParentQueue to Children, LeafQueue to apps), Child doesn't need to understand how Parent compute limits, it only need to respect limits. For example, app doesn't need to understand how queue computing queue capacity/user-limit/continous-reservation-looking, it only need to know what's the limit considering all factors, so it can make decision to allocate/release-before-allocate/cannot-continue. The usage of ResourceLimits in my mind for user-limit case is: - ParentQueue compute/set limits - LeafQueue store limits (why store see 1.) - LeafQueue recompute/set user-limit when trying to do allocate for each app/priority - LeafQueue check user-limit as well as limits when trying to allocate/reserve container - The user-limit saved in ResourceLimits is only used in normal allocation/reservation path, if it's a reserved allocation, we will reset user-limit to un-limited. 1. Why store limits in LeafQueue instead of passing down? This is required by headroom computing, app's headroom is affected by queue's parent as well as sibling changes, we cannot update all app's headroom when that changes, but we need recompute headroom when app do heartbeat, so we have to store latest ResourceLimits in LeafQueue. See YARN-2008 for more information. I'm not sure if above can make you understand better about my suggestion. Please let me know your thoughts. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3326) ReST support for getLabelsToNodes
[ https://issues.apache.org/jira/browse/YARN-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496693#comment-14496693 ] Tsuyoshi Ozawa commented on YARN-3326: -- +1, committing this shortly. Hey [~Naganarasimha], could you open new JIRA to update documentation for this feature? ReST support for getLabelsToNodes -- Key: YARN-3326 URL: https://issues.apache.org/jira/browse/YARN-3326 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Minor Attachments: YARN-3326.20150310-1.patch, YARN-3326.20150407-1.patch, YARN-3326.20150408-1.patch REST to support to retrieve LabelsToNodes Mapping -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496644#comment-14496644 ] Wangda Tan commented on YARN-3434: -- bq. All you would be using it for is passing it down to assignContainer and then it would be out of date. If someone else started looking at that value assuming it was up to date then it would be wrong (unless of course we started updating it as stated above). But it would only be for a single user, not all users unless again we changed to calculate for every user whenever something changed. That seems a bit excessive. To clarify, ResourceLimits is the bridge between parent and child, parent will tell child hey, this is the limit you can use, LeafQueue will do the same thing to app, ParentQueue doesn't compute/pass-down user-limit to LeafQueue at all, LeafQueue will do that and make sure it get updated for every allocation. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496663#comment-14496663 ] Jason Lowe commented on YARN-3491: -- Could you elaborate a bit on why the submit is time consuming? Unless I'm mistaken, the FSDownload constructor is very cheap and queueing should be simply tacking an entry on a queue. Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). - Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because FSDownload submission to the thread pool at the following code is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. {code} synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } {code} Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3492) AM fails to come up because RM and NM can't connect to each other
Karthik Kambatla created YARN-3492: -- Summary: AM fails to come up because RM and NM can't connect to each other Key: YARN-3492 URL: https://issues.apache.org/jira/browse/YARN-3492 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: pseudo-distributed cluster on a mac Reporter: Karthik Kambatla Priority: Blocker Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The container gets allocated, but doesn't get launched. The NM can't talk to the RM. Logs to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3492) AM fails to come up because RM and NM can't connect to each other
[ https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3492: --- Attachment: yarn-kasha-resourcemanager-kasha-mbp.local.log yarn-kasha-nodemanager-kasha-mbp.local.log AM fails to come up because RM and NM can't connect to each other - Key: YARN-3492 URL: https://issues.apache.org/jira/browse/YARN-3492 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: pseudo-distributed cluster on a mac Reporter: Karthik Kambatla Priority: Blocker Attachments: yarn-kasha-nodemanager-kasha-mbp.local.log, yarn-kasha-resourcemanager-kasha-mbp.local.log Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The container gets allocated, but doesn't get launched. The NM can't talk to the RM. Logs to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2696) Queue sorting in CapacityScheduler should consider node label
[ https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496881#comment-14496881 ] Jian He commented on YARN-2696: --- few minor comments - add a comment why no_label max resource is treated separately. {code} if (nodePartition == null || nodePartition.equals(RMNodeLabelsManager.NO_LABEL)) {code} - getChildrenAllocationIterator - sortAndGetChildrenAllocationIterator Queue sorting in CapacityScheduler should consider node label - Key: YARN-2696 URL: https://issues.apache.org/jira/browse/YARN-2696 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2696.1.patch, YARN-2696.2.patch In the past, when trying to allocate containers under a parent queue in CapacityScheduler. The parent queue will choose child queues by the used resource from smallest to largest. Now we support node label in CapacityScheduler, we should also consider used resource in child queues by node labels when allocating resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3354) Container should contains node-labels asked by original ResourceRequests
[ https://issues.apache.org/jira/browse/YARN-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496888#comment-14496888 ] Jian He commented on YARN-3354: --- +1 Container should contains node-labels asked by original ResourceRequests Key: YARN-3354 URL: https://issues.apache.org/jira/browse/YARN-3354 Project: Hadoop YARN Issue Type: Sub-task Components: api, capacityscheduler, nodemanager, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3354.1.patch, YARN-3354.2.patch We proposed non-exclusive node labels in YARN-3214, makes non-labeled resource requests can be allocated on labeled nodes which has idle resources. To make preemption work, we need know an allocated container's original node label: when labeled resource requests comes back, we need kill non-labeled containers running on labeled nodes. This requires add node-labels in Container, and also, NM need store this information and send back to RM when RM restart to recover original container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2696) Queue sorting in CapacityScheduler should consider node label
[ https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496892#comment-14496892 ] Jian He commented on YARN-2696: --- - Does this overlap with below {{Resources.equals(queueGuranteedResource, Resources.none()) ? 0}} check ? {code} // make queueGuranteed = minimum_allocation to avoid divided by 0. queueGuranteedResource = Resources.max(rc, totalPartitionResource, queueGuranteedResource, minimumAllocation); {code} Queue sorting in CapacityScheduler should consider node label - Key: YARN-2696 URL: https://issues.apache.org/jira/browse/YARN-2696 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2696.1.patch, YARN-2696.2.patch In the past, when trying to allocate containers under a parent queue in CapacityScheduler. The parent queue will choose child queues by the used resource from smallest to largest. Now we support node label in CapacityScheduler, we should also consider used resource in child queues by node labels when allocating resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496910#comment-14496910 ] Wangda Tan commented on YARN-3434: -- [~tgraves], Make sense to me, especially for the {{local transient variable rather then a globally stored one}}. So I think after the change, flows to use/update ResourceLimit will be: {code} In LeafQueue: Both: updateClusterResource | |-- resource-limit assignContainers | updatestore (only for compute headroom) Only: assignContainers | V check queue limit | V check user limit | V set how-much-should-unreserve to ResourceLimits and pass down {code} Is that what you also think about? Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3492) AM fails to come up because RM and NM can't connect to each other
[ https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496918#comment-14496918 ] Tsuyoshi Ozawa commented on YARN-3492: -- [~kasha], could you attach yarn-site.xml and mapred-site.xml for investigation? AM fails to come up because RM and NM can't connect to each other - Key: YARN-3492 URL: https://issues.apache.org/jira/browse/YARN-3492 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: pseudo-distributed cluster on a mac Reporter: Karthik Kambatla Priority: Blocker Attachments: yarn-kasha-nodemanager-kasha-mbp.local.log, yarn-kasha-resourcemanager-kasha-mbp.local.log Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The container gets allocated, but doesn't get launched. The NM can't talk to the RM. Logs to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3404) View the queue name to YARN Application page
[ https://issues.apache.org/jira/browse/YARN-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496920#comment-14496920 ] Jian He commented on YARN-3404: --- +1 View the queue name to YARN Application page Key: YARN-3404 URL: https://issues.apache.org/jira/browse/YARN-3404 Project: Hadoop YARN Issue Type: Improvement Reporter: Ryu Kobayashi Assignee: Ryu Kobayashi Priority: Minor Attachments: YARN-3404.1.patch, YARN-3404.2.patch, YARN-3404.3.patch, YARN-3404.4.patch, screenshot.png It want to display the name of the queue that is used to YARN Application page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3492) AM fails to come up because RM and NM can't connect to each other
[ https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3492: --- Attachment: yarn-site.xml mapred-site.xml AM fails to come up because RM and NM can't connect to each other - Key: YARN-3492 URL: https://issues.apache.org/jira/browse/YARN-3492 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: pseudo-distributed cluster on a mac Reporter: Karthik Kambatla Priority: Blocker Attachments: mapred-site.xml, yarn-kasha-nodemanager-kasha-mbp.local.log, yarn-kasha-resourcemanager-kasha-mbp.local.log, yarn-site.xml Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The container gets allocated, but doesn't get launched. The NM can't talk to the RM. Logs to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3005) [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java
[ https://issues.apache.org/jira/browse/YARN-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-3005: Assignee: Kengo Seki [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java Key: YARN-3005 URL: https://issues.apache.org/jira/browse/YARN-3005 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Akira AJISAKA Assignee: Kengo Seki Priority: Trivial Labels: newbie Fix For: 2.7.0 Attachments: YARN-3005.001.patch, YARN-3005.002.patch Since we have moved to JDK7, we can refactor the below if-else statement for String. {code} // TODO JDK7 SWITCH if (REGISTRY_CLIENT_AUTH_KERBEROS.equals(auth)) { access = AccessPolicy.sasl; } else if (REGISTRY_CLIENT_AUTH_DIGEST.equals(auth)) { access = AccessPolicy.digest; } else if (REGISTRY_CLIENT_AUTH_ANONYMOUS.equals(auth)) { access = AccessPolicy.anon; } else { throw new ServiceStateException(E_UNKNOWN_AUTHENTICATION_MECHANISM + \ + auth + \); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3326) Support RESTful API for getLabelsToNodes
[ https://issues.apache.org/jira/browse/YARN-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3326: - Summary: Support RESTful API for getLabelsToNodes (was: ReST support for getLabelsToNodes ) Support RESTful API for getLabelsToNodes - Key: YARN-3326 URL: https://issues.apache.org/jira/browse/YARN-3326 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Minor Attachments: YARN-3326.20150310-1.patch, YARN-3326.20150407-1.patch, YARN-3326.20150408-1.patch REST to support to retrieve LabelsToNodes Mapping -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496735#comment-14496735 ] Thomas Graves commented on YARN-3434: - I am not saying child needs to know how parent calculate resource limit. I am saying user limit and whether it needs to unreserve to make another reservation has nothing to do with the parent queue (ie it doesn't apply to parent queue). Remember I'm not needing to store user limit, I'm needing to store the fact of whether it needs to unreserve and if it does how much does it need to unreserve. When a node heartbeats it goes through the regular assignments and updates the leafQueue clusterResources based on what the parent passes in. When a node is removed or added then it updates the resource limits (none of these apply to calculation of whether it needs to unreserve or not). Basically it comes down to is this information useful outside of the small window between when it calculates it and when its needed in assignContainer() and my thought is no. And you said it yourself in last bullet above. Although we have been referring to the userLImit and perhaps that is the problem. I don't need to store the userLimit, I need to store whether it needs to unreserve and if so how much. Therefore it fits better as a local transient variable rather then a globally stored one. If you store just the userLImit then you need to recalculate stuff which I'm trying to avoid. I understand why we are storing the current information in ResourceLimits because it has to do with headroom and parent limits and is recalculated at various points, but the current implementation in canAssignToUser doesn't use headroom at all and whether we need to unreserve or not on the last call to assignContainers doesn't affect the headroom calculation. Again basically all we would be doing is placing an extra global variable(s) in the ResourceLimits class just to pass it on down a couple of functions. That to me is a parameter. Now if we had multiple things needing this or updating it then to me fits better in the ResourceLimits. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3462) Patches applied for YARN-2424 are inconsistent between trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496757#comment-14496757 ] Naganarasimha G R commented on YARN-3462: - Thanks for reviewing and Commiting , [~qwertymaniac] [~sidharta-s] Patches applied for YARN-2424 are inconsistent between trunk and branch-2 - Key: YARN-3462 URL: https://issues.apache.org/jira/browse/YARN-3462 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Sidharta Seethana Assignee: Naganarasimha G R Fix For: 2.7.1 Attachments: YARN-3462.20150508-1.patch It looks like the changes for YARN-2424 are not the same for trunk (commit 7e75226e68715c3eca9d346c8eaf2f265aa70d23) and branch-2 (commit 5d965f2f3cf97a87603720948aacd4f7877d73c4) . Branch-2 has a missing warning and documentation is a bit different as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3326) Support RESTful API for getLabelsToNodes
[ https://issues.apache.org/jira/browse/YARN-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496712#comment-14496712 ] Tsuyoshi Ozawa commented on YARN-3326: -- Committed this to trunk and branch-2. Thanks [~Naganarasimha] for your contribution and thanks [~vvasudev] for your review! Support RESTful API for getLabelsToNodes - Key: YARN-3326 URL: https://issues.apache.org/jira/browse/YARN-3326 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3326.20150310-1.patch, YARN-3326.20150407-1.patch, YARN-3326.20150408-1.patch REST to support to retrieve LabelsToNodes Mapping -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3326) Support RESTful API for getLabelsToNodes
[ https://issues.apache.org/jira/browse/YARN-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496733#comment-14496733 ] Naganarasimha G R commented on YARN-3326: - Thanks [~ozawa], thanks for the review, Will check the scope of yarn-2801 and if it doesnt cover this feature then will raise a new jira. Support RESTful API for getLabelsToNodes - Key: YARN-3326 URL: https://issues.apache.org/jira/browse/YARN-3326 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3326.20150310-1.patch, YARN-3326.20150407-1.patch, YARN-3326.20150408-1.patch REST to support to retrieve LabelsToNodes Mapping -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496702#comment-14496702 ] zhihai xu commented on YARN-3491: - I saw the serialization for public resource localization in the following logs: The following log shows two private localization requests and many public localization requests from container_e30_1426628374875_110892_01_000475 {code} 2015-04-07 22:49:56,750 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110892_01_000475 transitioned from NEW to LOCALIZING 2015-04-07 22:49:56,751 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/user/databot/.staging/job_1426628374875_110892/job.xml transitioned from INIT to DOWNLOADING 2015-04-07 22:49:56,751 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/user/databot/.staging/job_1426628374875_110892/job.jar transitioned from INIT to DOWNLOADING 2015-04-07 22:49:56,751 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/tmp/temp182237/tmp-1316042064/reflections.jar transitioned from INIT to DOWNLOADING 2015-04-07 22:49:56,751 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/tmp/temp182237/tmp-327542609/service-media-sdk.jar transitioned from INIT to DOWNLOADING 2015-04-07 22:49:56,751 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/tmp/temp182237/tmp1631960573/service-local-search-sdk.jar transitioned from INIT to DOWNLOADING 2015-04-07 22:49:56,751 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/tmp/temp182237/tmp-1521315530/ace-geo.jar transitioned from INIT to DOWNLOADING 2015-04-07 22:49:56,751 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/tmp/temp182237/tmp1347512155/cortex-server.jar transitioned from INIT to DOWNLOADING {code} The following log shows how the public resource localizations are processed. {code} 2015-04-07 22:49:56,758 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_e30_1426628374875_110892_01_000475 2015-04-07 22:49:56,758 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp182237/tmp-1316042064/reflections.jar, 1428446867531, FILE, null } 2015-04-07 22:49:56,882 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp182237/tmp-327542609/service-media-sdk.jar, 1428446864128, FILE, null } 2015-04-07 22:49:56,902 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/tmp/temp182237/tmp-1316042064/reflections.jar(-/data2/yarn/nm/filecache/4877652/reflections.jar) transitioned from DOWNLOADING to LOCALIZED 2015-04-07 22:49:57,127 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp182237/tmp1631960573/service-local-search-sdk.jar, 1428446858408, FILE, null } 2015-04-07 22:49:57,145 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/tmp/temp182237/tmp-327542609/service-media-sdk.jar(-/data11/yarn/nm/filecache/4877653/service-media-sdk.jar) transitioned from DOWNLOADING to LOCALIZED 2015-04-07 22:49:57,251 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp182237/tmp-1521315530/ace-geo.jar, 1428446862857, FILE, null } 2015-04-07 22:49:57,270 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://nameservice1/tmp/temp182237/tmp1631960573/service-local-search-sdk.jar(-/data1/yarn/nm/filecache/4877654/service-local-search-sdk.jar) transitioned from DOWNLOADING to LOCALIZED 2015-04-07 22:49:57,383 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp182237/tmp1347512155/cortex-server.jar, 1428446857069, FILE, null } {code} Based on the log, You can see the thread pools are not fully used, only one thread is used. The default thread
[jira] [Commented] (YARN-3005) [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java
[ https://issues.apache.org/jira/browse/YARN-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496708#comment-14496708 ] Akira AJISAKA commented on YARN-3005: - Assigned [~sekikn]. Thanks. [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java Key: YARN-3005 URL: https://issues.apache.org/jira/browse/YARN-3005 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Akira AJISAKA Assignee: Kengo Seki Priority: Trivial Labels: newbie Fix For: 2.7.0 Attachments: YARN-3005.001.patch, YARN-3005.002.patch Since we have moved to JDK7, we can refactor the below if-else statement for String. {code} // TODO JDK7 SWITCH if (REGISTRY_CLIENT_AUTH_KERBEROS.equals(auth)) { access = AccessPolicy.sasl; } else if (REGISTRY_CLIENT_AUTH_DIGEST.equals(auth)) { access = AccessPolicy.digest; } else if (REGISTRY_CLIENT_AUTH_ANONYMOUS.equals(auth)) { access = AccessPolicy.anon; } else { throw new ServiceStateException(E_UNKNOWN_AUTHENTICATION_MECHANISM + \ + auth + \); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3394) WebApplication proxy documentation is incomplete
[ https://issues.apache.org/jira/browse/YARN-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496725#comment-14496725 ] Tsuyoshi Ozawa commented on YARN-3394: -- Thanks Naganarasimha for your contribution and thanks Jian for your commit! WebApplication proxy documentation is incomplete - Key: YARN-3394 URL: https://issues.apache.org/jira/browse/YARN-3394 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: WebApplicationProxy.html, YARN-3394.20150324-1.patch Webproxy documentation is incomplete hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html 1.Configuration of service start/stop as separate server 2.Steps to start as daemon service 3.Secure mode for Web proxy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2696) Queue sorting in CapacityScheduler should consider node label
[ https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496732#comment-14496732 ] Hadoop QA commented on YARN-2696: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725637/YARN-2696.2.patch against trunk revision 9e8309a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7348//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7348//console This message is automatically generated. Queue sorting in CapacityScheduler should consider node label - Key: YARN-2696 URL: https://issues.apache.org/jira/browse/YARN-2696 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2696.1.patch, YARN-2696.2.patch In the past, when trying to allocate containers under a parent queue in CapacityScheduler. The parent queue will choose child queues by the used resource from smallest to largest. Now we support node label in CapacityScheduler, we should also consider used resource in child queues by node labels when allocating resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3326) Support RESTful API for getLabelsToNodes
[ https://issues.apache.org/jira/browse/YARN-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496731#comment-14496731 ] Hudson commented on YARN-3326: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7590 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7590/]) YARN-3326. Support RESTful API for getLabelsToNodes. Contributed by Naganarasimha G R. (ozawa: rev e48cedc663b8a26fd62140c8e2907f9b4edd9785) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LabelsToNodesInfo.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/NodeIDsInfo.java Support RESTful API for getLabelsToNodes - Key: YARN-3326 URL: https://issues.apache.org/jira/browse/YARN-3326 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3326.20150310-1.patch, YARN-3326.20150407-1.patch, YARN-3326.20150408-1.patch REST to support to retrieve LabelsToNodes Mapping -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497311#comment-14497311 ] Hadoop QA commented on YARN-3463: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725702/YARN-3463.64.patch against trunk revision 1b89a3e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1149 javac compiler warnings (more than the trunk's current 1147 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7350//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7350//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7350//console This message is automatically generated. Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497087#comment-14497087 ] Wangda Tan commented on YARN-3434: -- bq. Or were you saying create a ResourceLimit and pass it as parameter to canAssignToUser and canAssignToThisQueue and modify that instance. That instance would then be passed down though to assignContainer()? I prefer the above one which is according to your previously comment local transient variable rather than a globally stored one. Is this also what you preferred? Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3463: -- Attachment: YARN-3463.64.patch rebased to current trunk Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2498) Respect labels in preemption policy of capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497261#comment-14497261 ] Wangda Tan commented on YARN-2498: -- Discussed with [~mayank_bansal], taking over and working on this, will post patch/implementation-notes soon. Respect labels in preemption policy of capacity scheduler - Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, yarn-2498-implementation-notes.pdf There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when there're some resource available in the cluster, we shouldn't assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot access such labels For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
Sumana Sathish created YARN-3493: Summary: RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Priority: Critical Fix For: 2.7.0 RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at
[jira] [Assigned] (YARN-2498) Respect labels in preemption policy of capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2498: Assignee: Wangda Tan (was: Mayank Bansal) Respect labels in preemption policy of capacity scheduler - Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, yarn-2498-implementation-notes.pdf There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when there're some resource available in the cluster, we shouldn't assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot access such labels For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497065#comment-14497065 ] Wangda Tan commented on YARN-3434: -- bq. Are you suggesting we change the patch to modify ResourceLimits and pass down rather then using the LimitsInfo class? Yes, that's my suggested. bq. at least not without adding the shouldContinue flag to it Kind of, what I'm thinking is we can add amountNeededUnreserve to ResourceLimits. canAssignToThisQueue/User will return boolean means shouldContinue, and set amountNeededUnreserve (instead of limit, we don't need to change limit). That very similar to your original logic and we don't need the extra LimitsInfo. After we get the updated the ResourceLimit and pass down, problem should be resolved. Did I miss anything? Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497085#comment-14497085 ] zhihai xu commented on YARN-3491: - Hi [~jlowe], thanks for the comment. Queueing is faster, but It take longer time to add FSDownload to the worker thread. If all threads in the thread pool are used, it will be very fast to add an entry to the queue LinkedBlockingQueue#offer. Based on the following code in ThreadPoolExecutor#execute, corePoolSize is thread pool size which is 4 in this case. workQueue.offer(command) is fast but addWorker is slow. It only queues the task when all threads in the thread pool are running. {code} public void execute(Runnable command) { if (command == null) throw new NullPointerException(); /* * Proceed in 3 steps: * * 1. If fewer than corePoolSize threads are running, try to * start a new thread with the given command as its first * task. The call to addWorker atomically checks runState and * workerCount, and so prevents false alarms that would add * threads when it shouldn't, by returning false. * * 2. If a task can be successfully queued, then we still need * to double-check whether we should have added a thread * (because existing ones died since last checking) or that * the pool shut down since entry into this method. So we * recheck state and if necessary roll back the enqueuing if * stopped, or start a new thread if there are none. * * 3. If we cannot queue task, then we try to add a new * thread. If it fails, we know we are shut down or saturated * and so reject the task. */ int c = ctl.get(); if (workerCountOf(c) corePoolSize) { if (addWorker(command, true)) return; c = ctl.get(); } if (isRunning(c) workQueue.offer(command)) { int recheck = ctl.get(); if (! isRunning(recheck) remove(command)) reject(command); else if (workerCountOf(recheck) == 0) addWorker(null, false); } else if (!addWorker(command, false)) reject(command); } {code} The issue is: If the time to run one FSDownload(resource localization) is close to the time to run the submit(add FSDownload to the worker thread). The oscillation will happen and there will be only one worker thread running. Then Dispatcher thread will be blocked for longer time. The above logs can prove this situation. LocalizerRunner#addResource used by private localizer takes less than one millisecond to process one REQUEST_RESOURCE_LOCALIZATION event but PublicLocalizer#addResource used by public localizer takes 124 millisecond to process one REQUEST_RESOURCE_LOCALIZATION event. Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). - Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because FSDownload submission to the thread pool at the following code is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. {code} synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } {code} Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumana Sathish updated YARN-3493: - Attachment: yarn-yarn-resourcemanager.log.zip RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Priority: Critical Fix For: 2.7.0 Attachments: yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at
[jira] [Updated] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3493: -- Fix Version/s: (was: 2.7.0) RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at
[jira] [Assigned] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-3493: - Assignee: Jian He RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at
[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497129#comment-14497129 ] Jian He commented on YARN-3493: --- [~kasha], I think this happened on a different code path. RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at
[jira] [Commented] (YARN-2696) Queue sorting in CapacityScheduler should consider node label
[ https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497139#comment-14497139 ] Hadoop QA commented on YARN-2696: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725687/YARN-2696.3.patch against trunk revision b2e6cf6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7349//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7349//console This message is automatically generated. Queue sorting in CapacityScheduler should consider node label - Key: YARN-2696 URL: https://issues.apache.org/jira/browse/YARN-2696 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2696.1.patch, YARN-2696.2.patch, YARN-2696.3.patch In the past, when trying to allocate containers under a parent queue in CapacityScheduler. The parent queue will choose child queues by the used resource from smallest to largest. Now we support node label in CapacityScheduler, we should also consider used resource in child queues by node labels when allocating resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3463: -- Attachment: YARN-3463.65.patch Suppress orderingpolicy from appearing in web service responses, is still on the web ui Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3493: -- Attachment: YARN-3493.1.patch Upload a patch to ignore this exception on recovery RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3493.1.patch, yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497341#comment-14497341 ] Sangjin Lee commented on YARN-3390: --- I took a pass at the patch, and it looks good for the most part. I would ask you to reconcile the TimelineCollectorManager changes with what I have over on YARN-3437. Again, I have a slight preference for the hook/template methods for the aforementioned reason, but it's not a strong preference one way or another. However, I'm not sure why there is a change for RMContainerAllocator.java. It doesn't look like an intended change? Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3390.1.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497320#comment-14497320 ] Sangjin Lee commented on YARN-3051: --- We chatted offline about the issue of what context is required for the reader API and the uniqueness requirement. I'm not sure if there is a complete agreement on this yet, but at least this is a proposal from us ([~vrushalic], [~jrottinghuis], and me). - for reader calls that ask for sub-application entities, the application id must be specified - uniqueness is similarly defined; (entity type, entity id) uniquely identifies an entity within the scope of a YARN application We feel that this is the most natural way of supporting writes/reads. One scenario to consider is reducing impact on current users of ATS, as v.2 would require app id which v.1 did not require. For that, we would need to update the user library to have a compatibility layer (e.g. tez, etc.). Thoughts? [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497055#comment-14497055 ] Thomas Graves commented on YARN-3434: - I agree with Both section. I'm not sure I completely follow the Only section. Are you suggesting we change the patch to modify ResourceLimits and pass down rather then using the LimitsInfo class? If so that won't work, at least not without adding the shouldContinue flag to it. Unless you mean keep LimitsInfo class for use locally in assignContainers and then pass ResourceLimits down to assignContainer with the value of amountNeededUnreserve as the limit. That wouldn't really change much exception the object we pass down through the functions. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497076#comment-14497076 ] Thomas Graves commented on YARN-3434: - so you are saying add amountNeededUnreserve to ResourceLimits and then set the global currentResourceLimits.amountNeededUnreserve inside of canAssignToUser? This is what I was not in favor of above and there would be no need to pass it down as parameter. Or were you saying create a ResourceLimit and pass it as parameter to canAssignToUser and canAssignToThisQueue and modify that instance. That instance would then be passed down though to assignContainer()? I don't see how else you set the ResourceLimit. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497122#comment-14497122 ] Karthik Kambatla commented on YARN-3493: [~jianhe] - YARN-2010 should have fixed this right? RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Priority: Critical Fix For: 2.7.0 Attachments: yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at
[jira] [Commented] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497304#comment-14497304 ] Sangjin Lee commented on YARN-3491: --- I have the same question as [~jlowe]. The actual call {code} synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } {code} should be completely non-blocking and there is nothing that's expensive about it with the possible exception of the synchronization. Could you describe the root cause of the slowness you're seeing in some more detail? Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). - Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because FSDownload submission to the thread pool at the following code is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. {code} synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } {code} Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497315#comment-14497315 ] Vrushali C commented on YARN-3051: -- Hi [~varun_saxena] As per the discussion in the call today, here is the query document about flow (and user and queue) based queries that I had mentioned (put up on jira YARN-3050) https://issues.apache.org/jira/secure/attachment/12695071/Flow%20based%20queries.docx Also, some points that I think may be helpful: - the reader API is not going to be limited to one or two api calls - different queries will need different core read apis. For instance, all flow based queries may not need the application id or entity id info, but rather would need the flow id. for example, for a given user, return the flows that were run during this time frame. This query requires only cluster and cluster info, not entity nor application nor flowname is needed for the reader API to serve this query. This query cannot be boiled down to an entity level query. - So the reader API should allow for entity level, application level, flow level, user level, queue level and cluster level queries. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497317#comment-14497317 ] Jian He commented on YARN-3493: --- cancel the patch, uploading a newer version. RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3493.1.patch, yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at
[jira] [Assigned] (YARN-3494) Expose AM resource limit and user limit in QueueMetrics
[ https://issues.apache.org/jira/browse/YARN-3494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-3494: Assignee: Rohith Expose AM resource limit and user limit in QueueMetrics Key: YARN-3494 URL: https://issues.apache.org/jira/browse/YARN-3494 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Now we have the AM resource limit and user limit shown on the web UI, it would be useful to expose them in the QueueMetrics as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497378#comment-14497378 ] Rohith commented on YARN-3493: -- +1(non-binding) RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3493.1.patch, YARN-3493.2.patch, yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
[jira] [Commented] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497411#comment-14497411 ] zhihai xu commented on YARN-3491: - Hi [~sjlee0], that is a good point. I just think about queue.submit is the Bottleneck. Queue.submit is just part of the code in PublicLocalizer#addResource, the Bottleneck may come from publicRsrc.getPathForLocalization, we add a lot of stuff in LocalResourcesTrackerImpl#getPathForLocalization such as {{stateStore.startResourceLocalization(user, appId, ((LocalResourcePBImpl) lr).getProto(), localPath); }} I should describe it more clearly. Based on the log, the issue is: PublicLocalizer#addResource is very slow, which blocks the Dispatcher thread, I looked at the following code at PublicLocalizer#addResource, I feel queue.submit may take most of CPU cycles, based on [~jlowe]'s and your comment, the slowness may come from other code such as publicRsrc.getPathForLocalization or dirsHandler.getLocalPathForWrite. But I think moving all these code in PublicLocalizer#addResource from Dispatcher thread to PublicLocalizer thread should be a good optimization. We can use a synchronizedList of LocalizerResourceRequestEvent to store all these events for public resource localization, which is similar as what LocalizerRunner does for private resource localization. I will do some more profiling to see what is Bottleneck in PublicLocalizer#addResource, {code} public void addResource(LocalizerResourceRequestEvent request) { // TODO handle failures, cancellation, requests by other containers LocalizedResource rsrc = request.getResource(); LocalResourceRequest key = rsrc.getRequest(); LOG.info(Downloading public rsrc: + key); /* * Here multiple containers may request the same resource. So we need * to start downloading only when * 1) ResourceState == DOWNLOADING * 2) We are able to acquire non blocking semaphore lock. * If not we will skip this resource as either it is getting downloaded * or it FAILED / LOCALIZED. */ if (rsrc.tryAcquire()) { if (rsrc.getState() == ResourceState.DOWNLOADING) { LocalResource resource = request.getResource().getRequest(); try { Path publicRootPath = dirsHandler.getLocalPathForWrite(. + Path.SEPARATOR + ContainerLocalizer.FILECACHE, ContainerLocalizer.getEstimatedSize(resource), true); Path publicDirDestPath = publicRsrc.getPathForLocalization(key, publicRootPath); if (!publicDirDestPath.getParent().equals(publicRootPath)) { DiskChecker.checkDir(new File(publicDirDestPath.toUri().getPath())); } // In case this is not a newly initialized nm state, ensure // initialized local/log dirs similar to LocalizerRunner getInitializedLocalDirs(); getInitializedLogDirs(); // explicitly synchronize pending here to avoid future task // completing and being dequeued before pending updated synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } } catch (IOException e) { rsrc.unlock(); publicRsrc.handle(new ResourceFailedLocalizationEvent(request .getResource().getRequest(), e.getMessage())); LOG.error(Local path for public localization is not found. + May be disks failed., e); } catch (IllegalArgumentException ie) { rsrc.unlock(); publicRsrc.handle(new ResourceFailedLocalizationEvent(request .getResource().getRequest(), ie.getMessage())); LOG.error(Local path for public localization is not found. + Incorrect path. + request.getResource().getRequest() .getPath(), ie); } catch (RejectedExecutionException re) { rsrc.unlock(); publicRsrc.handle(new ResourceFailedLocalizationEvent(request .getResource().getRequest(), re.getMessage())); LOG.error(Failed to submit rsrc + rsrc + for download. + Either queue is full or threadpool is shutdown., re); } } else { rsrc.unlock(); } } } {code} Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). - Key: YARN-3491
[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497413#comment-14497413 ] Rohith commented on YARN-3493: -- The same problem would occur enabling RM work preserving restart where Running AM updates its ResourceRequest on RESYNC command from RM. This causes throw InvalidResourceRequestException to AM which AM do not expect it. RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3493.1.patch, YARN-3493.2.patch, yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
[jira] [Updated] (YARN-3495) Confusing log generated by FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3495: --- Attachment: YARN-3495.patch Confusing log generated by FairScheduler Key: YARN-3495 URL: https://issues.apache.org/jira/browse/YARN-3495 Project: Hadoop YARN Issue Type: Bug Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3495.patch 2015-04-16 12:03:48,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3495) Confusing log generated by FairScheduler
Brahma Reddy Battula created YARN-3495: -- Summary: Confusing log generated by FairScheduler Key: YARN-3495 URL: https://issues.apache.org/jira/browse/YARN-3495 Project: Hadoop YARN Issue Type: Bug Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula 2015-04-16 12:03:48,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3495) Confusing log generated by FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497516#comment-14497516 ] Brahma Reddy Battula commented on YARN-3495: Attached the patch..Kindly Review.. YARN-3197 fixed capacity-scheduler.. Confusing log generated by FairScheduler Key: YARN-3495 URL: https://issues.apache.org/jira/browse/YARN-3495 Project: Hadoop YARN Issue Type: Bug Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3495.patch 2015-04-16 12:03:48,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497565#comment-14497565 ] zhihai xu commented on YARN-3491: - Hi [~jlowe] and [~sjlee0], I think I know what is bottleneck in PublicLocalizer#addResource. I checked the old NM logs from old code in 2.3.0 release. PublicLocalizer#addResource took less than one millisecond in 2.3.0 release . {code} 2014-10-21 18:11:10,956 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp-602532977/asm.jar, 1413914982330, FILE, null } 2014-10-21 18:11:10,956 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp-983952127/start.jar, 1413914978818, FILE, null } 2014-10-21 18:11:10,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp-700474448/jsch.jar, 1413914981670, FILE, null } 2014-10-21 18:11:10,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp-295789958/kfs.jar, 1413914974035, FILE, null } 2014-10-21 18:11:10,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp1832142372/datasvc-search.jar, 1413914970738, FILE, null } 2014-10-21 18:11:10,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp-1244404847/args4j.jar, 1413914982044, FILE, null } 2014-10-21 18:11:10,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp729860031/slf4j-log4j12.jar, 1413914980407, FILE, null } 2014-10-21 18:11:10,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp-1748521227/jackson-mapper-asl.jar, 1413914983142, FILE, null } 2014-10-21 18:11:10,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp-246818030/jasper-compiler.jar, 1413914979243, FILE, null } 2014-10-21 18:11:10,958 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://nameservice1/tmp/temp-1620691366/tmp-1703279108/spiffy.jar, 1413914974080, FILE, null } {code} Then I compared the public localization code, the difference is at LocalResourcesTrackerImpl#getPathForLocalization: The following code is added after 2.3.0 release: {code} rPath = new Path(rPath, Long.toString(uniqueNumberGenerator.incrementAndGet())); Path localPath = new Path(rPath, req.getPath().getName()); LocalizedResource rsrc = localrsrc.get(req); rsrc.setLocalPath(localPath); LocalResource lr = LocalResource.newInstance(req.getResource(), req.getType(), req.getVisibility(), req.getSize(), req.getTimestamp()); try { stateStore.startResourceLocalization(user, appId, ((LocalResourcePBImpl) lr).getProto(), localPath); } catch (IOException e) { LOG.error(Unable to record localization start for + rsrc, e); } {code} I think most likely stateStore.startResourceLocalization is the bottleneck. startResourceLocalization stored the state in the levelDB. the levelDB operation is time consuming. It need go through the JNI interface. {code} public void startResourceLocalization(String user, ApplicationId appId, LocalResourceProto proto, Path localPath) throws IOException { String key = getResourceStartedKey(user, appId, localPath.toString()); try { db.put(bytes(key), proto.toByteArray()); } catch (DBException e) { throw new IOException(e); } } {code} I think it would be better to do these levelDB operations in a separate thread using AsyncDispatcher in NMLeveldbStateStoreService. Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). - Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491
[jira] [Updated] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Description: Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. was: Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). - Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497439#comment-14497439 ] Hadoop QA commented on YARN-3493: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725743/YARN-3493.2.patch against trunk revision 1b89a3e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7352//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7352//console This message is automatically generated. RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3493.1.patch, YARN-3493.2.patch, yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
[jira] [Updated] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3437: -- Attachment: YARN-3437.002.patch Rebased the patch with the latest from the YARN-2928 branch. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497451#comment-14497451 ] Hadoop QA commented on YARN-3463: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725744/YARN-3463.66.patch against trunk revision 1b89a3e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1148 javac compiler warnings (more than the trunk's current 1147 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7353//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7353//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7353//console This message is automatically generated. Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Summary: PublicLocalizer#addResource is too slow. (was: Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).) PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497354#comment-14497354 ] Hadoop QA commented on YARN-3463: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725714/YARN-3463.65.patch against trunk revision 1b89a3e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1149 javac compiler warnings (more than the trunk's current 1147 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7351//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7351//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7351//console This message is automatically generated. Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3493: -- Attachment: YARN-3493.2.patch uploaded a new patch RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3493.1.patch, YARN-3493.2.patch, yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at
[jira] [Commented] (YARN-3495) Confusing log generated by FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497571#comment-14497571 ] Hadoop QA commented on YARN-3495: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725774/YARN-3495.patch against trunk revision 1b89a3e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7355//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7355//console This message is automatically generated. Confusing log generated by FairScheduler Key: YARN-3495 URL: https://issues.apache.org/jira/browse/YARN-3495 Project: Hadoop YARN Issue Type: Bug Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3495.patch 2015-04-16 12:03:48,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3496) Add a configuration to disable/enable storing localization state in NM StateStore
zhihai xu created YARN-3496: --- Summary: Add a configuration to disable/enable storing localization state in NM StateStore Key: YARN-3496 URL: https://issues.apache.org/jira/browse/YARN-3496 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Add a configuration to disable/enable storing localization state in NM StateStore. Store Localization state in the levelDB may have some overhead, which may affect NM performance. It would better to have a configuration to disable/enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3494) Expose AM resource limit and user limit in QueueMetrics
Jian He created YARN-3494: - Summary: Expose AM resource limit and user limit in QueueMetrics Key: YARN-3494 URL: https://issues.apache.org/jira/browse/YARN-3494 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Now we have the AM resource limit and user limit shown on the web UI, it would be useful to expose them in the QueueMetrics as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3463: -- Attachment: YARN-3463.66.patch Fix build warnings, the tests all pass on my box. Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Description: Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. was: Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because FSDownload submission to the thread pool at the following code is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. {code} synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } {code} Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). - Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497432#comment-14497432 ] Hadoop QA commented on YARN-3437: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725758/YARN-3437.002.patch against trunk revision 1b89a3e. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7354//console This message is automatically generated. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3492) AM fails to come up because RM and NM can't connect to each other
[ https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497500#comment-14497500 ] Brahma Reddy Battula commented on YARN-3492: [~kasha] Thanks for reporting this issue.. I took your mapred-site.xml and yarn-site.xml and started the pseudo-distributed cluster.. Containers are getting allocated and NM able connect to RM.. *Please correct me If I am wrong..* *{color:blue} Nodemanager Log{color}* {noformat} 2015-04-16 09:06:54,130 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id -1430616116 2015-04-16 09:06:54,132 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for container-tokens, got key with id -751280008 2015-04-16 09:06:54,133 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as host132:42289 with total resource of memory:8192, vCores:8 2015-04-16 09:06:54,133 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests 2015-04-16 09:07:57,684 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1429155383347_0001_01 (auth:SIMPLE) 2015-04-16 09:07:57,772 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1429155383347_0001_01_01 by user hdfs 2015-04-16 09:07:57,797 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1429155383347_0001 2015-04-16 09:07:57,803 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs IP=.132 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1429155383347_0001 CONTAINERID=container_1429155383347_0001_01_01 {noformat} did you enable any firewall or host getting changed..? AM fails to come up because RM and NM can't connect to each other - Key: YARN-3492 URL: https://issues.apache.org/jira/browse/YARN-3492 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: pseudo-distributed cluster on a mac Reporter: Karthik Kambatla Priority: Blocker Attachments: mapred-site.xml, yarn-kasha-nodemanager-kasha-mbp.local.log, yarn-kasha-resourcemanager-kasha-mbp.local.log, yarn-site.xml Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The container gets allocated, but doesn't get launched. The NM can't talk to the RM. Logs to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497416#comment-14497416 ] Rohith commented on YARN-3493: -- bq. The same problem would occur Just to clarify I am referring to InvalidResourceRequestException, not the RM start failure RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3493.1.patch, YARN-3493.2.patch, yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Failed to load/recover state org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at
[jira] [Updated] (YARN-2696) Queue sorting in CapacityScheduler should consider node label
[ https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2696: - Attachment: YARN-2696.3.patch Addressed all comments from [~jianhe] and fixed test failure in TestFifoScheduler, uploaded ver.3 patch. Queue sorting in CapacityScheduler should consider node label - Key: YARN-2696 URL: https://issues.apache.org/jira/browse/YARN-2696 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2696.1.patch, YARN-2696.2.patch, YARN-2696.3.patch In the past, when trying to allocate containers under a parent queue in CapacityScheduler. The parent queue will choose child queues by the used resource from smallest to largest. Now we support node label in CapacityScheduler, we should also consider used resource in child queues by node labels when allocating resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3354) Container should contains node-labels asked by original ResourceRequests
[ https://issues.apache.org/jira/browse/YARN-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496981#comment-14496981 ] Wangda Tan commented on YARN-3354: -- Test failure is not related to the patch. Container should contains node-labels asked by original ResourceRequests Key: YARN-3354 URL: https://issues.apache.org/jira/browse/YARN-3354 Project: Hadoop YARN Issue Type: Sub-task Components: api, capacityscheduler, nodemanager, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3354.1.patch, YARN-3354.2.patch We proposed non-exclusive node labels in YARN-3214, makes non-labeled resource requests can be allocated on labeled nodes which has idle resources. To make preemption work, we need know an allocated container's original node label: when labeled resource requests comes back, we need kill non-labeled containers running on labeled nodes. This requires add node-labels in Container, and also, NM need store this information and send back to RM when RM restart to recover original container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)