[jira] [Updated] (YARN-2458) Add file handling features to the Windows Secure Container Executor LRPC service
[ https://issues.apache.org/jira/browse/YARN-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2458: --- Attachment: YARN-2458.2.patch A complete implementation that delegates critical file handling (mkdirs) to the privileged service. Add file handling features to the Windows Secure Container Executor LRPC service Key: YARN-2458 URL: https://issues.apache.org/jira/browse/YARN-2458 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2458.1.patch, YARN-2458.2.patch In the WSCE design the nodemanager needs to do certain privileged operations like change file ownership to arbitrary users or delete files owned by the task container user after completion of the task. As we want to remove the Administrator privilege requirement from the nodemanager service, we have to move these operations into the privileged LRPC helper service. Extend the RPC interface to contain methods for change file ownership and manipulate files, add JNI client side and implement the server side. This will piggyback on the existing LRPC service so is not much infrastructure to add (run as service, RPC init, authentictaion and authorization are already solved). It just needs to be implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127018#comment-14127018 ] Eric Payne commented on YARN-2056: -- [~leftnoteasy], thank you very much for your review comments. I appreciate it. Regarding the test for appA running on queueB: The following assertion tests that appA is preempted, but the preemption calculations are done per queue. {code} +verify(mDisp, times(10)).handle(argThat(new IsPreemptionRequestFor(appA))); {code} So, I added the following assertion to make the visual connection between appA and queueB: {code} +assertTrue(appA should be running on queueB, +mCS.getAppsInQueue(queueB).contains(expectedAttemptOnQueueB)); {code} So, the purpose is not really to test that the mockQueue/mockApp worked correctly, but to make it obvious that the preemption policy for queueB is being exercised. If you think it's not necessary, I will remove it, but I do like the link. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps
[ https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127034#comment-14127034 ] Anubhav Dhoot commented on YARN-2456: - Can we sort the ApplicationStates based on ApplicationState's submitTime or startTime fields when we recover? Possible deadlock in CapacityScheduler when RM is recovering apps - Key: YARN-2456 URL: https://issues.apache.org/jira/browse/YARN-2456 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2456.1.patch Consider this scenario: 1. RM is configured with a single queue and only one application can be active at a time. 2. Submit App1 which uses up the queue's whole capacity 3. Submit App2 which remains pending. 4. Restart RM. 5. App2 is recovered before App1, so App2 is added to the activeApplications list. Now App1 remains pending (because of max-active-app limit) 6. All containers of App1 are now recovered when NM registers, and use up the whole queue capacity again. 7. Since the queue is full, App2 cannot proceed to allocate AM container. 8. In the meanwhile, App1 cannot proceed to become active because of the max-active-app limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2451) Delete .orig files
[ https://issues.apache.org/jira/browse/YARN-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2451. Resolution: Invalid This isn't the case anymore - it was either only on my local machine or got fixed in the move to git. HADOOP-10609 added .orig and .rej files to .gitignore. Delete .orig files -- Key: YARN-2451 URL: https://issues.apache.org/jira/browse/YARN-2451 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Karthik Kambatla Assignee: Karthik Kambatla Looks like we checked in a few .orig files. We should delete them. {noformat} ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java.orig ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java.orig ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java.orig ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java.orig {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127086#comment-14127086 ] Jason Lowe commented on YARN-2440: -- bq. Specifically in the context of heterogeneous clusters where uniform % configurations can go really bad where the only resort will then be to do per-node configuration - not ideal. Yes, I could see the heterogenous cluster being a case where specifying absolute instead of relative may be desirable. My biggest concern is that it's confusing when trying to combine the absolute and relative concepts -- it's not obvious if one overrides the other or if one is relative to the other. Part of my concern to keep this as simple as possible and the configuration burden to an absolute minimum is that I'm missing the real-world use case. As I mentioned before, I think most users would rather not use the functionality proposed by this JIRA but instead setup peer cgroups for other systems and set their relative cgroup shares appropriately. With this JIRA the CPUs could sit idle despite demand from YARN containers, while a peer cgroup setup allows CPU guarantees without idle CPUs if the demand is there. Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1477) Improve AM web UI to avoid confusion about AM restart
[ https://issues.apache.org/jira/browse/YARN-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1477: -- Description: Improve AM web UI, Add submitTime field to the AM's web services REST API, improve Elapsed: row time computation, etc. (was: Similar to MAPREDUCE-5052, This is a fix on AM side. Add submitTime field to the AM's web services REST API) Improve AM web UI to avoid confusion about AM restart - Key: YARN-1477 URL: https://issues.apache.org/jira/browse/YARN-1477 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Chen He Assignee: Chen He Labels: features Fix For: 2.6.0 Improve AM web UI, Add submitTime field to the AM's web services REST API, improve Elapsed: row time computation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2496: - Attachment: YARN-2496.patch Attached patch, this patch is based on trunk, but it cannot be compiled itself. It will be hard to separate YARN-2500 with YARN-2496 and make each of them can be compiled. I split them just for easier reviewing. This patch is based on YARN-2493 - YARN-2494 YARN-2500, you need apply these patches. [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
[ https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2500: - Attachment: YARN-2500.patch Attached patch, this patch is based on trunk, but it cannot be compiled itself. It will be hard to separate YARN-2500 with YARN-2496 and make each of them can be compiled. I split them just for easier reviewing. This patch is based on YARN-2493 - YARN-2494 YARN-2496, you need apply these patches. [YARN-796] Miscellaneous changes in ResourceManager to support labels - Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2500.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2492) (Clone of YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127141#comment-14127141 ] Wangda Tan commented on YARN-2492: -- Uploaded patches for YARN-2496 (changes to support node label in CapacityScheduler) and YARN-2500 (misc changes to make RM support labels) (Clone of YARN-796) Allow for (admin) labels on nodes and resource-requests Key: YARN-2492 URL: https://issues.apache.org/jira/browse/YARN-2492 Project: Hadoop YARN Issue Type: Task Components: api, client, resourcemanager Reporter: Wangda Tan Since YARN-796 is a sub JIRA of YARN-397, this JIRA is used to create and track sub tasks and attach split patches for YARN-796. *Let's still keep over-all discussions on YARN-796.* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
[ https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127167#comment-14127167 ] Hadoop QA commented on YARN-2500: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667432/YARN-2500.patch against trunk revision 90c8ece. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4856//console This message is automatically generated. [YARN-796] Miscellaneous changes in ResourceManager to support labels - Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2500.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127168#comment-14127168 ] Hadoop QA commented on YARN-2496: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667431/YARN-2496.patch against trunk revision 90c8ece. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4855//console This message is automatically generated. [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2526) Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2526: -- Priority: Minor (was: Major) Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time Key: YARN-2526 URL: https://issues.apache.org/jira/browse/YARN-2526 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2526-1.patch The simulation may enter deadlock if all application simulators hold all threads provided by the thread pool, and all wait for AM container allocation. In that case, all AM simulators wait for NM simulators to do heartbeat to allocate resource, and all NM simulators wait for AM simulators to release some threads. The simulator is deadlocked. To solve this deadlock, need to remove the while() loop in the MRAMSimulator. {code} // waiting until the AM container is allocated while (true) { if (response != null ! response.getAllocatedContainers().isEmpty()) { // get AM container . break; } // this sleep time is different from HeartBeat Thread.sleep(1000); // send out empty request sendContainerRequest(); response = responseQueue.take(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2526) Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2526: -- Attachment: YARN-2526-1.patch Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time Key: YARN-2526 URL: https://issues.apache.org/jira/browse/YARN-2526 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2526-1.patch The simulation may enter deadlock if all application simulators hold all threads provided by the thread pool, and all wait for AM container allocation. In that case, all AM simulators wait for NM simulators to do heartbeat to allocate resource, and all NM simulators wait for AM simulators to release some threads. The simulator is deadlocked. To solve this deadlock, need to remove the while() loop in the MRAMSimulator. {code} // waiting until the AM container is allocated while (true) { if (response != null ! response.getAllocatedContainers().isEmpty()) { // get AM container . break; } // this sleep time is different from HeartBeat Thread.sleep(1000); // send out empty request sendContainerRequest(); response = responseQueue.take(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127201#comment-14127201 ] Xuan Gong commented on YARN-2459: - +1 LGTM RM crashes if App gets rejected for any reason and HA is enabled Key: YARN-2459 URL: https://issues.apache.org/jira/browse/YARN-2459 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch If RM HA is enabled and used Zookeeper store for RM State Store. If for any reason Any app gets rejected and directly goes to NEW to FAILED then final transition makes that to RMApps and Completed Apps memory structure but that doesn't make it to State store. Now when RMApps default limit reaches it starts deleting apps from memory and store. In that case it try to delete this app from store and fails which causes RM to crash. Thanks, Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2526) Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2526: -- Attachment: YARN-2526-1.patch Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time Key: YARN-2526 URL: https://issues.apache.org/jira/browse/YARN-2526 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2526-1.patch, YARN-2526-1.patch The simulation may enter deadlock if all application simulators hold all threads provided by the thread pool, and all wait for AM container allocation. In that case, all AM simulators wait for NM simulators to do heartbeat to allocate resource, and all NM simulators wait for AM simulators to release some threads. The simulator is deadlocked. To solve this deadlock, need to remove the while() loop in the MRAMSimulator. {code} // waiting until the AM container is allocated while (true) { if (response != null ! response.getAllocatedContainers().isEmpty()) { // get AM container . break; } // this sleep time is different from HeartBeat Thread.sleep(1000); // send out empty request sendContainerRequest(); response = responseQueue.take(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2526) Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2526: -- Attachment: (was: YARN-2526-1.patch) Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time Key: YARN-2526 URL: https://issues.apache.org/jira/browse/YARN-2526 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2526-1.patch The simulation may enter deadlock if all application simulators hold all threads provided by the thread pool, and all wait for AM container allocation. In that case, all AM simulators wait for NM simulators to do heartbeat to allocate resource, and all NM simulators wait for AM simulators to release some threads. The simulator is deadlocked. To solve this deadlock, need to remove the while() loop in the MRAMSimulator. {code} // waiting until the AM container is allocated while (true) { if (response != null ! response.getAllocatedContainers().isEmpty()) { // get AM container . break; } // this sleep time is different from HeartBeat Thread.sleep(1000); // send out empty request sendContainerRequest(); response = responseQueue.take(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.10.patch Create a new patch, which improve the check of backward compatibility and fix a missing break in the switch block. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2526) Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127280#comment-14127280 ] Hadoop QA commented on YARN-2526: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667440/YARN-2526-1.patch against trunk revision 90c8ece. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4857//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4857//console This message is automatically generated. Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time Key: YARN-2526 URL: https://issues.apache.org/jira/browse/YARN-2526 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2526-1.patch The simulation may enter deadlock if all application simulators hold all threads provided by the thread pool, and all wait for AM container allocation. In that case, all AM simulators wait for NM simulators to do heartbeat to allocate resource, and all NM simulators wait for AM simulators to release some threads. The simulator is deadlocked. To solve this deadlock, need to remove the while() loop in the MRAMSimulator. {code} // waiting until the AM container is allocated while (true) { if (response != null ! response.getAllocatedContainers().isEmpty()) { // get AM container . break; } // this sleep time is different from HeartBeat Thread.sleep(1000); // send out empty request sendContainerRequest(); response = responseQueue.take(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2525) yarn logs command gives error on trunk
[ https://issues.apache.org/jira/browse/YARN-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2525: --- Component/s: scripts yarn logs command gives error on trunk -- Key: YARN-2525 URL: https://issues.apache.org/jira/browse/YARN-2525 Project: Hadoop YARN Issue Type: Bug Components: scripts Reporter: Prakash Ramachandran Priority: Minor Labels: newbie yarn logs command (trunk branch) gives an error Error: Could not find or load main class org.apache.hadoop.yarn.logaggregation.LogDumper instead the class should be org.apache.hadoop.yarn.client.cli.LogsCLI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2525) yarn logs command gives error on trunk
[ https://issues.apache.org/jira/browse/YARN-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2525: --- Labels: newbie (was: ) yarn logs command gives error on trunk -- Key: YARN-2525 URL: https://issues.apache.org/jira/browse/YARN-2525 Project: Hadoop YARN Issue Type: Bug Components: scripts Reporter: Prakash Ramachandran Priority: Minor Labels: newbie yarn logs command (trunk branch) gives an error Error: Could not find or load main class org.apache.hadoop.yarn.logaggregation.LogDumper instead the class should be org.apache.hadoop.yarn.client.cli.LogsCLI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: YARN-1458.006.patch In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127329#comment-14127329 ] Craig Welch commented on YARN-796: -- This is a bit of a detail, but the current version of the code lowercases the nodelabels rather than respecting the given name. I don't believe this is what we want. The requirements do request case-insensitive comparison, but that is not the same as changing the case. There are a few options which come to mind: 1. Switch to case insensitive Set's and Maps for managing the labels - TreeSet and TreeMap can be configured to operate in a case-insensitive fashion, I expect they would be OK to use for nodelables. 2. Gate label names on the way in to force consistent case while maintaining case - a Map with lc key and original case value could be used to keep all labels for a given set of letters a consistent case (the original) 3. Drop the requirement for case insensitivity - I'm not sure of the reasoning, I assume it is to prevent mis-types, but I'm not sure it's really so important, and there are still many opportunities for mistyping labels, I'm not sure if protecting against this one case is worth the implementation cost/complexity or the loss of the original case as specified by the user. I suggest 3, FWIW Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127332#comment-14127332 ] zhihai xu commented on YARN-1458: - I uploaded a patch YARN-1458.006.patch for the first approach: This patch compare with previous result in the loop to fix the zero weight with non-zero minShare issue and calculate the start point for rMax using the minimum ratio of minShare/weight to fix all queues have none zero minShare issue. Either approach is ok for me. but the second approach is a little simpler and faster than the first approach. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps
[ https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127353#comment-14127353 ] Xuan Gong commented on YARN-2456: - I think that both of ways (sort the ApplicationStates based on ApplicationId or ApplicationState's submitTime) are fine. Since all processes are asynchronous, the corner case is still exist. [~jianhe] What do you think ? Possible deadlock in CapacityScheduler when RM is recovering apps - Key: YARN-2456 URL: https://issues.apache.org/jira/browse/YARN-2456 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2456.1.patch Consider this scenario: 1. RM is configured with a single queue and only one application can be active at a time. 2. Submit App1 which uses up the queue's whole capacity 3. Submit App2 which remains pending. 4. Restart RM. 5. App2 is recovered before App1, so App2 is added to the activeApplications list. Now App1 remains pending (because of max-active-app limit) 6. All containers of App1 are now recovered when NM registers, and use up the whole queue capacity again. 7. Since the queue is full, App2 cannot proceed to allocate AM container. 8. In the meanwhile, App1 cannot proceed to become active because of the max-active-app limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127374#comment-14127374 ] Hadoop QA commented on YARN-1458: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667451/YARN-1458.006.patch against trunk revision 90c8ece. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4859//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4859//console This message is automatically generated. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127384#comment-14127384 ] Hadoop QA commented on YARN-2033: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667445/YARN-2033.10.patch against trunk revision 90c8ece. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4858//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4858//console This message is automatically generated. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127403#comment-14127403 ] bc Wong commented on YARN-1530: --- bq. The current writing channel allows the data to be available on the timeline server immediately Let's have reliability before speed. I think one of the requirement of ATS is: *The channel for writing events should be reliable.* I'm using *reliable* here in a strong sense, not the TCP-best-effort style reliability. HDFS is reliable. Kafka is reliable. (They are also scalable and robust.) A normal RPC connection is not. I don't want the ATS to be able to slow down my writes, and therefore, my applications, at all. For example, an ATS failover shouldn't pause all my applications for N seconds. A direct RPC to the ATS for writing seems a poor choice in general. Yes, you could make a distributed reliable scalable ATS service to accept writing events. But that seems a lot of work, while we can leverage existing technologies. If the channel itself is pluggable, then we have lots of options. Kafka is a very good choice, for sites that already deploy Kafka and know how to operate it. Using HDFS as a channel is also a good default implementation, for people already know how to scale and manage HDFS. Embedding a Kafka broker with each ATS daemon is also an option, if we're ok with that dependency. [Umbrella] Store, manage and serve per-framework application-timeline data -- Key: YARN-1530 URL: https://issues.apache.org/jira/browse/YARN-1530 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, application timeline design-20140116.pdf, application timeline design-20140130.pdf, application timeline design-20140210.pdf This is a sibling JIRA for YARN-321. Today, each application/framework has to do store, and serve per-framework data all by itself as YARN doesn't have a common solution. This JIRA attempts to solve the storage, management and serving of per-framework data from various applications, both running and finished. The aim is to change YARN to collect and store data in a generic manner with plugin points for frameworks to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2440: Attachment: apache-yarn-2440.5.patch Uploaded new patch to address Vinod's concerns. bq.containers-limit-cpu-percentage - yarn.nodemanager.resource.percentage-cpu-limit to be consistent? Similarly NM_CONTAINERS_CPU_PERC? I don't like the tag 'resource', it should have been 'resources' but it is what it is. I'm worried that calling it that will lead users to think it's a percentage of the vcores that they specify. In the patch I've changed it to yarn.nodemanager.resource.percentage-physical-cpu-limit but if you or Jason feel strongly about it, I can change it to yarn.nodemanager.resource.percentage-cpu-limit. bq.You still have refs to YarnConfiguration.NM_CONTAINERS_CPU_ABSOLUTE in the patch. Similarly the javadoc in NodeManagerHardwareUtils needs to be updated if we are not adding the absolute cpu config. It should no longer refer to number of cores that should be used for YARN containers Fixed. bq.TestCgroupsLCEResourcesHandler: You can use mockito if you only want to override num-processors in TestResourceCalculatorPlugin. Similarly in TestNodeManagerHardwareUtils. Switched to mockito. bq.The tests may fail on a machine with 4 cores? Don't think so. The tests mock the getNumProcessors() function so we should be fine. Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, apache-yarn-2440.5.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127467#comment-14127467 ] Hadoop QA commented on YARN-2440: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667461/apache-yarn-2440.5.patch against trunk revision 2749fc6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4860//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4860//console This message is automatically generated. Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, apache-yarn-2440.5.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2527) NPE in ApplicationACLsManager
Benoy Antony created YARN-2527: -- Summary: NPE in ApplicationACLsManager Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127470#comment-14127470 ] Benoy Antony commented on YARN-2527: working on a patch for this issue. NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-154) Create Yarn trunk and commit jobs
[ https://issues.apache.org/jira/browse/YARN-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-154: -- Fix Version/s: (was: 3.0.0) 2.0.2-alpha 0.23.5 Create Yarn trunk and commit jobs - Key: YARN-154 URL: https://issues.apache.org/jira/browse/YARN-154 Project: Hadoop YARN Issue Type: Task Reporter: Eli Collins Assignee: Robert Joseph Evans Fix For: 2.0.2-alpha, 0.23.5 Yarn should have Hadoop-Yarn-trunk and Hadoop-Yarn-trunk-Commit jenkins jobs that correspond to the Common, HDFS, and MR ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-2080: --- Attachment: YARN-2080.patch Thanks [~vinodkv] for reviewing the patch. I am uploading a new patch that has includes your feedback: * Renamed all Yarn config variables as you suggested. I prefer using the standalone configs as it gives us more flexibility. * Removed duplicate logging in _ClientRMService_ _ReservationInputValidator_. Consistenly uses RMAuditLogger throughout. * Fixes in AbstractReservationSystem as you suggested. * Updated stale references to queues in Javadocs of _YarnClient.submitReservation()_ * _TestYarnClient_ _TestClientRMService_ use newInstance instead of PBImpls * Renamed _ReservationRequest.setLeaseDuration()_ was renamed to be simply _setDuration()_ * Moved _CapacitySchedulerConfiguration_ to YARN-1711 bq. ReservationInputValidator: Deleting a request shouldn't need validateReservationUpdateRequest-validateReservationDefinition. We only need the ID validation That's exactly what's being done. ReservationDefinitions are validated only for submission/update. bq. checkReservationACLs: Today anyone who can submit applications can also submit reservations. We may want to separate them, if you agree, I'll file a ticket for future separation of these ACLs. I agree. I have a set of follow up enhancement JIRAs to YARN-1051 in mind one of which was exactly to consider separation of ACLs as you pointed out. Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subramaniam Krishnan Assignee: Subramaniam Krishnan Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1215) Yarn URL should include userinfo
[ https://issues.apache.org/jira/browse/YARN-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1215: --- Fix Version/s: (was: 3.0.0) Yarn URL should include userinfo Key: YARN-1215 URL: https://issues.apache.org/jira/browse/YARN-1215 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 3.0.0 Reporter: Chuan Liu Assignee: Chuan Liu Fix For: 2.2.0 Attachments: YARN-1215-trunk.2.patch, YARN-1215-trunk.patch In the {{org.apache.hadoop.yarn.api.records.URL}} class, we don't have an userinfo as part of the URL. When converting a {{java.net.URI}} object into the YARN URL object in {{ConverterUtils.getYarnUrlFromURI()}} method, we will set uri host as the url host. If the uri has a userinfo part, the userinfo is discarded. This will lead to information loss if the original uri has the userinfo, e.g. foo://username:passw...@example.com will be converted to foo://example.com and username/password information is lost during the conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-794) YarnClientImpl.submitApplication() to add a timeout
[ https://issues.apache.org/jira/browse/YARN-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-794: -- Fix Version/s: (was: 2.1.0-beta) (was: 3.0.0) YarnClientImpl.submitApplication() to add a timeout --- Key: YARN-794 URL: https://issues.apache.org/jira/browse/YARN-794 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 3.0.0, 2.1.0-beta Reporter: Steve Loughran Priority: Minor {{YarnClientImpl.submitApplication()}} can spin forever waiting for the RM to accept the submission, ignoring interrupts on the sleep. # A timeout allows client applications to recognise and react to a failure of the RM to accept work in a timely manner. # The interrupt exception could be converted to an {{InterruptedIOException}} and raised within the current method signature -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127493#comment-14127493 ] Hadoop QA commented on YARN-2080: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667468/YARN-2080.patch against trunk revision 2749fc6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4861//console This message is automatically generated. Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subramaniam Krishnan Assignee: Subramaniam Krishnan Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2528) Cross Origin Filter Http response split vulnerability protection rejects valid origins
Jonathan Eagles created YARN-2528: - Summary: Cross Origin Filter Http response split vulnerability protection rejects valid origins Key: YARN-2528 URL: https://issues.apache.org/jira/browse/YARN-2528 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Jonathan Eagles URLEncoding is too strong of a protection for HTTP Response Split Vulnerability protection and major browser reject the encoded Origin. An adequate protection is simply to remove all CRs LFs as in the case of PHP's header function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2528) Cross Origin Filter Http response split vulnerability protection rejects valid origins
[ https://issues.apache.org/jira/browse/YARN-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-2528: -- Attachment: YARN-2528-v1.patch Cross Origin Filter Http response split vulnerability protection rejects valid origins -- Key: YARN-2528 URL: https://issues.apache.org/jira/browse/YARN-2528 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2528-v1.patch URLEncoding is too strong of a protection for HTTP Response Split Vulnerability protection and major browser reject the encoded Origin. An adequate protection is simply to remove all CRs LFs as in the case of PHP's header function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated YARN-2527: --- Description: NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. was: NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2528) Cross Origin Filter Http response split vulnerability protection rejects valid origins
[ https://issues.apache.org/jira/browse/YARN-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127575#comment-14127575 ] Hadoop QA commented on YARN-2528: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667480/YARN-2528-v1.patch against trunk revision 2749fc6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4862//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4862//console This message is automatically generated. Cross Origin Filter Http response split vulnerability protection rejects valid origins -- Key: YARN-2528 URL: https://issues.apache.org/jira/browse/YARN-2528 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2528-v1.patch URLEncoding is too strong of a protection for HTTP Response Split Vulnerability protection and major browser reject the encoded Origin. An adequate protection is simply to remove all CRs LFs as in the case of PHP's header function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2528) Cross Origin Filter Http response split vulnerability protection rejects valid origins
[ https://issues.apache.org/jira/browse/YARN-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127592#comment-14127592 ] Jonathan Eagles commented on YARN-2528: --- [~zjshen], sorry to bother you again. Found another bug while working on getting the Tez UI running in a hosted environment. Can you give a review? Cross Origin Filter Http response split vulnerability protection rejects valid origins -- Key: YARN-2528 URL: https://issues.apache.org/jira/browse/YARN-2528 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2528-v1.patch URLEncoding is too strong of a protection for HTTP Response Split Vulnerability protection and major browser reject the encoded Origin. An adequate protection is simply to remove all CRs LFs as in the case of PHP's header function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2526) Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127647#comment-14127647 ] Karthik Kambatla commented on YARN-2526: Thanks for reporting and fixing this, Wei. +1. Committing this. Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time Key: YARN-2526 URL: https://issues.apache.org/jira/browse/YARN-2526 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2526-1.patch The simulation may enter deadlock if all application simulators hold all threads provided by the thread pool, and all wait for AM container allocation. In that case, all AM simulators wait for NM simulators to do heartbeat to allocate resource, and all NM simulators wait for AM simulators to release some threads. The simulator is deadlocked. To solve this deadlock, need to remove the while() loop in the MRAMSimulator. {code} // waiting until the AM container is allocated while (true) { if (response != null ! response.getAllocatedContainers().isEmpty()) { // get AM container . break; } // this sleep time is different from HeartBeat Thread.sleep(1000); // send out empty request sendContainerRequest(); response = responseQueue.take(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2526) SLS can deadlock when all the threads are taken by AMSimulators
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2526: --- Component/s: scheduler-load-simulator Priority: Critical (was: Minor) Target Version/s: 2.6.0 Affects Version/s: 2.5.1 SLS can deadlock when all the threads are taken by AMSimulators --- Key: YARN-2526 URL: https://issues.apache.org/jira/browse/YARN-2526 Project: Hadoop YARN Issue Type: Bug Components: scheduler-load-simulator Affects Versions: 2.5.1 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2526-1.patch The simulation may enter deadlock if all application simulators hold all threads provided by the thread pool, and all wait for AM container allocation. In that case, all AM simulators wait for NM simulators to do heartbeat to allocate resource, and all NM simulators wait for AM simulators to release some threads. The simulator is deadlocked. To solve this deadlock, need to remove the while() loop in the MRAMSimulator. {code} // waiting until the AM container is allocated while (true) { if (response != null ! response.getAllocatedContainers().isEmpty()) { // get AM container . break; } // this sleep time is different from HeartBeat Thread.sleep(1000); // send out empty request sendContainerRequest(); response = responseQueue.take(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2526) SLS can deadlock when all the threads are taken by AMSimulators
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2526: --- Summary: SLS can deadlock when all the threads are taken by AMSimulators (was: Scheduler Load Simulator may enter deadlock if lots of applications submitted to the RM at the same time) SLS can deadlock when all the threads are taken by AMSimulators --- Key: YARN-2526 URL: https://issues.apache.org/jira/browse/YARN-2526 Project: Hadoop YARN Issue Type: Bug Components: scheduler-load-simulator Affects Versions: 2.5.1 Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2526-1.patch The simulation may enter deadlock if all application simulators hold all threads provided by the thread pool, and all wait for AM container allocation. In that case, all AM simulators wait for NM simulators to do heartbeat to allocate resource, and all NM simulators wait for AM simulators to release some threads. The simulator is deadlocked. To solve this deadlock, need to remove the while() loop in the MRAMSimulator. {code} // waiting until the AM container is allocated while (true) { if (response != null ! response.getAllocatedContainers().isEmpty()) { // get AM container . break; } // this sleep time is different from HeartBeat Thread.sleep(1000); // send out empty request sendContainerRequest(); response = responseQueue.take(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201409092204.txt [~jianhe], thank you very much for your time in reviewing this patch and your helpful suggestions. {quote} seems we don't need this check, because the returned ApplicationResourceUsageReport for non-active attempt is anyways null. {code} // Only add in the running containers if this is the active attempt. RMAppAttempt currentAttempt = rmContext.getRMApps() .get(attemptId.getApplicationId()).getCurrentAppAttempt(); if (currentAttempt != null currentAttempt.getAppAttemptId().equals(attemptId)) { {code} {quote} You are correct. The above check for {{currentAttempt != null}} is not necessary. With this new patch, I have upmerged again (since it wasn't applying cleanly) and removed this check. [~kkambatl], I would also like to thank you for your help on this patch. Were you okay with the changes I made in response to your suggestions? It would be great if we could move this patch over the goal line soon. Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127666#comment-14127666 ] Karthik Kambatla edited comment on YARN-415 at 9/9/14 10:20 PM: Eric - I haven't had a chance to take a look at the latest patch. I trust Jian and you to make sure the concerns are addressed, the suggestions themselves were straight-forward. Thanks for staying patient through this long-drawn JIRA. was (Author: kkambatl): Eric - I haven't had a chance to take a look at the latest patch. I trust Jian and you to make sure the concerns are addressed, the suggestions themselves were straight-forward. Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127666#comment-14127666 ] Karthik Kambatla commented on YARN-415: --- Eric - I haven't had a chance to take a look at the latest patch. I trust Jian and you to make sure the concerns are addressed, the suggestions themselves were straight-forward. Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated YARN-2527: --- Attachment: YARN-2527.patch Attaching a patch which checks if the map of ACLs foran application is null. If null, it uses the default ACL. A new test case is added which checks the normal case as well as the case when the ACLis not set for an application. NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Attachments: YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127755#comment-14127755 ] Benoy Antony commented on YARN-2527: [~vinodkv], could you please review this jira ? Could you also make me a Yarn contributor so that I can assign the jira to me? NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Attachments: YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1458: --- Attachment: yarn-1458-7.patch Thanks Zhihai. I see the advantage of the second approach. My main concern is readability of the approach. I have taken a stab at making it more readable/maintainable through only cosmetic changes. Can you please take a look and see if these cosmetic changes make sense to you. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1412#comment-1412 ] Hadoop QA commented on YARN-415: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667510/YARN-415.201409092204.txt against trunk revision 28d99db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4863//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4863//console This message is automatically generated. Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127795#comment-14127795 ] Hadoop QA commented on YARN-2527: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667532/YARN-2527.patch against trunk revision 28d99db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4864//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4864//console This message is automatically generated. NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Attachments: YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127854#comment-14127854 ] Hadoop QA commented on YARN-1458: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667535/yarn-1458-7.patch against trunk revision 28d99db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4865//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4865//console This message is automatically generated. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at
[jira] [Updated] (YARN-1712) Admission Control: plan follower
[ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1712: --- Attachment: YARN-1712.3.patch Thanks [~jianhe] for your detailed feedback. I am attaching a patch with the following updates: * Made move apps logic synchronous and move is to defReservationQueue (renamed) * Removed the synchronized on scheduler as individual calls are already synchronized * Fixed comment formatting and variable names * Created a common method to calculate lhsRes and rhsRes * Optimized the loop as suggested Some clarifications: * Exceptions are suppressed deliberately as PlanFollower is a background timer thread and we don't want it to exit * _plan.getReservationsAtTime(now)_ is used by others like Replanners. We need the reservations and not just the names even in PlanFollower so leaving it as is * Tried moving the default queue creating to when PlanQueue is initialized in CapacityScheduler but it was getting overly complex mainly due to the relaxed constraint of child capacities =100% for PlanQueues. This is just an additional hashmap lookup with the code being much cleaner so not moving it for now. If it is still a concern, I can add a flag to Plan and check that instead of CapacityScheduler#getQueue Admission Control: plan follower Key: YARN-1712 URL: https://issues.apache.org/jira/browse/YARN-1712 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations, scheduler Attachments: YARN-1712.1.patch, YARN-1712.2.patch, YARN-1712.3.patch, YARN-1712.patch This JIRA tracks a thread that continuously propagates the current state of an inventory subsystem to the scheduler. As the inventory subsystem store the plan of how the resources should be subdivided, the work we propose in this JIRA realizes such plan by dynamically instructing the CapacityScheduler to add/remove/resize queues to follow the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1711: --- Attachment: YARN-1712.3.patch Updated patch to include *CapacitySchedulerConfiguration* based on by [~vinodkv]'s [suggestion | https://issues.apache.org/jira/browse/YARN-2080?focusedCommentId=14125994] as the _majority_ of the configurations or for enforcement policies CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Attachments: YARN-1711.1.patch, YARN-1711.patch, YARN-1712.3.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1711: --- Attachment: (was: YARN-1712.3.patch) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Attachments: YARN-1711.1.patch, YARN-1711.2.patch, YARN-1711.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1711: --- Attachment: YARN-1711.2.patch CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Attachments: YARN-1711.1.patch, YARN-1711.2.patch, YARN-1711.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1712) Admission Control: plan follower
[ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127913#comment-14127913 ] Hadoop QA commented on YARN-1712: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667564/YARN-1712.3.patch against trunk revision 0de563a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4866//console This message is automatically generated. Admission Control: plan follower Key: YARN-1712 URL: https://issues.apache.org/jira/browse/YARN-1712 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations, scheduler Attachments: YARN-1712.1.patch, YARN-1712.2.patch, YARN-1712.3.patch, YARN-1712.patch This JIRA tracks a thread that continuously propagates the current state of an inventory subsystem to the scheduler. As the inventory subsystem store the plan of how the resources should be subdivided, the work we propose in this JIRA realizes such plan by dynamically instructing the CapacityScheduler to add/remove/resize queues to follow the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: yarn-1458-8.patch In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127929#comment-14127929 ] Hadoop QA commented on YARN-1711: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667567/YARN-1712.3.patch against trunk revision 0de563a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4867//console This message is automatically generated. CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Attachments: YARN-1711.1.patch, YARN-1711.2.patch, YARN-1711.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127939#comment-14127939 ] zhihai xu commented on YARN-1458: - Hi [~kasha], Your change makes the code much easier to read and maintain. I uploaded a new patch yarn-1458-8.patch with two minor changes based on your patch: use Math.max instead of Math.abs and check schedulables.isEmpty() after handleFixedFairShares. Please review it. thanks In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127941#comment-14127941 ] Hadoop QA commented on YARN-1711: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667570/YARN-1711.2.patch against trunk revision 0de563a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4868//console This message is automatically generated. CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Attachments: YARN-1711.1.patch, YARN-1711.2.patch, YARN-1711.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127967#comment-14127967 ] Hadoop QA commented on YARN-1458: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667572/yarn-1458-8.patch against trunk revision 0de563a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4869//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4869//console This message is automatically generated. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at
[jira] [Commented] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps
[ https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127987#comment-14127987 ] Wangda Tan commented on YARN-2456: -- I think there're too many factors will affect active applications list after application submission. IMHO, recovering application by time of creation or submission is not a big deal, keep it simple and straight-forward should be more important, I prefer Jian's method. +1 for the patch Thanks, Possible deadlock in CapacityScheduler when RM is recovering apps - Key: YARN-2456 URL: https://issues.apache.org/jira/browse/YARN-2456 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2456.1.patch Consider this scenario: 1. RM is configured with a single queue and only one application can be active at a time. 2. Submit App1 which uses up the queue's whole capacity 3. Submit App2 which remains pending. 4. Restart RM. 5. App2 is recovered before App1, so App2 is added to the activeApplications list. Now App1 remains pending (because of max-active-app limit) 6. All containers of App1 are now recovered when NM registers, and use up the whole queue capacity again. 7. Since the queue is full, App2 cannot proceed to allocate AM container. 8. In the meanwhile, App1 cannot proceed to become active because of the max-active-app limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps
[ https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127989#comment-14127989 ] Wangda Tan commented on YARN-2456: -- In addition, I suggest to change the {{deadlock}} of the title to be {{livelock}}, because no thread is locked here, it just a weird state preventing the state to make progress, instead of lock. See: http://en.wikipedia.org/wiki/Deadlock#Livelock Wangda Possible deadlock in CapacityScheduler when RM is recovering apps - Key: YARN-2456 URL: https://issues.apache.org/jira/browse/YARN-2456 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2456.1.patch Consider this scenario: 1. RM is configured with a single queue and only one application can be active at a time. 2. Submit App1 which uses up the queue's whole capacity 3. Submit App2 which remains pending. 4. Restart RM. 5. App2 is recovered before App1, so App2 is added to the activeApplications list. Now App1 remains pending (because of max-active-app limit) 6. All containers of App1 are now recovered when NM registers, and use up the whole queue capacity again. 7. Since the queue is full, App2 cannot proceed to allocate AM container. 8. In the meanwhile, App1 cannot proceed to become active because of the max-active-app limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128005#comment-14128005 ] Benoy Antony commented on YARN-2527: [~zjshen], could you please review this jira ? Could you also make me a Yarn contributor so that I can assign the jira to me? NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Attachments: YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128011#comment-14128011 ] Wangda Tan commented on YARN-796: - Hi [~cwelch] and [~aw], I agree with #3 as well, since the original starting point is to avoid case-typo from users. But refer to other existing configs of YARN, like queue name of CS, different case of queue name means different queue. I prefer to drop the requirement if there's no strong opinion to do that. Thanks, Wangda Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1712) Admission Control: plan follower
[ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128023#comment-14128023 ] Wangda Tan commented on YARN-1712: -- Hi [~subru], I've just taken a look at your latest patch, now the code is much cleaner than before, thanks! I'm not quite understand what you said, bq. Tried moving the default queue creating to when PlanQueue is initialized in CapacityScheduler but it was getting overly complex mainly due to the relaxed constraint of child capacities =100% for PlanQueues. This is just an additional hashmap lookup with the code being much cleaner so not moving it for now. If it is still a concern, I can add a flag to Plan and check that instead of CapacityScheduler#getQueue Could you please elaborate? And in addition, a very minor comments is, could you put LOG.debug within block like LOG.debug in other modules? {code} if (LOG.isDebugEnabled()) { // ... } {code} Thanks, Wangda Admission Control: plan follower Key: YARN-1712 URL: https://issues.apache.org/jira/browse/YARN-1712 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations, scheduler Attachments: YARN-1712.1.patch, YARN-1712.2.patch, YARN-1712.3.patch, YARN-1712.patch This JIRA tracks a thread that continuously propagates the current state of an inventory subsystem to the scheduler. As the inventory subsystem store the plan of how the resources should be subdivided, the work we propose in this JIRA realizes such plan by dynamically instructing the CapacityScheduler to add/remove/resize queues to follow the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2528) Cross Origin Filter Http response split vulnerability protection rejects valid origins
[ https://issues.apache.org/jira/browse/YARN-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128026#comment-14128026 ] Zhijie Shen commented on YARN-2528: --- [~jeagles], no problem. I compared our CrossOriginFilter with the one in Jetty. That one seems not to do any post-process for the string obtained from ORIGIN header. What's the reason that we need for our CrossOriginFilter? According to test case, you want to avoid the issue that the string contains the other header, don't you? HttpServletResponse.getHeader doesn't handle header splitting properly? BTW, it seems that ours' only allows one origin in the request header, but Jetty's allows multiple one. And I find a specification: http://tools.ietf.org/html/draft-abarth-origin-09, which tells that ORIGIN can be a list. Any thought? Cross Origin Filter Http response split vulnerability protection rejects valid origins -- Key: YARN-2528 URL: https://issues.apache.org/jira/browse/YARN-2528 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2528-v1.patch URLEncoding is too strong of a protection for HTTP Response Split Vulnerability protection and major browser reject the encoded Origin. An adequate protection is simply to remove all CRs LFs as in the case of PHP's header function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2527: -- Assignee: Benoy Antony NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128038#comment-14128038 ] Zhijie Shen commented on YARN-2527: --- [~benoyantony], I'va added you as a YARN contributor, and assign this Jira to you. W.R.T the NPE, did you have a chance to see why NPE will happen? For each submitted app, its acls seem to be added into ApplicationACLsManager. ContainerLaunchContext#getApplicationACLs should return a empty acls map, if user doesn't specify anything, right? NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128049#comment-14128049 ] Wangda Tan commented on YARN-2158: -- Thanks [~vvasudev] for the fix, I think it looks good to me. And also thanks for improvements in the patch, looks good to me too. +1, Wangda TestRMWebServicesAppsModification sometimes fails in trunk -- Key: YARN-2158 URL: https://issues.apache.org/jira/browse/YARN-2158 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Varun Vasudev Priority: Minor Attachments: apache-yarn-2158.0.patch, apache-yarn-2158.1.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : {code} Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.297 sec FAILURE! java.lang.AssertionError: app state incorrect at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128097#comment-14128097 ] Tsuyoshi OZAWA commented on YARN-2517: -- [~zjshen], [~vinodkv], can we go with current design(v1 patch)? Implement TimelineClientAsync - Key: YARN-2517 URL: https://issues.apache.org/jira/browse/YARN-2517 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2517.1.patch In some scenarios, we'd like to put timeline entities in another thread no to block the current one. It's good to have a TimelineClientAsync like AMRMClientAsync and NMClientAsync. It can buffer entities, put them in a separate thread, and have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)