[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066145#comment-14066145 ] Wangda Tan commented on YARN-415: - Hi [~eepayne], I've spent some time to review and think about the JIRA. I have a 1. Revert changes of SchedulerAppReport, we already have changed ApplicationResourceUsageReport, and memory utilization should be a part of resource usage report. 2. Remove getMemory(VCore)Seconds from RMAppAttempt, modify RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running resource utilization. 3. put {code} ._(Resources:, String.format(%d MB-seconds, %d vcore-seconds, app.getMemorySeconds(), app.getVcoreSeconds())) {code} from Application Overview to Application Metrics, and rename it to Resource Seconds. It should be considered as a part of application metrics instead of overview. 4. Change finishedMemory/VCoreSeconds to AtomicLong in RMAppAttemptMetrics to make it can be efficiently accessed by multi-thread. 5. I think it's better to add a new method in SchedulerApplicationAttempt like getMemoryUtilization, which will only return memory/cpu seconds. We do this to prevent locking scheduling thread when showing application metrics on web UI. getMemoryUtilization will be used by RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running resource utilization. And used by SchedulerApplicationAttempt#getResourceUsageReport as well. The MemoryUtilization class may contain two fields: runningContainerMemory(VCore)Seconds. 6. Since compute running container resource utilization is not O(1), we need scan all containers under an application. I think it's better to cache a previous compute result, and it will be recomputed after several seconds (maybe 1-3 seconds should be enough) elapsed. And you can modify SchedulerApplicationAttempt#liveContainers to be a ConcurrentHashMap. With #6, get memory utilization to show metrics on web UI will not lock scheduling thread at all. Please let me know if you have any comments here, Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066154#comment-14066154 ] Wangda Tan commented on YARN-2305: -- Thanks for your elaboration, I understand now, I think this is inconsistency between ParentQueue and LeafQueue, using clusterResource instead of allocated+available can definitely solve this problem. When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Sunil G Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066156#comment-14066156 ] Wangda Tan commented on YARN-2308: -- I think it should doable, queue of application missing should not make RM failure to start. NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Priority: Critical I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-2301: --- Assignee: Naganarasimha G R Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenwu Peng updated YARN-2319: - Attachment: YARN-2319.0.patch Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033_ALL.1.patch Upload a patch including the two dependent ones for jenkins to verify. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.1.patch I've made a first patch, that include the whole feature for timeline store based generic history service, and test cases. In this jira, I don't deprecate the old application history store classes set. I'll file another jira for it. Once this jira is done, we should mark those deprecated. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066186#comment-14066186 ] Zhijie Shen commented on YARN-2319: --- I encountered some test failures today around this test case. Will take a look Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2320) Deprecate existing application history store after we store the history data to timeline store
Zhijie Shen created YARN-2320: - Summary: Deprecate existing application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066195#comment-14066195 ] Zhijie Shen commented on YARN-2301: --- bq. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. You can run yarn appattempt to get the attempt. Anyway it's arguable if it is user friendly or not. Given adding a function, I vote for yarn container -list appId One more comment. “yarn container” can source the container information either from RM or from timeline server. When making the changes, please make sure the both sides are changed consistently Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066210#comment-14066210 ] Hadoop QA commented on YARN-2319: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656480/YARN-2319.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4357//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4357//console This message is automatically generated. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066220#comment-14066220 ] Zhijie Shen commented on YARN-2319: --- I ran through the test cases on trunk again. The failure I encountered before is not related to this. However, it's still good to have the close at the end. The set of test failures seem to be related to other things as well. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2304) TestRMWebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066224#comment-14066224 ] Zhijie Shen commented on YARN-2304: --- It happened several times. Another instance: https://issues.apache.org/jira/browse/YARN-2319?focusedCommentId=14066210page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14066210 TestRMWebServices* fails intermittently --- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066246#comment-14066246 ] Hadoop QA commented on YARN-2033: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656482/YARN-2033_ALL.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 20 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.resourcemanager.metrics.TestYarnMetricsPublisher {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4358//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4358//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4358//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4358//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4358//console This message is automatically generated. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066251#comment-14066251 ] Hudson commented on YARN-1341: -- FAILURE: Integrated in Hadoop-Yarn-trunk #616 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/616/]) YARN-1341. Recover NMTokens upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611512) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/BaseNMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/security/NMTokenSecretManagerInNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security/TestNMTokenSecretManagerInNM.java Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.6.0 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2321) NodeManager WebUI get wrong configuration of isPmemCheckEnabled()
Leitao Guo created YARN-2321: Summary: NodeManager WebUI get wrong configuration of isPmemCheckEnabled() Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066272#comment-14066272 ] Varun Vasudev commented on YARN-2270: - [~ajisakaa] your current patch if ok, but maybe we should skip the test if the ancestor permissions aren't right? If the real issue is the ancestor permissions, then the get() will fail for all the files. Maybe something like - {noformat} boolean ancestorPermissionsOK = FSDownload.ancestorsHaveExecutePermissions(fs, basedir, null); assumeTrue(ancestorPermissionsOK); {noformat} The benefit of this approach is that the test gets reported as skipped and people who are interested in ensuring it runs correctly can fix their build environment to ensure the test runs. Your current approach hides the fact that the test didn't really do what it was expected to do(apart from the log message). TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Attachments: YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2321) NodeManager WebUI get wrong configuration of isPmemCheckEnabled()
[ https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2321: - Attachment: YARN-2321.patch NodeManager WebUI get wrong configuration of isPmemCheckEnabled() - Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Attachments: YARN-2321.patch WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-2270: Attachment: YARN-2270.2.patch TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Attachments: YARN-2270.2.patch, YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2321) NodeManager WebUI get wrong configuration of isPmemCheckEnabled()
[ https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066291#comment-14066291 ] Hadoop QA commented on YARN-2321: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656497/YARN-2321.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4359//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4359//console This message is automatically generated. NodeManager WebUI get wrong configuration of isPmemCheckEnabled() - Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Attachments: YARN-2321.patch WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066290#comment-14066290 ] Akira AJISAKA commented on YARN-2270: - Thanks [~vvasudev] for the review! Update the patch to skip test if the basedir doesn't have the ancestor permissions. TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Attachments: YARN-2270.2.patch, YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066295#comment-14066295 ] Hadoop QA commented on YARN-2270: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656501/YARN-2270.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4360//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4360//console This message is automatically generated. TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Attachments: YARN-2270.2.patch, YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066297#comment-14066297 ] Varun Vasudev commented on YARN-2270: - +1, looks good to me. TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Attachments: YARN-2270.2.patch, YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066313#comment-14066313 ] Naganarasimha G R commented on YARN-2301: - Thanks [~zjshen] for the comments, I feel it would be easy to hit a single command and i would like to add yarn container -list appId I will consider the changes for container information got from Timeline/History server also. Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066342#comment-14066342 ] Hudson commented on YARN-1341: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1835 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1835/]) YARN-1341. Recover NMTokens upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611512) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/BaseNMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/security/NMTokenSecretManagerInNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security/TestNMTokenSecretManagerInNM.java Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.6.0 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066365#comment-14066365 ] Hudson commented on YARN-1341: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1808 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1808/]) YARN-1341. Recover NMTokens upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611512) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/BaseNMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/security/NMTokenSecretManagerInNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security/TestNMTokenSecretManagerInNM.java Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.6.0 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066376#comment-14066376 ] Jason Lowe commented on YARN-2314: -- While there is cache mismanagement going on as described above, a bigger issue is how this cache interacts with the ClientCache in the RPC layer and how Connection instances behave. Despite this cache's intent to try to limit the number of connected NMs, calling stopProxy does *not* mean the connection and corresponding IPC client thread is removed. Closing a proxy will only shutdown threads if there are *no* other instances of that protocol proxy currently open. See ClientCache.stopClient for details. Given that the whole point of the ContainerManagementProtocolProxy cache is to preserve at least one reference to the Client, the IPC Client stop method will never be called in practice and IPC client threads will never be explicitly torn down as a result of calling stopProxy. As for Connection instances within the IPC Client, outside of erroneous operation they will only shutdown if either they reach their idle timeout or are explicitly told to stop via Client.stop, and the latter will never be called in practice per above. That means the number of IPC client threads lingering around is solely dictated by how fast we're connecting to new nodes and how long the IPC idle timeout is. By default this timeout is 10 seconds, and an AM running a wide-spread large job on a large, idle cluster can easily allocate containers for and connect to all of the nodes in less than 10 seconds. That means we cam still have thousands of IPC client threads despite ContainerManagementProtocolProxy's efforts to limit the number of connections. In simplest terms this is a regression of MAPREDUCE-. That patch explicitly tuned the IPC timeout of ContainerManagement proxies to zero so they would be torn down as soon as we finished the first call. I've verified that setting the IPC timeout to zero prevents the explosion of IPC client threads. That's sort of a ham-fisted fix since it brings the whole point of the NM proxy cache into question. We would be keeping the proxy objects around, but the connection to the NM would need to be re-established each time we reused it. Not sure the cache would be worth much at that point. If we want to explicitly manage the number of outstanding NM connections without forcing the connections to shutdown on each IPC call then I think we need help from the IPC layer itself. As I mentioned above, I don't think there's an exposed mechanism to close an individual connection of an IPC Client. So to sum up, we can fix the cache management bugs described in the first comment, but that alone will not prevent thousands of IPC client threads from co-existing. We either need to set the IPC timeout to 0 (which brings the utility of the NM proxy cache into question) or change the IPC layer to allow us to close individual Client connections. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066462#comment-14066462 ] Tsuyoshi OZAWA commented on YARN-2319: -- IIUC, the test failure is caused by JerseyTest. JerseyTest's constructor - getContainer() - getBaseURI always returns the result of {{UriBuilder.fromUri(http://localhost/;).port(getPort(9998)).build()}}. If the another test jobs are running at the same time, some of them fail to bind port and tests fail as a result. {code} public JerseyTest(AppDescriptor ad) throws TestContainerException { this.tc = getContainer(ad, getTestContainerFactory()); this.client = getClient(tc, ad); } /** * Returns the base URI of the application. * @return The base URI of the application */ protected URI getBaseURI() { return UriBuilder.fromUri(http://localhost/;) .port(getPort(9998)).build(); } {code} Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2304) TestRMWebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066469#comment-14066469 ] Tsuyoshi OZAWA commented on YARN-2304: -- IIUC, the test failure is caused by JerseyTest. JerseyTest's constructor - getContainer() - getBaseURI always returns the result of UriBuilder.fromUri(http://localhost/;).port(getPort(9998)).build(). If the another test jobs are running at the same time, some of them fail to bind port and tests fail as a result. {code} public JerseyTest(AppDescriptor ad) throws TestContainerException { this.tc = getContainer(ad, getTestContainerFactory()); this.client = getClient(tc, ad); } /** * Returns the base URI of the application. * @return The base URI of the application */ protected URI getBaseURI() { return UriBuilder.fromUri(http://localhost/;) .port(getPort(9998)).build(); } {code} TestRMWebServices* fails intermittently --- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066468#comment-14066468 ] Tsuyoshi OZAWA commented on YARN-2319: -- Oops, sorry, I intended to comment on YARN-2304. Feel free to delete it. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2008: -- Attachment: YARN-2008.1.patch Patch implementing the described behavior... CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066510#comment-14066510 ] Craig Welch commented on YARN-2008: --- [~airbots] Chen, I put together a patch, with it I believe the scenario you describe plays out as it should. Can you have a look? Also, do you mind if I assign this one over to me see it through? CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066511#comment-14066511 ] Craig Welch commented on YARN-2008: --- [~wangda], can you have a look at this pls? This is the headroom patch wrt ancestor-sibling utilization issues. CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066559#comment-14066559 ] Jian He commented on YARN-2208: --- patch looks good AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.9.patch, YARN-2208.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2322) Provide Cli to refesh Admin Acls for Timeline server
Karam Singh created YARN-2322: - Summary: Provide Cli to refesh Admin Acls for Timeline server Key: YARN-2322 URL: https://issues.apache.org/jira/browse/YARN-2322 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Reporter: Karam Singh Provide Cli to refresh Admin Acls for Timelineserver. Currently rmadmin -refreshAdminAcls provides facility to refresh Admin Acls for ResourceManager. But If we want modify adminAcls from Timelineserver, then we need to restart it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066589#comment-14066589 ] Hadoop QA commented on YARN-2008: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656531/YARN-2008.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4361//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4361//console This message is automatically generated. CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2244: Attachment: YARN-2244.005.patch Responded to feedback FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2297) Preemption can prevent progress in small queues
[ https://issues.apache.org/jira/browse/YARN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066615#comment-14066615 ] Sunil G commented on YARN-2297: --- Hi [~gp.leftnoteasy] bq. 1 Use (guaranteed - used) I feel this can create a little bit more starvation for queues configured with less capacity. bq. 2 combined function like sigmoid(ratio(used, guaranteed)) * (guaranteed - used) Yes. This make more sense, it can neutralize ratio as well as difference to a uniform way. I feel more sampling can be done to come with a better approach. i can check and update you. Preemption can prevent progress in small queues --- Key: YARN-2297 URL: https://issues.apache.org/jira/browse/YARN-2297 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan Priority: Critical Preemption can cause hang issue in single-node cluster. Only AMs run. No task container can run. h3. queue configuration Queue A/B has 1% and 99% respectively. No max capacity. h3. scenario Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use 1 user. Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB. Occupy entire cluster. Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each. Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch. No task of either app can proceed. h3. commands /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.randomtextwriter.bytespermap=2147483648 -Dmapreduce.job.queuename=A -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -Dmapreduce.randomtextwriter.mapsperhost=1 -Dmapreduce.randomtextwriter.totalbytes=2147483648 dir1 /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.job.queuename=B -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -m 1 -r 0 -mt 4000 -rt 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2008: -- Attachment: YARN-2008.2.patch Added missed unit test CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066654#comment-14066654 ] Craig Welch commented on YARN-2008: --- The tests seem to pass on my box, I think these are still issues with the build server (tried org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched and org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066712#comment-14066712 ] Hadoop QA commented on YARN-2244: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656543/YARN-2244.005.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4362//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4362//console This message is automatically generated. FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066719#comment-14066719 ] Karthik Kambatla commented on YARN-2244: [~adhoot] - can you check if the test failures are related? FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066741#comment-14066741 ] Hadoop QA commented on YARN-2008: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656545/YARN-2008.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4363//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4363//console This message is automatically generated. CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066927#comment-14066927 ] Xuan Gong commented on YARN-2208: - Committed to trunk and branch-2. Thanks Jian for review. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.9.patch, YARN-2208.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-810: - Attachment: YARN-810.patch Upload a patch for review. (1) Add a configuration field cpu_enforce_ceiling_enabled to the ApplicationSubmissionContext. Each application can set this field to true (default is false) if it wants cpu ceiling enforcement. (2) RM will notify the list of containers with cpu_enforce_ceiling_enabled with NM through heartbeat. The heartbeat responsem message contains a list of containerIds which are launched at current node and with ceiling enabled. (3) The CgroupsLCEResource will set the cpu.cfs_period_us and cpu.cfs_quota_us for containers with ceiling enabled. (4) Update the distributed shell example to include the cpu_enforce_ceiling_enabled configuration, so we can test this feature using distributedshell. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066943#comment-14066943 ] Hudson commented on YARN-2208: -- FAILURE: Integrated in Hadoop-trunk-Commit #5918 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5918/]) YARN-2208. AMRMTokenManager need to have a way to roll over AMRMToken. Contributed by Xuan Gong (xgong: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611820) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/AMRMTokenIdentifier.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/AMRMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestAMRMTokens.java AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.9.patch, YARN-2208.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066938#comment-14066938 ] Anubhav Dhoot commented on YARN-2244: - Seems unrelated . Most failures were with port binding issues com.sun.jersey.test.framework.spi.container.TestContainerException: java.net.BindException: Address already in use Will trigger a retest FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2244: Attachment: YARN-2244.005.patch Retrigger test FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067006#comment-14067006 ] Robert Kanter commented on YARN-2131: - Given that Karthik created YARN-2268 and we can't use the multi operation, I think the addendum patch I uploaded already should be good, right? It simply renames the command from -format to -format-state-store. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067050#comment-14067050 ] Hadoop QA commented on YARN-2244: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656583/YARN-2244.005.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4365//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4365//console This message is automatically generated. FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1342: - Attachment: YARN-1342v4.patch Attaching a patch updated to trunk. Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1342.patch, YARN-1342v2.patch, YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067063#comment-14067063 ] Hadoop QA commented on YARN-810: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656584/YARN-810.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4364//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4364//console This message is automatically generated. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here:
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067078#comment-14067078 ] Craig Welch commented on YARN-2008: --- And, the two which failed this time also pass on my box... CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2315: Attachment: (was: YARN-2315.patch) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. --- Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2315: Attachment: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. --- Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067202#comment-14067202 ] Hadoop QA commented on YARN-1342: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656604/YARN-1342v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4366//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4366//console This message is automatically generated. Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1342.patch, YARN-1342v2.patch, YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067210#comment-14067210 ] Hadoop QA commented on YARN-2045: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656602/YARN-2045-v7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4367//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4367//console This message is automatically generated. Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045-v6.patch, YARN-2045-v7.patch, YARN-2045.patch As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067241#comment-14067241 ] Karthik Kambatla commented on YARN-2244: Latest patch looks good to me. +1. FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067242#comment-14067242 ] Karthik Kambatla commented on YARN-2244: Committing this. FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-810: - Attachment: YARN-810.patch Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit. Implementation: What do you guys think about introducing a variable to YarnConfiguration: bq.
[jira] [Commented] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067279#comment-14067279 ] Hadoop QA commented on YARN-2315: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656609/YARN-2315.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4368//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4368//console This message is automatically generated. Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. --- Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067290#comment-14067290 ] Karthik Kambatla commented on YARN-2273: [~wei.yan] - you mentioned writing a unit test to reproduce the issue. Can we include that in the patch? NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2273: -- Attachment: YARN-2273-replayException.patch [~kasha], uploaded the testcase used before. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067328#comment-14067328 ] Hudson commented on YARN-2244: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5920 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5920/]) YARN-2244. FairScheduler missing handling of containers for unknown application attempts. (Anubhav Dhoot via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611840) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Fix For: 2.6.0 Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch, YARN-2244.005.patch, YARN-2244.005.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067347#comment-14067347 ] Hadoop QA commented on YARN-2273: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656686/YARN-2273-replayException.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4370//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4370//console This message is automatically generated. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. --
[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2211: Attachment: YARN-2211.5.patch RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, YARN-2211.4.patch, YARN-2211.5.patch After YARN-2208, AMRMToken can be rolled over periodically. We need to save related Master Keys and use them to recover the AMRMToken when RM restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067350#comment-14067350 ] Hadoop QA commented on YARN-810: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656675/YARN-810.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.util.TestFSDownload {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4369//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4369//console This message is automatically generated. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067367#comment-14067367 ] Hadoop QA commented on YARN-2211: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656695/YARN-2211.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4371//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4371//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4371//console This message is automatically generated. RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, YARN-2211.4.patch, YARN-2211.5.patch After YARN-2208, AMRMToken can be rolled over periodically. We need to save related Master Keys and use them to recover the AMRMToken when RM restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2309) NPE during RM-Restart test scenario
[ https://issues.apache.org/jira/browse/YARN-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067379#comment-14067379 ] Devaraj K commented on YARN-2309: - Dup of YARN-1919. NPE during RM-Restart test scenario --- Key: YARN-2309 URL: https://issues.apache.org/jira/browse/YARN-2309 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Nishan Shetty Priority: Minor During RMRestart test scenarios, we met with below exception. A point to note here is, Zookeeper also was not stable during this testing, we could see many Zookeeper exception before getting this NPE {code} 2014-07-10 10:49:46,817 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:125) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1039) {code} Zookeeper Exception {code} 2014-07-10 10:49:46,816 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService failed in state INITED; cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1046) at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1017) at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:632) at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:766) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)