[jira] [Commented] (YARN-2378) Adding support for moving apps between queues in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087311#comment-14087311 ] Varun Vasudev commented on YARN-2378: - [~subru] have you looked at YARN-2248? It also allows you to move apps between queues in CapacityScheduler. Adding support for moving apps between queues in Capacity Scheduler --- Key: YARN-2378 URL: https://issues.apache.org/jira/browse/YARN-2378 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Subramaniam Venkatraman Krishnan Labels: capacity-scheduler Attachments: YARN-2378.patch As discussed with [~leftnoteasy] and [~jianhe], we are breaking up YARN-1707 to smaller patches for manageability. This JIRA will address adding support for moving apps between queues in Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087357#comment-14087357 ] Junping Du commented on YARN-2288: -- The test failure seems to be related to configuration of testbed but not be related to the patch. Kick off Jenkins test again manually. Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2288-v2.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1336) Work-preserving nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087360#comment-14087360 ] Junping Du commented on YARN-1336: -- Got it. Will help to review YARN-1337. Thanks [~jlowe]! Work-preserving nodemanager restart --- Key: YARN-1336 URL: https://issues.apache.org/jira/browse/YARN-1336 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: NMRestartDesignOverview.pdf, YARN-1336-rollup-v2.patch, YARN-1336-rollup.patch This serves as an umbrella ticket for tasks related to work-preserving nodemanager restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087380#comment-14087380 ] Hadoop QA commented on YARN-2288: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660025/YARN-2288-v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice: org.apache.hadoop.yarn.server.timeline.webapp.TestTimelineWebServices {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4530//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4530//console This message is automatically generated. Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2288-v2.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2374) YARN trunk build failing TestDistributedShell.testDSShell
[ https://issues.apache.org/jira/browse/YARN-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087411#comment-14087411 ] Varun Vasudev commented on YARN-2374: - [~jianhe] and [~gkesavan] spoke offline and fixed the hostname; Junping resubmitted the patch to Jenkins. Thank you to all three. YARN trunk build failing TestDistributedShell.testDSShell - Key: YARN-2374 URL: https://issues.apache.org/jira/browse/YARN-2374 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2374.0.patch, apache-yarn-2374.1.patch, apache-yarn-2374.2.patch, apache-yarn-2374.3.patch, apache-yarn-2374.4.patch The YARN trunk build has been failing for the last few days in the distributed shell module. {noformat} testDSShell(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 27.269 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:188) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: YARN-1954.5.patch Thank you for comments, [~zjshen]. Updated a patch: 1. Changed to throw IllegalArgumentException when the arguments are invalid. 2. Added new argument {{logInterval}} to {{waitFor}} API. 3. Removed unnecessary changes. 4. Changed to check countDownChecker#counter == 3 after waitFor in TestAMRMClient#testWaitFor. 5. Removed unnecessary synchronized block. Instead of this, added synchronized block against {{callback}} to read correct value from main thread, because {{callback.notify}} is updated in another thread. Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: (was: YARN-1954.5.patch) Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: YARN-1954.5.patch Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2382) Resource Manager throws InvalidStateTransitonException
[ https://issues.apache.org/jira/browse/YARN-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087458#comment-14087458 ] Nishan Shetty commented on YARN-2382: - Hi [~ywskycn] This issue came when RM is restarted while job is in progress. What configuration you need can you please specify? Resource Manager throws InvalidStateTransitonException -- Key: YARN-2382 URL: https://issues.apache.org/jira/browse/YARN-2382 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Nishan Shetty {code} 2014-08-05 03:44:47,882 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.18.40.26/10.18.40.26:11578, initiating session 2014-08-05 03:44:47,888 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.18.40.26/10.18.40.26:11578, sessionid = 0x347a051fda60035, negotiated timeout = 1 2014-08-05 03:44:47,889 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-05 03:44:47,890 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-05 03:44:47,890 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at
[jira] [Updated] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2138: --- Attachment: YARN-2138.patch Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2130) Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext
[ https://issues.apache.org/jira/browse/YARN-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2130: - Attachment: YARN-2130.8.patch Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext Key: YARN-2130 URL: https://issues.apache.org/jira/browse/YARN-2130 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2130.1.patch, YARN-2130.2.patch, YARN-2130.3.patch, YARN-2130.4.patch, YARN-2130.5.patch, YARN-2130.6.patch, YARN-2130.7-2.patch, YARN-2130.7.patch, YARN-2130.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2130) Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext
[ https://issues.apache.org/jira/browse/YARN-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087482#comment-14087482 ] Tsuyoshi OZAWA commented on YARN-2130: -- Rebased on trunk. Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext Key: YARN-2130 URL: https://issues.apache.org/jira/browse/YARN-2130 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2130.1.patch, YARN-2130.2.patch, YARN-2130.3.patch, YARN-2130.4.patch, YARN-2130.5.patch, YARN-2130.6.patch, YARN-2130.7-2.patch, YARN-2130.7.patch, YARN-2130.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087501#comment-14087501 ] Hadoop QA commented on YARN-1954: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660096/YARN-1954.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.api.impl.TestAMRMClient org.apache.hadoop.yarn.client.api.async.impl.TestAMRMClientAsync org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4531//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4531//console This message is automatically generated. Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087523#comment-14087523 ] Hadoop QA commented on YARN-1954: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660096/YARN-1954.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.api.async.impl.TestAMRMClientAsync org.apache.hadoop.yarn.client.api.impl.TestAMRMClient {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4532//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4532//console This message is automatically generated. Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2298) Move TimelineClient to yarn-common project
[ https://issues.apache.org/jira/browse/YARN-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087537#comment-14087537 ] Hudson commented on YARN-2298: -- FAILURE: Integrated in Hadoop-Yarn-trunk #635 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/635/]) YARN-2298. Move TimelineClient to yarn-common project (Contributed by Zhijie Shen) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616100) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java Move TimelineClient to yarn-common project -- Key: YARN-2298 URL: https://issues.apache.org/jira/browse/YARN-2298 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2298.1.patch To allow RM to reuse the timeline client code, we have to move it out of yarn-client module, due to maven dependency issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2382) Resource Manager throws InvalidStateTransitonException
[ https://issues.apache.org/jira/browse/YARN-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087543#comment-14087543 ] Wei Yan commented on YARN-2382: --- thanks, [~nishan], that's enough information. Resource Manager throws InvalidStateTransitonException -- Key: YARN-2382 URL: https://issues.apache.org/jira/browse/YARN-2382 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Nishan Shetty {code} 2014-08-05 03:44:47,882 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.18.40.26/10.18.40.26:11578, initiating session 2014-08-05 03:44:47,888 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.18.40.26/10.18.40.26:11578, sessionid = 0x347a051fda60035, negotiated timeout = 1 2014-08-05 03:44:47,889 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-05 03:44:47,890 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-05 03:44:47,890 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at
[jira] [Commented] (YARN-2381) aa
[ https://issues.apache.org/jira/browse/YARN-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087549#comment-14087549 ] Wei Yan commented on YARN-2381: --- Never mind, [~Jackliu91]. aa -- Key: YARN-2381 URL: https://issues.apache.org/jira/browse/YARN-2381 Project: Hadoop YARN Issue Type: Test Reporter: JiankunLiu Priority: Blocker -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087557#comment-14087557 ] Hadoop QA commented on YARN-2138: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660104/YARN-2138.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4533//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4533//console This message is automatically generated. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2130) Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext
[ https://issues.apache.org/jira/browse/YARN-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087583#comment-14087583 ] Hadoop QA commented on YARN-2130: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660110/YARN-2130.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA org.apache.hadoop.yarn.client.TestRMFailover org.apache.hadoop.yarn.client.api.impl.TestAMRMClient org.apache.hadoop.yarn.client.api.impl.TestNMClient org.apache.hadoop.yarn.client.TestGetGroups org.apache.hadoop.yarn.client.TestResourceManagerAdministrationProtocolPBClientImpl org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.client.api.impl.TestYarnClient org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.TestRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4534//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4534//console This message is automatically generated. Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext Key: YARN-2130 URL: https://issues.apache.org/jira/browse/YARN-2130 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2130.1.patch, YARN-2130.2.patch, YARN-2130.3.patch, YARN-2130.4.patch, YARN-2130.5.patch, YARN-2130.6.patch, YARN-2130.7-2.patch, YARN-2130.7.patch, YARN-2130.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087593#comment-14087593 ] Varun Saxena commented on YARN-2138: [~jianhe], kindly review the changes made for the patch. I have made following changes : 1. Deleted classes RMAppUpdatedSavedEvent,RMAppNewSavedEvent, RMAppAttemptNewSavedEvent and RMAppAttemptUpdatESavedEvent as they were offering no additional functionality over and above the base class, after removal of stored exception and updated exception. 2. Refactored code in RMStateStore and removed notifyDoneXXX methods. 3. Removed code corresponding to exception handling in RMAppImpl and RMAppAttemptImpl. 4. Made necessary changes in test cases. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1954: - Attachment: YARN-1954.6.patch Fixed to pass tests: * Updated test case. * The following change is unnecessary, so removed it. {quote} 5. Removed unnecessary synchronized block. Instead of this, added synchronized block against callback to read correct value from main thread, because callback.notify is updated in another thread. {quote} Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2298) Move TimelineClient to yarn-common project
[ https://issues.apache.org/jira/browse/YARN-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087629#comment-14087629 ] Hudson commented on YARN-2298: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1829 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1829/]) YARN-2298. Move TimelineClient to yarn-common project (Contributed by Zhijie Shen) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616100) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java Move TimelineClient to yarn-common project -- Key: YARN-2298 URL: https://issues.apache.org/jira/browse/YARN-2298 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2298.1.patch To allow RM to reuse the timeline client code, we have to move it out of yarn-client module, due to maven dependency issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2298) Move TimelineClient to yarn-common project
[ https://issues.apache.org/jira/browse/YARN-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087653#comment-14087653 ] Hudson commented on YARN-2298: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1855 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1855/]) YARN-2298. Move TimelineClient to yarn-common project (Contributed by Zhijie Shen) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616100) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java Move TimelineClient to yarn-common project -- Key: YARN-2298 URL: https://issues.apache.org/jira/browse/YARN-2298 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2298.1.patch To allow RM to reuse the timeline client code, we have to move it out of yarn-client module, due to maven dependency issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1337) Recover containers upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087665#comment-14087665 ] Jason Lowe commented on YARN-1337: -- I'm unable to reproduce these test failures locally. Checking a few of the test failures show they are likely all failing because the machine can't lookup it's own name, e.g.: java.net.UnknownHostException: asf901.ygridcore.net: asf901.ygridcore.net. I'll work with ops to get the machine fixed and rekick Jenkins. Recover containers upon nodemanager restart --- Key: YARN-1337 URL: https://issues.apache.org/jira/browse/YARN-1337 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1337-v1.patch To support work-preserving NM restart we need to recover the state of the containers when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate. The state of finished containers also needs to be recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087690#comment-14087690 ] Hadoop QA commented on YARN-1954: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660129/YARN-1954.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4535//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4535//console This message is automatically generated. Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2378) Adding support for moving apps between queues in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087723#comment-14087723 ] Wangda Tan commented on YARN-2378: -- Hi [~subru], Thanks for uploading patch, I took a look at your patch. As mentioned by [~vvasudev], there's an other JIRA (YARN-2248) related to moving. I think two JIRAs has different advantages, I hope you can decide how to merge your works. - YARN-2378 covers RMApp related changes, which should be done while moving - YARN-2248 covers more tests for queue-metrics. I think another major difference is, YARN-2248 will check queue capacity before moving and YARN-2378 not. I had a discussion with [~curino] offline about this, here I paste what he said: {code} Imagine I have a busy cluster an want to migrate apps from queue A to queue B. Since we do not provide any transactional semantics from the CLI it would be quite hard to make sure I can move an app (even if I kill everything in a queue B, and then invoke move A-B, more apps might show up and crowd the target queue B before I can successfully move). Having move to be more sturdy and succeed right away, and enhance preemption (if needed) to repair invariants seems a better option in this scenario. I think preemption already would already enforce max capacity, other active JIRAs should deal with user-limit as well. More generally I think eventually preemption can be our universal rebalancer/enforcer, allowing us to play a bit more fast an loose with move/resizing of queues. {code} I agree with this, another example is when refresh queue capacity, some queues may be shrunk to lower than its guaranteed/used resource. We will not stop such queue refresh, and preemption will also take care this. Some other comments about YARN-2378 1) I think we should implement state store in move transition: {code} // TODO: Write out change to state store (YARN-1558) // Also take care of RM failover moveEvent.getResult().set(null); {code} 2) There’re lots of test failure, I’m afraid it broke some major logic, could you please check it? Will include test review in next iteration. Thanks, Wangda Adding support for moving apps between queues in Capacity Scheduler --- Key: YARN-2378 URL: https://issues.apache.org/jira/browse/YARN-2378 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Subramaniam Venkatraman Krishnan Labels: capacity-scheduler Attachments: YARN-2378.patch As discussed with [~leftnoteasy] and [~jianhe], we are breaking up YARN-1707 to smaller patches for manageability. This JIRA will address adding support for moving apps between queues in Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087732#comment-14087732 ] Tsuyoshi OZAWA commented on YARN-1954: -- It's ready for review. [~zjshen], could you review the latest patch? Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1177: --- Attachment: yarn-1177-ancient-version.patch Here is an ancient version of the patch, that does *not* apply on the latest trunk. Posting in case anyone is particularly interested in taking this further before I get to it. Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087760#comment-14087760 ] Karthik Kambatla commented on YARN-2359: +1. Will commit this later today if no one objects. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2130) Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext
[ https://issues.apache.org/jira/browse/YARN-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2130: - Attachment: YARN-2130.8.patch Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext Key: YARN-2130 URL: https://issues.apache.org/jira/browse/YARN-2130 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2130.1.patch, YARN-2130.2.patch, YARN-2130.3.patch, YARN-2130.4.patch, YARN-2130.5.patch, YARN-2130.6.patch, YARN-2130.7-2.patch, YARN-2130.7.patch, YARN-2130.8.patch, YARN-2130.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2130) Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext
[ https://issues.apache.org/jira/browse/YARN-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087762#comment-14087762 ] Tsuyoshi OZAWA commented on YARN-2130: -- {quote} java.net.UnknownHostException: asf901.ygridcore.net: asf901.ygridcore.net at java.net.InetAddress.getLocalHost(InetAddress.java:1402) {quote} The test failure looks strange - all failure reason is UnknownHostException. Let me kick CI with the same patch again. Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext Key: YARN-2130 URL: https://issues.apache.org/jira/browse/YARN-2130 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2130.1.patch, YARN-2130.2.patch, YARN-2130.3.patch, YARN-2130.4.patch, YARN-2130.5.patch, YARN-2130.6.patch, YARN-2130.7-2.patch, YARN-2130.7.patch, YARN-2130.8.patch, YARN-2130.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087765#comment-14087765 ] Karthik Kambatla commented on YARN-2352: The test failures seem unrelated and caused by java.net.UnknownHostException: asf901.ygridcore.net: asf901.ygridcore.net. YARN-1337 had a similar issue, and it appears it is due to the build machine. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2352: --- Attachment: yarn-2352-2.patch Uploading the same patch again to see if Jenkins would run this on a different machine. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087827#comment-14087827 ] Tsuyoshi OZAWA commented on YARN-2352: -- I found same test failure by java.net.UnknownHostException: asf901.ygridcore.net: asf901.ygridcore.net on YARN-2130. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2130) Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext
[ https://issues.apache.org/jira/browse/YARN-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087843#comment-14087843 ] Hadoop QA commented on YARN-2130: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660150/YARN-2130.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4536//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4536//console This message is automatically generated. Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext Key: YARN-2130 URL: https://issues.apache.org/jira/browse/YARN-2130 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2130.1.patch, YARN-2130.2.patch, YARN-2130.3.patch, YARN-2130.4.patch, YARN-2130.5.patch, YARN-2130.6.patch, YARN-2130.7-2.patch, YARN-2130.7.patch, YARN-2130.8.patch, YARN-2130.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087848#comment-14087848 ] Tsuyoshi OZAWA commented on YARN-2359: -- +1(non-binding), it looks good to me. Also ran tests and confirmed that it works. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2130) Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext
[ https://issues.apache.org/jira/browse/YARN-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087851#comment-14087851 ] Tsuyoshi OZAWA commented on YARN-2130: -- [~kkambatl], could you check the latest patch? I think it address all points you mentioned. Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext Key: YARN-2130 URL: https://issues.apache.org/jira/browse/YARN-2130 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2130.1.patch, YARN-2130.2.patch, YARN-2130.3.patch, YARN-2130.4.patch, YARN-2130.5.patch, YARN-2130.6.patch, YARN-2130.7-2.patch, YARN-2130.7.patch, YARN-2130.8.patch, YARN-2130.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087860#comment-14087860 ] Hadoop QA commented on YARN-2352: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660151/yarn-2352-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4537//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4537//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4537//console This message is automatically generated. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1177: --- Assignee: Wei Yan (was: Karthik Kambatla) Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Wei Yan Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087873#comment-14087873 ] Karthik Kambatla commented on YARN-1177: [~ywskycn] - all yours. Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Wei Yan Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087869#comment-14087869 ] Wei Yan commented on YARN-1177: --- hey, [~kasha], I'm interested in taking it. could you assign it to me? Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087887#comment-14087887 ] Sandy Ryza commented on YARN-2352: -- IIUC, this patch will only record the duration. If we go that route, I think we should call these metrics lastNodeUpdateDuration etc.. However, would it make sense to go with an approach that records more historical information? For example, RPCMetrics uses a MutableRate to keep stats on the processing time for RPCs, and I think a similar model could work here. Last, is there any need to make the FSPerfMetrics instance static? Right now I think the Fair Scheduler has managed to avoid any mutable static variables. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088002#comment-14088002 ] Jian He commented on YARN-2359: --- [~zxu], thanks for working on it. I have a question: bq. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. where in the code is the IllegalArgumentException thrown ? Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2374) YARN trunk build failing TestDistributedShell.testDSShell
[ https://issues.apache.org/jira/browse/YARN-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088017#comment-14088017 ] Jian He commented on YARN-2374: --- checking this in. YARN trunk build failing TestDistributedShell.testDSShell - Key: YARN-2374 URL: https://issues.apache.org/jira/browse/YARN-2374 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2374.0.patch, apache-yarn-2374.1.patch, apache-yarn-2374.2.patch, apache-yarn-2374.3.patch, apache-yarn-2374.4.patch The YARN trunk build has been failing for the last few days in the distributed shell module. {noformat} testDSShell(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 27.269 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:188) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088047#comment-14088047 ] zhihai xu commented on YARN-2359: - [~jianhe] The code is in pullNewlyAllocatedContainersAndNMTokens of SchedulerApplicationAttempt.java {code} try { // create container token and NMToken altogether. container.setContainerToken(rmContext.getContainerTokenSecretManager() .createContainerToken(container.getId(), container.getNodeId(), getUser(), container.getResource(), container.getPriority(), rmContainer.getCreationTime())); NMToken nmToken = rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(), getApplicationAttemptId(), container); if (nmToken != null) { nmTokens.add(nmToken); } } catch (IllegalArgumentException e) { // DNS might be down, skip returning this container. LOG.error(Error trying to assign container token and NM token to + an allocated container + container.getId(), e); continue; } {code} When IllegalArgumentException exception happened from createContainerToken, the code will skip the container. Then zero container is returned in amContainerAllocation. The following code in AMContainerAllocatedTransition in RMAppAttemptImpl.java will keep retry CONTAINER_ALLOCATED in SCHEDULED state. So IllegalArgumentException will cause zero container returned in amContainerAllocation, which will cause RMAppAttemptImpl stay at state RMAppAttemptState.SCHEDULED. {code} if (amContainerAllocation.getContainers().size() == 0) { appAttempt.retryFetchingAMContainer(appAttempt); return RMAppAttemptState.SCHEDULED; } {code} Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2374) YARN trunk build failing TestDistributedShell.testDSShell
[ https://issues.apache.org/jira/browse/YARN-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088060#comment-14088060 ] Hudson commented on YARN-2374: -- FAILURE: Integrated in Hadoop-trunk-Commit #6023 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6023/]) YARN-2374. Fixed TestDistributedShell#testDSShell failure due to hostname dismatch. Contributed by Varun Vasudev (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616302) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java YARN trunk build failing TestDistributedShell.testDSShell - Key: YARN-2374 URL: https://issues.apache.org/jira/browse/YARN-2374 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2374.0.patch, apache-yarn-2374.1.patch, apache-yarn-2374.2.patch, apache-yarn-2374.3.patch, apache-yarn-2374.4.patch The YARN trunk build has been failing for the last few days in the distributed shell module. {noformat} testDSShell(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 27.269 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:188) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
Mit Desai created YARN-2387: --- Summary: Resource Manager crashes with NPE due to lack of synchronization Key: YARN-2387 URL: https://issues.apache.org/jira/browse/YARN-2387 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai We recently came across a 0.23 RM crashing with an NPE. Here is the stacktrace for it. {noformat} 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) at org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) at java.lang.Thread.run(Thread.java:722) 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} On investigating a on the issue we found that the ContainerStatusPBImpl has methods that are called by different threads and are not synchronized. Even the 2.X code looks alike. We need to make these methods synchronized so that we do not encounter this problem in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088085#comment-14088085 ] Zhijie Shen commented on YARN-2288: --- TestTimelineWebServices fails on trunk, it seems to be broken by HADOOP-10791. I'll file a separate ticket. Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2288-v2.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2388) TestTimelineWebServices fails on trunk after HADOOP-10791
Zhijie Shen created YARN-2388: - Summary: TestTimelineWebServices fails on trunk after HADOOP-10791 Key: YARN-2388 URL: https://issues.apache.org/jira/browse/YARN-2388 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Zhijie Shen https://builds.apache.org/job/PreCommit-YARN-Build/4530//testReport/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2388) TestTimelineWebServices fails on trunk after HADOOP-10791
[ https://issues.apache.org/jira/browse/YARN-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2388: -- Attachment: YARN-2388.1.patch Make a quick fix for the test failure. TestTimelineWebServices fails on trunk after HADOOP-10791 - Key: YARN-2388 URL: https://issues.apache.org/jira/browse/YARN-2388 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2388.1.patch https://builds.apache.org/job/PreCommit-YARN-Build/4530//testReport/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088107#comment-14088107 ] Zhijie Shen commented on YARN-2288: --- bq. If objects in store will get lost after TS restart, we don't need it. What do you think? I neglect the fact of being persisted. I agree on it. bq. Do we have plan to persistent MemoryTimelineStore? At least we're going to have a HbaseTimelineStore. CURRENT_VERSION_INFO can is case-by-case for each impl, but TS_STORE_VERSION_KEY is going to be a common constant across different impls. In addition, TS_STORE_VERSION_KEY - TIMELINE_STORE_VERSION_KEY? some other nits: 1. T - t? {code} + Incompatible version for Timeline store: expecting version {code} 2. Unnecessary change? {code} - @SuppressWarnings(resource) {code} Other than that, I think the patch is good to go. Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2288-v2.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2389) Adding support for drainig a queue, ie killing all apps in the queue
Subramaniam Venkatraman Krishnan created YARN-2389: -- Summary: Adding support for drainig a queue, ie killing all apps in the queue Key: YARN-2389 URL: https://issues.apache.org/jira/browse/YARN-2389 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2385) Adding support for listing all applications in a queue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Venkatraman Krishnan updated YARN-2385: --- Component/s: capacityscheduler Adding support for listing all applications in a queue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: abstractyarnscheduler This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2385) Adding support for listing all applications in a queue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Venkatraman Krishnan updated YARN-2385: --- Summary: Adding support for listing all applications in a queue (was: Adding support for move all applications from a source queue to destination queue) Adding support for listing all applications in a queue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: abstractyarnscheduler This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2385) Adding support for listing all applications in a queue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Venkatraman Krishnan updated YARN-2385: --- Labels: abstractyarnscheduler (was: fairscheduler) Adding support for listing all applications in a queue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: abstractyarnscheduler This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2389) Adding support for drainig a queue, ie killing all apps in the queue
[ https://issues.apache.org/jira/browse/YARN-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Venkatraman Krishnan updated YARN-2389: --- Labels: capacity-scheduler fairscheduler (was: fairscheduler) Adding support for drainig a queue, ie killing all apps in the queue Key: YARN-2389 URL: https://issues.apache.org/jira/browse/YARN-2389 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: capacity-scheduler, fairscheduler This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2385) Adding support for listing all applications in a queue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Venkatraman Krishnan updated YARN-2385: --- Description: This JIRA proposes adding a method in AbstractYarnScheduler to get all the pending/active applications. Fair scheduler already supports moving a single application from one queue to another. Support for the same is being added to Capacity Scheduler as part of YARN-2378 and YARN-2248. So with the addition of this method, we can transparently add support for moving all applications from source queue to target queue and draining a queue, i.e. killing all applications in a queue as proposed by YARN-2389 (was: This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target.) Adding support for listing all applications in a queue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: abstractyarnscheduler This JIRA proposes adding a method in AbstractYarnScheduler to get all the pending/active applications. Fair scheduler already supports moving a single application from one queue to another. Support for the same is being added to Capacity Scheduler as part of YARN-2378 and YARN-2248. So with the addition of this method, we can transparently add support for moving all applications from source queue to target queue and draining a queue, i.e. killing all applications in a queue as proposed by YARN-2389 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2389) Adding support for drainig a queue, ie killing all apps in the queue
[ https://issues.apache.org/jira/browse/YARN-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Venkatraman Krishnan updated YARN-2389: --- Description: This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. This will use YARN-2385 so will work for both Capacity Fair scheduler. (was: This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target.) Adding support for drainig a queue, ie killing all apps in the queue Key: YARN-2389 URL: https://issues.apache.org/jira/browse/YARN-2389 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: capacity-scheduler, fairscheduler This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. This will use YARN-2385 so will work for both Capacity Fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2389) Adding support for drainig a queue, ie killing all apps in the queue
[ https://issues.apache.org/jira/browse/YARN-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Venkatraman Krishnan updated YARN-2389: --- Component/s: capacityscheduler Adding support for drainig a queue, ie killing all apps in the queue Key: YARN-2389 URL: https://issues.apache.org/jira/browse/YARN-2389 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: capacity-scheduler, fairscheduler This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2389) Adding support for drainig a queue, ie killing all apps in the queue
[ https://issues.apache.org/jira/browse/YARN-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Venkatraman Krishnan reassigned YARN-2389: -- Assignee: Subramaniam Venkatraman Krishnan (was: Karthik Kambatla) Adding support for drainig a queue, ie killing all apps in the queue Key: YARN-2389 URL: https://issues.apache.org/jira/browse/YARN-2389 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Subramaniam Venkatraman Krishnan Labels: capacity-scheduler, fairscheduler This is a parallel JIRA to YARN-2378. Fair scheduler already supports moving a single application from one queue to another. This will add support to move all applications from the specified source queue to target. This will use YARN-2385 so will work for both Capacity Fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2374) YARN trunk build failing TestDistributedShell.testDSShell
[ https://issues.apache.org/jira/browse/YARN-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088121#comment-14088121 ] Hudson commented on YARN-2374: -- FAILURE: Integrated in Hadoop-Yarn-trunk #636 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/636/]) YARN-2374. Fixed TestDistributedShell#testDSShell failure due to hostname dismatch. Contributed by Varun Vasudev (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616302) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java YARN trunk build failing TestDistributedShell.testDSShell - Key: YARN-2374 URL: https://issues.apache.org/jira/browse/YARN-2374 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2374.0.patch, apache-yarn-2374.1.patch, apache-yarn-2374.2.patch, apache-yarn-2374.3.patch, apache-yarn-2374.4.patch The YARN trunk build has been failing for the last few days in the distributed shell module. {noformat} testDSShell(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 27.269 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:188) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues
[ https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088137#comment-14088137 ] Subramaniam Venkatraman Krishnan commented on YARN-2248: Hi [~keyki], we have been working on adding support for move for sometime in Capacity Scheduler as part of YARN-2378 (originally YARN-1707) and [~vvasudev] was kind enough to point out that you were doing the same. To prevent duplication, I suggest we merge our work. I looked at your patch we are doing essentially the same thing (which was good validation for both of us :)). Based on [~leftnoteasy]'s [feedback | https://issues.apache.org/jira/browse/YARN-2378?focusedCommentId=14087723], I think it would be easiest if I merged your metrics test with the patch I have. Would that be OK? Capacity Scheduler changes for moving apps between queues - Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Assignee: Janos Matyas Fix For: 2.6.0 Attachments: YARN-2248-1.patch, YARN-2248-2.patch, YARN-2248-3.patch We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2378) Adding support for moving apps between queues in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088146#comment-14088146 ] Subramaniam Venkatraman Krishnan commented on YARN-2378: Thanks [~vvasudev] for pointing out the parallel work and [~leftnoteasy] for your feedback. I agree we should merge both have a [proposal | https://issues.apache.org/jira/browse/YARN-2248?focusedCommentId=14088137] based on your review. About your comments on the patch I uploaded: * Thanks for clarifying on not requiring to check capacity before move. * I will look at the implementing state store in move transition. * Will look at the test failures fix them, my bad. Adding support for moving apps between queues in Capacity Scheduler --- Key: YARN-2378 URL: https://issues.apache.org/jira/browse/YARN-2378 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Subramaniam Venkatraman Krishnan Labels: capacity-scheduler Attachments: YARN-2378.patch As discussed with [~leftnoteasy] and [~jianhe], we are breaking up YARN-1707 to smaller patches for manageability. This JIRA will address adding support for moving apps between queues in Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088161#comment-14088161 ] Jian He commented on YARN-2359: --- I see, thanks for your explanation. looks good to me too Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues
[ https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088185#comment-14088185 ] Krisztian Horvath commented on YARN-2248: - Hi, As long as we don't break the functionality we can merge them and try to take the best out of them, so yes. Have you tried your patch with the queue metrics test, yet? Capacity Scheduler changes for moving apps between queues - Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Assignee: Janos Matyas Fix For: 2.6.0 Attachments: YARN-2248-1.patch, YARN-2248-2.patch, YARN-2248-3.patch We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1337) Recover containers upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088204#comment-14088204 ] Hadoop QA commented on YARN-1337: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12659958/YARN-1337-v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4538//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4538//console This message is automatically generated. Recover containers upon nodemanager restart --- Key: YARN-1337 URL: https://issues.apache.org/jira/browse/YARN-1337 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1337-v1.patch To support work-preserving NM restart we need to recover the state of the containers when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate. The state of finished containers also needs to be recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2388) TestTimelineWebServices fails on trunk after HADOOP-10791
[ https://issues.apache.org/jira/browse/YARN-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088238#comment-14088238 ] Hadoop QA commented on YARN-2388: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660201/YARN-2388.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4539//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4539//console This message is automatically generated. TestTimelineWebServices fails on trunk after HADOOP-10791 - Key: YARN-2388 URL: https://issues.apache.org/jira/browse/YARN-2388 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2388.1.patch https://builds.apache.org/job/PreCommit-YARN-Build/4530//testReport/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2388) TestTimelineWebServices fails on trunk after HADOOP-10791
[ https://issues.apache.org/jira/browse/YARN-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088243#comment-14088243 ] Xuan Gong commented on YARN-2388: - +1 LGTM TestTimelineWebServices fails on trunk after HADOOP-10791 - Key: YARN-2388 URL: https://issues.apache.org/jira/browse/YARN-2388 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2388.1.patch https://builds.apache.org/job/PreCommit-YARN-Build/4530//testReport/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2008: -- Attachment: YARN-2008.8.patch Make ResourceCalculator.isInvalidDivisor abstract, move (correct) impls into Default and Dominant, checking for 0 mem and 0 mem or 0 vcore, respectively CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch, YARN-2008.5.patch, YARN-2008.6.patch, YARN-2008.7.patch, YARN-2008.8.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1488) Allow containers to delegate resources to another container
[ https://issues.apache.org/jira/browse/YARN-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy reassigned YARN-1488: --- Assignee: Arun C Murthy Allow containers to delegate resources to another container --- Key: YARN-1488 URL: https://issues.apache.org/jira/browse/YARN-1488 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy Assignee: Arun C Murthy We should allow containers to delegate resources to another container. This would allow external frameworks to share not just YARN's resource-management capabilities but also it's workload-management capabilities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1488) Allow containers to delegate resources to another container
[ https://issues.apache.org/jira/browse/YARN-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088309#comment-14088309 ] Arun C Murthy commented on YARN-1488: - I have an early patch I'll share shortly, this feature ask is coming up in a lot of places and has generated lots of interest. Allow containers to delegate resources to another container --- Key: YARN-1488 URL: https://issues.apache.org/jira/browse/YARN-1488 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy Assignee: Arun C Murthy We should allow containers to delegate resources to another container. This would allow external frameworks to share not just YARN's resource-management capabilities but also it's workload-management capabilities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088346#comment-14088346 ] Jian He commented on YARN-2212: --- looks good overall. minor comments: - AllocateResponse#newInstance: the first newInstance should not be changed, it’s marked stable - // Should have exception: check exception type ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch, YARN-2212.7.patch, YARN-2212.7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2212: Attachment: YARN-2212.8.patch ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch, YARN-2212.7.patch, YARN-2212.7.patch, YARN-2212.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088447#comment-14088447 ] Xuan Gong commented on YARN-2212: - bq. AllocateResponse#newInstance: the first newInstance should not be changed, it’s marked stable FIXED bq. // Should have exception: check exception type FIXED ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch, YARN-2212.7.patch, YARN-2212.7.patch, YARN-2212.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088470#comment-14088470 ] Hadoop QA commented on YARN-2008: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660231/YARN-2008.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4540//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4540//console This message is automatically generated. CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch, YARN-2008.5.patch, YARN-2008.6.patch, YARN-2008.7.patch, YARN-2008.8.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2352: --- Attachment: yarn-2352-3.patch Thanks Sandy for pointing me to RpcMetrics. MutableRate seemed a good candidate for the stats that we want to collect. Updated the patch to use that. For MutableRate, I have enabled showing extended stats like stdev, min/max etc. by default. In the future, we can add a config to toggle this if we see any particular overhead. Regarding using a Singleton, if I don't do this, the tests fail complaining of already existing metrics for FSDurations. Even QueueMetrics has a static map that it re-uses. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch, yarn-2352-3.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088479#comment-14088479 ] Karthik Kambatla commented on YARN-2359: Checking this in.. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2359) Application hangs when it fails to launch AM container
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2359: --- Summary: Application hangs when it fails to launch AM container (was: Application is hung without timeout and retry after DNS/network is down. ) Application hangs when it fails to launch AM container --- Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088488#comment-14088488 ] Craig Welch commented on YARN-2008: --- TestAMRestart passes on my box with the changes, build server issue? CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch, YARN-2008.5.patch, YARN-2008.6.patch, YARN-2008.7.patch, YARN-2008.8.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2249: -- Attachment: YARN-2249.1.patch Instead of making client side changes, changed RM to cache the outstanding release request. And the container won't be recovered if the container remains in the cache. The cache will be cleaned after NM expire interval if no such container is received by RM for recovery. Uploaded a patch based on that. RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201408062232.txt [~leftnoteasy], Thank you for your suggestions. I added an end-to-end unit test that covered most of your points. However, I had trouble setting up a test with more than one attempt for the same app. I think I covered the rest. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues
[ https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088619#comment-14088619 ] Subramaniam Venkatraman Krishnan commented on YARN-2248: Thanks [~keyki]. I just added all your test cases and ran them they do pass with my patch including the queue metrics test. The test cases are quite useful, thanks again. Capacity Scheduler changes for moving apps between queues - Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Assignee: Janos Matyas Fix For: 2.6.0 Attachments: YARN-2248-1.patch, YARN-2248-2.patch, YARN-2248-3.patch We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application hangs when it fails to launch AM container
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088634#comment-14088634 ] Hudson commented on YARN-2359: -- FAILURE: Integrated in Hadoop-trunk-Commit #6025 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6025/]) YARN-2359. Application hangs when it fails to launch AM container. (Zhihai Xu via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1616375) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java Application hangs when it fails to launch AM container --- Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Fix For: 2.6.0 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088646#comment-14088646 ] Hadoop QA commented on YARN-415: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660287/YARN-415.201408062232.txt against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4542//console This message is automatically generated. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088653#comment-14088653 ] Hadoop QA commented on YARN-2212: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660260/YARN-2212.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4541//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4541//console This message is automatically generated. ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch, YARN-2212.7.patch, YARN-2212.7.patch, YARN-2212.8.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2388) TestTimelineWebServices fails on trunk after HADOOP-10791
[ https://issues.apache.org/jira/browse/YARN-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088654#comment-14088654 ] Zhijie Shen commented on YARN-2388: --- [~xgong], thanks! I'll commit it later today if no more comments. TestTimelineWebServices fails on trunk after HADOOP-10791 - Key: YARN-2388 URL: https://issues.apache.org/jira/browse/YARN-2388 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2388.1.patch https://builds.apache.org/job/PreCommit-YARN-Build/4530//testReport/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2288: - Attachment: YARN-2288-v3.patch Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2288-v2.patch, YARN-2288-v3.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088655#comment-14088655 ] Junping Du commented on YARN-2288: -- Thanks for review, [~zjshen]! Please see my reply below: bq. but TS_STORE_VERSION_KEY is going to be a common constant across different impls. In addition, TS_STORE_VERSION_KEY - TIMELINE_STORE_VERSION_KEY? Actually, I had a long discussion with Jason on YARN-2045 and both of us think we should keep API (include public constant) as simple as possible. This key will not be used outside of class or sub-classes, so there is no hard requirement to put it over its parent class (an interface actually), the only value to put this to parent class is one line code reuse but this is not necessary for some other sub-classes (i.e MemoryTimelineStore) and bring extra complexity to interface which is simple now. So I prefer it to stay at sub class until HBase implementation is there and we have strong feeling to share it across different impls. Thoughts? I will fix the name issue here. bq. T - t? Nice catch. Will fix it soon. bq. Unnecessary change for - @SuppressWarnings(resource)? That just fix a Javac warning. Fix it in a separated patch sounds overkill, so include a fix here. Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2288-v2.patch, YARN-2288-v3.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088657#comment-14088657 ] Junping Du commented on YARN-2288: -- Address [~zjshen]'s comments in v3 patch. Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2288-v2.patch, YARN-2288-v3.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088663#comment-14088663 ] Hadoop QA commented on YARN-2249: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660275/YARN-2249.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4543//console This message is automatically generated. RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1729) TimelineWebServices always passes primary and secondary filters as strings
[ https://issues.apache.org/jira/browse/YARN-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo reassigned YARN-1729: Assignee: Leitao Guo (was: Billie Rinaldi) TimelineWebServices always passes primary and secondary filters as strings -- Key: YARN-1729 URL: https://issues.apache.org/jira/browse/YARN-1729 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Leitao Guo Fix For: 2.4.0 Attachments: YARN-1729.1.patch, YARN-1729.2.patch, YARN-1729.3.patch, YARN-1729.4.patch, YARN-1729.5.patch, YARN-1729.6.patch, YARN-1729.7.patch Primary filters and secondary filter values can be arbitrary json-compatible Object. The web services should determine if the filters specified as query parameters are objects or strings before passing them to the store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088745#comment-14088745 ] Hadoop QA commented on YARN-2352: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660272/yarn-2352-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4544//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4544//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4544//console This message is automatically generated. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch, yarn-2352-2.patch, yarn-2352-2.patch, yarn-2352-3.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2249: -- Attachment: YARN-2249.1.patch RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues
[ https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088832#comment-14088832 ] Krisztian Horvath commented on YARN-2248: - Is there a change we can get this committed in 2.6.0? Capacity Scheduler changes for moving apps between queues - Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Assignee: Janos Matyas Fix For: 2.6.0 Attachments: YARN-2248-1.patch, YARN-2248-2.patch, YARN-2248-3.patch We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2288) Data persistent in timelinestore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088841#comment-14088841 ] Hadoop QA commented on YARN-2288: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660309/YARN-2288-v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice: org.apache.hadoop.yarn.server.timeline.webapp.TestTimelineWebServices org.apache.hadoop.yarn.server.timeline.TestLeveldbTimelineStore {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4545//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4545//console This message is automatically generated. Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2288-v2.patch, YARN-2288-v3.patch, YARN-2288-v4.patch, YARN-2288.patch We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)