[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001563#comment-14001563 ] Steve Loughran commented on YARN-1372: -- how long is AM restart likely to take? Should failed AMs with the restart flag set be pushed to the front of any queues because they are consuming so much cluster resource, finishing fast (or restarting the long-lived service) should get priority? Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only
[ https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001572#comment-14001572 ] Varun Vasudev commented on YARN-1937: - My feedback - 1. admins should be allowed to view all entities - the current patch only allows the owner 2. There should be a way to prevent un-authenticated users from posting entities. In the current patch, the owner is set to null but the entity is saved. Admins should be allowed to insist that users be authenticated before posting entities. Otherwise it looks fine to me. Add entity-level access control of the timeline data for owners only Key: YARN-1937 URL: https://issues.apache.org/jira/browse/YARN-1937 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2072) RM/NM UIs and webservices are missing vcore information
Nathan Roberts created YARN-2072: Summary: RM/NM UIs and webservices are missing vcore information Key: YARN-2072 URL: https://issues.apache.org/jira/browse/YARN-2072 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Change RM and NM UIs and webservices to include virtual cores. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002059#comment-14002059 ] Bikas Saha commented on YARN-1366: -- It would be easier for users if the RM would simply accept the first register from the app and the last finishApplicationMaster() without needing a resync. Lets says that app version 1 was running and we considered it lost because we lost network communication. So the RM started version 2 of the app. Then the RM dies. Then network connectivity for app 1 got restored. Now both v1 and v2 are trying to make allocate calls to the non-existent RM instance. When the RM comes back up how does it differentiate between v1 and v2 and keep v2 and ask v1 to exit? Does this already work? Until now it may not have been a problem because the RM would always ask these to exit and start a new v3. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
Karthik Kambatla created YARN-2073: -- Summary: FairScheduler starts preempting resources even with free resources on the cluster Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002140#comment-14002140 ] Vinod Kumar Vavilapalli commented on YARN-2055: --- Hi folks, I filed YARN-2074 to address the orthogonal issue of not failing apps when repeatedly preempting AM containers. Preemption: Jobs are failing due to AMs are getting launched and killed multiple times -- Key: YARN-2055 URL: https://issues.apache.org/jira/browse/YARN-2055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002141#comment-14002141 ] Vinod Kumar Vavilapalli commented on YARN-2022: --- Hi folks, I filed YARN-2074 to address the orthogonal issue of not failing apps when repeatedly preempting AM containers. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2074: -- Fix Version/s: (was: 2.1.0-beta) Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only
[ https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002145#comment-14002145 ] Zhijie Shen commented on YARN-1937: --- Hi Varun, thanks for review! W.R.T to you concern, see my comments bellow: bq. 1. admins should be allowed to view all entities - the current patch only allows the owner Yeah, we definitely need to allow admin as well as users/groups on the allowed access list. However, for now, since we still haven't admin module, I prefer to defer the admin check until we support admin role (see YARN-2059, YARN-2060). bq. 2. There should be a way to prevent un-authenticated users from posting entities. In the current patch, the owner is set to null but the entity is saved. Admins should be allowed to insist that users be authenticated before posting entities. IMHO, we should allow un-authenticated to post entities. Otherwise, the unsecured cluster cannot leverage the timeline service. Add entity-level access control of the timeline data for owners only Key: YARN-1937 URL: https://issues.apache.org/jira/browse/YARN-1937 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1408: -- Target Version/s: 2.5.0 Fix Version/s: (was: 2.5.0) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only
[ https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002160#comment-14002160 ] Varun Vasudev commented on YARN-1937: - {quote} IMHO, we should allow un-authenticated to post entities. Otherwise, the unsecured cluster cannot leverage the timeline service. {quote} Sorry, I should have explained myself better. You are entirely correct that unsecured clusters should be able to leverage the timeline service. My point was that in a secure cluster, the admin should be allowed to insist that all posts to the timeline server be authenticated. Add entity-level access control of the timeline data for owners only Key: YARN-1937 URL: https://issues.apache.org/jira/browse/YARN-1937 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only
[ https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002178#comment-14002178 ] Zhijie Shen commented on YARN-1937: --- bq. My point was that in a secure cluster, the admin should be allowed to insist that all posts to the timeline server be authenticated. When authentication is enabled, putEntities API is only accessible by the authenticated users. YARN-1936 is to make the client be able to put the timeline data in secure mode. Therefore, we don't need to worry about that un-authenticated users will post the timeline data. Add entity-level access control of the timeline data for owners only Key: YARN-1937 URL: https://issues.apache.org/jira/browse/YARN-1937 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002284#comment-14002284 ] Karthik Kambatla edited comment on YARN-1474 at 5/19/14 8:08 PM: - Sorry for prolonging this discussion. If we don't change the {{reinitialize}} signature, we might not need setRMContext at all. Each scheduler can (re)set the local {{RMContext}}, may be we can start with setting it only on null. None of the tests need to change, I think the patch would shrink considerably. Let us open another JIRA to revisit the ResourceScheduler API, and may be we can add the new setRMContext and update reinitialize? What do you think? was (Author: kkambatl): Sorry for the prolonging this discussion. If we don't change the {{reinitialize}} signature, we might not need setRMContext at all. Each scheduler can (re)set the local {{RMContext}}, may be we can start with setting it only on null. None of the tests need to change, I think the patch would be fairly small. Let us open another JIRA to revisit the ResourceScheduler API, and may be we can add the new setRMContext and update reinitialize? What do you think? Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2075) TestRMAdminCLI consistently fail on trunk
Zhijie Shen created YARN-2075: - Summary: TestRMAdminCLI consistently fail on trunk Key: YARN-2075 URL: https://issues.apache.org/jira/browse/YARN-2075 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen {code} Running org.apache.hadoop.yarn.client.TestRMAdminCLI Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.082 sec ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.088 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1935) Security for timeline server
[ https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002318#comment-14002318 ] Zhijie Shen commented on YARN-1935: --- The test failure should be unrelated: YARN-2075. Security for timeline server Key: YARN-1935 URL: https://issues.apache.org/jira/browse/YARN-1935 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy Assignee: Zhijie Shen Attachments: Timeline_Kerberos_DT_ACLs.patch Jira to track work to secure the ATS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002344#comment-14002344 ] Jian He commented on YARN-1366: --- bq. When the RM comes back up how does it differentiate between v1 and v2 and keep v2 and ask v1 to exit? Does this already work? There’s a response map in AMS to differentiate the attempt, I think this should work already. bq. It would be easier for users if the RM would simply accept the first register from the app and the last finishApplicationMaster() without needing a resync. agree. bq. For the case where AM last heartbeat has been sent to RM, and RM restarted before finishApplicationMaster() called. Does ApplicationMaterServer send resync? Seems we have a race that allocate call gets the resync and do the re-register even after the finishApplicationMaster is called. Checked the MR code that this cannot happen because the allocate thread is interrupted and joined before calling unregister. We may document the API say that allocate should not be called after finishApplicationMaster or handle it explicitly in RM ? ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2076) Minor error in TestLeafQueue files
Chen He created YARN-2076: - Summary: Minor error in TestLeafQueue files Key: YARN-2076 URL: https://issues.apache.org/jira/browse/YARN-2076 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Chen He Assignee: Chen He Priority: Minor numNodes should be 2 instead of 3 in testReservationExchange() since only two nodes are defined. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002393#comment-14002393 ] Bikas Saha commented on YARN-1366: -- bq. Seems we have a race that allocate call gets the resync and do the re-register even after the finishApplicationMaster is called. Checked the MR code that this cannot happen because the allocate thread is interrupted and joined before calling unregister. We may document the API say that allocate should not be called after finishApplicationMaster or handle it explicitly in RM ? If the AMRMClientAsync is not doing this then we should fix it. bq.There’s a response map in AMS to differentiate the attempt, I think this should work already. That is for the running RM right? How does the restarted RM to do it? Currently, absence of an entry for that AM in the responseMap is the cause for asking the AM to resync. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002406#comment-14002406 ] Anubhav Dhoot commented on YARN-1550: - Manually tested by commenting out the line that triggers the START transition in RMAppManager submitApplication. This ensures the app is in NEW and without a currentAttempt causing the null ref reported (which is now at line 111). this.rmContext.getDispatcher().getEventHandler().handle(new RMAppEvent(applicationId, RMAppEventType.START)); Before fix the web page skips rendering the FairScheduler block (some other code path is catching exceptions so that the originally reported 500 does not show up). After the fix the FairScheduler block renders with no apps listed. NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002420#comment-14002420 ] Xuan Gong commented on YARN-941: That is fine. This proposal is only focused on updating AMRMToken for Long Running Service. Proposal: 1. From RM side, specifically, AMRMTokenSecretManager: We need to roll-up AMRMToken periodically. We have two parameters which can temporary save the currenMasterKey and nextMasterKey. And Have a thread which will periodically activate the nextMasterKey (Basically replace currentMasterKey with nextMasterKey). When we need to retrieve the password to do the authentication, we can compare the key_id to get the correct password. 2. ApplicationMasterService: Everytime, when the AMRMToken has been rolled-up, we can inform the AM with the regular heartbeat process. Also, we need to save the AMRMToken into the RMStateStore if it has been updated. 3. AMRMClient: When the AM gets the latest AMRMToken, it will update the token. RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002422#comment-14002422 ] Xuan Gong commented on YARN-941: Uploaded a preview patch for the previous proposal. Will add new test cases and do more tests on real clusters. RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong Attachments: YARN-941.preview.patch When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-941: --- Attachment: YARN-941.preview.patch RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong Attachments: YARN-941.preview.patch When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002432#comment-14002432 ] Karthik Kambatla commented on YARN-1366: With the responseMap, I think the best approach is to set the corresponding entry to -1 on resync just like we do for new apps. On register(), we set the entry to 0 and move on just like in the new app case. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002437#comment-14002437 ] Bikas Saha commented on YARN-1366: -- Then what happens when there are 2 versions of the AM running like I mentioned in the previous comment. How do we prevent v1 from re-connecting with the RM. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002438#comment-14002438 ] Anubhav Dhoot commented on YARN-1366: - I have a patch uploaded to [YARN-1365|https://issues.apache.org/jira/browse/YARN-1365] that does just that. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002453#comment-14002453 ] Jian He commented on YARN-1366: --- bq. That is for the running RM right? How does the restarted RM to do it? sorry, I meant if we correctly populate the responseMap back for the current active attempt on recovery. The current active attempt should get RESYNC because of the non-null entry and previous dead attempt should get SHUTDOWN because of the empty entry in responseMap. Right, we need code change. We should differentiate the two commands SHUTDOWN and RESYNC. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002455#comment-14002455 ] Karthik Kambatla commented on YARN-1366: Sorry, missed the point in your previous comment. The responseMap should keep track of the AM version, and allow resync/re-register only to the current or later version of the AM. Once the version stored is updated, we should kill/shutdown all previous versions. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002499#comment-14002499 ] Anubhav Dhoot commented on YARN-1366: - To summarize along with current changes in YARN-1365 (which sets responseMap to -1 in recovery, ie allows the latest known AM to register/finish on resync) we need 2 more changes a) return SHUTDOWN instead of resync for empty responseMap (ie for any AMs that are not known to be the latest) b) For known last AMs, b.1) allow finishApplicationMaster to succeed when responseMap is set to -1 (ie not yet registered but known to be last). b.2) return RESYNC for all allocate for known AMs that have not yet registered. b.3) allow register for known AM after restart (already covered in 1365's current patch) [~rohithsharma] let me know if you mind if we add these as well to [YARN-1365|https://issues.apache.org/jira/browse/YARN-1365]. Its needed for fixing the unit test failures in 1365's current patch and will also keep it consistent instead of split across patches. We can keep this patch for all the AM side of things. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002594#comment-14002594 ] Tsuyoshi OZAWA commented on YARN-1474: -- {quote} If we don't change the reinitialize signature, we might not need setRMContext at all. {quote} [~kkambatl], In this case, we need to call reinitialize() directly from ResourceManager#serviceInit(). Is it acceptable for us? It means that Schedulers#serviceInit() doesn't initialize anything. If it's acceptable for us, I can fix it soon. {code} - try { -scheduler.reinitialize(conf, rmContext); - } catch (IOException ioe) { -throw new RuntimeException(Failed to initialize scheduler, ioe); - } {code} Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002615#comment-14002615 ] Karthik Kambatla commented on YARN-1474: Let me take a closer look. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002657#comment-14002657 ] Tsuyoshi OZAWA commented on YARN-1474: -- It's because serviceInit() doesn't have any interfaces to pass RMContext to schedulers. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-941: --- Attachment: YARN-941.preview.2.patch Added a testcase RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong Attachments: YARN-941.preview.2.patch, YARN-941.preview.patch When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002707#comment-14002707 ] Karthik Kambatla commented on YARN-1474: Thanks [~ozawa] for your patience with the reviews. I guess we can leave setRMContext as is. And, let us handle the incompatible change to reinitialize in a separate JIRA. On an HA cluster, I noticed that the scheduler threads (FS - updateThread, continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. Ideally, the threads should start only on start(). I guess we should adopt a modified version of your earlier patch: # From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} to serviceInit. # Don't call {{reinitialize()}} in serviceInit or serviceStart. # For the individual threads in the schedulers, init them in serviceInit, but call thread.start() in serviceStart() # serviceStop() for FS looks good. We should fix the serviceStop() for CS. # In TestFairScheduler, the following is not required. {code} // To initialize scheduler scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestFairSchedulerEventLog, the following is not required. In this case and the above, some tests might require calling resourceManager.startt(). {code} scheduler.serviceInit(conf); scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestFifoScheduler, we don't need the following: {code} scheduler.setRMContext(rm.getRMContext()); {code} # TestFSLeafQueue doesn't need this either: {code} scheduler.serviceInit(conf); scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, in any other places. # In TestQueueParsing, you might need to call capacityScheduler.init() in addition to or instead of {code} capacityScheduler.reinitialize(conf, null); {code} # In TestRMContainerAllocator, we might have to call init() instead of reinitialize(). # In TestRMWebApp, we should call init() instead of reinitialize() In general, in the tests, # If there is an RM / Mock RM involved, we don't have to call setRMContext and reinitialize as long as RM#init is called. # If there is no RM / Mock RM, we should call a setRMContext followed by init on the scheduler. Subsequent, calls should remain reinitialize Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002707#comment-14002707 ] Karthik Kambatla edited comment on YARN-1474 at 5/20/14 2:01 AM: - Thanks [~ozawa] for your patience with the reviews. I guess we can leave setRMContext as is. And, let us handle the incompatible change to reinitialize in a separate JIRA. On an HA cluster, I noticed that the scheduler threads (FS - updateThread, continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. Ideally, the threads should start only on start(). I guess we should adopt a modified version of your earlier patch: # From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} to serviceInit. # Don't call {{reinitialize()}} in serviceInit or serviceStart. # For the individual threads in the schedulers, init them in serviceInit, but call thread.start() in serviceStart() # serviceStop() for FS looks good. We should fix the serviceStop() for CS. Other comments: # In TestFairScheduler, the following is not required. {code} // To initialize scheduler scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestFairSchedulerEventLog, the following is not required. In this case and the above, some tests might require calling resourceManager.startt(). {code} scheduler.serviceInit(conf); scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestFifoScheduler, we don't need the following: {code} scheduler.setRMContext(rm.getRMContext()); {code} # TestFSLeafQueue doesn't need this either: {code} scheduler.serviceInit(conf); scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, in any other places. # In TestQueueParsing, you might need to call capacityScheduler.init() in addition to or instead of {code} capacityScheduler.reinitialize(conf, null); {code} # In TestRMContainerAllocator, we might have to call init() instead of reinitialize(). # In TestRMWebApp, we should call init() instead of reinitialize() In general, in the tests, # If there is an RM / Mock RM involved, we don't have to call setRMContext and reinitialize as long as RM#init is called. # If there is no RM / Mock RM, we should call a setRMContext followed by init on the scheduler. Subsequent, calls should remain reinitialize was (Author: kkambatl): Thanks [~ozawa] for your patience with the reviews. I guess we can leave setRMContext as is. And, let us handle the incompatible change to reinitialize in a separate JIRA. On an HA cluster, I noticed that the scheduler threads (FS - updateThread, continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. Ideally, the threads should start only on start(). I guess we should adopt a modified version of your earlier patch: # From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} to serviceInit. # Don't call {{reinitialize()}} in serviceInit or serviceStart. # For the individual threads in the schedulers, init them in serviceInit, but call thread.start() in serviceStart() # serviceStop() for FS looks good. We should fix the serviceStop() for CS. # In TestFairScheduler, the following is not required. {code} // To initialize scheduler scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestFairSchedulerEventLog, the following is not required. In this case and the above, some tests might require calling resourceManager.startt(). {code} scheduler.serviceInit(conf); scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestFifoScheduler, we don't need the following: {code} scheduler.setRMContext(rm.getRMContext()); {code} # TestFSLeafQueue doesn't need this either: {code} scheduler.serviceInit(conf); scheduler.setRMContext(resourceManager.getRMContext()); {code} # In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, in any other places. # In TestQueueParsing, you might need to call capacityScheduler.init() in addition to or instead of {code} capacityScheduler.reinitialize(conf, null); {code} # In TestRMContainerAllocator, we might have to call init() instead of reinitialize(). # In TestRMWebApp, we should call init() instead of reinitialize() In general, in the tests, # If there is an RM / Mock RM involved, we don't have to call setRMContext and reinitialize as long as RM#init is called. # If there is no RM / Mock RM, we should call a setRMContext followed by init on the scheduler. Subsequent, calls should remain reinitialize Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components:
[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-941: --- Attachment: YARN-941.preview.3.patch RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, YARN-941.preview.patch When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002715#comment-14002715 ] Xuan Gong commented on YARN-941: Fix some typos RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, YARN-941.preview.patch When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang reassigned YARN-2030: --- Assignee: Binglin Chang Use StateMachine to simplify handleStoreEvent() in RMStateStore --- Key: YARN-2030 URL: https://issues.apache.org/jira/browse/YARN-2030 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Assignee: Binglin Chang Now the logic to handle different store events in handleStoreEvent() is as following: {code} if (event.getType().equals(RMStateStoreEventType.STORE_APP) || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } ... try { if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT) || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) { ... } else { ... } } {code} This is not only confuse people but also led to mistake easily. We may leverage state machine to simply this even no state transitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002752#comment-14002752 ] Rohith commented on YARN-1366: -- bq. Rohith let me know if you mind if we add these as well to YARN-1365. Agree bq. If the AMRMClientAsync is not doing this then we should fix it. we need not to fix this. It is handled by setting keepRunning flag to false. bq. allow finishApplicationMaster to succeed when responseMap is set to -1 (ie not yet registered but known to be last). It would require additional state transition for. RMAppAttemptImpl : LAUNCHED - EnumSet.of(RMAppAttemptState.FINAL_SAVING, RMAppAttemptState.FINISHED) RMAppImpl : ACCEPTED - FINAL_SAVING From above overall discussions, on resync existing approach will be used istead of going with new API.Please let me know anyone has concern on this? ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002753#comment-14002753 ] Rohith commented on YARN-1366: -- Overall patch would contain MR and Yarn. 1. MapReduce change for resending resource request on resync. 2. AMRMClientImpl from YarnClient providing benifit of resync. 3. ApplicationMasterService. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2075) TestRMAdminCLI consistently fail on trunk
[ https://issues.apache.org/jira/browse/YARN-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenji Kikushima updated YARN-2075: -- Attachment: YARN-2075.patch Attached a patch. - testTransitionToActive failure: Changed to use ArrayList at HAAdmin#getTargetIds. HAAdmin#getTargetIds uses only Arrays.asList, it returns a fixed-size list. So, UnsupportedOperationException occured by calling remove in HAAdmin#isOtherTargetNodeActive. - testHelp failure: Adjusted space and --forceactive message at transitionToActive command usage test. TestRMAdminCLI consistently fail on trunk - Key: YARN-2075 URL: https://issues.apache.org/jira/browse/YARN-2075 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Attachments: YARN-2075.patch {code} Running org.apache.hadoop.yarn.client.TestRMAdminCLI Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.082 sec ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.088 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1352) Recover LogAggregationService upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002797#comment-14002797 ] Ming Ma commented on YARN-1352: --- Jason, not sure you will cover NonAggregatingLogHandler in a different jira, there is delayed task state that needs to be restored, similar to DeletionService jira. Recover LogAggregationService upon nodemanager restart -- Key: YARN-1352 URL: https://issues.apache.org/jira/browse/YARN-1352 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe LogAggregationService state needs to be recovered as part of the work-preserving nodemanager restart feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002800#comment-14002800 ] Sunil G commented on YARN-2074: --- Hi Vinod As per the description I understood that the AM container can get preempted as happening now, and the resultant kill/preemption should not result in Job failures. In this scenario also, we may kill some AM containers and it has to re-launch. By keeping a lower priority for all AM's may help to kill map/reducer container from other applications in similar scenario. As Carlo has mentioned in YARN-2022, there can be extreme corner cases for this approach. But may help in avoiding the cost of re-launching AM container again. Could you please consider this point also in this Jira. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)