date:20140519

Karthik Kambatla created YARN-2073:
--

 Summary: FairScheduler starts preempting resources even with free 
resources on the cluster
 Key: YARN-2073
 URL: https://issues.apache.org/jira/browse/YARN-2073
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical


Preemption should kick in only when the currently available slots don't match 
the request. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times


[ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002140#comment-14002140
 ] 

Vinod Kumar Vavilapalli commented on YARN-2055:
---

Hi folks, I filed YARN-2074 to address the orthogonal issue of not failing apps 
when repeatedly preempting AM containers.

 Preemption: Jobs are failing due to AMs are getting launched and killed 
 multiple times
 --

 Key: YARN-2055
 URL: https://issues.apache.org/jira/browse/YARN-2055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal

 If Queue A does not have enough capacity to run AM, then AM will borrow 
 capacity from queue B to run AM in that case AM will be killed if queue B 
 will reclaim its capacity and again AM will be launched and killed again, in 
 that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy


[ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002141#comment-14002141
 ] 

Vinod Kumar Vavilapalli commented on YARN-2022:
---

Hi folks, I filed YARN-2074 to address the orthogonal issue of not failing apps 
when repeatedly preempting AM containers.

 Preempting an Application Master container can be kept as least priority when 
 multiple applications are marked for preemption by 
 ProportionalCapacityPreemptionPolicy
 -

 Key: YARN-2022
 URL: https://issues.apache.org/jira/browse/YARN-2022
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-2022.1.patch


 Cluster Size = 16GB [2NM's]
 Queue A Capacity = 50%
 Queue B Capacity = 50%
 Consider there are 3 applications running in Queue A which has taken the full 
 cluster capacity. 
 J1 = 2GB AM + 1GB * 4 Maps
 J2 = 2GB AM + 1GB * 4 Maps
 J3 = 2GB AM + 1GB * 2 Maps
 Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
 Currently in this scenario, Jobs J3 will get killed including its AM.
 It is better if AM can be given least priority among multiple applications. 
 In this same scenario, map tasks from J3 and J2 can be preempted.
 Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures


 [ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2074:
--

Fix Version/s: (was: 2.1.0-beta)

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only


[ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002145#comment-14002145
 ] 

Zhijie Shen commented on YARN-1937:
---

Hi Varun, thanks for review! W.R.T to you concern, see my comments bellow:

bq. 1. admins should be allowed to view all entities - the current patch only 
allows the owner

Yeah, we definitely need to allow admin as well as users/groups on the allowed 
access list. However, for now, since we still haven't admin module, I prefer to 
defer the admin check until we support admin role (see YARN-2059, YARN-2060).

bq. 2. There should be a way to prevent un-authenticated users from posting 
entities. In the current patch, the owner is set to null but the entity is 
saved. Admins should be allowed to insist that users be authenticated before 
posting entities.

IMHO, we should allow un-authenticated to post entities. Otherwise, the 
unsecured cluster cannot leverage the timeline service.

 Add entity-level access control of the timeline data for owners only
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins


 [ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1408:
--

Target Version/s: 2.5.0
   Fix Version/s: (was: 2.5.0)

 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
 timeout for 30mins
 --

 Key: YARN-1408
 URL: https://issues.apache.org/jira/browse/YARN-1408
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
 Yarn-1408.4.patch, Yarn-1408.patch


 Capacity preemption is enabled as follows.
  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
  *  
 yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
 Queue = a,b
 Capacity of Queue A = 80%
 Capacity of Queue B = 20%
 Step 1: Assign a big jobA on queue a which uses full cluster capacity
 Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
 capacity
 JobA task which uses queue b capcity is been preempted and killed.
 This caused below problem:
 1. New Container has got allocated for jobA in Queue A as per node update 
 from an NM.
 2. This container has been preempted immediately as per preemption.
 Here ACQUIRED at KILLED Invalid State exception came when the next AM 
 heartbeat reached RM.
 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 ACQUIRED at KILLED
 This also caused the Task to go for a timeout for 30minutes as this Container 
 was already killed by preemption.
 attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only

2014-05-19 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002160#comment-14002160
 ] 

Varun Vasudev commented on YARN-1937:
-

{quote}
IMHO, we should allow un-authenticated to post entities. Otherwise, the 
unsecured cluster cannot leverage the timeline service.
{quote}

Sorry, I should have explained myself better. You are entirely correct that 
unsecured clusters should be able to leverage the timeline service. My point 
was that in a secure cluster, the admin should be allowed to insist that all 
posts to the timeline server be authenticated.

 Add entity-level access control of the timeline data for owners only
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only


[ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002178#comment-14002178
 ] 

Zhijie Shen commented on YARN-1937:
---

bq. My point was that in a secure cluster, the admin should be allowed to 
insist that all posts to the timeline server be authenticated.

When authentication is enabled, putEntities API is only accessible by the 
authenticated users. YARN-1936 is to make the client be able to put the 
timeline data in secure mode. Therefore, we don't need to worry about that 
un-authenticated users will post the timeline data.

 Add entity-level access control of the timeline data for owners only
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (YARN-1474) Make schedulers services

[
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002284#comment-14002284
]

Karthik Kambatla edited comment on YARN-1474 at 5/19/14 8:08 PM:
-

Sorry for prolonging this discussion. If we don't change the {{reinitialize}}
signature, we might not need setRMContext at all. Each scheduler can (re)set
the local {{RMContext}}, may be we can start with setting it only on null. None
of the tests need to change, I think the patch would shrink considerably.

Let us open another JIRA to revisit the ResourceScheduler API, and may be we
can add the new setRMContext and update reinitialize? What do you think?

was (Author: kkambatl):
Sorry for the prolonging this discussion. If we don't change the
{{reinitialize}} signature, we might not need setRMContext at all. Each
scheduler can (re)set the local {{RMContext}}, may be we can start with setting
it only on null. None of the tests need to change, I think the patch would be
fairly small.

Let us open another JIRA to revisit the ResourceScheduler API, and may be we
can add the new setRMContext and update reinitialize? What do you think?

Make schedulers services

Key: YARN-1474
URL: https://issues.apache.org/jira/browse/YARN-1474
Project: Hadoop YARN
Issue Type: Sub-task
Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
Attachments: YARN-1474.1.patch, YARN-1474.10.patch,
YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch,
YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch,
YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch,
YARN-1474.9.patch

Schedulers currently have a reinitialize but no start and stop. Fitting them
into the YARN service model would make things more coherent.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2075) TestRMAdminCLI consistently fail on trunk

Zhijie Shen created YARN-2075:
-

 Summary: TestRMAdminCLI consistently fail on trunk
 Key: YARN-2075
 URL: https://issues.apache.org/jira/browse/YARN-2075
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen


{code}
Running org.apache.hadoop.yarn.client.TestRMAdminCLI
Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec  
FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI
testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI)  Time 
elapsed: 0.082 sec   ERROR!
java.lang.UnsupportedOperationException: null
at java.util.AbstractList.remove(AbstractList.java:144)
at java.util.AbstractList$Itr.remove(AbstractList.java:360)
at java.util.AbstractCollection.remove(AbstractCollection.java:252)
at 
org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173)
at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144)
at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447)
at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380)
at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318)
at 
org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180)

testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI)  Time elapsed: 0.088 sec 
  FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366)
at 
org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1935) Security for timeline server


[ 
https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002318#comment-14002318
 ] 

Zhijie Shen commented on YARN-1935:
---

The test failure should be unrelated: YARN-2075.

 Security for timeline server
 

 Key: YARN-1935
 URL: https://issues.apache.org/jira/browse/YARN-1935
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Zhijie Shen
 Attachments: Timeline_Kerberos_DT_ACLs.patch


 Jira to track work to secure the ATS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Jian He (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002344#comment-14002344
]

Jian He commented on YARN-1366:
---

bq. When the RM comes back up how does it differentiate between v1 and v2 and
keep v2 and ask v1 to exit? Does this already work?
There’s a response map in AMS to differentiate the attempt, I think this should
work already.
bq. It would be easier for users if the RM would simply accept the first
register from the app and the last finishApplicationMaster() without needing a
resync.
agree.
bq. For the case where AM last heartbeat has been sent to RM, and RM restarted
before finishApplicationMaster() called. Does ApplicationMaterServer send
resync?
Seems we have a race that allocate call gets the resync and do the re-register
even after the finishApplicationMaster is called. Checked the MR code that this
cannot happen because the allocate thread is interrupted and joined before
calling unregister. We may document the API say that allocate should not be
called after finishApplicationMaster or handle it explicitly in RM ?

ApplicationMasterService should Resync with the AM upon allocate call after
restart
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2076) Minor error in TestLeafQueue files

2014-05-19 Thread Chen He (JIRA)

Chen He created YARN-2076:
-

 Summary: Minor error in TestLeafQueue files
 Key: YARN-2076
 URL: https://issues.apache.org/jira/browse/YARN-2076
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Chen He
Assignee: Chen He
Priority: Minor


numNodes should be 2 instead of 3 in testReservationExchange() since only two 
nodes are defined.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Bikas Saha (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002393#comment-14002393
]

Bikas Saha commented on YARN-1366:
--

bq. Seems we have a race that allocate call gets the resync and do the
re-register even after the finishApplicationMaster is called. Checked the MR
code that this cannot happen because the allocate thread is interrupted and
joined before calling unregister. We may document the API say that allocate
should not be called after finishApplicationMaster or handle it explicitly in
RM ?
If the AMRMClientAsync is not doing this then we should fix it.

bq.There’s a response map in AMS to differentiate the attempt, I think this
should work already.
That is for the running RM right? How does the restarted RM to do it?
Currently, absence of an entry for that AM in the responseMap is the cause for
asking the AM to resync.

ApplicationMasterService should Resync with the AM upon allocate call after
restart
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render

2014-05-19 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002406#comment-14002406
 ] 

Anubhav Dhoot commented on YARN-1550:
-

Manually tested by commenting out the line that triggers the START transition 
in RMAppManager submitApplication. This ensures the app is in NEW and without a 
currentAttempt causing the null ref reported (which is now at line 111).  
this.rmContext.getDispatcher().getEventHandler().handle(new 
RMAppEvent(applicationId, RMAppEventType.START));

Before fix the web page skips rendering the FairScheduler block (some other 
code path is catching exceptions so that the originally reported 500 does not 
show up). After the fix the FairScheduler block renders with no apps listed.

 NPE in FairSchedulerAppsBlock#render
 

 Key: YARN-1550
 URL: https://issues.apache.org/jira/browse/YARN-1550
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
Reporter: caolong
Priority: Critical
 Fix For: 2.2.1

 Attachments: YARN-1550.001.patch, YARN-1550.patch


 three Steps :
 1、debug at RMAppManager#submitApplication after code
 if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
 null) {
   String message = Application with id  + applicationId
   +  is already present! Cannot add a duplicate!;
   LOG.warn(message);
   throw RPCUtil.getRemoteException(message);
 }
 2、submit one application:hadoop jar 
 ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar
  sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 
 -r 1
 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR！
 the log:
 {noformat}
 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
 handling URI: /cluster/scheduler
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

[
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002420#comment-14002420
]

Xuan Gong commented on YARN-941:

That is fine.
This proposal is only focused on updating AMRMToken for Long Running Service.

Proposal:
1. From RM side, specifically, AMRMTokenSecretManager:
We need to roll-up AMRMToken periodically. We have two parameters which can
temporary save the currenMasterKey and nextMasterKey. And Have a thread which
will periodically activate the nextMasterKey (Basically replace
currentMasterKey with nextMasterKey). When we need to retrieve the password to
do the authentication, we can compare the key_id to get the correct password.

2. ApplicationMasterService:
Everytime, when the AMRMToken has been rolled-up, we can inform the AM with the
regular heartbeat process. Also, we need to save the AMRMToken into the
RMStateStore if it has been updated.

3. AMRMClient:
When the AM gets the latest AMRMToken, it will update the token.

RM Should have a way to update the tokens it has for a running application
--

Key: YARN-941
URL: https://issues.apache.org/jira/browse/YARN-941
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong

When an application is submitted to the RM it includes with it a set of
tokens that the RM will renew on behalf of the application, that will be
passed to the AM when the application is launched, and will be used when
launching the application to access HDFS to download files on behalf of the
application.
For long lived applications/services these tokens can expire, and then the
tokens that the AM has will be invalid, and the tokens that the RM had will
also not work to launch a new AM.
We need to provide an API that will allow the RM to replace the current
tokens for this application with a new set. To avoid any real race issues, I
think this API should be something that the AM calls, so that the client can
connect to the AM with a new set of tokens it got using kerberos, then the AM
can inform the RM of the new set of tokens and quickly update its tokens
internally to use these new ones.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application


[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002422#comment-14002422
 ] 

Xuan Gong commented on YARN-941:


Uploaded a preview patch for the previous proposal. 
Will add new test cases and do more tests on real clusters. 

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application


 [ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-941:
---

Attachment: YARN-941.preview.patch

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart


[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002432#comment-14002432
 ] 

Karthik Kambatla commented on YARN-1366:


With the responseMap, I think the best approach is to set the corresponding 
entry to -1 on resync just like we do for new apps. On register(), we set the 
entry to 0 and move on just like in the new app case.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002437#comment-14002437
 ] 

Bikas Saha commented on YARN-1366:
--

Then what happens when there are 2 versions of the AM running like I mentioned 
in the previous comment. How do we prevent v1 from re-connecting with the RM.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002438#comment-14002438
 ] 

Anubhav Dhoot commented on YARN-1366:
-

I have a patch uploaded to 
[YARN-1365|https://issues.apache.org/jira/browse/YARN-1365] that does just that.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Jian He (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002453#comment-14002453
]

Jian He commented on YARN-1366:
---

bq. That is for the running RM right? How does the restarted RM to do it?
sorry, I meant if we correctly populate the responseMap back for the current
active attempt on recovery. The current active attempt should get RESYNC
because of the non-null entry and previous dead attempt should get SHUTDOWN
because of the empty entry in responseMap. Right, we need code change. We
should differentiate the two commands SHUTDOWN and RESYNC.

ApplicationMasterService should Resync with the AM upon allocate call after
restart
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

[
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002455#comment-14002455
]

Karthik Kambatla commented on YARN-1366:

Sorry, missed the point in your previous comment.

The responseMap should keep track of the AM version, and allow
resync/re-register only to the current or later version of the AM. Once the
version stored is updated, we should kill/shutdown all previous versions.

ApplicationMasterService should Resync with the AM upon allocate call after
restart
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002499#comment-14002499
 ] 

Anubhav Dhoot commented on YARN-1366:
-

To summarize along with current changes in YARN-1365 (which sets responseMap to 
-1 in recovery, ie allows the latest known AM to register/finish on resync) we 
need 2 more changes
a) return SHUTDOWN instead of resync for empty responseMap (ie for any AMs that 
are not known to be the latest)  
b) For known last AMs,
b.1) allow finishApplicationMaster to succeed when responseMap is set to -1 (ie 
not yet registered but known to be last). 
b.2) return RESYNC for all allocate for known AMs that have not yet registered. 
b.3) allow register for known AM after restart (already covered in 1365's 
current patch)

[~rohithsharma] let me know if you mind if we add these as well to 
[YARN-1365|https://issues.apache.org/jira/browse/YARN-1365]. Its needed for 
fixing the unit test failures in 1365's current patch and will also keep it 
consistent instead of split across patches. We can keep this patch for all the 
AM side of things. 

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-19 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002594#comment-14002594
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

{quote}
If we don't change the reinitialize signature, we might not need setRMContext 
at all. 
{quote}

[~kkambatl], In this case, we need to call reinitialize() directly from 
ResourceManager#serviceInit(). Is it acceptable for us? It means that 
Schedulers#serviceInit() doesn't initialize anything. If it's acceptable for 
us, I can fix it soon.

{code}
-  try {
-scheduler.reinitialize(conf, rmContext);
-  } catch (IOException ioe) {
-throw new RuntimeException(Failed to initialize scheduler, ioe);
-  }
{code}

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002615#comment-14002615
 ] 

Karthik Kambatla commented on YARN-1474:


Let me take a closer look. 

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-19 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002657#comment-14002657
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

It's because serviceInit() doesn't have any interfaces to pass RMContext to 
schedulers.

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application


 [ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-941:
---

Attachment: YARN-941.preview.2.patch

Added a testcase

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.2.patch, YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002707#comment-14002707
 ] 

Karthik Kambatla commented on YARN-1474:


Thanks [~ozawa] for your patience with the reviews. I guess we can leave 
setRMContext as is. And, let us handle the incompatible change to reinitialize 
in a separate JIRA.

On an HA cluster, I noticed that the scheduler threads (FS - updateThread, 
continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. 
Ideally, the threads should start only on start(). I guess we should adopt a 
modified version of your earlier patch:
# From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} 
to serviceInit.
# Don't call {{reinitialize()}} in serviceInit or serviceStart.
# For the individual threads in the schedulers, init them in serviceInit, but 
call thread.start() in serviceStart()
# serviceStop() for FS looks good. We should fix the serviceStop() for CS.
# In TestFairScheduler, the following is not required.
{code}
// To initialize scheduler
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFairSchedulerEventLog, the following is not required. In this case and 
the above, some tests might require calling resourceManager.startt().
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFifoScheduler, we don't need the following:
{code}
scheduler.setRMContext(rm.getRMContext());
{code}
# TestFSLeafQueue doesn't need this either:
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, 
in any other places. 
# In TestQueueParsing, you might need to call capacityScheduler.init() in 
addition to or instead of 
{code}
capacityScheduler.reinitialize(conf, null);
{code}
# In TestRMContainerAllocator, we might have to call init() instead of 
reinitialize().
# In TestRMWebApp, we should call init() instead of reinitialize()

In general, in the tests,
# If there is an RM / Mock RM involved, we don't have to call setRMContext and 
reinitialize as long as RM#init is called. 
# If there is no RM / Mock RM, we should call a setRMContext followed by init 
on the scheduler. Subsequent, calls should remain reinitialize

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002707#comment-14002707
 ] 

Karthik Kambatla edited comment on YARN-1474 at 5/20/14 2:01 AM:
-

Thanks [~ozawa] for your patience with the reviews. I guess we can leave 
setRMContext as is. And, let us handle the incompatible change to reinitialize 
in a separate JIRA.

On an HA cluster, I noticed that the scheduler threads (FS - updateThread, 
continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. 
Ideally, the threads should start only on start(). I guess we should adopt a 
modified version of your earlier patch:
# From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} 
to serviceInit.
# Don't call {{reinitialize()}} in serviceInit or serviceStart.
# For the individual threads in the schedulers, init them in serviceInit, but 
call thread.start() in serviceStart()
# serviceStop() for FS looks good. We should fix the serviceStop() for CS.

Other comments:
# In TestFairScheduler, the following is not required.
{code}
// To initialize scheduler
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFairSchedulerEventLog, the following is not required. In this case and 
the above, some tests might require calling resourceManager.startt().
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFifoScheduler, we don't need the following:
{code}
scheduler.setRMContext(rm.getRMContext());
{code}
# TestFSLeafQueue doesn't need this either:
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, 
in any other places. 
# In TestQueueParsing, you might need to call capacityScheduler.init() in 
addition to or instead of 
{code}
capacityScheduler.reinitialize(conf, null);
{code}
# In TestRMContainerAllocator, we might have to call init() instead of 
reinitialize().
# In TestRMWebApp, we should call init() instead of reinitialize()

In general, in the tests,
# If there is an RM / Mock RM involved, we don't have to call setRMContext and 
reinitialize as long as RM#init is called. 
# If there is no RM / Mock RM, we should call a setRMContext followed by init 
on the scheduler. Subsequent, calls should remain reinitialize


was (Author: kkambatl):
Thanks [~ozawa] for your patience with the reviews. I guess we can leave 
setRMContext as is. And, let us handle the incompatible change to reinitialize 
in a separate JIRA.

On an HA cluster, I noticed that the scheduler threads (FS - updateThread, 
continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. 
Ideally, the threads should start only on start(). I guess we should adopt a 
modified version of your earlier patch:
# From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} 
to serviceInit.
# Don't call {{reinitialize()}} in serviceInit or serviceStart.
# For the individual threads in the schedulers, init them in serviceInit, but 
call thread.start() in serviceStart()
# serviceStop() for FS looks good. We should fix the serviceStop() for CS.
# In TestFairScheduler, the following is not required.
{code}
// To initialize scheduler
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFairSchedulerEventLog, the following is not required. In this case and 
the above, some tests might require calling resourceManager.startt().
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFifoScheduler, we don't need the following:
{code}
scheduler.setRMContext(rm.getRMContext());
{code}
# TestFSLeafQueue doesn't need this either:
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, 
in any other places. 
# In TestQueueParsing, you might need to call capacityScheduler.init() in 
addition to or instead of 
{code}
capacityScheduler.reinitialize(conf, null);
{code}
# In TestRMContainerAllocator, we might have to call init() instead of 
reinitialize().
# In TestRMWebApp, we should call init() instead of reinitialize()

In general, in the tests,
# If there is an RM / Mock RM involved, we don't have to call setRMContext and 
reinitialize as long as RM#init is called. 
# If there is no RM / Mock RM, we should call a setRMContext followed by init 
on the scheduler. Subsequent, calls should remain reinitialize

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components:

[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application


 [ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-941:
---

Attachment: YARN-941.preview.3.patch

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, 
 YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application