[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-05-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001563#comment-14001563
 ] 

Steve Loughran commented on YARN-1372:
--

how long is AM restart likely to take? Should failed AMs with the restart flag 
set be pushed to the front of any queues because they are consuming so much 
cluster resource, finishing fast (or restarting the long-lived service) should 
get priority?

 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only

2014-05-19 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001572#comment-14001572
 ] 

Varun Vasudev commented on YARN-1937:
-

My feedback -
1. admins should be allowed to view all entities - the current patch only 
allows the owner
2. There should be a way to prevent un-authenticated users from posting 
entities. In the current patch, the owner is set to null but the entity is 
saved. Admins should be allowed to insist that users be authenticated before 
posting entities.

Otherwise it looks fine to me. 

 Add entity-level access control of the timeline data for owners only
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2072) RM/NM UIs and webservices are missing vcore information

2014-05-19 Thread Nathan Roberts (JIRA)
Nathan Roberts created YARN-2072:


 Summary: RM/NM UIs and webservices are missing vcore information
 Key: YARN-2072
 URL: https://issues.apache.org/jira/browse/YARN-2072
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


Change RM and NM UIs and webservices to include virtual cores.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002059#comment-14002059
 ] 

Bikas Saha commented on YARN-1366:
--

It would be easier for users if the RM would simply accept the first register 
from the app and the last finishApplicationMaster() without needing a resync.
Lets says that app version 1 was running and we considered it lost because we 
lost network communication. So the RM started version 2 of the app. Then the RM 
dies. Then network connectivity for app 1 got restored. Now both v1 and v2 are 
trying to make allocate calls to the non-existent RM instance. When the RM 
comes back up how does it differentiate between v1 and v2 and keep v2 and ask 
v1 to exit? Does this already work? Until now it may not have been a problem 
because the RM would always ask these to exit and start a new v3.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster

2014-05-19 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-2073:
--

 Summary: FairScheduler starts preempting resources even with free 
resources on the cluster
 Key: YARN-2073
 URL: https://issues.apache.org/jira/browse/YARN-2073
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical


Preemption should kick in only when the currently available slots don't match 
the request. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2014-05-19 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002140#comment-14002140
 ] 

Vinod Kumar Vavilapalli commented on YARN-2055:
---

Hi folks, I filed YARN-2074 to address the orthogonal issue of not failing apps 
when repeatedly preempting AM containers.

 Preemption: Jobs are failing due to AMs are getting launched and killed 
 multiple times
 --

 Key: YARN-2055
 URL: https://issues.apache.org/jira/browse/YARN-2055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal

 If Queue A does not have enough capacity to run AM, then AM will borrow 
 capacity from queue B to run AM in that case AM will be killed if queue B 
 will reclaim its capacity and again AM will be launched and killed again, in 
 that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-05-19 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002141#comment-14002141
 ] 

Vinod Kumar Vavilapalli commented on YARN-2022:
---

Hi folks, I filed YARN-2074 to address the orthogonal issue of not failing apps 
when repeatedly preempting AM containers.

 Preempting an Application Master container can be kept as least priority when 
 multiple applications are marked for preemption by 
 ProportionalCapacityPreemptionPolicy
 -

 Key: YARN-2022
 URL: https://issues.apache.org/jira/browse/YARN-2022
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-2022.1.patch


 Cluster Size = 16GB [2NM's]
 Queue A Capacity = 50%
 Queue B Capacity = 50%
 Consider there are 3 applications running in Queue A which has taken the full 
 cluster capacity. 
 J1 = 2GB AM + 1GB * 4 Maps
 J2 = 2GB AM + 1GB * 4 Maps
 J3 = 2GB AM + 1GB * 2 Maps
 Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
 Currently in this scenario, Jobs J3 will get killed including its AM.
 It is better if AM can be given least priority among multiple applications. 
 In this same scenario, map tasks from J3 and J2 can be preempted.
 Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-05-19 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2074:
--

Fix Version/s: (was: 2.1.0-beta)

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only

2014-05-19 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002145#comment-14002145
 ] 

Zhijie Shen commented on YARN-1937:
---

Hi Varun, thanks for review! W.R.T to you concern, see my comments bellow:

bq. 1. admins should be allowed to view all entities - the current patch only 
allows the owner

Yeah, we definitely need to allow admin as well as users/groups on the allowed 
access list. However, for now, since we still haven't admin module, I prefer to 
defer the admin check until we support admin role (see YARN-2059, YARN-2060).

bq. 2. There should be a way to prevent un-authenticated users from posting 
entities. In the current patch, the owner is set to null but the entity is 
saved. Admins should be allowed to insist that users be authenticated before 
posting entities.

IMHO, we should allow un-authenticated to post entities. Otherwise, the 
unsecured cluster cannot leverage the timeline service.

 Add entity-level access control of the timeline data for owners only
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

2014-05-19 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1408:
--

Target Version/s: 2.5.0
   Fix Version/s: (was: 2.5.0)

 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
 timeout for 30mins
 --

 Key: YARN-1408
 URL: https://issues.apache.org/jira/browse/YARN-1408
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
 Yarn-1408.4.patch, Yarn-1408.patch


 Capacity preemption is enabled as follows.
  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
  *  
 yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
 Queue = a,b
 Capacity of Queue A = 80%
 Capacity of Queue B = 20%
 Step 1: Assign a big jobA on queue a which uses full cluster capacity
 Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
 capacity
 JobA task which uses queue b capcity is been preempted and killed.
 This caused below problem:
 1. New Container has got allocated for jobA in Queue A as per node update 
 from an NM.
 2. This container has been preempted immediately as per preemption.
 Here ACQUIRED at KILLED Invalid State exception came when the next AM 
 heartbeat reached RM.
 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 ACQUIRED at KILLED
 This also caused the Task to go for a timeout for 30minutes as this Container 
 was already killed by preemption.
 attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only

2014-05-19 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002160#comment-14002160
 ] 

Varun Vasudev commented on YARN-1937:
-

{quote}
IMHO, we should allow un-authenticated to post entities. Otherwise, the 
unsecured cluster cannot leverage the timeline service.
{quote}

Sorry, I should have explained myself better. You are entirely correct that 
unsecured clusters should be able to leverage the timeline service. My point 
was that in a secure cluster, the admin should be allowed to insist that all 
posts to the timeline server be authenticated.

 Add entity-level access control of the timeline data for owners only
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1937) Add entity-level access control of the timeline data for owners only

2014-05-19 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002178#comment-14002178
 ] 

Zhijie Shen commented on YARN-1937:
---

bq. My point was that in a secure cluster, the admin should be allowed to 
insist that all posts to the timeline server be authenticated.

When authentication is enabled, putEntities API is only accessible by the 
authenticated users. YARN-1936 is to make the client be able to put the 
timeline data in secure mode. Therefore, we don't need to worry about that 
un-authenticated users will post the timeline data.

 Add entity-level access control of the timeline data for owners only
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (YARN-1474) Make schedulers services

2014-05-19 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002284#comment-14002284
 ] 

Karthik Kambatla edited comment on YARN-1474 at 5/19/14 8:08 PM:
-

Sorry for prolonging this discussion. If we don't change the {{reinitialize}} 
signature, we might not need setRMContext at all. Each scheduler can (re)set 
the local {{RMContext}}, may be we can start with setting it only on null. None 
of the tests need to change, I think the patch would shrink considerably. 

Let us open another JIRA to revisit the ResourceScheduler API, and may be we 
can add the new setRMContext and update reinitialize? What do you think? 


was (Author: kkambatl):
Sorry for the prolonging this discussion. If we don't change the 
{{reinitialize}} signature, we might not need setRMContext at all. Each 
scheduler can (re)set the local {{RMContext}}, may be we can start with setting 
it only on null. None of the tests need to change, I think the patch would be 
fairly small. 

Let us open another JIRA to revisit the ResourceScheduler API, and may be we 
can add the new setRMContext and update reinitialize? What do you think? 

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2075) TestRMAdminCLI consistently fail on trunk

2014-05-19 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2075:
-

 Summary: TestRMAdminCLI consistently fail on trunk
 Key: YARN-2075
 URL: https://issues.apache.org/jira/browse/YARN-2075
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen


{code}
Running org.apache.hadoop.yarn.client.TestRMAdminCLI
Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec  
FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI
testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI)  Time 
elapsed: 0.082 sec   ERROR!
java.lang.UnsupportedOperationException: null
at java.util.AbstractList.remove(AbstractList.java:144)
at java.util.AbstractList$Itr.remove(AbstractList.java:360)
at java.util.AbstractCollection.remove(AbstractCollection.java:252)
at 
org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173)
at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144)
at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447)
at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380)
at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318)
at 
org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180)

testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI)  Time elapsed: 0.088 sec 
  FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366)
at 
org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1935) Security for timeline server

2014-05-19 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002318#comment-14002318
 ] 

Zhijie Shen commented on YARN-1935:
---

The test failure should be unrelated: YARN-2075.

 Security for timeline server
 

 Key: YARN-1935
 URL: https://issues.apache.org/jira/browse/YARN-1935
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Zhijie Shen
 Attachments: Timeline_Kerberos_DT_ACLs.patch


 Jira to track work to secure the ATS



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002344#comment-14002344
 ] 

Jian He commented on YARN-1366:
---

bq. When the RM comes back up how does it differentiate between v1 and v2 and 
keep v2 and ask v1 to exit? Does this already work?
There’s a response map in AMS to differentiate the attempt, I think this should 
work already.
bq. It would be easier for users if the RM would simply accept the first 
register from the app and the last finishApplicationMaster() without needing a 
resync.
agree.
bq. For the case where AM last heartbeat has been sent to RM, and RM restarted 
before finishApplicationMaster() called. Does ApplicationMaterServer send 
resync?
Seems we have a race that allocate call gets the resync and do the re-register 
even after the finishApplicationMaster is called. Checked the MR code that this 
cannot happen because the allocate thread is interrupted and joined before 
calling unregister. We may document the API say that allocate should not be 
called after finishApplicationMaster or handle it explicitly in RM ?

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2076) Minor error in TestLeafQueue files

2014-05-19 Thread Chen He (JIRA)
Chen He created YARN-2076:
-

 Summary: Minor error in TestLeafQueue files
 Key: YARN-2076
 URL: https://issues.apache.org/jira/browse/YARN-2076
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Chen He
Assignee: Chen He
Priority: Minor


numNodes should be 2 instead of 3 in testReservationExchange() since only two 
nodes are defined.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002393#comment-14002393
 ] 

Bikas Saha commented on YARN-1366:
--

bq. Seems we have a race that allocate call gets the resync and do the 
re-register even after the finishApplicationMaster is called. Checked the MR 
code that this cannot happen because the allocate thread is interrupted and 
joined before calling unregister. We may document the API say that allocate 
should not be called after finishApplicationMaster or handle it explicitly in 
RM ?
If the AMRMClientAsync is not doing this then we should fix it.

bq.There’s a response map in AMS to differentiate the attempt, I think this 
should work already.
That is for the running RM right? How does the restarted RM to do it? 
Currently, absence of an entry for that AM in the responseMap is the cause for 
asking the AM to resync.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render

2014-05-19 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002406#comment-14002406
 ] 

Anubhav Dhoot commented on YARN-1550:
-

Manually tested by commenting out the line that triggers the START transition 
in RMAppManager submitApplication. This ensures the app is in NEW and without a 
currentAttempt causing the null ref reported (which is now at line 111).  
this.rmContext.getDispatcher().getEventHandler().handle(new 
RMAppEvent(applicationId, RMAppEventType.START));

Before fix the web page skips rendering the FairScheduler block (some other 
code path is catching exceptions so that the originally reported 500 does not 
show up). After the fix the FairScheduler block renders with no apps listed.

 NPE in FairSchedulerAppsBlock#render
 

 Key: YARN-1550
 URL: https://issues.apache.org/jira/browse/YARN-1550
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
Reporter: caolong
Priority: Critical
 Fix For: 2.2.1

 Attachments: YARN-1550.001.patch, YARN-1550.patch


 three Steps :
 1、debug at RMAppManager#submitApplication after code
 if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
 null) {
   String message = Application with id  + applicationId
   +  is already present! Cannot add a duplicate!;
   LOG.warn(message);
   throw RPCUtil.getRemoteException(message);
 }
 2、submit one application:hadoop jar 
 ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar
  sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 
 -r 1
 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR!
 the log:
 {noformat}
 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
 handling URI: /cluster/scheduler
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-05-19 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002420#comment-14002420
 ] 

Xuan Gong commented on YARN-941:


That is fine. 
This proposal is only focused on updating AMRMToken for Long Running Service.

Proposal:
1. From RM side, specifically, AMRMTokenSecretManager:
We need to roll-up AMRMToken periodically. We have two parameters which can 
temporary save the currenMasterKey and nextMasterKey. And Have a thread which 
will periodically activate the nextMasterKey (Basically replace 
currentMasterKey with nextMasterKey). When we need to retrieve the password to 
do the authentication, we can compare the key_id to get the correct password. 

2. ApplicationMasterService:
Everytime, when the AMRMToken has been rolled-up, we can inform the AM with the 
regular heartbeat process. Also, we need to save the AMRMToken into the 
RMStateStore if it has been updated.

3. AMRMClient:
When the AM gets the latest AMRMToken, it will update the token.


 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong

 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-05-19 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002422#comment-14002422
 ] 

Xuan Gong commented on YARN-941:


Uploaded a preview patch for the previous proposal. 
Will add new test cases and do more tests on real clusters. 

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-05-19 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-941:
---

Attachment: YARN-941.preview.patch

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002432#comment-14002432
 ] 

Karthik Kambatla commented on YARN-1366:


With the responseMap, I think the best approach is to set the corresponding 
entry to -1 on resync just like we do for new apps. On register(), we set the 
entry to 0 and move on just like in the new app case.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002437#comment-14002437
 ] 

Bikas Saha commented on YARN-1366:
--

Then what happens when there are 2 versions of the AM running like I mentioned 
in the previous comment. How do we prevent v1 from re-connecting with the RM.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002438#comment-14002438
 ] 

Anubhav Dhoot commented on YARN-1366:
-

I have a patch uploaded to 
[YARN-1365|https://issues.apache.org/jira/browse/YARN-1365] that does just that.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002453#comment-14002453
 ] 

Jian He commented on YARN-1366:
---

bq. That is for the running RM right? How does the restarted RM to do it?
sorry,  I meant if we correctly populate the responseMap back for the current 
active attempt on recovery. The current active attempt should get RESYNC 
because of the non-null entry and previous dead attempt should get SHUTDOWN 
because of the empty entry in responseMap. Right, we need code change.  We 
should differentiate the two commands SHUTDOWN and RESYNC.



 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002455#comment-14002455
 ] 

Karthik Kambatla commented on YARN-1366:


Sorry, missed the point in your previous comment. 

The responseMap should keep track of the AM version, and allow 
resync/re-register only to the current or later version of the AM. Once the 
version stored is updated, we should kill/shutdown all previous versions. 

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002499#comment-14002499
 ] 

Anubhav Dhoot commented on YARN-1366:
-

To summarize along with current changes in YARN-1365 (which sets responseMap to 
-1 in recovery, ie allows the latest known AM to register/finish on resync) we 
need 2 more changes
a) return SHUTDOWN instead of resync for empty responseMap (ie for any AMs that 
are not known to be the latest)  
b) For known last AMs,
b.1) allow finishApplicationMaster to succeed when responseMap is set to -1 (ie 
not yet registered but known to be last). 
b.2) return RESYNC for all allocate for known AMs that have not yet registered. 
b.3) allow register for known AM after restart (already covered in 1365's 
current patch)

[~rohithsharma] let me know if you mind if we add these as well to 
[YARN-1365|https://issues.apache.org/jira/browse/YARN-1365]. Its needed for 
fixing the unit test failures in 1365's current patch and will also keep it 
consistent instead of split across patches. We can keep this patch for all the 
AM side of things. 

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-19 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002594#comment-14002594
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

{quote}
If we don't change the reinitialize signature, we might not need setRMContext 
at all. 
{quote}

[~kkambatl], In this case, we need to call reinitialize() directly from 
ResourceManager#serviceInit(). Is it acceptable for us? It means that 
Schedulers#serviceInit() doesn't initialize anything. If it's acceptable for 
us, I can fix it soon.

{code}
-  try {
-scheduler.reinitialize(conf, rmContext);
-  } catch (IOException ioe) {
-throw new RuntimeException(Failed to initialize scheduler, ioe);
-  }
{code}

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-19 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002615#comment-14002615
 ] 

Karthik Kambatla commented on YARN-1474:


Let me take a closer look. 

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-19 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002657#comment-14002657
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

It's because serviceInit() doesn't have any interfaces to pass RMContext to 
schedulers.

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-05-19 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-941:
---

Attachment: YARN-941.preview.2.patch

Added a testcase

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.2.patch, YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-19 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002707#comment-14002707
 ] 

Karthik Kambatla commented on YARN-1474:


Thanks [~ozawa] for your patience with the reviews. I guess we can leave 
setRMContext as is. And, let us handle the incompatible change to reinitialize 
in a separate JIRA.

On an HA cluster, I noticed that the scheduler threads (FS - updateThread, 
continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. 
Ideally, the threads should start only on start(). I guess we should adopt a 
modified version of your earlier patch:
# From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} 
to serviceInit.
# Don't call {{reinitialize()}} in serviceInit or serviceStart.
# For the individual threads in the schedulers, init them in serviceInit, but 
call thread.start() in serviceStart()
# serviceStop() for FS looks good. We should fix the serviceStop() for CS.
# In TestFairScheduler, the following is not required.
{code}
// To initialize scheduler
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFairSchedulerEventLog, the following is not required. In this case and 
the above, some tests might require calling resourceManager.startt().
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFifoScheduler, we don't need the following:
{code}
scheduler.setRMContext(rm.getRMContext());
{code}
# TestFSLeafQueue doesn't need this either:
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, 
in any other places. 
# In TestQueueParsing, you might need to call capacityScheduler.init() in 
addition to or instead of 
{code}
capacityScheduler.reinitialize(conf, null);
{code}
# In TestRMContainerAllocator, we might have to call init() instead of 
reinitialize().
# In TestRMWebApp, we should call init() instead of reinitialize()

In general, in the tests,
# If there is an RM / Mock RM involved, we don't have to call setRMContext and 
reinitialize as long as RM#init is called. 
# If there is no RM / Mock RM, we should call a setRMContext followed by init 
on the scheduler. Subsequent, calls should remain reinitialize

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (YARN-1474) Make schedulers services

2014-05-19 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002707#comment-14002707
 ] 

Karthik Kambatla edited comment on YARN-1474 at 5/20/14 2:01 AM:
-

Thanks [~ozawa] for your patience with the reviews. I guess we can leave 
setRMContext as is. And, let us handle the incompatible change to reinitialize 
in a separate JIRA.

On an HA cluster, I noticed that the scheduler threads (FS - updateThread, 
continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. 
Ideally, the threads should start only on start(). I guess we should adopt a 
modified version of your earlier patch:
# From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} 
to serviceInit.
# Don't call {{reinitialize()}} in serviceInit or serviceStart.
# For the individual threads in the schedulers, init them in serviceInit, but 
call thread.start() in serviceStart()
# serviceStop() for FS looks good. We should fix the serviceStop() for CS.

Other comments:
# In TestFairScheduler, the following is not required.
{code}
// To initialize scheduler
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFairSchedulerEventLog, the following is not required. In this case and 
the above, some tests might require calling resourceManager.startt().
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFifoScheduler, we don't need the following:
{code}
scheduler.setRMContext(rm.getRMContext());
{code}
# TestFSLeafQueue doesn't need this either:
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, 
in any other places. 
# In TestQueueParsing, you might need to call capacityScheduler.init() in 
addition to or instead of 
{code}
capacityScheduler.reinitialize(conf, null);
{code}
# In TestRMContainerAllocator, we might have to call init() instead of 
reinitialize().
# In TestRMWebApp, we should call init() instead of reinitialize()

In general, in the tests,
# If there is an RM / Mock RM involved, we don't have to call setRMContext and 
reinitialize as long as RM#init is called. 
# If there is no RM / Mock RM, we should call a setRMContext followed by init 
on the scheduler. Subsequent, calls should remain reinitialize


was (Author: kkambatl):
Thanks [~ozawa] for your patience with the reviews. I guess we can leave 
setRMContext as is. And, let us handle the incompatible change to reinitialize 
in a separate JIRA.

On an HA cluster, I noticed that the scheduler threads (FS - updateThread, 
continuousSchedulingThread; CS - asyncSchedulerThread) start on init() itself. 
Ideally, the threads should start only on start(). I guess we should adopt a 
modified version of your earlier patch:
# From {{reinitialize()}}, move the part corresponding to {{if (!initialized)}} 
to serviceInit.
# Don't call {{reinitialize()}} in serviceInit or serviceStart.
# For the individual threads in the schedulers, init them in serviceInit, but 
call thread.start() in serviceStart()
# serviceStop() for FS looks good. We should fix the serviceStop() for CS.
# In TestFairScheduler, the following is not required.
{code}
// To initialize scheduler
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFairSchedulerEventLog, the following is not required. In this case and 
the above, some tests might require calling resourceManager.startt().
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestFifoScheduler, we don't need the following:
{code}
scheduler.setRMContext(rm.getRMContext());
{code}
# TestFSLeafQueue doesn't need this either:
{code}
scheduler.serviceInit(conf);
scheduler.setRMContext(resourceManager.getRMContext());
{code}
# In TestLeafQueue, we should call cs.init() instead of cs.serviceInit(). Also, 
in any other places. 
# In TestQueueParsing, you might need to call capacityScheduler.init() in 
addition to or instead of 
{code}
capacityScheduler.reinitialize(conf, null);
{code}
# In TestRMContainerAllocator, we might have to call init() instead of 
reinitialize().
# In TestRMWebApp, we should call init() instead of reinitialize()

In general, in the tests,
# If there is an RM / Mock RM involved, we don't have to call setRMContext and 
reinitialize as long as RM#init is called. 
# If there is no RM / Mock RM, we should call a setRMContext followed by init 
on the scheduler. Subsequent, calls should remain reinitialize

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: 

[jira] [Updated] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-05-19 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-941:
---

Attachment: YARN-941.preview.3.patch

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, 
 YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-05-19 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002715#comment-14002715
 ] 

Xuan Gong commented on YARN-941:


Fix some typos

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, 
 YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore

2014-05-19 Thread Binglin Chang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binglin Chang reassigned YARN-2030:
---

Assignee: Binglin Chang

 Use StateMachine to simplify handleStoreEvent() in RMStateStore
 ---

 Key: YARN-2030
 URL: https://issues.apache.org/jira/browse/YARN-2030
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
Assignee: Binglin Chang

 Now the logic to handle different store events in handleStoreEvent() is as 
 following:
 {code}
 if (event.getType().equals(RMStateStoreEventType.STORE_APP)
 || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) {
   ...
   if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
 ...
   } else {
 ...
   }
   ...
   try {
 if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
   ...
 } else {
   ...
 }
   } 
   ...
 } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)
 || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) {
   ...
   if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) {
 ...
   } else {
 ...
   }
 ...
 if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) {
   ...
 } else {
   ...
 }
   }
   ...
 } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) {
 ...
 } else {
   ...
 }
 }
 {code}
 This is not only confuse people but also led to mistake easily. We may 
 leverage state machine to simply this even no state transitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002752#comment-14002752
 ] 

Rohith commented on YARN-1366:
--

bq. Rohith let me know if you mind if we add these as well to YARN-1365.
Agree

bq. If the AMRMClientAsync is not doing this then we should fix it.
we need not to fix this. It is handled by setting keepRunning flag to false.

bq. allow finishApplicationMaster to succeed when responseMap is set to -1 (ie 
not yet registered but known to be last). 
It would require additional state transition for. 
RMAppAttemptImpl : LAUNCHED - 
EnumSet.of(RMAppAttemptState.FINAL_SAVING, RMAppAttemptState.FINISHED)
RMAppImpl : ACCEPTED - FINAL_SAVING


From above overall discussions, on resync existing approach will be used 
istead of going with new API.Please let me know anyone has concern on this?

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002753#comment-14002753
 ] 

Rohith commented on YARN-1366:
--

Overall patch would contain MR and Yarn.
1. MapReduce change for resending resource request on resync.
2. AMRMClientImpl from YarnClient providing benifit of resync.
3. ApplicationMasterService.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2075) TestRMAdminCLI consistently fail on trunk

2014-05-19 Thread Kenji Kikushima (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenji Kikushima updated YARN-2075:
--

Attachment: YARN-2075.patch

Attached a patch.
- testTransitionToActive failure: Changed to use ArrayList at 
HAAdmin#getTargetIds. HAAdmin#getTargetIds uses only Arrays.asList, it returns 
a fixed-size list. So, UnsupportedOperationException occured by calling remove 
in HAAdmin#isOtherTargetNodeActive.
- testHelp failure: Adjusted space and --forceactive message at 
transitionToActive command usage test.

 TestRMAdminCLI consistently fail on trunk
 -

 Key: YARN-2075
 URL: https://issues.apache.org/jira/browse/YARN-2075
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
 Attachments: YARN-2075.patch


 {code}
 Running org.apache.hadoop.yarn.client.TestRMAdminCLI
 Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec 
  FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI
 testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI)  Time 
 elapsed: 0.082 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at java.util.AbstractList.remove(AbstractList.java:144)
   at java.util.AbstractList$Itr.remove(AbstractList.java:360)
   at java.util.AbstractCollection.remove(AbstractCollection.java:252)
   at 
 org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173)
   at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144)
   at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447)
   at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380)
   at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318)
   at 
 org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180)
 testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI)  Time elapsed: 0.088 
 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366)
   at 
 org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1352) Recover LogAggregationService upon nodemanager restart

2014-05-19 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002797#comment-14002797
 ] 

Ming Ma commented on YARN-1352:
---

Jason, not sure you will cover NonAggregatingLogHandler in a different jira, 
there is delayed task state that needs to be restored, similar to 
DeletionService jira.

 Recover LogAggregationService upon nodemanager restart
 --

 Key: YARN-1352
 URL: https://issues.apache.org/jira/browse/YARN-1352
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe

 LogAggregationService state needs to be recovered as part of the 
 work-preserving nodemanager restart feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-05-19 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002800#comment-14002800
 ] 

Sunil G commented on YARN-2074:
---

Hi Vinod

As per the description I understood that the AM container can get preempted as 
happening now, and the resultant kill/preemption should not result in Job 
failures.
In this scenario also, we may kill some AM containers and it has to re-launch. 
By keeping a lower priority for all AM's may help to kill map/reducer container 
from other applications in similar scenario.

As Carlo has mentioned in YARN-2022, there can be extreme corner cases for this 
approach. But may help in avoiding the cost of re-launching AM container again.
Could you please consider this point also in this Jira.

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)