[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-06-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033479#comment-14033479
 ] 

Wangda Tan commented on YARN-2074:
--

[~jianhe], thanks for your clarification. 
I think the testAMPreemptedNotCountedForAMFailures is exactly what I meant. 
LGTM, +1.


 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
 YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033491#comment-14033491
 ] 

Hadoop QA commented on YARN-2022:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650734/YARN-2022.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4010//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4010//console

This message is automatically generated.

 Preempting an Application Master container can be kept as least priority when 
 multiple applications are marked for preemption by 
 ProportionalCapacityPreemptionPolicy
 -

 Key: YARN-2022
 URL: https://issues.apache.org/jira/browse/YARN-2022
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, 
 YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, 
 Yarn-2022.1.patch


 Cluster Size = 16GB [2NM's]
 Queue A Capacity = 50%
 Queue B Capacity = 50%
 Consider there are 3 applications running in Queue A which has taken the full 
 cluster capacity. 
 J1 = 2GB AM + 1GB * 4 Maps
 J2 = 2GB AM + 1GB * 4 Maps
 J3 = 2GB AM + 1GB * 2 Maps
 Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
 Currently in this scenario, Jobs J3 will get killed including its AM.
 It is better if AM can be given least priority among multiple applications. 
 In this same scenario, map tasks from J3 and J2 can be preempted.
 Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-06-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2074:
--

Attachment: YARN-2074.7.patch

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
 YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, 
 YARN-2074.7.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-06-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033496#comment-14033496
 ] 

Jian He commented on YARN-2074:
---

Thanks for pointing out RMAppAttemptImpl.isLastAttempt, there's an existing bug 
when calculating isLastAttempt. I updated the patch and test case accordingly.

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
 YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, 
 YARN-2074.7.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-06-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2074:
--

Attachment: YARN-2074.7.patch

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
 YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, 
 YARN-2074.7.patch, YARN-2074.7.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI list command

2014-06-17 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033502#comment-14033502
 ] 

Zhijie Shen commented on YARN-1480:
---

Hi [~kj-ki], thanks for the patch. Here're some meta comments on it:

1. I looked into the current RMWebServices#getApps(), and below is the list of 
missing options in ApplicationCLI. queue (current queue option is for the 
movetoqueue command) and tags are not covered in the patch. If it's not a 
big addition, is it better to include these two options into the option list?
{code}
  @QueryParam(finalStatus) String finalStatusQuery,
  @QueryParam(user) String userQuery,
  @QueryParam(queue) String queueQuery,
  @QueryParam(limit) String count,
  @QueryParam(startedTimeBegin) String startedBegin,
  @QueryParam(startedTimeEnd) String startedEnd,
  @QueryParam(finishedTimeBegin) String finishBegin,
  @QueryParam(finishedTimeEnd) String finishEnd,
  @QueryParam(applicationTags) SetString applicationTags
{code}

2. ApplicationClientProtocol#getApplications already support full filters, 
while YarnClient seems not to support the full options now. IMHO, the right way 
here  is to make YarnClient to support full filters, and ApplicationCLI simply 
calls the API. It is an inefficient way to pull a long app list from RM and do 
local filtering.

 RM web services getApps() accepts many more filters than ApplicationCLI 
 list command
 --

 Key: YARN-1480
 URL: https://issues.apache.org/jira/browse/YARN-1480
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Kenji Kikushima
 Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, 
 YARN-1480-5.patch, YARN-1480.patch


 Nowadays RM web services getApps() accepts many more filters than 
 ApplicationCLI list command, which only accepts state and type. IMHO, 
 ideally, different interfaces should provide consistent functionality. Is it 
 better to allow more filters in ApplicationCLI?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033531#comment-14033531
 ] 

Hadoop QA commented on YARN-2074:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650742/YARN-2074.7.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4011//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4011//console

This message is automatically generated.

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
 YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, 
 YARN-2074.7.patch, YARN-2074.7.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-06-17 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: trust.patch

Test  weather this patch can wrok

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler
Affects Versions: 2.2.0
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
Reporter: anders
Priority: Minor
  Labels: patch
 Fix For: 2.2.0

 Attachments: trust.patch, trust.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's healthcheckservice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033674#comment-14033674
 ] 

Hudson commented on YARN-2167:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/586/])
YARN-2167. LeveldbIterator should get closed in 
NMLeveldbStateStoreService#loadLocalizationState() within finally block. 
Contributed by Junping Du (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603039)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java


 LeveldbIterator should get closed in 
 NMLeveldbStateStoreService#loadLocalizationState() within finally block
 

 Key: YARN-2167
 URL: https://issues.apache.org/jira/browse/YARN-2167
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du
 Fix For: 3.0.0, 2.5.0

 Attachments: YARN-2167.patch


 In NMLeveldbStateStoreService#loadLocalizationState(), we have 
 LeveldbIterator to read NM's localization state but it is not get closed in 
 finally block. We should close this connection to DB as a common practice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033673#comment-14033673
 ] 

Hudson commented on YARN-2159:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/586/])
YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603003)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java


 Better logging in SchedulerNode#allocateContainer
 -

 Key: YARN-2159
 URL: https://issues.apache.org/jira/browse/YARN-2159
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Ray Chiang
Assignee: Ray Chiang
Priority: Trivial
  Labels: newbie, supportability
 Fix For: 2.5.0

 Attachments: YARN2159-01.patch


 This bit of code:
 {quote}
 LOG.info(Assigned container  + container.getId() +  of capacity 
 + container.getResource() +  on host  + rmNode.getNodeAddress()
 + , which currently has  + numContainers +  containers, 
 + getUsedResource() +  used and  + getAvailableResource()
 +  available);
 {quote}
 results in a line like:
 {quote}
 2014-05-30 16:17:43,573 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
 Assigned container container_14000_0009_01_00 of capacity 
 memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently 
 has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 
 available
 {quote}
 That message is fine in most cases, but looks pretty bad after the last 
 available allocation, since it says something like vCores:0 available.
 Here is one suggested phrasing
   - which has 18 containers, memory:27648, vCores:18 used and 
 memory:3072, vCores:0 available after allocation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033677#comment-14033677
 ] 

Hudson commented on YARN-1339:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/586/])
YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed 
by Jason Lowe) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603036)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java


 Recover DeletionService state upon nodemanager restart
 --

 Key: YARN-1339
 URL: https://issues.apache.org/jira/browse/YARN-1339
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 2.5.0

 Attachments: YARN-1339.patch, YARN-1339v2.patch, 
 YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, 
 YARN-1339v6.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033679#comment-14033679
 ] 

Hudson commented on YARN-1885:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/586/])
YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs 
where the completed applications previously ran in case of RM restart. 
Contributed by Wangda Tan (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603028)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* 

[jira] [Created] (YARN-2169) NMSimulator of sls should catch more Exception

2014-06-17 Thread Beckham007 (JIRA)
Beckham007 created YARN-2169:


 Summary: NMSimulator of sls should catch more Exception
 Key: YARN-2169
 URL: https://issues.apache.org/jira/browse/YARN-2169
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Beckham007


In the method middleStep() of NMSimulator , sending heart beat may cause 
InterruptedException or other Exception if the load is heavily. If not handler 
these exceptions, the task of NMSimulator cloud not add to the executor queue 
again. So the NM will lost.
In my situation, the pool size is 4000, nm size is 2000, and am is 1500. Some 
NMs will lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2169) NMSimulator of sls should catch more Exception

2014-06-17 Thread Beckham007 (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Beckham007 updated YARN-2169:
-

Attachment: YARN-2169.patch

 NMSimulator of sls should catch more Exception
 --

 Key: YARN-2169
 URL: https://issues.apache.org/jira/browse/YARN-2169
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Beckham007
 Attachments: YARN-2169.patch


 In the method middleStep() of NMSimulator , sending heart beat may cause 
 InterruptedException or other Exception if the load is heavily. If not 
 handler these exceptions, the task of NMSimulator cloud not add to the 
 executor queue again. So the NM will lost.
 In my situation, the pool size is 4000, nm size is 2000, and am is 1500. Some 
 NMs will lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2170) Fix components' version information in the web page 'About the Cluster'

2014-06-17 Thread Jun Gong (JIRA)
Jun Gong created YARN-2170:
--

 Summary: Fix components' version information in the web page 
'About the Cluster'
 Key: YARN-2170
 URL: https://issues.apache.org/jira/browse/YARN-2170
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Priority: Minor


In the web page 'About the Cluster', YARN's component's build version(e.g. 
ResourceManager) is the same as Hadoop version now. It is caused by   calling 
getVersion() instead of _getVersion() in VersionInfo.java by mistake.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2170) Fix components' version information in the web page 'About the Cluster'

2014-06-17 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-2170:
---

Attachment: YARN-2170.patch

 Fix components' version information in the web page 'About the Cluster'
 ---

 Key: YARN-2170
 URL: https://issues.apache.org/jira/browse/YARN-2170
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Priority: Minor
 Attachments: YARN-2170.patch


 In the web page 'About the Cluster', YARN's component's build version(e.g. 
 ResourceManager) is the same as Hadoop version now. It is caused by   calling 
 getVersion() instead of _getVersion() in VersionInfo.java by mistake.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033829#comment-14033829
 ] 

Hudson commented on YARN-1339:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/])
YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed 
by Jason Lowe) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603036)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java


 Recover DeletionService state upon nodemanager restart
 --

 Key: YARN-1339
 URL: https://issues.apache.org/jira/browse/YARN-1339
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 2.5.0

 Attachments: YARN-1339.patch, YARN-1339v2.patch, 
 YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, 
 YARN-1339v6.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033826#comment-14033826
 ] 

Hudson commented on YARN-2167:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/])
YARN-2167. LeveldbIterator should get closed in 
NMLeveldbStateStoreService#loadLocalizationState() within finally block. 
Contributed by Junping Du (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603039)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java


 LeveldbIterator should get closed in 
 NMLeveldbStateStoreService#loadLocalizationState() within finally block
 

 Key: YARN-2167
 URL: https://issues.apache.org/jira/browse/YARN-2167
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du
 Fix For: 3.0.0, 2.5.0

 Attachments: YARN-2167.patch


 In NMLeveldbStateStoreService#loadLocalizationState(), we have 
 LeveldbIterator to read NM's localization state but it is not get closed in 
 finally block. We should close this connection to DB as a common practice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033831#comment-14033831
 ] 

Hudson commented on YARN-1885:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/])
YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs 
where the completed applications previously ran in case of RM restart. 
Contributed by Wangda Tan (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603028)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* 

[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033825#comment-14033825
 ] 

Hudson commented on YARN-2159:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/])
YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603003)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java


 Better logging in SchedulerNode#allocateContainer
 -

 Key: YARN-2159
 URL: https://issues.apache.org/jira/browse/YARN-2159
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Ray Chiang
Assignee: Ray Chiang
Priority: Trivial
  Labels: newbie, supportability
 Fix For: 2.5.0

 Attachments: YARN2159-01.patch


 This bit of code:
 {quote}
 LOG.info(Assigned container  + container.getId() +  of capacity 
 + container.getResource() +  on host  + rmNode.getNodeAddress()
 + , which currently has  + numContainers +  containers, 
 + getUsedResource() +  used and  + getAvailableResource()
 +  available);
 {quote}
 results in a line like:
 {quote}
 2014-05-30 16:17:43,573 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
 Assigned container container_14000_0009_01_00 of capacity 
 memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently 
 has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 
 available
 {quote}
 That message is fine in most cases, but looks pretty bad after the last 
 available allocation, since it says something like vCores:0 available.
 Here is one suggested phrasing
   - which has 18 containers, memory:27648, vCores:18 used and 
 memory:3072, vCores:0 available after allocation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-2171:


 Summary: AMs block on the CapacityScheduler lock during allocate()
 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.4.0, 0.23.10
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical


When AMs heartbeat into the RM via the allocate() call they are blocking on the 
CapacityScheduler lock when trying to get the number of nodes in the cluster 
via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033864#comment-14033864
 ] 

Jason Lowe commented on YARN-2171:
--

When the CapacityScheduler scheduler thread is running full-time due to a 
constant stream of events (e.g.: large number of running applications with a 
large number of cluster nodes) then the CapacityScheduler lock is held by that 
scheduler loop most of the time.  As AMs heartbeat into the RM to try to get 
their resources, the capacity scheduler code goes out of its way to try to 
avoid having the AMs grab the scheduler lock.  Unfortunately this one was 
missed to get this one integer value.  Therefore they end up piling up on the 
scheduler lock, filling all of the IPC handlers of the ApplicationMasterService 
and the others back up on the call queue.  Once the scheduler releases the lock 
it will quickly try to grab it again, so only a few AMs end up getting through 
the gate and the IPC handlers fill again with the next batch of AMs blocking 
on the scheduler lock.  This causes the average RPC response times to skyrocket 
for AMs.  AMs experience large delays getting their allocations which in turn 
leads to lower cluster utilization and increased application runtimes.

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical

 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-409) Allow apps to be killed via the RM REST API

2014-06-17 Thread Romain Rigaux (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033874#comment-14033874
 ] 

Romain Rigaux commented on YARN-409:


dup of https://issues.apache.org/jira/browse/YARN-1702?

 Allow apps to be killed via the RM REST API
 ---

 Key: YARN-409
 URL: https://issues.apache.org/jira/browse/YARN-409
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 The RM REST API currently allows getting information about running 
 applications.  Adding the capability to kill applications would allow systems 
 like Hue to perform their functions over HTTP.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)
Richard Chen created YARN-2172:
--

 Summary: Suspend/Resume Hadoop Jobs
 Key: YARN-2172
 URL: https://issues.apache.org/jira/browse/YARN-2172
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager, webapp
Affects Versions: 2.2.0
 Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
 Fix For: 2.2.0


In a multi-application cluster environment, jobs running inside Hadoop 
application may be of lower-priority than jobs running inside other 
applications like HBase. To give way to other higher-priority jobs inside 
Hadoop, a user or some cluster-level resource scheduling service should be able 
to suspend and/or resume some particular jobs within Hadoop application.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Description: 
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

  was:
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop application.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.


 Suspend/Resume Hadoop Jobs
 --

 Key: YARN-2172
 URL: https://issues.apache.org/jira/browse/YARN-2172
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager, webapp
Affects Versions: 2.2.0
 Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
  Labels: hadoop, jobs, resume, suspend
 Fix For: 2.2.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 In a multi-application cluster environment, jobs running inside Hadoop YARN 
 may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
 give way to other higher-priority jobs inside Hadoop, a user or some 
 cluster-level resource scheduling service should be able to suspend and/or 
 resume some particular jobs within Hadoop YARN.
 When target jobs inside Hadoop are suspended, those already allocated and 
 running task containers will continue to run until their completion or active 
 preemption by other ways. But no more new containers would be allocated to 
 the target jobs. In contrast, when suspended jobs are put into resume mode, 
 they will continue to run from the previous job progress and have new task 
 containers allocated to complete the rest of the jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Description: 
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop application.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

  was:
In a multi-application cluster environment, jobs running inside Hadoop 
application may be of lower-priority than jobs running inside other 
applications like HBase. To give way to other higher-priority jobs inside 
Hadoop, a user or some cluster-level resource scheduling service should be able 
to suspend and/or resume some particular jobs within Hadoop application.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.


 Suspend/Resume Hadoop Jobs
 --

 Key: YARN-2172
 URL: https://issues.apache.org/jira/browse/YARN-2172
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager, webapp
Affects Versions: 2.2.0
 Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
  Labels: hadoop, jobs, resume, suspend
 Fix For: 2.2.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 In a multi-application cluster environment, jobs running inside Hadoop YARN 
 may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
 give way to other higher-priority jobs inside Hadoop, a user or some 
 cluster-level resource scheduling service should be able to suspend and/or 
 resume some particular jobs within Hadoop application.
 When target jobs inside Hadoop are suspended, those already allocated and 
 running task containers will continue to run until their completion or active 
 preemption by other ways. But no more new containers would be allocated to 
 the target jobs. In contrast, when suspended jobs are put into resume mode, 
 they will continue to run from the previous job progress and have new task 
 containers allocated to complete the rest of the jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Description: 
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

My team has completed its implementation and our tests showed it works in a 
rather solid way. 

  was:
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

My team has completed its implementation and our tests showed it is working in 
a rather solid way. 


 Suspend/Resume Hadoop Jobs
 --

 Key: YARN-2172
 URL: https://issues.apache.org/jira/browse/YARN-2172
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager, webapp
Affects Versions: 2.2.0
 Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
  Labels: hadoop, jobs, resume, suspend
 Fix For: 2.2.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 In a multi-application cluster environment, jobs running inside Hadoop YARN 
 may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
 give way to other higher-priority jobs inside Hadoop, a user or some 
 cluster-level resource scheduling service should be able to suspend and/or 
 resume some particular jobs within Hadoop YARN.
 When target jobs inside Hadoop are suspended, those already allocated and 
 running task containers will continue to run until their completion or active 
 preemption by other ways. But no more new containers would be allocated to 
 the target jobs. In contrast, when suspended jobs are put into resume mode, 
 they will continue to run from the previous job progress and have new task 
 containers allocated to complete the rest of the jobs.
 My team has completed its implementation and our tests showed it works in a 
 rather solid way. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Description: 
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

My team has completed its implementation and our tests showed it is working in 
a rather solid way. 

  was:
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.


 Suspend/Resume Hadoop Jobs
 --

 Key: YARN-2172
 URL: https://issues.apache.org/jira/browse/YARN-2172
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager, webapp
Affects Versions: 2.2.0
 Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
  Labels: hadoop, jobs, resume, suspend
 Fix For: 2.2.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 In a multi-application cluster environment, jobs running inside Hadoop YARN 
 may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
 give way to other higher-priority jobs inside Hadoop, a user or some 
 cluster-level resource scheduling service should be able to suspend and/or 
 resume some particular jobs within Hadoop YARN.
 When target jobs inside Hadoop are suspended, those already allocated and 
 running task containers will continue to run until their completion or active 
 preemption by other ways. But no more new containers would be allocated to 
 the target jobs. In contrast, when suspended jobs are put into resume mode, 
 they will continue to run from the previous job progress and have new task 
 containers allocated to complete the rest of the jobs.
 My team has completed its implementation and our tests showed it is working 
 in a rather solid way. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033906#comment-14033906
 ] 

Hudson commented on YARN-2167:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/])
YARN-2167. LeveldbIterator should get closed in 
NMLeveldbStateStoreService#loadLocalizationState() within finally block. 
Contributed by Junping Du (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603039)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java


 LeveldbIterator should get closed in 
 NMLeveldbStateStoreService#loadLocalizationState() within finally block
 

 Key: YARN-2167
 URL: https://issues.apache.org/jira/browse/YARN-2167
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du
 Fix For: 3.0.0, 2.5.0

 Attachments: YARN-2167.patch


 In NMLeveldbStateStoreService#loadLocalizationState(), we have 
 LeveldbIterator to read NM's localization state but it is not get closed in 
 finally block. We should close this connection to DB as a common practice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033905#comment-14033905
 ] 

Hudson commented on YARN-2159:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/])
YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603003)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java


 Better logging in SchedulerNode#allocateContainer
 -

 Key: YARN-2159
 URL: https://issues.apache.org/jira/browse/YARN-2159
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Ray Chiang
Assignee: Ray Chiang
Priority: Trivial
  Labels: newbie, supportability
 Fix For: 2.5.0

 Attachments: YARN2159-01.patch


 This bit of code:
 {quote}
 LOG.info(Assigned container  + container.getId() +  of capacity 
 + container.getResource() +  on host  + rmNode.getNodeAddress()
 + , which currently has  + numContainers +  containers, 
 + getUsedResource() +  used and  + getAvailableResource()
 +  available);
 {quote}
 results in a line like:
 {quote}
 2014-05-30 16:17:43,573 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
 Assigned container container_14000_0009_01_00 of capacity 
 memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently 
 has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 
 available
 {quote}
 That message is fine in most cases, but looks pretty bad after the last 
 available allocation, since it says something like vCores:0 available.
 Here is one suggested phrasing
   - which has 18 containers, memory:27648, vCores:18 used and 
 memory:3072, vCores:0 available after allocation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033909#comment-14033909
 ] 

Hudson commented on YARN-1339:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/])
YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed 
by Jason Lowe) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603036)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java


 Recover DeletionService state upon nodemanager restart
 --

 Key: YARN-1339
 URL: https://issues.apache.org/jira/browse/YARN-1339
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 2.5.0

 Attachments: YARN-1339.patch, YARN-1339v2.patch, 
 YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, 
 YARN-1339v6.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts

2014-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033911#comment-14033911
 ] 

Hudson commented on YARN-1885:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/])
YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs 
where the completed applications previously ran in case of RM restart. 
Contributed by Wangda Tan (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603028)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* 

[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2171:
-

Attachment: YARN-2171.patch

Patch to use AtomicInteger for the number of nodes so we can avoid grabbing the 
lock to access the value.  I also added a unit test to verify allocate doesn't 
try to grab the capacity scheduler lock.

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2171.patch


 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-409) Allow apps to be killed via the RM REST API

2014-06-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033950#comment-14033950
 ] 

Sandy Ryza commented on YARN-409:
-

definitely.  will close this because there seems to be more activity there.

 Allow apps to be killed via the RM REST API
 ---

 Key: YARN-409
 URL: https://issues.apache.org/jira/browse/YARN-409
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 The RM REST API currently allows getting information about running 
 applications.  Adding the capability to kill applications would allow systems 
 like Hue to perform their functions over HTTP.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-409) Allow apps to be killed via the RM REST API

2014-06-17 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved YARN-409.
-

Resolution: Duplicate

 Allow apps to be killed via the RM REST API
 ---

 Key: YARN-409
 URL: https://issues.apache.org/jira/browse/YARN-409
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 The RM REST API currently allows getting information about running 
 applications.  Adding the capability to kill applications would allow systems 
 like Hue to perform their functions over HTTP.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2162) Fair Scheduler :ability to configure minResources and maxResources in terms of percentage

2014-06-17 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2162:
-

Description: 
minResources and maxResources in fair scheduler configs are expressed in terms 
of absolute numbers X mb, Y vcores. 
As a result, when we expand or shrink our hadoop cluster, we need to 
recalculate and change minResources/maxResources accordingly, which is pretty 
inconvenient.
We can circumvent this problem if we can optionally configure these properties 
in terms of percentage of cluster capacity. 

  was:
minResources and maxResources in fair scheduler configs are expressed in terms 
of absolute numbers X mb, Y vcores. 
As a result, when we expand or shrink our hadoop cluster, we need to 
recalculate and change minResources/maxResources accordingly, which is pretty 
inconvenient.
We can circumvent this problem if we can (optionally) configure these 
properties in terms of percentage of cluster capacity. 


 Fair Scheduler :ability to configure minResources and maxResources in terms 
 of percentage
 -

 Key: YARN-2162
 URL: https://issues.apache.org/jira/browse/YARN-2162
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
  Labels: scheduler

 minResources and maxResources in fair scheduler configs are expressed in 
 terms of absolute numbers X mb, Y vcores. 
 As a result, when we expand or shrink our hadoop cluster, we need to 
 recalculate and change minResources/maxResources accordingly, which is pretty 
 inconvenient.
 We can circumvent this problem if we can optionally configure these 
 properties in terms of percentage of cluster capacity. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2173) Enabling HTTPS for the reader REST APIs

2014-06-17 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2173:
-

 Summary: Enabling HTTPS for the reader REST APIs
 Key: YARN-2173
 URL: https://issues.apache.org/jira/browse/YARN-2173
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage

2014-06-17 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2162:
-

Summary: Fair Scheduler :ability to optionally configure minResources and 
maxResources in terms of percentage  (was: Fair Scheduler :ability to configure 
minResources and maxResources in terms of percentage)

 Fair Scheduler :ability to optionally configure minResources and maxResources 
 in terms of percentage
 

 Key: YARN-2162
 URL: https://issues.apache.org/jira/browse/YARN-2162
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
  Labels: scheduler

 minResources and maxResources in fair scheduler configs are expressed in 
 terms of absolute numbers X mb, Y vcores. 
 As a result, when we expand or shrink our hadoop cluster, we need to 
 recalculate and change minResources/maxResources accordingly, which is pretty 
 inconvenient.
 We can circumvent this problem if we can optionally configure these 
 properties in terms of percentage of cluster capacity. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2174) Enabling HTTPs for the writer REST API

2014-06-17 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2174:
-

 Summary: Enabling HTTPs for the writer REST API
 Key: YARN-2174
 URL: https://issues.apache.org/jira/browse/YARN-2174
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage

2014-06-17 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034019#comment-14034019
 ] 

Ashwin Shankar commented on YARN-2162:
--

[~maysamyabandeh], yes that was the intention. Changed title and description to 
make it clear.

 Fair Scheduler :ability to optionally configure minResources and maxResources 
 in terms of percentage
 

 Key: YARN-2162
 URL: https://issues.apache.org/jira/browse/YARN-2162
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
  Labels: scheduler

 minResources and maxResources in fair scheduler configs are expressed in 
 terms of absolute numbers X mb, Y vcores. 
 As a result, when we expand or shrink our hadoop cluster, we need to 
 recalculate and change minResources/maxResources accordingly, which is pretty 
 inconvenient.
 We can circumvent this problem if we can optionally configure these 
 properties in terms of percentage of cluster capacity. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2174) Enabling HTTPs for the writer REST API

2014-06-17 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen reassigned YARN-2174:
-

Assignee: Zhijie Shen

 Enabling HTTPs for the writer REST API
 --

 Key: YARN-2174
 URL: https://issues.apache.org/jira/browse/YARN-2174
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2174) Enabling HTTPs for the writer REST API

2014-06-17 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2174:
--

Description: Since we'd like to allow the application to put the timeline 
data at the client, the AM and even the containers, we need to provide the way 
to distribute the keystore.

 Enabling HTTPs for the writer REST API
 --

 Key: YARN-2174
 URL: https://issues.apache.org/jira/browse/YARN-2174
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Since we'd like to allow the application to put the timeline data at the 
 client, the AM and even the containers, we need to provide the way to 
 distribute the keystore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart

2014-06-17 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034021#comment-14034021
 ] 

Junping Du commented on YARN-1341:
--

[~jlowe], Thanks for the patch here. I am currently reviewing it and looks like 
some code like: LeveldbIterator, NMStateStoreService already get committed in 
other patches. Would you resync the patch here against trunk? Thanks!

 Recover NMTokens upon nodemanager restart
 -

 Key: YARN-1341
 URL: https://issues.apache.org/jira/browse/YARN-1341
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
 YARN-1341v4-and-YARN-1987.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2102) More generalized timeline ACLs

2014-06-17 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2102:
--

Summary: More generalized timeline ACLs  (was: Extend access control for 
configured user/group list)

 More generalized timeline ACLs
 --

 Key: YARN-2102
 URL: https://issues.apache.org/jira/browse/YARN-2102
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Like ApplicationACLsManager, we should also allow configured user/group to 
 access the timeline data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2102) More generalized timeline ACLs

2014-06-17 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2102:
--

Description: We need to differentiate the access controls of reading and 
writing operations, and we need to think about cross-entity access control. For 
example, if we are executing a workflow of MR jobs, which writing the timeline 
data of this workflow, we don't want other user to pollute the timeline data of 
the workflow by putting something under it.  (was: Like ApplicationACLsManager, 
we should also allow configured user/group to access the timeline data.)

 More generalized timeline ACLs
 --

 Key: YARN-2102
 URL: https://issues.apache.org/jira/browse/YARN-2102
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 We need to differentiate the access controls of reading and writing 
 operations, and we need to think about cross-entity access control. For 
 example, if we are executing a workflow of MR jobs, which writing the 
 timeline data of this workflow, we don't want other user to pollute the 
 timeline data of the workflow by putting something under it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034034#comment-14034034
 ] 

Hadoop QA commented on YARN-2171:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650819/YARN-2171.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4014//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4014//console

This message is automatically generated.

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2171.patch


 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Yi Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated YARN-2083:
--

Attachment: YARN-2083-2.patch

move test code to TestFSQueue.java

 In fair scheduler, Queue should not been assigned more containers when its 
 usedResource had reach the maxResource limit
 ---

 Key: YARN-2083
 URL: https://issues.apache.org/jira/browse/YARN-2083
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Yi Tian
  Labels: assignContainer, fair, scheduler
 Fix For: 2.4.1

 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083.patch


 In fair scheduler, FSParentQueue and FSLeafQueue do an 
 assignContainerPreCheck to guaranty this queue is not over its limit.
 But the fitsIn function in Resource.java did not return false when the 
 usedResource equals the maxResource.
 I think we should create a new Function fitsInWithoutEqual instead of 
 fitsIn in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()

2014-06-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated YARN-868:
-

Target Version/s: 2.5.0  (was: 2.1.0-beta)

 YarnClient should set the service address in tokens returned by 
 getRMDelegationToken()
 --

 Key: YARN-868
 URL: https://issues.apache.org/jira/browse/YARN-868
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah

 Either the client should set this information into the token or the client 
 layer should expose an api that returns the service address.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034060#comment-14034060
 ] 

Vinod Kumar Vavilapalli commented on YARN-2171:
---

The code changes look fine enough to me.

The test is not so useful beyond validating this ticket, but that's okay. I see 
that we don't have any test validating the number of nodes itself explicitly, 
shall we add that here?

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2171.patch


 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-365) Each NM heartbeat should not generate an event for the Scheduler

2014-06-17 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-365:


Attachment: YARN-365.branch-0.23.patch

Patch for branch-0.23.  RM unit tests pass, and I manually tested it as well on 
a single-node cluster forcing the scheduler to run slower than the heartbeat 
interval.

 Each NM heartbeat should not generate an event for the Scheduler
 

 Key: YARN-365
 URL: https://issues.apache.org/jira/browse/YARN-365
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager, scheduler
Affects Versions: 0.23.5
Reporter: Siddharth Seth
Assignee: Xuan Gong
 Fix For: 2.1.0-beta

 Attachments: Prototype2.txt, Prototype3.txt, YARN-365.1.patch, 
 YARN-365.10.patch, YARN-365.2.patch, YARN-365.3.patch, YARN-365.4.patch, 
 YARN-365.5.patch, YARN-365.6.patch, YARN-365.7.patch, YARN-365.8.patch, 
 YARN-365.9.patch, YARN-365.branch-0.23.patch


 Follow up from YARN-275
 https://issues.apache.org/jira/secure/attachment/12567075/Prototype.txt



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034154#comment-14034154
 ] 

Hadoop QA commented on YARN-2083:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650834/YARN-2083-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSQueue

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4015//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4015//console

This message is automatically generated.

 In fair scheduler, Queue should not been assigned more containers when its 
 usedResource had reach the maxResource limit
 ---

 Key: YARN-2083
 URL: https://issues.apache.org/jira/browse/YARN-2083
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Yi Tian
  Labels: assignContainer, fair, scheduler
 Fix For: 2.4.1

 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083.patch


 In fair scheduler, FSParentQueue and FSLeafQueue do an 
 assignContainerPreCheck to guaranty this queue is not over its limit.
 But the fitsIn function in Resource.java did not return false when the 
 usedResource equals the maxResource.
 I think we should create a new Function fitsInWithoutEqual instead of 
 fitsIn in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034160#comment-14034160
 ] 

Vinod Kumar Vavilapalli commented on YARN-1972:
---

bq. All in all a very high privilege required for NM. We are considering a 
future iteration in which we extract the privileged operations into a dedicated 
NT service (=daemon) and bestow the high privileges only to this service.
Thanks. Let's document this in a Windows specific docs page.

bq. You are launching so many commands for every container - to chown files, to 
copy files etc.
bq. We'll measure. [..]  I don't think that moving the localization into native 
code would result in much benefit over a proper Java implementation.
I'd file an investigation ticket to track this.

bq. DCE and WCE no longer create user file cache, this is done solely by the 
localizer initDirs. DCE Test modified to reflect this. DCE.createUserCacheDirs 
renamed to createUserAppCacheDirs accordingly
The division of responsibility between launching multiple commands before 
starting the localizer and the stuff that happens inside the localizer: 
Unfortunately, this still isn't ideal. Having userCache created by the 
ContainerExecutor but not file-cache is assymetric and confusing. I propose 
that we split this refactoring into a separate JIRA and stick to your original 
code. Apologies for the back-and-forth on this one.

bq. There is more feedback to address (DRY between LCE and WCE localization 
launch, proper place for localization classpath jar).
So, you will work on them here itself, right?

Looks fine otherwise, exception for the above comments and a request for some 
basic documentation.

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
 Windows security group name that is the nodemanager service principal is a 
 member of (equivalent of LCE 
 `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
 does not require any configuration outside of the Hadoop own's yar-site.xml.
 For WCE to work the nodemanager must run as a service principal that is 
 member of the local Administrators group or LocalSystem. this is derived from 
 the need to invoke LoadUserProfile API which mention these requirements in 
 the specifications. This is in addition 

[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-06-17 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034179#comment-14034179
 ] 

Remus Rusanu commented on YARN-1972:


Thanks for the update Vinod. I have updated the item description to act as 
documentation. Do you think anything more is needed?

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
 Windows security group name that is the nodemanager service principal is a 
 member of (equivalent of LCE 
 `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
 does not require any configuration outside of the Hadoop own's yar-site.xml.
 For WCE to work the nodemanager must run as a service principal that is 
 member of the local Administrators group or LocalSystem. this is derived from 
 the need to invoke LoadUserProfile API which mention these requirements in 
 the specifications. This is in addition to the SE_TCB privilege mentioned in 
 YARN-1063, but this requirement will automatically imply that the SE_TCB 
 privilege is held by the nodemanager. For the Linux speakers in the audience, 
 the requirement is basically to run NM as root.
 h2. Dedicated high privilege Service
 Due to the high privilege required by the WCE we had discussed the need to 
 isolate the high privilege operations into a separate process, an 'executor' 
 service that is solely responsible to start the containers (incloding the 
 localizer). The NM would have to authenticate, authorize and communicate with 
 this service via an IPC mechanism and use this service to launch the 
 containers. I still believe we'll end up deploying such a service, but the 
 effort to onboard such a new platfrom specific new service on the project are 
 not trivial.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-06-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034186#comment-14034186
 ] 

Jian He commented on YARN-1367:
---

[~adhoot], mind updating the patch please? I'm happy to work on it if you are 
busy.

 After restart NM should resync with the RM without killing containers
 -

 Key: YARN-1367
 URL: https://issues.apache.org/jira/browse/YARN-1367
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1367.prototype.patch


 After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
  Upon receiving the resync response, the NM kills all containers and 
 re-registers with the RM. The NM should be changed to not kill the container 
 and instead inform the RM about all currently running containers including 
 their allocations etc. After the re-register, the NM should send all pending 
 container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034268#comment-14034268
 ] 

Vinod Kumar Vavilapalli commented on YARN-1972:
---

That looks fine. I was suggesting we create one more document at 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/.

You can create that doc and add it to the patch together with addressing my 
review in the last comment.

Tx again for working on this, it's almost there.. 

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
 Windows security group name that is the nodemanager service principal is a 
 member of (equivalent of LCE 
 `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
 does not require any configuration outside of the Hadoop own's yar-site.xml.
 For WCE to work the nodemanager must run as a service principal that is 
 member of the local Administrators group or LocalSystem. this is derived from 
 the need to invoke LoadUserProfile API which mention these requirements in 
 the specifications. This is in addition to the SE_TCB privilege mentioned in 
 YARN-1063, but this requirement will automatically imply that the SE_TCB 
 privilege is held by the nodemanager. For the Linux speakers in the audience, 
 the requirement is basically to run NM as root.
 h2. Dedicated high privilege Service
 Due to the high privilege required by the WCE we had discussed the need to 
 isolate the high privilege operations into a separate process, an 'executor' 
 service that is solely responsible to start the containers (incloding the 
 localizer). The NM would have to authenticate, authorize and communicate with 
 this service via an IPC mechanism and use this service to launch the 
 containers. I still believe we'll end up deploying such a service, but the 
 effort to onboard such a new platfrom specific new service on the project are 
 not trivial.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-06-17 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034359#comment-14034359
 ] 

Anubhav Dhoot commented on YARN-1367:
-

I am still working on it. Will have an update soon

 After restart NM should resync with the RM without killing containers
 -

 Key: YARN-1367
 URL: https://issues.apache.org/jira/browse/YARN-1367
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1367.prototype.patch


 After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
  Upon receiving the resync response, the NM kills all containers and 
 re-registers with the RM. The NM should be changed to not kill the container 
 and instead inform the RM about all currently running containers including 
 their allocations etc. After the re-register, the NM should send all pending 
 container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2171:
-

Attachment: YARN-2171v2.patch

The point of the unit test was to catch regressions at a high level.  If anyone 
changes the code such that calling allocate() will grab the scheduler lock then 
the test will fail, whether that's a regression in this particular method or 
some new method that's added that ApplicationMasterService or CapacityScheduler 
itself calls and grabs the lock.

I added a separate unit test to exercise the getNumClusterNodes method.

The AHS unit test failure seems unrelated, and it passes for me locally even 
with this change.

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2171.patch, YARN-2171v2.patch


 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-17 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2175:
---

 Summary: Container localization has no timeouts and tasks can be 
stuck there for a long time
 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot


There are no timeouts that can be used to limit the time taken by various 
container startup operations. Localization for example could take a long time 
and there is no way to kill an task if its stuck in these states. These may 
have nothing to do with the task itself and could be an issue within the 
platform. 

Ideally there should be configurable limits for various states within the 
NodeManager to limit various states. The RM does not care about most of these 
and its only between AM and the NM. We can start by making these global 
configurable defaults and in future we can make it fancier by letting AM 
override them in the start container request.

This jira will be used to limit localization time and we open others if we feel 
we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-17 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2175:


Affects Version/s: 2.4.0

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no way to kill an task if its stuck in these states. These may 
 have nothing to do with the task itself and could be an issue within the 
 platform. 
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request.
 This jira will be used to limit localization time and we open others if we 
 feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-17 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-2175:
---

Assignee: Anubhav Dhoot

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no way to kill an task if its stuck in these states. These may 
 have nothing to do with the task itself and could be an issue within the 
 platform. 
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request.
 This jira will be used to limit localization time and we open others if we 
 feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2176) CapacityScheduler loops over all running applications rather than actively requesting apps

2014-06-17 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-2176:


 Summary: CapacityScheduler loops over all running applications 
rather than actively requesting apps
 Key: YARN-2176
 URL: https://issues.apache.org/jira/browse/YARN-2176
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.4.0
Reporter: Jason Lowe


The capacity scheduler performance is primarily dominated by 
LeafQueue.assignContainers, and that currently loops over all applications that 
are running in the queue.  It would be more efficient if we looped over just 
the applications that are actively asking for resources rather than all 
applications, as there could be thousands of applications running but only a 
few hundred that are currently asking for resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-06-17 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034405#comment-14034405
 ] 

Anubhav Dhoot commented on YARN-1367:
-

I am still working on it and will have it ready soon.





 After restart NM should resync with the RM without killing containers
 -

 Key: YARN-1367
 URL: https://issues.apache.org/jira/browse/YARN-1367
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1367.prototype.patch


 After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
  Upon receiving the resync response, the NM kills all containers and 
 re-registers with the RM. The NM should be changed to not kill the container 
 and instead inform the RM about all currently running containers including 
 their allocations etc. After the re-register, the NM should send all pending 
 container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-1373.
---

Resolution: Duplicate
  Assignee: Omkar Vinit Joshi  (was: Anubhav Dhoot)

bq. Currently the RM moves recovered app attempts to the a terminal recovered 
state and starts a new attempt.
This is no longer an issue - never been since YARN-1210. Even in 
non-work-preserving RM restart, RM explicitly never kills the AMs, it's the 
nodes that kill all containers - this was done in YARN-1210. The state-machines 
are already setup correctly and so no changes are needed here. Closing as 
duplicate of YARN-1210.

 Transition RMApp and RMAppAttempt state to RUNNING after restart for 
 recovered running apps
 ---

 Key: YARN-1373
 URL: https://issues.apache.org/jira/browse/YARN-1373
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Omkar Vinit Joshi

 Currently the RM moves recovered app attempts to the a terminal recovered 
 state and starts a new attempt. Instead, it will have to transition the last 
 attempt to a running state such that it can proceed as normal once the 
 running attempt has resynced with the ApplicationMasterService (YARN-1365 and 
 YARN-1366). If the RM had started the application container before dying then 
 the AM would be up and trying to contact the RM. The RM may have had died 
 before launching the container. For this case, the RM should wait for AM 
 liveliness period and issue a kill container for the stored master container. 
 It should transition this attempt to some RECOVER_ERROR state and proceed to 
 start a new attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2174) Enabling HTTPs for the writer REST API of TimelineServer

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2174:
--

Summary: Enabling HTTPs for the writer REST API of TimelineServer  (was: 
Enabling HTTPs for the writer REST API)

 Enabling HTTPs for the writer REST API of TimelineServer
 

 Key: YARN-2174
 URL: https://issues.apache.org/jira/browse/YARN-2174
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Since we'd like to allow the application to put the timeline data at the 
 client, the AM and even the containers, we need to provide the way to 
 distribute the keystore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2173) Enabling HTTPS for the reader REST APIs of TimelineServer

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2173:
--

Summary: Enabling HTTPS for the reader REST APIs of TimelineServer  (was: 
Enabling HTTPS for the reader REST APIs)

 Enabling HTTPS for the reader REST APIs of TimelineServer
 -

 Key: YARN-2173
 URL: https://issues.apache.org/jira/browse/YARN-2173
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034448#comment-14034448
 ] 

Vinod Kumar Vavilapalli commented on YARN-2052:
---

bq. BTW, I think we should update CheckpointAMPreemptionPolicy after this JIRA. 
Ideally this should be container-allocation timestamp and we should depend on 
that instead of comparing container-IDs. IAC, let's fix it separately..

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034452#comment-14034452
 ] 

Jian He commented on YARN-2052:
---

Another question is how are we going to show the containerId string? 
specifically the toString() method.  If we just say  original containerId 
string+UUID, it'll be inconvenient for debugging as the UUID has no meaning. 


 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034456#comment-14034456
 ] 

Hadoop QA commented on YARN-2171:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650880/YARN-2171v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4016//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4016//console

This message is automatically generated.

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2171.patch, YARN-2171v2.patch


 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034474#comment-14034474
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

Vinod, OK. I'll create new JIRA to address it.

{quote}
Another question is how are we going to show the containerId string? 
specifically the toString() method.  If we just say  original containerId 
string+UUID, it'll be inconvenient for debugging as the UUID has no meaning. 
{quote}

From developer's point of view, you're right. One idea is showing RM_ID 
instead of UUID. Validating RM_ID and confirming not to include underscore at 
startup time. One concern of this approach is that we'll break backward 
compatibility of yarn-site.xml. If we can accept it, it's better approach.

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1341) Recover NMTokens upon nodemanager restart

2014-06-17 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1341:
-

Attachment: YARN-1341v5.patch

Thanks for taking a look, Junping!  I've updated the patch to trunk.

 Recover NMTokens upon nodemanager restart
 -

 Key: YARN-1341
 URL: https://issues.apache.org/jira/browse/YARN-1341
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
 YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034541#comment-14034541
 ] 

Jian He commented on YARN-2052:
---

Seems more problem with the randomId approach if user wants to the kill the 
container,  user has to be aware of the random ID..

Had an offline discussion with Vinod.  Maybe it's still better to persist  some 
sequence number to indicate the number of RM restarts when RM starts up. Today 
containerId#id is int (32 bits), we reserve some bits in the front for the 
number of RM restarts. e.g. 32bits divided as 8bits for the number of RM 
restarts and 24 bits for the number of containers. Each time RM restarts, we 
increase the RM sequence number. Also, We should have a followup jira to change 
the containerId/appId from integer to long and deprecate the old one.  
[~ozawa],  do you agree?

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart

2014-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034588#comment-14034588
 ] 

Hadoop QA commented on YARN-1341:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650914/YARN-1341v5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4017//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4017//console

This message is automatically generated.

 Recover NMTokens upon nodemanager restart
 -

 Key: YARN-1341
 URL: https://issues.apache.org/jira/browse/YARN-1341
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
 YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails

2014-06-17 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034612#comment-14034612
 ] 

Daryn Sharp commented on YARN-2147:
---

I don't think the patch handles the use case it's designed for.  If job 
submission failed with a bland Read timed out, then logging all the tokens in 
the RM log doesn't help the end user, nor does the RM log even answer the 
question which token timed out? 

What you really want to do is change 
{{DelegationTokenRenewer#handleAppSubmitEvent}} to trap exceptions from 
{{renewToken}}.  Wrap the exception with a more descriptive exception that 
stringifies to the user as Can't renew token blah: Read timed out.

 client lacks delegation token exception details when application submit fails
 -

 Key: YARN-2147
 URL: https://issues.apache.org/jira/browse/YARN-2147
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
 Attachments: YARN-2147-v2.patch, YARN-2147.patch


 When an client submits an application and the delegation token process fails 
 the client can lack critical details needed to understand the nature of the 
 error.  Only the message of the error exception is conveyed to the client, 
 which sometimes isn't enough to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2144) Add logs when preemption occurs

2014-06-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034624#comment-14034624
 ] 

Jian He commented on YARN-2144:
---

the patch needs rebase, can you update please? thx

 Add logs when preemption occurs
 ---

 Key: YARN-2144
 URL: https://issues.apache.org/jira/browse/YARN-2144
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.5.0
Reporter: Tassapol Athiapinya
Assignee: Wangda Tan
 Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
 YARN-2144.patch, YARN-2144.patch


 There should be easy-to-read logs when preemption does occur. 
 1. For debugging purpose, RM should log this.
 2. For administrative purpose, RM webpage should have a page to show recent 
 preemption events.
 RM logs should have following properties:
 * Logs are retrievable when an application is still running and often flushed.
 * Can distinguish between AM container preemption and task container 
 preemption with container ID shown.
 * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034637#comment-14034637
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

Basically, I agree with the approach. If we take the sequence-number approach, 
we should define the behavior when sequence number overflows. One simple way is 
to fallback to RM-restart implemented in YARN-128. After changing the 
containerId/appId from integer to long,  it'll happen very rarely. [~jianhe], 
what do you think about the behavior?

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034691#comment-14034691
 ] 

Bikas Saha commented on YARN-2052:
--

bq. Had an offline discussion with Vinod. Maybe it's still better to persist 
some sequence number to indicate the number of RM restarts when RM starts up.
Is this the same as the epoch number that was mentioned earlier in this jira? 
https://issues.apache.org/jira/browse/YARN-2052?focusedCommentId=13996675page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13996675.
 Seems to me that its the same with epoch number changed to num-rm-restarts.



 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps

2014-06-17 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034700#comment-14034700
 ] 

Bikas Saha commented on YARN-1373:
--

Sorry I am not clear how this is a dup. This jira is tracking new behavior in 
the RM that will transition a recovered RMAppImpl/RMAppAttemptImpl (and still 
running for real) app to a RUNNING state instead of a terminal recovered state. 
This is to ensure that the state machines are in the correct state for the 
running AM to resync and continue as running. This is not related to killing 
the app master process on the NM.

 Transition RMApp and RMAppAttempt state to RUNNING after restart for 
 recovered running apps
 ---

 Key: YARN-1373
 URL: https://issues.apache.org/jira/browse/YARN-1373
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Omkar Vinit Joshi

 Currently the RM moves recovered app attempts to the a terminal recovered 
 state and starts a new attempt. Instead, it will have to transition the last 
 attempt to a running state such that it can proceed as normal once the 
 running attempt has resynced with the ApplicationMasterService (YARN-1365 and 
 YARN-1366). If the RM had started the application container before dying then 
 the AM would be up and trying to contact the RM. The RM may have had died 
 before launching the container. For this case, the RM should wait for AM 
 liveliness period and issue a kill container for the stored master container. 
 It should transition this attempt to some RECOVER_ERROR state and proceed to 
 start a new attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2144) Add logs when preemption occurs

2014-06-17 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2144:
-

Attachment: YARN-2144.patch

Rebased patch to latest trunk.

 Add logs when preemption occurs
 ---

 Key: YARN-2144
 URL: https://issues.apache.org/jira/browse/YARN-2144
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.5.0
Reporter: Tassapol Athiapinya
Assignee: Wangda Tan
 Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
 YARN-2144.patch, YARN-2144.patch, YARN-2144.patch


 There should be easy-to-read logs when preemption does occur. 
 1. For debugging purpose, RM should log this.
 2. For administrative purpose, RM webpage should have a page to show recent 
 preemption events.
 RM logs should have following properties:
 * Logs are retrievable when an application is still running and often flushed.
 * Can distinguish between AM container preemption and task container 
 preemption with container ID shown.
 * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034702#comment-14034702
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

[~bikassaha], Yes, I think it's same.

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034716#comment-14034716
 ] 

Jian He commented on YARN-2052:
---

bq. One simple way is to fallback to RM-restart implemented in YARN-128
Can you clarify more what you mean?

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034722#comment-14034722
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

I meant starting apps from a clean state after the restart like RM restart 
phase 1. If the sequence numbers are reset to zero, some applications can do 
unexpected behavior because the {{ContainerId#compareTo}} doesn't work 
correctly. If the apps start from a clean state, we can avoid the situation.

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2144) Add logs when preemption occurs

2014-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034725#comment-14034725
 ] 

Hadoop QA commented on YARN-2144:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650937/YARN-2144.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4018//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4018//console

This message is automatically generated.

 Add logs when preemption occurs
 ---

 Key: YARN-2144
 URL: https://issues.apache.org/jira/browse/YARN-2144
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.5.0
Reporter: Tassapol Athiapinya
Assignee: Wangda Tan
 Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
 YARN-2144.patch, YARN-2144.patch, YARN-2144.patch


 There should be easy-to-read logs when preemption does occur. 
 1. For debugging purpose, RM should log this.
 2. For administrative purpose, RM webpage should have a page to show recent 
 preemption events.
 RM logs should have following properties:
 * Logs are retrievable when an application is still running and often flushed.
 * Can distinguish between AM container preemption and task container 
 preemption with container ID shown.
 * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034731#comment-14034731
 ] 

Bikas Saha commented on YARN-2052:
--

Why would ContainerId#compareTo fail? Existing containerId's should remain 
unchanged after RM restart. Only new container ids should have a different 
epoch number.

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034732#comment-14034732
 ] 

Bikas Saha commented on YARN-2052:
--

Ah. I did not see the rest of the comment. Yes. Integer overflow is a problem. 
We should make it a long in the same release as the epoch number addition so 
that we dont have to worry about that.

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Yi Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated YARN-2083:
--

Attachment: YARN-2083-3.patch

little change for YARN-1474. Make schedulers services. 


 In fair scheduler, Queue should not been assigned more containers when its 
 usedResource had reach the maxResource limit
 ---

 Key: YARN-2083
 URL: https://issues.apache.org/jira/browse/YARN-2083
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Yi Tian
  Labels: assignContainer, fair, scheduler
 Fix For: 2.4.1

 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, 
 YARN-2083.patch


 In fair scheduler, FSParentQueue and FSLeafQueue do an 
 assignContainerPreCheck to guaranty this queue is not over its limit.
 But the fitsIn function in Resource.java did not return false when the 
 usedResource equals the maxResource.
 I think we should create a new Function fitsInWithoutEqual instead of 
 fitsIn in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034746#comment-14034746
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

{quote}
We should make it a long in the same release as the epoch number addition so 
that we dont have to worry about that.
{quote}

+1 to do this in the same release. We'll plan to do the improvement on another 
JIRA. It's OK, but I think it's important for us that we decide the behavior 
when the overflow happens. We have 2 options: just aborting RM for now or 
starting apps from a clean state after the restart. We're planning to make id 
long just after this JIRA, so we can take aborting approach to prevent 
unexpected behavior for the simplicity. [~bikassaha], [~jianhe], what do you 
think about this?

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034777#comment-14034777
 ] 

Hadoop QA commented on YARN-2083:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650950/YARN-2083-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4019//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4019//console

This message is automatically generated.

 In fair scheduler, Queue should not been assigned more containers when its 
 usedResource had reach the maxResource limit
 ---

 Key: YARN-2083
 URL: https://issues.apache.org/jira/browse/YARN-2083
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Yi Tian
  Labels: assignContainer, fair, scheduler
 Fix For: 2.4.1

 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, 
 YARN-2083.patch


 In fair scheduler, FSParentQueue and FSLeafQueue do an 
 assignContainerPreCheck to guaranty this queue is not over its limit.
 But the fitsIn function in Resource.java did not return false when the 
 usedResource equals the maxResource.
 I think we should create a new Function fitsInWithoutEqual instead of 
 fitsIn in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Yi Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated YARN-2083:
--

Fix Version/s: (was: 2.4.1)

 In fair scheduler, Queue should not been assigned more containers when its 
 usedResource had reach the maxResource limit
 ---

 Key: YARN-2083
 URL: https://issues.apache.org/jira/browse/YARN-2083
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Yi Tian
  Labels: assignContainer, fair, scheduler
 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, 
 YARN-2083.patch


 In fair scheduler, FSParentQueue and FSLeafQueue do an 
 assignContainerPreCheck to guaranty this queue is not over its limit.
 But the fitsIn function in Resource.java did not return false when the 
 usedResource equals the maxResource.
 I think we should create a new Function fitsInWithoutEqual instead of 
 fitsIn in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034812#comment-14034812
 ] 

Yi Tian commented on YARN-2083:
---

[~ywskycn], thanks for your advice, YARN-2083-3.patch works fine in thunk 
,YARN-2083-2.patch works fine in branch-2.4.1.
 is it possible to apply this patch into yarn-project?

 In fair scheduler, Queue should not been assigned more containers when its 
 usedResource had reach the maxResource limit
 ---

 Key: YARN-2083
 URL: https://issues.apache.org/jira/browse/YARN-2083
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Yi Tian
  Labels: assignContainer, fair, scheduler
 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, 
 YARN-2083.patch


 In fair scheduler, FSParentQueue and FSLeafQueue do an 
 assignContainerPreCheck to guaranty this queue is not over its limit.
 But the fitsIn function in Resource.java did not return false when the 
 usedResource equals the maxResource.
 I think we should create a new Function fitsInWithoutEqual instead of 
 fitsIn in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)