[jira] [Updated] (YARN-2255) YARN Audit logging not added to log4j.properties

2014-07-07 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2255:
---

Description: log4j.properties file which is part of the hadoop package, 
doesnt have YARN Audit logging tied to it. This leads to audit logs getting 
generated in the normal log files. Audit logs should be generated in a separate 
log file  (was: The log4j.properties which is part of the hadoop package, 
doesnt have YARN Audit logging tied to it. This leads to audit logs getting 
generated in the normal logs. Audit logs should be generated separately)

 YARN Audit logging not added to log4j.properties
 

 Key: YARN-2255
 URL: https://issues.apache.org/jira/browse/YARN-2255
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Varun Saxena

 log4j.properties file which is part of the hadoop package, doesnt have YARN 
 Audit logging tied to it. This leads to audit logs getting generated in the 
 normal log files. Audit logs should be generated in a separate log file



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2255) YARN Audit logging not added to log4j.properties

2014-07-07 Thread Varun Saxena (JIRA)
Varun Saxena created YARN-2255:
--

 Summary: YARN Audit logging not added to log4j.properties
 Key: YARN-2255
 URL: https://issues.apache.org/jira/browse/YARN-2255
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Varun Saxena


The log4j.properties which is part of the hadoop package, doesnt have YARN 
Audit logging tied to it. This leads to audit logs getting generated in the 
normal logs. Audit logs should be generated separately



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2255) YARN Audit logging not added to log4j.properties

2014-07-07 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2255:
---

Description: log4j.properties file which is part of the hadoop package, 
doesnt have YARN Audit logging tied to it. This leads to audit logs getting 
generated in normal log files. Audit logs should be generated in a separate log 
file  (was: log4j.properties file which is part of the hadoop package, doesnt 
have YARN Audit logging tied to it. This leads to audit logs getting generated 
in the normal log files. Audit logs should be generated in a separate log file)

 YARN Audit logging not added to log4j.properties
 

 Key: YARN-2255
 URL: https://issues.apache.org/jira/browse/YARN-2255
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Varun Saxena

 log4j.properties file which is part of the hadoop package, doesnt have YARN 
 Audit logging tied to it. This leads to audit logs getting generated in 
 normal log files. Audit logs should be generated in a separate log file



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2256) Too many nodemanager audit logs are generated

2014-07-07 Thread Varun Saxena (JIRA)
Varun Saxena created YARN-2256:
--

 Summary: Too many nodemanager audit logs are generated
 Key: YARN-2256
 URL: https://issues.apache.org/jira/browse/YARN-2256
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Varun Saxena


Following audit logs are generated too many times(due to the possibility of a 
large number of containers) :
1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
container
2. In RM - Audit logs corresponding to AM allocating a container and AM 
releasing a container

We can have different log levels even for audit logs and have these container 
related logs to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2256) Too many nodemanager audit logs are generated

2014-07-07 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2256:
---

Description: 
Following audit logs are generated too many times(due to the possibility of a 
large number of containers) :
1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
container
2. In RM - Audit logs corresponding to AM allocating a container and AM 
releasing a container

We can have different log levels even for NM and RM audit logs and have these 
container related logs to DEBUG.

  was:
Following audit logs are generated too many times(due to the possibility of a 
large number of containers) :
1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
container
2. In RM - Audit logs corresponding to AM allocating a container and AM 
releasing a container

We can have different log levels even for audit logs and have these container 
related logs to DEBUG.


 Too many nodemanager audit logs are generated
 -

 Key: YARN-2256
 URL: https://issues.apache.org/jira/browse/YARN-2256
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Varun Saxena

 Following audit logs are generated too many times(due to the possibility of a 
 large number of containers) :
 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
 container
 2. In RM - Audit logs corresponding to AM allocating a container and AM 
 releasing a container
 We can have different log levels even for NM and RM audit logs and have these 
 container related logs to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2256) Too many nodemanager audit logs are generated

2014-07-07 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2256:
---

Description: 
Following audit logs are generated too many times(due to the possibility of a 
large number of containers) :
1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
container
2. In RM - Audit logs corresponding to AM allocating a container and AM 
releasing a container

We can have different log levels even for NM and RM audit logs and move these 
successful container related logs to DEBUG.

  was:
Following audit logs are generated too many times(due to the possibility of a 
large number of containers) :
1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
container
2. In RM - Audit logs corresponding to AM allocating a container and AM 
releasing a container

We can have different log levels even for NM and RM audit logs and have these 
container related logs to DEBUG.


 Too many nodemanager audit logs are generated
 -

 Key: YARN-2256
 URL: https://issues.apache.org/jira/browse/YARN-2256
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Varun Saxena

 Following audit logs are generated too many times(due to the possibility of a 
 large number of containers) :
 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
 container
 2. In RM - Audit logs corresponding to AM allocating a container and AM 
 releasing a container
 We can have different log levels even for NM and RM audit logs and move these 
 successful container related logs to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: trust001.patch)

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
 trust002.patch, trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: trust002.patch)

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: trust.patch

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: trust.patch)

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling

2014-07-07 Thread Ratandeep Ratti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053413#comment-14053413
 ] 

Ratandeep Ratti commented on YARN-2252:
---

Wei, calling fairscheduler.stop() will not stop the threads. It seems that the 
threads continuousScheduling thread and update thread are not handling 
interrupt properly. Though we are doing [schedulingthread | 
updateThread].interrupt(), we do need to keep checking the interrupt flag in 
the while loop of these threads.

 Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
 

 Key: YARN-2252
 URL: https://issues.apache.org/jira/browse/YARN-2252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: trunk-win
Reporter: Ratandeep Ratti
Assignee: Wei Yan
  Labels: hadoop2, scheduler, yarn

 This test-case is failing sporadically on my machine. I think I have a 
 plausible explanation  for this.
 It seems that when the Scheduler is being asked for resources, the resource 
 requests that are being constructed have no preference for the hosts (nodes).
 The two mock hosts constructed, both have a memory of 8192 mb.
 The containers(resources) being requested each require a memory of 1024mb, 
 hence a single node can execute both the resource requests for the 
 application.
 In the end of the test-case it is being asserted that the containers 
 (resource requests) be executed on different nodes, but since we haven't 
 specified any preferences for nodes when requesting the resources, the 
 scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2257) Add user to queue mapping in Fair-Scheduler

2014-07-07 Thread Patrick Liu (JIRA)
Patrick Liu created YARN-2257:
-

 Summary: Add user to queue mapping in Fair-Scheduler
 Key: YARN-2257
 URL: https://issues.apache.org/jira/browse/YARN-2257
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Patrick Liu


Currently, the fair-scheduler supports two modes, default queue or individual 
queue for each user.
Apparently, the default queue is not a good option, because the resources 
cannot be managed for each user or group.
However, individual queue for each user is not good enough. Especially when 
connecting yarn with hive. There will be increasing hive users in a corporate 
environment. If we create a queue for a user, the resource management will be 
hard to maintain.

I think the problem can be solved like this:
1. Define user-queue mapping in Fair-Scheduler.xml. Inside each queue, use 
aclSubmitApps to control user's ability.
2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
the app will be scheduled to that queue; otherwise, the app will be submitted 
to default queue.
3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053419#comment-14053419
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654280/trust.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4207//console

This message is automatically generated.

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: t.patch

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: trust.patch)

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread bc Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053457#comment-14053457
 ] 

bc Wong commented on YARN-796:
--

[~yufeldman]  [~sdaingade], just read your proposal 
(LabelBasedScheduling.pdf). Has a few comments:

1. *Would let each node report its own labels.* The current proposal specifies 
the node-label mapping in a centralized file. This seems operationally 
unfriendly, as the file is hard to maintain.
* You need to get the DNS name right, which could be hard for a multi-homed 
setup.
* The proposal uses regexes on FQDN, such as {{perfnode.*}}. This may work if 
the hostnames are set up by IT like that. But in reality, I've seen lots of 
sites where the FQDN is like {{stmp09wk0013.foobar.com}}, where stmp refers 
to the data center, and wk0013 refers to worker 13, and other weird stuff 
like that. Now imagine that a centralized node-label mapping file with 2000 
nodes with such names. It'd be a nightmare.

Instead, each node can supply its own labels, via 
{{yarn.nodemanager.node.labels}} (which specifies labels directly) or 
{{yarn.nodemanager.node.labelFile}} (which points to a file that has a single 
line containing all the labels). It's easy to generate the label file for each 
node. The admin can have puppet push it out, or populate it when the VM is 
built, or compute it in a local script by inspecting /proc. (Oh I have 192GB, 
so add the label largeMem.) There is little room for mistake.

The NM can still periodically refreshes its own labels, and update the RM via 
the heartbeat mechanism. The RM should also expose a node label report, which 
is the real-time information of all nodes and their labels.

2. *Labels are per-container, not per-app. Right?* The doc keeps mentioning 
application label, ApplicationLabelExpression, etc. Should those be 
container label instead? I just want to confirm that each container request 
can carry its own label expression. Example use case: Only the mappers need 
GPU, not the reducers.

3. *Can we fail container requests with no satisfying nodes?* In 
Considerations, #5, you wrote that the app would be in waiting state. Seems 
that a fail-fast behaviour would be better. If no node can satisfy the label 
expression, then it's better to tell the client no. Very likely somebody made 
a typo somewhere.



 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053465#comment-14053465
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654288/t.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4208//console

This message is automatically generated.

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: t.patch)

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: trust .patch, trust.patch, trust.patch, trust003.patch, 
 trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: t.patch

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053504#comment-14053504
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654294/t.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4209//console

This message is automatically generated.

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
 trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling

2014-07-07 Thread Ratandeep Ratti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ratandeep Ratti updated YARN-2252:
--

Attachment: YARN-2252-1.patch

Wei, since I already have a patch for this on my 2.3 hadoop branch (for my 
internal org). I'm also submitting it here. Please have a look to see if it is 
fine by you.

 Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
 

 Key: YARN-2252
 URL: https://issues.apache.org/jira/browse/YARN-2252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: trunk-win
Reporter: Ratandeep Ratti
Assignee: Wei Yan
  Labels: hadoop2, scheduler, yarn
 Attachments: YARN-2252-1.patch


 This test-case is failing sporadically on my machine. I think I have a 
 plausible explanation  for this.
 It seems that when the Scheduler is being asked for resources, the resource 
 requests that are being constructed have no preference for the hosts (nodes).
 The two mock hosts constructed, both have a memory of 8192 mb.
 The containers(resources) being requested each require a memory of 1024mb, 
 hence a single node can execute both the resource requests for the 
 application.
 In the end of the test-case it is being asserted that the containers 
 (resource requests) be executed on different nodes, but since we haven't 
 specified any preferences for nodes when requesting the resources, the 
 scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty, Huawei (JIRA)
Nishan Shetty, Huawei created YARN-2258:
---

 Summary: Aggregation of MR job logs failing when Resourcemanager 
switches
 Key: YARN-2258
 URL: https://issues.apache.org/jira/browse/YARN-2258
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, nodemanager
Affects Versions: 2.4.1
Reporter: Nishan Shetty, Huawei


1.Install RM in HA mode

2.Run a job with more tasks

3.Induce RM switchover while job is in progress

Observe that log aggregation fails for the job which is running when  
Resourcemanager switchover is induced.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-07-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053555#comment-14053555
 ] 

Hudson commented on YARN-1367:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #606 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/606/])
YARN-1367. Changed NM to not kill containers on NM resync if RM work-preserving 
restart is enabled. Contributed by Anubhav Dhoot (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1608334)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java


 After restart NM should resync with the RM without killing containers
 -

 Key: YARN-1367
 URL: https://issues.apache.org/jira/browse/YARN-1367
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Fix For: 2.5.0

 Attachments: YARN-1367.001.patch, YARN-1367.002.patch, 
 YARN-1367.003.patch, YARN-1367.prototype.patch


 After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
  Upon receiving the resync response, the NM kills all containers and 
 re-registers with the RM. The NM should be changed to not kill the container 
 and instead inform the RM about all currently running containers including 
 their allocations etc. After the re-register, the NM should send all pending 
 container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty, Huawei (JIRA)
Nishan Shetty, Huawei created YARN-2259:
---

 Summary: NM-Local dir cleanup failing when Resourcemanager switches
 Key: YARN-2259
 URL: https://issues.apache.org/jira/browse/YARN-2259
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
 Environment: Induce RM switchover while job is in progress

Observe that NM-Local dir cleanup failing when Resourcemanager switches.

Reporter: Nishan Shetty, Huawei






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty, Huawei (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty, Huawei updated YARN-2259:


Environment: 



  was:
Induce RM switchover while job is in progress

Observe that NM-Local dir cleanup failing when Resourcemanager switches.



 NM-Local dir cleanup failing when Resourcemanager switches
 --

 Key: YARN-2259
 URL: https://issues.apache.org/jira/browse/YARN-2259
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
 Environment: 
Reporter: Nishan Shetty, Huawei

 Induce RM switchover while job is in progress
 Observe that NM-Local dir cleanup failing when Resourcemanager switches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty, Huawei (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty, Huawei updated YARN-2259:


Description: 
Induce RM switchover while job is in progress

Observe that NM-Local dir cleanup failing when Resourcemanager switches.

 NM-Local dir cleanup failing when Resourcemanager switches
 --

 Key: YARN-2259
 URL: https://issues.apache.org/jira/browse/YARN-2259
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
 Environment: Induce RM switchover while job is in progress
 Observe that NM-Local dir cleanup failing when Resourcemanager switches.
Reporter: Nishan Shetty, Huawei

 Induce RM switchover while job is in progress
 Observe that NM-Local dir cleanup failing when Resourcemanager switches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes

2014-07-07 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2013:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-522

 The diagnostics is always the ExitCodeException stack when the container 
 crashes
 

 Key: YARN-2013
 URL: https://issues.apache.org/jira/browse/YARN-2013
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
 YARN-2013.3-2.patch, YARN-2013.3.patch


 When a container crashes, ExitCodeException will be thrown from Shell. 
 Default/LinuxContainerExecutor captures the exception, put the exception 
 stack into the diagnostic. Therefore, the exception stack is always the same. 
 {code}
 String diagnostics = Exception from container-launch: \n
 + StringUtils.stringifyException(e) + \n + shExec.getOutput();
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
 {code}
 In addition, it seems that the exception always has a empty message as 
 there's no message from stderr. Hence the diagnostics is not of much use for 
 users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-07-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053674#comment-14053674
 ] 

Hudson commented on YARN-1367:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1797 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1797/])
YARN-1367. Changed NM to not kill containers on NM resync if RM work-preserving 
restart is enabled. Contributed by Anubhav Dhoot (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1608334)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java


 After restart NM should resync with the RM without killing containers
 -

 Key: YARN-1367
 URL: https://issues.apache.org/jira/browse/YARN-1367
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Fix For: 2.5.0

 Attachments: YARN-1367.001.patch, YARN-1367.002.patch, 
 YARN-1367.003.patch, YARN-1367.prototype.patch


 After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
  Upon receiving the resync response, the NM kills all containers and 
 re-registers with the RM. The NM should be changed to not kill the container 
 and instead inform the RM about all currently running containers including 
 their allocations etc. After the re-register, the NM should send all pending 
 container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053704#comment-14053704
 ] 

Allen Wittenauer commented on YARN-796:
---

b

bq. Instead, each node can supply its own labels, via 
yarn.nodemanager.node.labels (which specifies labels directly) or 
yarn.nodemanager.node.labelFile (which points to a file that has a single line 
containing all the labels). It's easy to generate the label file for each node. 

Why not just generate this on the node manager a la health check or topology?  
Provide a hook to actually execute the script or the class and have the NM run 
it by a user-defined period, including just at a boot.  [... and before it 
gets asked, yes, certain classes of hardware *do* allow such dynamic change.]

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2248) Capacity Scheduler changes for moving apps between queues

2014-07-07 Thread Krisztian Horvath (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Horvath updated YARN-2248:


Attachment: YARN-2248-2.patch

 Capacity Scheduler changes for moving apps between queues
 -

 Key: YARN-2248
 URL: https://issues.apache.org/jira/browse/YARN-2248
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Janos Matyas
Assignee: Janos Matyas
 Attachments: YARN-2248-1.patch, YARN-2248-2.patch


 We would like to have the capability (same as the Fair Scheduler has) to move 
 applications between queues. 
 We have made a baseline implementation and tests to start with - and we would 
 like the community to review, come up with suggestions and finally have this 
 contributed. 
 The current implementation is available for 2.4.1 - so the first thing is 
 that we'd need to identify the target version as there are differences 
 between 2.4.* and 3.* interfaces.
 The story behind is available at 
 http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ 
 and the baseline implementation and test at:
 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924
 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053744#comment-14053744
 ] 

Steve Loughran commented on YARN-2242:
--

+1 for retaining/propagating as much information from the shell exception as 
possible. Also, if  {{this.getTrackingUrl()}} returns null, that line of output 
should be skipped

 Improve exception information on AM launch crashes
 --

 Key: YARN-2242
 URL: https://issues.apache.org/jira/browse/YARN-2242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Fix For: 2.6.0

 Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
 YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch


 Now on each time AM Container crashes during launch, both the console and the 
 webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
 but sometimes confusing. With the help of log aggregator, container logs are 
 actually aggregated, and can be very helpful for debugging. One possible way 
 to improve the whole process is to send a pointer to the aggregated logs to 
 the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-07-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053747#comment-14053747
 ] 

Hudson commented on YARN-1367:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1824 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1824/])
YARN-1367. Changed NM to not kill containers on NM resync if RM work-preserving 
restart is enabled. Contributed by Anubhav Dhoot (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1608334)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java


 After restart NM should resync with the RM without killing containers
 -

 Key: YARN-1367
 URL: https://issues.apache.org/jira/browse/YARN-1367
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Fix For: 2.5.0

 Attachments: YARN-1367.001.patch, YARN-1367.002.patch, 
 YARN-1367.003.patch, YARN-1367.prototype.patch


 After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
  Upon receiving the resync response, the NM kills all containers and 
 re-registers with the RM. The NM should be changed to not kill the container 
 and instead inform the RM about all currently running containers including 
 their allocations etc. After the re-register, the NM should send all pending 
 container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues

2014-07-07 Thread Krisztian Horvath (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053750#comment-14053750
 ] 

Krisztian Horvath commented on YARN-2248:
-

Can anyone take a look at the patch? I've some concerns regarding the live 
containers.

Movement steps:

1, Check if the target queue has enough capacity and some more validation, 
exception otherwise (same as with FairScheduler)
2, Remove the app attempt from the current queue
3, Release resources used by live containers on this queue 
4, Remove application upwards root (--numApplications)
5, QueueMetrics update 
6, Set new queue in application
7, Allocate resources consumed by the live containers (basically the resource 
usage moved here from the original queue)
8, Submit new app attempt
9, Add application (++numApplications)

 Capacity Scheduler changes for moving apps between queues
 -

 Key: YARN-2248
 URL: https://issues.apache.org/jira/browse/YARN-2248
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Janos Matyas
Assignee: Janos Matyas
 Attachments: YARN-2248-1.patch, YARN-2248-2.patch


 We would like to have the capability (same as the Fair Scheduler has) to move 
 applications between queues. 
 We have made a baseline implementation and tests to start with - and we would 
 like the community to review, come up with suggestions and finally have this 
 contributed. 
 The current implementation is available for 2.4.1 - so the first thing is 
 that we'd need to identify the target version as there are differences 
 between 2.4.* and 3.* interfaces.
 The story behind is available at 
 http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ 
 and the baseline implementation and test at:
 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924
 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-07-07 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053779#comment-14053779
 ] 

Tsuyoshi OZAWA commented on YARN-2229:
--

I talked with [~jianhe] offline. We'll change ContainerId based on following 
design: 

1. Make containerId long. Add ContainerId#newInstance(ApplicationAttemptId 
appAttemptId, long containerId) as a factory method.
2. Mark {{getId}} as deprecated.
3. Remove epoch field from {{ContainerId}}.
4. Add {{getContainerId}} to return 64bit id including epoch.
5. {{ContainerId#toString}} will return string _containerId(lower 32bit of 
container id)_epoch(upper 32bit ofr containerid)


 ContainerId can overflow with RM restart
 

 Key: YARN-2229
 URL: https://issues.apache.org/jira/browse/YARN-2229
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, 
 YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch


 On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
 lower 22 bits are for sequence number of Ids. This is for preserving 
 semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
 {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
 {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
 restarts 1024 times.
 To avoid the problem, its better to make containerId long. We need to define 
 the new format of container Id with preserving backward compatibility on this 
 JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling

2014-07-07 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2252:
--

Assignee: (was: Wei Yan)

 Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
 

 Key: YARN-2252
 URL: https://issues.apache.org/jira/browse/YARN-2252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: trunk-win
Reporter: Ratandeep Ratti
  Labels: hadoop2, scheduler, yarn
 Attachments: YARN-2252-1.patch


 This test-case is failing sporadically on my machine. I think I have a 
 plausible explanation  for this.
 It seems that when the Scheduler is being asked for resources, the resource 
 requests that are being constructed have no preference for the hosts (nodes).
 The two mock hosts constructed, both have a memory of 8192 mb.
 The containers(resources) being requested each require a memory of 1024mb, 
 hence a single node can execute both the resource requests for the 
 application.
 In the end of the test-case it is being asserted that the containers 
 (resource requests) be executed on different nodes, but since we haven't 
 specified any preferences for nodes when requesting the resources, the 
 scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling

2014-07-07 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053790#comment-14053790
 ] 

Wei Yan commented on YARN-2252:
---

Thanks, [~rdsr]. I'll take a look later.

 Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
 

 Key: YARN-2252
 URL: https://issues.apache.org/jira/browse/YARN-2252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: trunk-win
Reporter: Ratandeep Ratti
  Labels: hadoop2, scheduler, yarn
 Attachments: YARN-2252-1.patch


 This test-case is failing sporadically on my machine. I think I have a 
 plausible explanation  for this.
 It seems that when the Scheduler is being asked for resources, the resource 
 requests that are being constructed have no preference for the hosts (nodes).
 The two mock hosts constructed, both have a memory of 8192 mb.
 The containers(resources) being requested each require a memory of 1024mb, 
 hence a single node can execute both the resource requests for the 
 application.
 In the end of the test-case it is being asserted that the containers 
 (resource requests) be executed on different nodes, but since we haven't 
 specified any preferences for nodes when requesting the resources, the 
 scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-07 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-415:


Attachment: YARN-415.201407071542.txt

This new patch addresses findbugs issues.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2257) Add user to queue mapping in Fair-Scheduler

2014-07-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053846#comment-14053846
 ] 

Sandy Ryza commented on YARN-2257:
--

Definitely needed.  This should be implemented as a QueuePlacementRule.

 Add user to queue mapping in Fair-Scheduler
 ---

 Key: YARN-2257
 URL: https://issues.apache.org/jira/browse/YARN-2257
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Patrick Liu
  Labels: features

 Currently, the fair-scheduler supports two modes, default queue or individual 
 queue for each user.
 Apparently, the default queue is not a good option, because the resources 
 cannot be managed for each user or group.
 However, individual queue for each user is not good enough. Especially when 
 connecting yarn with hive. There will be increasing hive users in a corporate 
 environment. If we create a queue for a user, the resource management will be 
 hard to maintain.
 I think the problem can be solved like this:
 1. Define user-queue mapping in Fair-Scheduler.xml. Inside each queue, use 
 aclSubmitApps to control user's ability.
 2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
 the app will be scheduled to that queue; otherwise, the app will be submitted 
 to default queue.
 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2014-07-07 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053850#comment-14053850
 ] 

Sangjin Lee commented on YARN-2238:
---

Ping? I'd like to find out more about this. Is this an expected behavior?

As in the previous comment, this issue boils down to filtering by search terms 
when a search had been done previously. However, in case of search by key, 
value, or source chain, the search term is not displayed in the UI, thus making 
this a real strange. I'd appreciate comments on this.

 filtering on UI sticks even if I move away from the page
 

 Key: YARN-2238
 URL: https://issues.apache.org/jira/browse/YARN-2238
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.4.0
Reporter: Sangjin Lee
 Attachments: filtered.png


 The main data table in many web pages (RM, AM, etc.) seems to show an 
 unexpected filtering behavior.
 If I filter the table by typing something in the key or value field (or I 
 suspect any search field), the data table gets filtered. The example I used 
 is the job configuration page for a MR job. That is expected.
 However, when I move away from that page and visit any other web page of the 
 same type (e.g. a job configuration page), the page is rendered with the 
 filtering! That is unexpected.
 What's even stranger is that it does not render the filtering term. As a 
 result, I have a page that's mysteriously filtered but doesn't tell me what 
 it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread Yuliya Feldman (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053885#comment-14053885
 ] 

Yuliya Feldman commented on YARN-796:
-

[~bcwalrus]
Thank you for your comments

Regarding:
The NM can still periodically refreshes its own labels, and update the RM via 
the heartbeat mechanism. The RM should also expose a node label report, which 
is the real-time information of all nodes and their labels.
Yes - you would have yarn command to showlabels that would show  all the 
labels in the cluster
yarn rmadmin -showlabels

Regarding:
2. Labels are per-container, not per-app. Right? The doc keeps mentioning 
application label, ApplicationLabelExpression, etc. Should those be 
container label instead? I just want to confirm that each container request 
can carry its own label expression. Example use case: Only the mappers need 
GPU, not the reducers.

Proposal here to have labels per application, not per containers. Though it is 
not that hard to specify label per container (rather per Request) 
There are pros and cons for both (per container and per app):
pros for App - the only place to setLabel is ApplicationSubmissionContext
cons for App - as you said - you want one configuration for Mappers and other 
for Reducers
cons for container level labels - every application that wants to take 
advantage of the labels will have to code it in their AppMaster while creating 
ResourceRequests

Regarding: 
--- The proposal uses regexes on FQDN, such as perfnode.*. 

File with labels does not need to contain Regexes for FQDN - since it will be 
based solely on what hostname what is used in isBlackListed() method.
But I surely open to suggestions to get labels from nodes, as long as it is not 
high burden on the Cluster Admin who needs to specify labels per node on the 
node

Regarding:
--- Can we fail container requests with no satisfying nodes?

I think it would be the same behavior as for any other Request that can not be 
satisfied because queues were setup incorrectly, or there is no free resource 
available t the moment. How would you differentiate between those cases?






 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053889#comment-14053889
 ] 

Hadoop QA commented on YARN-415:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12654334/YARN-415.201407071542.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4210//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4210//console

This message is automatically generated.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue

2014-07-07 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053959#comment-14053959
 ] 

Mayank Bansal commented on YARN-2069:
-

hi [~wangda],

Thanks for the review. I updated the patch please take a look , Le tme answer 
your questions.
bq. In ProportionalCapacityPreemptionPolicy,
bq. 1) balanceUserLimitsinQueueForPreemption()
bq. 1.1, I think there's a bug when multiple applications under a same user 
(say Jim) in a queue, and usage of Jim is over user-limit.
Any of Jim's applications will be tried to be preempted 
(total-resource-used-by-Jim - user-limit).
We should remember resourcesToClaimBackFromUser and initialRes for each user 
(not reset them when handling each application)
And it's better to add test to make sure this behavior is correct.

We need to maintian the reverse order of application submission which only can 
be done iterating through applications as we want to preempt applications which 
are last submitted. 

bq. 1.2, Some debug logging should be removed like
Done

bq. 1.3, This check should be unnecessary
Done

bq. 2) preemptFrom
bq. I noticed this method will be called multiple times for a same application 
within a editSchedule() call.
bq. The reservedContainers will be calculated multiple times.
bq. An alternative way to do this is to cache
This method will only be executed for all the applicatoins only once as we will 
be removing all reservations and for the apps the reservation is been removed 
that would be no-op

bq.In LeafQueue,
bq. 1) I think it's better to remember user limit, no need to compute it every 
time, add a method like getUserLimit() to leafQueue should be better.
That valus is not static and changed every time based on cluster utilization 
and thats why i am calculating every time.

bq, 1) Should we preempt containers equally from users when there're multiple 
users beyond user-limit in a queue?
Its not good it should be based on who submitted last and over user limit, 
however its not fair but we want to preempt last jobs first

bq. 2) Should we preempt containers equally from applications in a same user? 
(Heap-like data structure maybe helpful to solve 1/2)
No as above mentioned

bq. 3) Should user-limit preemption be configurable?
I think if we just configure preemption , thats enough thoughts?

Thanks,
Mayank

 Add cross-user preemption within CapacityScheduler's leaf-queue
 ---

 Key: YARN-2069
 URL: https://issues.apache.org/jira/browse/YARN-2069
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
 Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
 YARN-2069-trunk-3.patch


 Preemption today only works across queues and moves around resources across 
 queues per demand and usage. We should also have user-level preemption within 
 a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053970#comment-14053970
 ] 

Eric Payne commented on YARN-415:
-

@wheeleast : Thank you very much for taking the time to review this patch.

Can you please make sure you reviewed the latest patch? There were some old 
patches that contained changes to AppSchedulingInfo, but not the recent ones.

Also, please keep in mind that YARN-415 needs to calculate resource usage for 
running applications as well as completed ones. To do this, it needs access to 
the live containers, which list is kept in the SchedulerApplicationAttempt 
object.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053976#comment-14053976
 ] 

Eric Payne commented on YARN-415:
-

[~leftnoteasy], Sorry, I don't think the previous post worked. Trying it again:

Thank you very much for taking the time to review this patch.

Can you please make sure you reviewed the latest patch? There were some old 
patches that contained changes to AppSchedulingInfo, but not the recent ones.

Also, please keep in mind that YARN-415 needs to calculate resource usage for 
running applications as well as completed ones. To do this, it needs access to 
the live containers, which list is kept in the SchedulerApplicationAttempt 
object.


 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2257) Add user to queue mapping in Fair-Scheduler

2014-07-07 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054117#comment-14054117
 ] 

Vinod Kumar Vavilapalli commented on YARN-2257:
---

Some part of this is a core YARN feature and shouldn't be built for each 
scheduler - the part about maintaining a user-queue mappings and then accepting 
submissions from users automatically to those queues. The configuration can be 
per scheduler.

I propose we fix it in general. Will edit the subject if there is no 
disagreement.

 Add user to queue mapping in Fair-Scheduler
 ---

 Key: YARN-2257
 URL: https://issues.apache.org/jira/browse/YARN-2257
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Patrick Liu
  Labels: features

 Currently, the fair-scheduler supports two modes, default queue or individual 
 queue for each user.
 Apparently, the default queue is not a good option, because the resources 
 cannot be managed for each user or group.
 However, individual queue for each user is not good enough. Especially when 
 connecting yarn with hive. There will be increasing hive users in a corporate 
 environment. If we create a queue for a user, the resource management will be 
 hard to maintain.
 I think the problem can be solved like this:
 1. Define user-queue mapping in Fair-Scheduler.xml. Inside each queue, use 
 aclSubmitApps to control user's ability.
 2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
 the app will be scheduled to that queue; otherwise, the app will be submitted 
 to default queue.
 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2257) Add user to queue mapping in Fair-Scheduler

2014-07-07 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054157#comment-14054157
 ] 

Karthik Kambatla commented on YARN-2257:


Agree with both Sandy and Vinod. It looks like there is merit to making the 
QueuePlacementRule general to support all schedulers? 

 Add user to queue mapping in Fair-Scheduler
 ---

 Key: YARN-2257
 URL: https://issues.apache.org/jira/browse/YARN-2257
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Patrick Liu
  Labels: features

 Currently, the fair-scheduler supports two modes, default queue or individual 
 queue for each user.
 Apparently, the default queue is not a good option, because the resources 
 cannot be managed for each user or group.
 However, individual queue for each user is not good enough. Especially when 
 connecting yarn with hive. There will be increasing hive users in a corporate 
 environment. If we create a queue for a user, the resource management will be 
 hard to maintain.
 I think the problem can be solved like this:
 1. Define user-queue mapping in Fair-Scheduler.xml. Inside each queue, use 
 aclSubmitApps to control user's ability.
 2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
 the app will be scheduled to that queue; otherwise, the app will be submitted 
 to default queue.
 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2258:
---

Issue Type: Sub-task  (was: Bug)
Parent: YARN-149

 Aggregation of MR job logs failing when Resourcemanager switches
 

 Key: YARN-2258
 URL: https://issues.apache.org/jira/browse/YARN-2258
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager
Affects Versions: 2.4.1
Reporter: Nishan Shetty, Huawei

 1.Install RM in HA mode
 2.Run a job with more tasks
 3.Induce RM switchover while job is in progress
 Observe that log aggregation fails for the job which is running when  
 Resourcemanager switchover is induced.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2259:
---

Issue Type: Sub-task  (was: Bug)
Parent: YARN-149

 NM-Local dir cleanup failing when Resourcemanager switches
 --

 Key: YARN-2259
 URL: https://issues.apache.org/jira/browse/YARN-2259
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
 Environment: 
Reporter: Nishan Shetty, Huawei

 Induce RM switchover while job is in progress
 Observe that NM-Local dir cleanup failing when Resourcemanager switches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-07 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054168#comment-14054168
 ] 

Robert Joseph Evans commented on YARN-611:
--

Why are you using java serialization for the retry policy?  There are too many 
problems with java serialization, especially if we are persisting it into a DB, 
like the state store.  Please switch to using something like protocol buffers 
that will allow for forward/backward compatible modifications going forward.

in the javadocs for RMApp.setRetryCount it would be good to explain what retry 
count actually is and does.

In the constructor for RMAppAttemptImpl there is special logic to call setup 
only for WindowsSlideAMRetryCountResetPolicy.  This completely loses the 
abstraction that the AMResetCountPolicy interface should be providing.  Please 
update the interface so that you don't need special case code for a single 
implementation.

In RMAppAttemptImpl you mark setMaybeLastAttemptFlag as Private this really 
should have been done in the parent interface. In the parent interface you also 
add in myBeLastAttempt() this too should be marked as Private and both of them 
should have comments that these are for testing.

 Add an AM retry count reset window to YARN RM
 -

 Key: YARN-611
 URL: https://issues.apache.org/jira/browse/YARN-611
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong
 Attachments: YARN-611.1.patch


 YARN currently has the following config:
 yarn.resourcemanager.am.max-retries
 This config defaults to 2, and defines how many times to retry a failed AM 
 before failing the whole YARN job. YARN counts an AM as failed if the node 
 that it was running on dies (the NM will timeout, which counts as a failure 
 for the AM), or if the AM dies.
 This configuration is insufficient for long running (or infinitely running) 
 YARN jobs, since the machine (or NM) that the AM is running on will 
 eventually need to be restarted (or the machine/NM will fail). In such an 
 event, the AM has not done anything wrong, but this is counted as a failure 
 by the RM. Since the retry count for the AM is never reset, eventually, at 
 some point, the number of machine/NM failures will result in the AM failure 
 count going above the configured value for 
 yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
 job as failed, and shut it down. This behavior is not ideal.
 I propose that we add a second configuration:
 yarn.resourcemanager.am.retry-count-window-ms
 This configuration would define a window of time that would define when an AM 
 is well behaved, and it's safe to reset its failure count back to zero. 
 Every time an AM fails the RmAppImpl would check the last time that the AM 
 failed. If the last failure was less than retry-count-window-ms ago, and the 
 new failure count is  max-retries, then the job should fail. If the AM has 
 never failed, the retry count is  max-retries, or if the last failure was 
 OUTSIDE the retry-count-window-ms, then the job should be restarted. 
 Additionally, if the last failure was outside the retry-count-window-ms, then 
 the failure count should be set back to 0.
 This would give developers a way to have well-behaved AMs run forever, while 
 still failing mis-behaving AMs after a short period of time.
 I think the work to be done here is to change the RmAppImpl to actually look 
 at app.attempts, and see if there have been more than max-retries failures in 
 the last retry-count-window-ms milliseconds. If there have, then the job 
 should fail, if not, then the job should go forward. Additionally, we might 
 also need to add an endTime in either RMAppAttemptImpl or 
 RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
 failure.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Attachment: YARN-2001.2.patch

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054240#comment-14054240
 ] 

Jian He commented on YARN-2001:
---

Uploaded a new patch:
- added a unit test.
- fixed a bug in testAppReregisterOnRMWorkPreservingRestart

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054275#comment-14054275
 ] 

Hadoop QA commented on YARN-2001:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654403/YARN-2001.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4211//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4211//console

This message is automatically generated.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken

2014-07-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054283#comment-14054283
 ] 

Jian He commented on YARN-2208:
---


Some comments on the patch:
- RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS - 
RM_AM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS
- am-rm-tokens.master-key-rolling-interval-secs - 
am-tokens.master-key-rolling-interval-secs
- RM_NMTOKEN ?
{code}
  YarnConfiguration.RM_NMTOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS
  +  should be more than 2 X 
  + YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS);
{code}
- Should we cache the am token password instead of re-computing password each 
time rpc is invoked?
{code}
org.apache.hadoop.security.token.TokenAMRMTokenIdentifier token =
rm1.getRMContext().getRMApps().get(appAttemptId.getApplicationId())
.getRMAppAttempt(appAttemptId).getAMRMToken();
try {
  UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
  ugi.addTokenIdentifier(token.decodeIdentifier());
} catch (IOException e) {
  throw new YarnRuntimeException(e);
}
{code}
- please fix the test failure also

 AMRMTokenManager need to have a way to roll over AMRMToken
 --

 Key: YARN-2208
 URL: https://issues.apache.org/jira/browse/YARN-2208
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2014-07-07 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2113:


Summary: Add cross-user preemption within CapacityScheduler's leaf-queue  
(was: CS queue level preemption should respect user-limits)

 Add cross-user preemption within CapacityScheduler's leaf-queue
 ---

 Key: YARN-2113
 URL: https://issues.apache.org/jira/browse/YARN-2113
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 2.5.0


 This is different from (even if related to, and likely share code with) 
 YARN-2069.
 YARN-2069 focuses on making sure that even if queue has its guaranteed 
 capacity, it's individual users are treated in-line with their limits 
 irrespective of when they join in.
 This JIRA is about respecting user-limits while preempting containers to 
 balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-07 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2069:


Summary: CS queue level preemption should respect user-limits  (was: Add 
cross-user preemption within CapacityScheduler's leaf-queue)

 CS queue level preemption should respect user-limits
 

 Key: YARN-2069
 URL: https://issues.apache.org/jira/browse/YARN-2069
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
 Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
 YARN-2069-trunk-3.patch


 Preemption today only works across queues and moves around resources across 
 queues per demand and usage. We should also have user-level preemption within 
 a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2014-07-07 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2113:


Description: Preemption today only works across queues and moves around 
resources across queues per demand and usage. We should also have user-level 
preemption within a queue, to balance capacity across users in a predictable 
manner.  (was: This is different from (even if related to, and likely share 
code with) YARN-2069.

YARN-2069 focuses on making sure that even if queue has its guaranteed 
capacity, it's individual users are treated in-line with their limits 
irrespective of when they join in.

This JIRA is about respecting user-limits while preempting containers to 
balance queue capacities.)

 Add cross-user preemption within CapacityScheduler's leaf-queue
 ---

 Key: YARN-2113
 URL: https://issues.apache.org/jira/browse/YARN-2113
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 2.5.0


 Preemption today only works across queues and moves around resources across 
 queues per demand and usage. We should also have user-level preemption within 
 a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-07 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2069:


Description: 
This is different from (even if related to, and likely share code with) 
YARN-2113.

YARN-2113 focuses on making sure that even if queue has its guaranteed 
capacity, it's individual users are treated in-line with their limits 
irrespective of when they join in.

This JIRA is about respecting user-limits while preempting containers to 
balance queue capacities.

  was:Preemption today only works across queues and moves around resources 
across queues per demand and usage. We should also have user-level preemption 
within a queue, to balance capacity across users in a predictable manner.


 CS queue level preemption should respect user-limits
 

 Key: YARN-2069
 URL: https://issues.apache.org/jira/browse/YARN-2069
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
 Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
 YARN-2069-trunk-3.patch


 This is different from (even if related to, and likely share code with) 
 YARN-2113.
 YARN-2113 focuses on making sure that even if queue has its guaranteed 
 capacity, it's individual users are treated in-line with their limits 
 irrespective of when they join in.
 This JIRA is about respecting user-limits while preempting containers to 
 balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Jian He (JIRA)
Jian He created YARN-2260:
-

 Summary: Add containers to launchedContainers list in RMNode on 
container recovery
 Key: YARN-2260
 URL: https://issues.apache.org/jira/browse/YARN-2260
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He


The justLaunchedContainers map in RMNode should be re-populated when container 
is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2260:
--

Attachment: YARN-2260.1.patch

Patch to re-populate launchedContainers list in RMNode on recovery:
- changed the launchedContainers  type from map to set, as set is enough.
- add unit tests.

 Add containers to launchedContainers list in RMNode on container recovery
 -

 Key: YARN-2260
 URL: https://issues.apache.org/jira/browse/YARN-2260
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2260.1.patch


 The justLaunchedContainers map in RMNode should be re-populated when 
 container is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054334#comment-14054334
 ] 

Wangda Tan commented on YARN-2069:
--

Hi [~mayank_bansal],
Thanks for your comments,
I think the change of title/description should be correct, this patch is 
targeted to solve cross-queue preemption should respect user-limit.

I think your other comments all make sense to me. Only below one,
bq. We need to maintian the reverse order of application submission which only 
can be done iterating through applications as we want to preempt applications 
which are last submitted.
IMHO, this is reasonable but conflict with this JIRA's scope, let me give you 
an example. 
Assume a queue has 10 apps, each app has 5 containers (1G for each container, 
so queue has 50G mem used). There're two apps, each app has 5 apps. User-limit 
is 15G, queue's absolute capacity is 30G.
And first 5 apps belongs to user-A, last 5 apps belongs to user-B.
In your correct method, user-B will be preempted 20 containers and user-A will 
be preempted nothing.
After preemption, only 5 container left for user-B, and 25 containers left for 
user-A. User-limit is respected here.

Does this make sense to you?

Thanks,
Wangda

 CS queue level preemption should respect user-limits
 

 Key: YARN-2069
 URL: https://issues.apache.org/jira/browse/YARN-2069
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
 Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
 YARN-2069-trunk-3.patch


 This is different from (even if related to, and likely share code with) 
 YARN-2113.
 YARN-2113 focuses on making sure that even if queue has its guaranteed 
 capacity, it's individual users are treated in-line with their limits 
 irrespective of when they join in.
 This JIRA is about respecting user-limits while preempting containers to 
 balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054341#comment-14054341
 ] 

Jian He commented on YARN-2258:
---

Hi [~nishan], thanks for reporting this.  do you mind sharing some logs? 
specifically where the log aggregation failure happens. 

 Aggregation of MR job logs failing when Resourcemanager switches
 

 Key: YARN-2258
 URL: https://issues.apache.org/jira/browse/YARN-2258
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager
Affects Versions: 2.4.1
Reporter: Nishan Shetty, Huawei

 1.Install RM in HA mode
 2.Run a job with more tasks
 3.Induce RM switchover while job is in progress
 Observe that log aggregation fails for the job which is running when  
 Resourcemanager switchover is induced.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054365#comment-14054365
 ] 

Jian He commented on YARN-2001:
---

bq. found more issues that RM is possible to receive the 
release-container-requset(sent by AM on resync) before the containers are 
actually recovered
opened YARN-2249 to take care of this.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054368#comment-14054368
 ] 

Li Lu commented on YARN-2242:
-

Changes in YARN 2013 do preserve shell exception information while enhance 
overall user experience on AM launch crashes. So I agree that we should merge 
these two issues, and keep working on the patch of YARN 2013.  

 Improve exception information on AM launch crashes
 --

 Key: YARN-2242
 URL: https://issues.apache.org/jira/browse/YARN-2242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Fix For: 2.6.0

 Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
 YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch


 Now on each time AM Container crashes during launch, both the console and the 
 webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
 but sometimes confusing. With the help of log aggregator, container logs are 
 actually aggregated, and can be very helpful for debugging. One possible way 
 to improve the whole process is to send a pointer to the aggregated logs to 
 the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu resolved YARN-2242.
-

Resolution: Duplicate

Close as duplicate, with YARN 2013. 

 Improve exception information on AM launch crashes
 --

 Key: YARN-2242
 URL: https://issues.apache.org/jira/browse/YARN-2242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Fix For: 2.6.0

 Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
 YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch


 Now on each time AM Container crashes during launch, both the console and the 
 webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
 but sometimes confusing. With the help of log aggregator, container logs are 
 actually aggregated, and can be very helpful for debugging. One possible way 
 to improve the whole process is to send a pointer to the aggregated logs to 
 the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054376#comment-14054376
 ] 

Hadoop QA commented on YARN-2260:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654414/YARN-2260.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
  
org.apache.hadoop.yarn.server.resourcemanager.TestRMNodeTransitions
  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4212//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4212//console

This message is automatically generated.

 Add containers to launchedContainers list in RMNode on container recovery
 -

 Key: YARN-2260
 URL: https://issues.apache.org/jira/browse/YARN-2260
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2260.1.patch


 The justLaunchedContainers map in RMNode should be re-populated when 
 container is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes

2014-07-07 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054382#comment-14054382
 ] 

Li Lu commented on YARN-2013:
-

Hi [~ozawa], I just closed YARN 2242 as a duplicate of this issue. Could you 
please add back the diagnostics information that I removed in my patch back? I 
can do this clean up if you don't want to. 

 The diagnostics is always the ExitCodeException stack when the container 
 crashes
 

 Key: YARN-2013
 URL: https://issues.apache.org/jira/browse/YARN-2013
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
 YARN-2013.3-2.patch, YARN-2013.3.patch


 When a container crashes, ExitCodeException will be thrown from Shell. 
 Default/LinuxContainerExecutor captures the exception, put the exception 
 stack into the diagnostic. Therefore, the exception stack is always the same. 
 {code}
 String diagnostics = Exception from container-launch: \n
 + StringUtils.stringifyException(e) + \n + shExec.getOutput();
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
 {code}
 In addition, it seems that the exception always has a empty message as 
 there's no message from stderr. Hence the diagnostics is not of much use for 
 users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054410#comment-14054410
 ] 

Junping Du commented on YARN-2242:
--

bq. you are swallowing diagnostics from the container. 
My bad. How could I miss this?
bq. Imagine AM container failing due to localization failure, we want to show 
the right diagnostics there. The solution for this ticket is to change the 
message on the NM side, not the RM side.
As you mentioned, YARN-2013 already addressed on NM side. Here we have 
agreements above to address on RM side separately to provide more diagnostic 
info.

 Improve exception information on AM launch crashes
 --

 Key: YARN-2242
 URL: https://issues.apache.org/jira/browse/YARN-2242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Fix For: 2.6.0

 Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
 YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch


 Now on each time AM Container crashes during launch, both the console and the 
 webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
 but sometimes confusing. With the help of log aggregator, container logs are 
 actually aggregated, and can be very helpful for debugging. One possible way 
 to improve the whole process is to send a pointer to the aggregated logs to 
 the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reopened YARN-2242:
--


[~gtCarrera9], I reopen this jira for a improvement patch. Would you deliver 
one as [~vinodkv] and [~ste...@apache.org]'s suggestions? Include: adding back 
status.getDiagnostics(), and handling case that trackerUrl = null.

 Improve exception information on AM launch crashes
 --

 Key: YARN-2242
 URL: https://issues.apache.org/jira/browse/YARN-2242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Fix For: 2.6.0

 Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
 YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch


 Now on each time AM Container crashes during launch, both the console and the 
 webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
 but sometimes confusing. With the help of log aggregator, container logs are 
 actually aggregated, and can be very helpful for debugging. One possible way 
 to improve the whole process is to send a pointer to the aggregated logs to 
 the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: abc.patch

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: abc.patch, t.patch, trust .patch, trust.patch, 
 trust.patch, trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054427#comment-14054427
 ] 

Li Lu commented on YARN-2242:
-

[~djp] sure I'll do that. 

 Improve exception information on AM launch crashes
 --

 Key: YARN-2242
 URL: https://issues.apache.org/jira/browse/YARN-2242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Fix For: 2.6.0

 Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
 YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch


 Now on each time AM Container crashes during launch, both the console and the 
 webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
 but sometimes confusing. With the help of log aggregator, container logs are 
 actually aggregated, and can be very helpful for debugging. One possible way 
 to improve the whole process is to send a pointer to the aggregated logs to 
 the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes

2014-07-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054428#comment-14054428
 ] 

Junping Du commented on YARN-2013:
--

[~gtCarrera9], I reopen YARN-2242 as we agreed to address RM/NM side 
separately. Let's do an improved patch on that jira. 
[~ozawa], Thanks for the patch here which is in good direction. Do you think we 
should do similar thing with LinuxContainerExecutor? If so, please add. Also, I 
think it is better to add some unit test (i.e. add in TestContainerLaunch.java) 
to verify messages.


 The diagnostics is always the ExitCodeException stack when the container 
 crashes
 

 Key: YARN-2013
 URL: https://issues.apache.org/jira/browse/YARN-2013
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
 YARN-2013.3-2.patch, YARN-2013.3.patch


 When a container crashes, ExitCodeException will be thrown from Shell. 
 Default/LinuxContainerExecutor captures the exception, put the exception 
 stack into the diagnostic. Therefore, the exception stack is always the same. 
 {code}
 String diagnostics = Exception from container-launch: \n
 + StringUtils.stringifyException(e) + \n + shExec.getOutput();
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
 {code}
 In addition, it seems that the exception always has a empty message as 
 there's no message from stderr. Hence the diagnostics is not of much use for 
 users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054436#comment-14054436
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654438/abc.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4214//console

This message is automatically generated.

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler, webapp
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
 Only in branch-2.2.0.
Reporter: anders
Priority: Minor
  Labels: features
 Attachments: abc.patch, t.patch, trust .patch, trust.patch, 
 trust.patch, trust003.patch, trust2.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's health checkservice.
 ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054437#comment-14054437
 ] 

Hadoop QA commented on YARN-2260:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654432/YARN-2260.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4213//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4213//console

This message is automatically generated.

 Add containers to launchedContainers list in RMNode on container recovery
 -

 Key: YARN-2260
 URL: https://issues.apache.org/jira/browse/YARN-2260
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2260.1.patch, YARN-2260.2.patch


 The justLaunchedContainers map in RMNode should be re-populated when 
 container is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2131) Add a way to nuke the RMStateStore

2014-07-07 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054439#comment-14054439
 ] 

Karthik Kambatla commented on YARN-2131:


Looks good. +1. 

One nit that I can fix at commit time: rename ZKStore#deleteWithRetriesHelper 
to recursivedeleteWithRetries and add a comment about recursion. 

Also, may be in another JIRA, we should not format if an RM is actively 
running. I am not sure how easy this is, particularly in an HA setting. 

 Add a way to nuke the RMStateStore
 --

 Key: YARN-2131
 URL: https://issues.apache.org/jira/browse/YARN-2131
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Robert Kanter
 Attachments: YARN-2131.patch, YARN-2131.patch


 There are cases when we don't want to recover past applications, but recover 
 applications going forward. To do this, one has to clear the store. Today, 
 there is no easy way to do this and users should understand how each store 
 works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-796:


Attachment: Node-labels-Requirements-Design-doc-V1.pdf

I've attached the design doc -- Node-labels-Requirements-Design-doc-V1.pdf. 
This is a doc we're working on, any feedbacks are welcome, we can continuously 
improve the design doc.

Thanks,
Wangda Tan

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-07-07 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054471#comment-14054471
 ] 

Zhijie Shen commented on YARN-2158:
---

[~vvasudev], the patch seems to add additional debugging information for the 
test case. However, do you know the exact reason why the test case will fail 
occasionally?

 TestRMWebServicesAppsModification sometimes fails in trunk
 --

 Key: YARN-2158
 URL: https://issues.apache.org/jira/browse/YARN-2158
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Varun Vasudev
Priority: Minor
 Attachments: apache-yarn-2158.0.patch


 From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
 {code}
 Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
 testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
   Time elapsed: 2.297 sec   FAILURE!
 java.lang.AssertionError: app state incorrect
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054500#comment-14054500
 ] 

Nishan Shetty commented on YARN-2258:
-

Successful flow
{code}
ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1(1032483,114):2014-07-06
 22:01:52,928 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0022 transitioned from NEW to INITING
ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1(1032499,114):2014-07-06
 22:01:52,974 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0022 transitioned from INITING to RUNNING
ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1(1033850,114):2014-07-06
 22:02:56,905 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0022 transitioned from RUNNING to 
APPLICATION_RESOURCES_CLEANINGUP
ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1(1033853,114):2014-07-06
 22:02:57,048 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0022 transitioned from 
APPLICATION_RESOURCES_CLEANINGUP to FINISHED
{code}

Failed flow
{code}
ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1(1074500,114):2014-07-06
 22:37:03,775 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0056 transitioned from NEW to INITING
ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1(1074502,114):2014-07-06
 22:37:03,860 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0056 transitioned from INITING to RUNNING
{code}



 Aggregation of MR job logs failing when Resourcemanager switches
 

 Key: YARN-2258
 URL: https://issues.apache.org/jira/browse/YARN-2258
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager
Affects Versions: 2.4.1
Reporter: Nishan Shetty

 1.Install RM in HA mode
 2.Run a job with more tasks
 3.Induce RM switchover while job is in progress
 Observe that log aggregation fails for the job which is running when  
 Resourcemanager switchover is induced.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-07-07 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054503#comment-14054503
 ] 

Varun Vasudev commented on YARN-2158:
-

[~zjshen] I'm unsure why the test fails occasionally. I suspect the app is in 
the New/Submitted/Failed state but the test expects it to be in the Accepted or 
Killed state. The patch above will let us know the next time the test fails.

 TestRMWebServicesAppsModification sometimes fails in trunk
 --

 Key: YARN-2158
 URL: https://issues.apache.org/jira/browse/YARN-2158
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Varun Vasudev
Priority: Minor
 Attachments: apache-yarn-2158.0.patch


 From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
 {code}
 Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
 testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
   Time elapsed: 2.297 sec   FAILURE!
 java.lang.AssertionError: app state incorrect
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2258:


Affects Version/s: (was: 2.4.1)
   2.4.0

 Aggregation of MR job logs failing when Resourcemanager switches
 

 Key: YARN-2258
 URL: https://issues.apache.org/jira/browse/YARN-2258
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager
Affects Versions: 2.4.0
Reporter: Nishan Shetty

 1.Install RM in HA mode
 2.Run a job with more tasks
 3.Induce RM switchover while job is in progress
 Observe that log aggregation fails for the job which is running when  
 Resourcemanager switchover is induced.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2259:


Affects Version/s: (was: 2.4.1)
   2.4.0

 NM-Local dir cleanup failing when Resourcemanager switches
 --

 Key: YARN-2259
 URL: https://issues.apache.org/jira/browse/YARN-2259
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.0
 Environment: 
Reporter: Nishan Shetty

 Induce RM switchover while job is in progress
 Observe that NM-Local dir cleanup failing when Resourcemanager switches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-07-07 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054533#comment-14054533
 ] 

Zhijie Shen commented on YARN-2158:
---

Make sense. Let's commit this patch. Once the intermittent test failure happens 
again, we will have more information.

 TestRMWebServicesAppsModification sometimes fails in trunk
 --

 Key: YARN-2158
 URL: https://issues.apache.org/jira/browse/YARN-2158
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Varun Vasudev
Priority: Minor
 Attachments: apache-yarn-2158.0.patch


 From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
 {code}
 Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
 testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
   Time elapsed: 2.297 sec   FAILURE!
 java.lang.AssertionError: app state incorrect
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-07-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054543#comment-14054543
 ] 

Hudson commented on YARN-2158:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5838 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5838/])
YARN-2158. Improved assertion messages of TestRMWebServicesAppsModification. 
Contributed by Varun Vasudev. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1608667)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java


 TestRMWebServicesAppsModification sometimes fails in trunk
 --

 Key: YARN-2158
 URL: https://issues.apache.org/jira/browse/YARN-2158
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Varun Vasudev
Priority: Minor
 Attachments: apache-yarn-2158.0.patch


 From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
 {code}
 Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
 testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
   Time elapsed: 2.297 sec   FAILURE!
 java.lang.AssertionError: app state incorrect
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)