[jira] [Commented] (YARN-2256) Too many nodemanager audit logs are generated

2014-07-07 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054574#comment-14054574
 ] 

Varun Saxena commented on YARN-2256:


I will make the following changes :
1. Create new logSuccess and logFailure methods having an additional parameter 
indicating log level. This can be an enum in RMAuditLogger and NMAuditLogger.
2. The previous logSuccess method in RMAuditLogger will continue printing logs 
at INFO level. The new method can be used to print logs at appropriate levels.
3. Change the container logs to DEBUG.


> Too many nodemanager audit logs are generated
> -
>
> Key: YARN-2256
> URL: https://issues.apache.org/jira/browse/YARN-2256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.4.0
>Reporter: Varun Saxena
>
> Following audit logs are generated too many times(due to the possibility of a 
> large number of containers) :
> 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
> container
> 2. In RM - Audit logs corresponding to AM allocating a container and AM 
> releasing a container
> We can have different log levels even for NM and RM audit logs and move these 
> successful container related logs to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup

2014-07-07 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054571#comment-14054571
 ] 

Vinod Kumar Vavilapalli commented on YARN-2261:
---

The proposal here is to have a YARN application-level cleanup container that 
runs only as the last thing for an application in the cluster.
 - In a way, we already have this today, as we let AM's hang around for a while 
(by default - 10mins) *after* unregister - this feature makes it explicit.
 - For those who have lived in this space around for a while, this is akin to 
MR job-cleanup.
 - This feature lets apps submit a separate container-launch-context for 
cleanup, one that is only run after the app is done for real.
 - Clearly, it will
-- be optional.
-- Have timeouts on how much time it can take to finish (default, 
overridable, and upper limit. Default = today's time for AMs to exit after 
unregister?)
-- Have resource requests limits like usual
-- May have its own retries (Cleanup failure != Application failure as 
today?)

Some challenges
 - Cleanup container may not get resources because cluster may have gotten busy 
after the final AM exit. Solution is to reserve (part of) resources used by the 
last AM for use by the cleanup container

> YARN should have a way to run post-application cleanup
> --
>
> Key: YARN-2261
> URL: https://issues.apache.org/jira/browse/YARN-2261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> See MAPREDUCE-5956 for context. Specific options are at 
> https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054566#comment-14054566
 ] 

Hadoop QA commented on YARN-2248:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654330/YARN-2248-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4215//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4215//console

This message is automatically generated.

> Capacity Scheduler changes for moving apps between queues
> -
>
> Key: YARN-2248
> URL: https://issues.apache.org/jira/browse/YARN-2248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Janos Matyas
>Assignee: Janos Matyas
> Attachments: YARN-2248-1.patch, YARN-2248-2.patch
>
>
> We would like to have the capability (same as the Fair Scheduler has) to move 
> applications between queues. 
> We have made a baseline implementation and tests to start with - and we would 
> like the community to review, come up with suggestions and finally have this 
> contributed. 
> The current implementation is available for 2.4.1 - so the first thing is 
> that we'd need to identify the target version as there are differences 
> between 2.4.* and 3.* interfaces.
> The story behind is available at 
> http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ 
> and the baseline implementation and test at:
> https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924
> https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2261) YARN should have a way to run post-application cleanup

2014-07-07 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-2261:
-

 Summary: YARN should have a way to run post-application cleanup
 Key: YARN-2261
 URL: https://issues.apache.org/jira/browse/YARN-2261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


See MAPREDUCE-5956 for context. Specific options are at 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-07-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054543#comment-14054543
 ] 

Hudson commented on YARN-2158:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5838 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5838/])
YARN-2158. Improved assertion messages of TestRMWebServicesAppsModification. 
Contributed by Varun Vasudev. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608667)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java


> TestRMWebServicesAppsModification sometimes fails in trunk
> --
>
> Key: YARN-2158
> URL: https://issues.apache.org/jira/browse/YARN-2158
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Varun Vasudev
>Priority: Minor
> Attachments: apache-yarn-2158.0.patch
>
>
> From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
> {code}
> Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
> testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
>   Time elapsed: 2.297 sec  <<< FAILURE!
> java.lang.AssertionError: app state incorrect
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-07-07 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054533#comment-14054533
 ] 

Zhijie Shen commented on YARN-2158:
---

Make sense. Let's commit this patch. Once the intermittent test failure happens 
again, we will have more information.

> TestRMWebServicesAppsModification sometimes fails in trunk
> --
>
> Key: YARN-2158
> URL: https://issues.apache.org/jira/browse/YARN-2158
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Varun Vasudev
>Priority: Minor
> Attachments: apache-yarn-2158.0.patch
>
>
> From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
> {code}
> Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
> testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
>   Time elapsed: 2.297 sec  <<< FAILURE!
> java.lang.AssertionError: app state incorrect
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2259:


Affects Version/s: (was: 2.4.1)
   2.4.0

> NM-Local dir cleanup failing when Resourcemanager switches
> --
>
> Key: YARN-2259
> URL: https://issues.apache.org/jira/browse/YARN-2259
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.0
> Environment: 
>Reporter: Nishan Shetty
>
> Induce RM switchover while job is in progress
> Observe that NM-Local dir cleanup failing when Resourcemanager switches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2258:


Affects Version/s: (was: 2.4.1)
   2.4.0

> Aggregation of MR job logs failing when Resourcemanager switches
> 
>
> Key: YARN-2258
> URL: https://issues.apache.org/jira/browse/YARN-2258
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.4.0
>Reporter: Nishan Shetty
>
> 1.Install RM in HA mode
> 2.Run a job with more tasks
> 3.Induce RM switchover while job is in progress
> Observe that log aggregation fails for the job which is running when  
> Resourcemanager switchover is induced.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-07-07 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054503#comment-14054503
 ] 

Varun Vasudev commented on YARN-2158:
-

[~zjshen] I'm unsure why the test fails occasionally. I suspect the app is in 
the New/Submitted/Failed state but the test expects it to be in the Accepted or 
Killed state. The patch above will let us know the next time the test fails.

> TestRMWebServicesAppsModification sometimes fails in trunk
> --
>
> Key: YARN-2158
> URL: https://issues.apache.org/jira/browse/YARN-2158
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Varun Vasudev
>Priority: Minor
> Attachments: apache-yarn-2158.0.patch
>
>
> From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
> {code}
> Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
> testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
>   Time elapsed: 2.297 sec  <<< FAILURE!
> java.lang.AssertionError: app state incorrect
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054500#comment-14054500
 ] 

Nishan Shetty commented on YARN-2258:
-

Successful flow
{code}
"ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1"(1032483,114):2014-07-06
 22:01:52,928 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0022 transitioned from NEW to INITING
"ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1"(1032499,114):2014-07-06
 22:01:52,974 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0022 transitioned from INITING to RUNNING
"ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1"(1033850,114):2014-07-06
 22:02:56,905 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0022 transitioned from RUNNING to 
APPLICATION_RESOURCES_CLEANINGUP
"ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1"(1033853,114):2014-07-06
 22:02:57,048 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0022 transitioned from 
APPLICATION_RESOURCES_CLEANINGUP to FINISHED
{code}

Failed flow
{code}
"ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1"(1074500,114):2014-07-06
 22:37:03,775 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0056 transitioned from NEW to INITING
"ftp(1):/home/testos/install/hadoop/logs/yarn-testos-nodemanager-HOST-10-18-40-153.log.1"(1074502,114):2014-07-06
 22:37:03,860 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1404662892762_0056 transitioned from INITING to RUNNING
{code}



> Aggregation of MR job logs failing when Resourcemanager switches
> 
>
> Key: YARN-2258
> URL: https://issues.apache.org/jira/browse/YARN-2258
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.4.1
>Reporter: Nishan Shetty
>
> 1.Install RM in HA mode
> 2.Run a job with more tasks
> 3.Induce RM switchover while job is in progress
> Observe that log aggregation fails for the job which is running when  
> Resourcemanager switchover is induced.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-07-07 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054471#comment-14054471
 ] 

Zhijie Shen commented on YARN-2158:
---

[~vvasudev], the patch seems to add additional debugging information for the 
test case. However, do you know the exact reason why the test case will fail 
occasionally?

> TestRMWebServicesAppsModification sometimes fails in trunk
> --
>
> Key: YARN-2158
> URL: https://issues.apache.org/jira/browse/YARN-2158
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Varun Vasudev
>Priority: Minor
> Attachments: apache-yarn-2158.0.patch
>
>
> From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
> {code}
> Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
> testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
>   Time elapsed: 2.297 sec  <<< FAILURE!
> java.lang.AssertionError: app state incorrect
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-796:


Attachment: Node-labels-Requirements-Design-doc-V1.pdf

I've attached the design doc -- "Node-labels-Requirements-Design-doc-V1.pdf". 
This is a doc we're working on, any feedbacks are welcome, we can continuously 
improve the design doc.

Thanks,
Wangda Tan

> Allow for (admin) labels on nodes and resource-requests
> ---
>
> Key: YARN-796
> URL: https://issues.apache.org/jira/browse/YARN-796
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Wangda Tan
> Attachments: LabelBasedScheduling.pdf, 
> Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch
>
>
> It will be useful for admins to specify labels for nodes. Examples of labels 
> are OS, processor architecture etc.
> We should expose these labels and allow applications to specify labels on 
> resource-requests.
> Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2131) Add a way to nuke the RMStateStore

2014-07-07 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054439#comment-14054439
 ] 

Karthik Kambatla commented on YARN-2131:


Looks good. +1. 

One nit that I can fix at commit time: rename ZKStore#deleteWithRetriesHelper 
to recursivedeleteWithRetries and add a comment about recursion. 

Also, may be in another JIRA, we should not format if an RM is actively 
running. I am not sure how easy this is, particularly in an HA setting. 

> Add a way to nuke the RMStateStore
> --
>
> Key: YARN-2131
> URL: https://issues.apache.org/jira/browse/YARN-2131
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
> Attachments: YARN-2131.patch, YARN-2131.patch
>
>
> There are cases when we don't want to recover past applications, but recover 
> applications going forward. To do this, one has to clear the store. Today, 
> there is no easy way to do this and users should understand how each store 
> works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054437#comment-14054437
 ] 

Hadoop QA commented on YARN-2260:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654432/YARN-2260.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4213//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4213//console

This message is automatically generated.

> Add containers to launchedContainers list in RMNode on container recovery
> -
>
> Key: YARN-2260
> URL: https://issues.apache.org/jira/browse/YARN-2260
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2260.1.patch, YARN-2260.2.patch
>
>
> The justLaunchedContainers map in RMNode should be re-populated when 
> container is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054436#comment-14054436
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654438/abc.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4214//console

This message is automatically generated.

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: abc.patch, t.patch, trust .patch, trust.patch, 
> trust.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054427#comment-14054427
 ] 

Li Lu commented on YARN-2242:
-

[~djp] sure I'll do that. 

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
> YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes

2014-07-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054428#comment-14054428
 ] 

Junping Du commented on YARN-2013:
--

[~gtCarrera9], I reopen YARN-2242 as we agreed to address RM/NM side 
separately. Let's do an improved patch on that jira. 
[~ozawa], Thanks for the patch here which is in good direction. Do you think we 
should do similar thing with LinuxContainerExecutor? If so, please add. Also, I 
think it is better to add some unit test (i.e. add in TestContainerLaunch.java) 
to verify messages.


> The diagnostics is always the ExitCodeException stack when the container 
> crashes
> 
>
> Key: YARN-2013
> URL: https://issues.apache.org/jira/browse/YARN-2013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Zhijie Shen
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
> YARN-2013.3-2.patch, YARN-2013.3.patch
>
>
> When a container crashes, ExitCodeException will be thrown from Shell. 
> Default/LinuxContainerExecutor captures the exception, put the exception 
> stack into the diagnostic. Therefore, the exception stack is always the same. 
> {code}
> String diagnostics = "Exception from container-launch: \n"
> + StringUtils.stringifyException(e) + "\n" + shExec.getOutput();
> container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
> diagnostics));
> {code}
> In addition, it seems that the exception always has a empty message as 
> there's no message from stderr. Hence the diagnostics is not of much use for 
> users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: abc.patch

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: abc.patch, t.patch, trust .patch, trust.patch, 
> trust.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reopened YARN-2242:
--


[~gtCarrera9], I reopen this jira for a improvement patch. Would you deliver 
one as [~vinodkv] and [~ste...@apache.org]'s suggestions? Include: adding back 
status.getDiagnostics(), and handling case that trackerUrl = null.

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
> YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054410#comment-14054410
 ] 

Junping Du commented on YARN-2242:
--

bq. you are swallowing diagnostics from the container. 
My bad. How could I miss this?
bq. Imagine AM container failing due to localization failure, we want to show 
the right diagnostics there. The solution for this ticket is to change the 
message on the NM side, not the RM side.
As you mentioned, YARN-2013 already addressed on NM side. Here we have 
agreements above to address on RM side separately to provide more diagnostic 
info.

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
> YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2260:
--

Attachment: YARN-2260.2.patch

Fix test failures

> Add containers to launchedContainers list in RMNode on container recovery
> -
>
> Key: YARN-2260
> URL: https://issues.apache.org/jira/browse/YARN-2260
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2260.1.patch, YARN-2260.2.patch
>
>
> The justLaunchedContainers map in RMNode should be re-populated when 
> container is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes

2014-07-07 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054382#comment-14054382
 ] 

Li Lu commented on YARN-2013:
-

Hi [~ozawa], I just closed YARN 2242 as a duplicate of this issue. Could you 
please add back the diagnostics information that I removed in my patch back? I 
can do this clean up if you don't want to. 

> The diagnostics is always the ExitCodeException stack when the container 
> crashes
> 
>
> Key: YARN-2013
> URL: https://issues.apache.org/jira/browse/YARN-2013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Zhijie Shen
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
> YARN-2013.3-2.patch, YARN-2013.3.patch
>
>
> When a container crashes, ExitCodeException will be thrown from Shell. 
> Default/LinuxContainerExecutor captures the exception, put the exception 
> stack into the diagnostic. Therefore, the exception stack is always the same. 
> {code}
> String diagnostics = "Exception from container-launch: \n"
> + StringUtils.stringifyException(e) + "\n" + shExec.getOutput();
> container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
> diagnostics));
> {code}
> In addition, it seems that the exception always has a empty message as 
> there's no message from stderr. Hence the diagnostics is not of much use for 
> users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054376#comment-14054376
 ] 

Hadoop QA commented on YARN-2260:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654414/YARN-2260.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
  
org.apache.hadoop.yarn.server.resourcemanager.TestRMNodeTransitions
  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4212//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4212//console

This message is automatically generated.

> Add containers to launchedContainers list in RMNode on container recovery
> -
>
> Key: YARN-2260
> URL: https://issues.apache.org/jira/browse/YARN-2260
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2260.1.patch
>
>
> The justLaunchedContainers map in RMNode should be re-populated when 
> container is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu resolved YARN-2242.
-

Resolution: Duplicate

Close as duplicate, with YARN 2013. 

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
> YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054368#comment-14054368
 ] 

Li Lu commented on YARN-2242:
-

Changes in YARN 2013 do preserve shell exception information while enhance 
overall user experience on AM launch crashes. So I agree that we should merge 
these two issues, and keep working on the patch of YARN 2013.  

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
> YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054365#comment-14054365
 ] 

Jian He commented on YARN-2001:
---

bq. found more issues that RM is possible to receive the 
release-container-requset(sent by AM on resync) before the containers are 
actually recovered
opened YARN-2249 to take care of this.

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054346#comment-14054346
 ] 

Wangda Tan commented on YARN-2069:
--

Hi [~mayank_bansal],
Please let me know if I didn't understand correctly,
For the following code snippet,
{code}
1  Resource userLimitforQueue = qT.leafQueue.computeUserLimit(fc,
2  clusterResource, Resources.none());
3  if (Resources.lessThan(rc, clusterResource, userLimitforQueue,
4  qT.leafQueue.getUser(fc.getUser()).getConsumedResources())) {
5
6// As we have used more resources the user limit,
7// we need to claim back the resources equivalent to
8// consumed resources by user - user limit
9Resource resourcesToClaimBackFromUser = Resources.subtract(qT.leafQueue
10  .getUser(fc.getUser()).getConsumedResources(), userLimitforQueue);
{code}
Line 1-4 will check if we need preempt resource from an application, because 
preemption is a delayed behavior, the "preemptFrom" will not change return 
value of qT.leafQueue.getUser(fc.getUser()).getConsumedResources(). So for 5 
apps of B, all its apps will be preempt line 9-10 containers. Is it correct?

> CS queue level preemption should respect user-limits
> 
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch
>
>
> This is different from (even if related to, and likely share code with) 
> YARN-2113.
> YARN-2113 focuses on making sure that even if queue has its guaranteed 
> capacity, it's individual users are treated in-line with their limits 
> irrespective of when they join in.
> This JIRA is about respecting user-limits while preempting containers to 
> balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054341#comment-14054341
 ] 

Jian He commented on YARN-2258:
---

Hi [~nishan], thanks for reporting this.  do you mind sharing some logs? 
specifically where the log aggregation failure happens. 

> Aggregation of MR job logs failing when Resourcemanager switches
> 
>
> Key: YARN-2258
> URL: https://issues.apache.org/jira/browse/YARN-2258
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.4.1
>Reporter: Nishan Shetty, Huawei
>
> 1.Install RM in HA mode
> 2.Run a job with more tasks
> 3.Induce RM switchover while job is in progress
> Observe that log aggregation fails for the job which is running when  
> Resourcemanager switchover is induced.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-07 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054339#comment-14054339
 ] 

Mayank Bansal commented on YARN-2069:
-

[~wangda]

bq. Assume a queue has 10 apps, each app has 5 containers (1G for each 
container, so queue has 50G mem used). There're two apps, each app has 5 apps. 
User-limit is 15G, queue's absolute capacity is 30G.
And first 5 apps belongs to user-A, last 5 apps belongs to user-B.
In your correct method, user-B will be preempted 20 containers and user-A will 
be preempted nothing.
After preemption, only 5 container left for user-B, and 25 containers left for 
user-A. User-limit is respected here.

No, if User A has limit 15G Limit then it will be preempted only 15 GB and then 
B tasks will be prrempted



> CS queue level preemption should respect user-limits
> 
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch
>
>
> This is different from (even if related to, and likely share code with) 
> YARN-2113.
> YARN-2113 focuses on making sure that even if queue has its guaranteed 
> capacity, it's individual users are treated in-line with their limits 
> irrespective of when they join in.
> This JIRA is about respecting user-limits while preempting containers to 
> balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054334#comment-14054334
 ] 

Wangda Tan commented on YARN-2069:
--

Hi [~mayank_bansal],
Thanks for your comments,
I think the change of title/description should be correct, this patch is 
targeted to solve cross-queue preemption should respect user-limit.

I think your other comments all make sense to me. Only below one,
bq. We need to maintian the reverse order of application submission which only 
can be done iterating through applications as we want to preempt applications 
which are last submitted.
IMHO, this is reasonable but conflict with this JIRA's scope, let me give you 
an example. 
Assume a queue has 10 apps, each app has 5 containers (1G for each container, 
so queue has 50G mem used). There're two apps, each app has 5 apps. User-limit 
is 15G, queue's absolute capacity is 30G.
And first 5 apps belongs to user-A, last 5 apps belongs to user-B.
In your correct method, user-B will be preempted 20 containers and user-A will 
be preempted nothing.
After preemption, only 5 container left for user-B, and 25 containers left for 
user-A. User-limit is respected here.

Does this make sense to you?

Thanks,
Wangda

> CS queue level preemption should respect user-limits
> 
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch
>
>
> This is different from (even if related to, and likely share code with) 
> YARN-2113.
> YARN-2113 focuses on making sure that even if queue has its guaranteed 
> capacity, it's individual users are treated in-line with their limits 
> irrespective of when they join in.
> This JIRA is about respecting user-limits while preempting containers to 
> balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2260:
--

Attachment: YARN-2260.1.patch

Patch to re-populate launchedContainers list in RMNode on recovery:
- changed the launchedContainers  type from map to set, as set is enough.
- add unit tests.

> Add containers to launchedContainers list in RMNode on container recovery
> -
>
> Key: YARN-2260
> URL: https://issues.apache.org/jira/browse/YARN-2260
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2260.1.patch
>
>
> The justLaunchedContainers map in RMNode should be re-populated when 
> container is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery

2014-07-07 Thread Jian He (JIRA)
Jian He created YARN-2260:
-

 Summary: Add containers to launchedContainers list in RMNode on 
container recovery
 Key: YARN-2260
 URL: https://issues.apache.org/jira/browse/YARN-2260
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He


The justLaunchedContainers map in RMNode should be re-populated when container 
is sent from NM for recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-07 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2069:


Description: 
This is different from (even if related to, and likely share code with) 
YARN-2113.

YARN-2113 focuses on making sure that even if queue has its guaranteed 
capacity, it's individual users are treated in-line with their limits 
irrespective of when they join in.

This JIRA is about respecting user-limits while preempting containers to 
balance queue capacities.

  was:Preemption today only works across queues and moves around resources 
across queues per demand and usage. We should also have user-level preemption 
within a queue, to balance capacity across users in a predictable manner.


> CS queue level preemption should respect user-limits
> 
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch
>
>
> This is different from (even if related to, and likely share code with) 
> YARN-2113.
> YARN-2113 focuses on making sure that even if queue has its guaranteed 
> capacity, it's individual users are treated in-line with their limits 
> irrespective of when they join in.
> This JIRA is about respecting user-limits while preempting containers to 
> balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2014-07-07 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2113:


Description: Preemption today only works across queues and moves around 
resources across queues per demand and usage. We should also have user-level 
preemption within a queue, to balance capacity across users in a predictable 
manner.  (was: This is different from (even if related to, and likely share 
code with) YARN-2069.

YARN-2069 focuses on making sure that even if queue has its guaranteed 
capacity, it's individual users are treated in-line with their limits 
irrespective of when they join in.

This JIRA is about respecting user-limits while preempting containers to 
balance queue capacities.)

> Add cross-user preemption within CapacityScheduler's leaf-queue
> ---
>
> Key: YARN-2113
> URL: https://issues.apache.org/jira/browse/YARN-2113
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Fix For: 2.5.0
>
>
> Preemption today only works across queues and moves around resources across 
> queues per demand and usage. We should also have user-level preemption within 
> a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-07 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2069:


Summary: CS queue level preemption should respect user-limits  (was: Add 
cross-user preemption within CapacityScheduler's leaf-queue)

> CS queue level preemption should respect user-limits
> 
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch
>
>
> Preemption today only works across queues and moves around resources across 
> queues per demand and usage. We should also have user-level preemption within 
> a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2014-07-07 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2113:


Summary: Add cross-user preemption within CapacityScheduler's leaf-queue  
(was: CS queue level preemption should respect user-limits)

> Add cross-user preemption within CapacityScheduler's leaf-queue
> ---
>
> Key: YARN-2113
> URL: https://issues.apache.org/jira/browse/YARN-2113
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Fix For: 2.5.0
>
>
> This is different from (even if related to, and likely share code with) 
> YARN-2069.
> YARN-2069 focuses on making sure that even if queue has its guaranteed 
> capacity, it's individual users are treated in-line with their limits 
> irrespective of when they join in.
> This JIRA is about respecting user-limits while preempting containers to 
> balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken

2014-07-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054283#comment-14054283
 ] 

Jian He commented on YARN-2208:
---


Some comments on the patch:
- RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS -> 
RM_AM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS
- am-rm-tokens.master-key-rolling-interval-secs -> 
am-tokens.master-key-rolling-interval-secs
- RM_NMTOKEN ?
{code}
  YarnConfiguration.RM_NMTOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS
  + " should be more than 2 X "
  + YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS);
{code}
- Should we cache the am token password instead of re-computing password each 
time rpc is invoked?
{code}
org.apache.hadoop.security.token.Token token =
rm1.getRMContext().getRMApps().get(appAttemptId.getApplicationId())
.getRMAppAttempt(appAttemptId).getAMRMToken();
try {
  UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
  ugi.addTokenIdentifier(token.decodeIdentifier());
} catch (IOException e) {
  throw new YarnRuntimeException(e);
}
{code}
- please fix the test failure also

> AMRMTokenManager need to have a way to roll over AMRMToken
> --
>
> Key: YARN-2208
> URL: https://issues.apache.org/jira/browse/YARN-2208
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054275#comment-14054275
 ] 

Hadoop QA commented on YARN-2001:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654403/YARN-2001.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4211//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4211//console

This message is automatically generated.

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054240#comment-14054240
 ] 

Jian He commented on YARN-2001:
---

Uploaded a new patch:
- added a unit test.
- fixed a bug in testAppReregisterOnRMWorkPreservingRestart

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Attachment: YARN-2001.2.patch

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-07 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054168#comment-14054168
 ] 

Robert Joseph Evans commented on YARN-611:
--

Why are you using java serialization for the retry policy?  There are too many 
problems with java serialization, especially if we are persisting it into a DB, 
like the state store.  Please switch to using something like protocol buffers 
that will allow for forward/backward compatible modifications going forward.

in the javadocs for RMApp.setRetryCount it would be good to explain what retry 
count actually is and does.

In the constructor for RMAppAttemptImpl there is special logic to call setup 
only for WindowsSlideAMRetryCountResetPolicy.  This completely loses the 
abstraction that the AMResetCountPolicy interface should be providing.  Please 
update the interface so that you don't need special case code for a single 
implementation.

In RMAppAttemptImpl you mark setMaybeLastAttemptFlag as Private this really 
should have been done in the parent interface. In the parent interface you also 
add in myBeLastAttempt() this too should be marked as Private and both of them 
should have comments that these are for testing.

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
> Attachments: YARN-611.1.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2259:
---

Issue Type: Sub-task  (was: Bug)
Parent: YARN-149

> NM-Local dir cleanup failing when Resourcemanager switches
> --
>
> Key: YARN-2259
> URL: https://issues.apache.org/jira/browse/YARN-2259
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.1
> Environment: 
>Reporter: Nishan Shetty, Huawei
>
> Induce RM switchover while job is in progress
> Observe that NM-Local dir cleanup failing when Resourcemanager switches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2258:
---

Issue Type: Sub-task  (was: Bug)
Parent: YARN-149

> Aggregation of MR job logs failing when Resourcemanager switches
> 
>
> Key: YARN-2258
> URL: https://issues.apache.org/jira/browse/YARN-2258
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.4.1
>Reporter: Nishan Shetty, Huawei
>
> 1.Install RM in HA mode
> 2.Run a job with more tasks
> 3.Induce RM switchover while job is in progress
> Observe that log aggregation fails for the job which is running when  
> Resourcemanager switchover is induced.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2257) Add user to queue mapping in Fair-Scheduler

2014-07-07 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054157#comment-14054157
 ] 

Karthik Kambatla commented on YARN-2257:


Agree with both Sandy and Vinod. It looks like there is merit to making the 
QueuePlacementRule general to support all schedulers? 

> Add user to queue mapping in Fair-Scheduler
> ---
>
> Key: YARN-2257
> URL: https://issues.apache.org/jira/browse/YARN-2257
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Patrick Liu
>  Labels: features
>
> Currently, the fair-scheduler supports two modes, default queue or individual 
> queue for each user.
> Apparently, the default queue is not a good option, because the resources 
> cannot be managed for each user or group.
> However, individual queue for each user is not good enough. Especially when 
> connecting yarn with hive. There will be increasing hive users in a corporate 
> environment. If we create a queue for a user, the resource management will be 
> hard to maintain.
> I think the problem can be solved like this:
> 1. Define user->queue mapping in Fair-Scheduler.xml. Inside each queue, use 
> aclSubmitApps to control user's ability.
> 2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
> the app will be scheduled to that queue; otherwise, the app will be submitted 
> to default queue.
> 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2257) Add user to queue mapping in Fair-Scheduler

2014-07-07 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054117#comment-14054117
 ] 

Vinod Kumar Vavilapalli commented on YARN-2257:
---

Some part of this is a core YARN feature and shouldn't be built for each 
scheduler - the part about maintaining a user-queue mappings and then accepting 
submissions from users automatically to those queues. The configuration can be 
per scheduler.

I propose we fix it in general. Will edit the subject if there is no 
disagreement.

> Add user to queue mapping in Fair-Scheduler
> ---
>
> Key: YARN-2257
> URL: https://issues.apache.org/jira/browse/YARN-2257
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Patrick Liu
>  Labels: features
>
> Currently, the fair-scheduler supports two modes, default queue or individual 
> queue for each user.
> Apparently, the default queue is not a good option, because the resources 
> cannot be managed for each user or group.
> However, individual queue for each user is not good enough. Especially when 
> connecting yarn with hive. There will be increasing hive users in a corporate 
> environment. If we create a queue for a user, the resource management will be 
> hard to maintain.
> I think the problem can be solved like this:
> 1. Define user->queue mapping in Fair-Scheduler.xml. Inside each queue, use 
> aclSubmitApps to control user's ability.
> 2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
> the app will be scheduled to that queue; otherwise, the app will be submitted 
> to default queue.
> 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053976#comment-14053976
 ] 

Eric Payne commented on YARN-415:
-

[~leftnoteasy], Sorry, I don't think the previous post worked. Trying it again:

Thank you very much for taking the time to review this patch.

Can you please make sure you reviewed the latest patch? There were some old 
patches that contained changes to AppSchedulingInfo, but not the recent ones.

Also, please keep in mind that YARN-415 needs to calculate resource usage for 
running applications as well as completed ones. To do this, it needs access to 
the live containers, which list is kept in the SchedulerApplicationAttempt 
object.


> Capture memory utilization at the app-level for chargeback
> --
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053970#comment-14053970
 ] 

Eric Payne commented on YARN-415:
-

@wheeleast : Thank you very much for taking the time to review this patch.

Can you please make sure you reviewed the latest patch? There were some old 
patches that contained changes to AppSchedulingInfo, but not the recent ones.

Also, please keep in mind that YARN-415 needs to calculate resource usage for 
running applications as well as completed ones. To do this, it needs access to 
the live containers, which list is kept in the SchedulerApplicationAttempt 
object.

> Capture memory utilization at the app-level for chargeback
> --
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue

2014-07-07 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053959#comment-14053959
 ] 

Mayank Bansal commented on YARN-2069:
-

hi [~wangda],

Thanks for the review. I updated the patch please take a look , Le tme answer 
your questions.
bq. In ProportionalCapacityPreemptionPolicy,
bq. 1) balanceUserLimitsinQueueForPreemption()
bq. 1.1, I think there's a bug when multiple applications under a same user 
(say Jim) in a queue, and usage of Jim is over user-limit.
Any of Jim's applications will be tried to be preempted 
(total-resource-used-by-Jim - user-limit).
We should remember resourcesToClaimBackFromUser and initialRes for each user 
(not reset them when handling each application)
And it's better to add test to make sure this behavior is correct.

We need to maintian the reverse order of application submission which only can 
be done iterating through applications as we want to preempt applications which 
are last submitted. 

bq. 1.2, Some debug logging should be removed like
Done

bq. 1.3, This check should be unnecessary
Done

bq. 2) preemptFrom
bq. I noticed this method will be called multiple times for a same application 
within a editSchedule() call.
bq. The reservedContainers will be calculated multiple times.
bq. An alternative way to do this is to cache
This method will only be executed for all the applicatoins only once as we will 
be removing all reservations and for the apps the reservation is been removed 
that would be no-op

bq.In LeafQueue,
bq. 1) I think it's better to remember user limit, no need to compute it every 
time, add a method like getUserLimit() to leafQueue should be better.
That valus is not static and changed every time based on cluster utilization 
and thats why i am calculating every time.

bq, 1) Should we preempt containers equally from users when there're multiple 
users beyond user-limit in a queue?
Its not good it should be based on who submitted last and over user limit, 
however its not fair but we want to preempt last jobs first

bq. 2) Should we preempt containers equally from applications in a same user? 
(Heap-like data structure maybe helpful to solve 1/2)
No as above mentioned

bq. 3) Should user-limit preemption be configurable?
I think if we just configure preemption , thats enough thoughts?

Thanks,
Mayank

> Add cross-user preemption within CapacityScheduler's leaf-queue
> ---
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch
>
>
> Preemption today only works across queues and moves around resources across 
> queues per demand and usage. We should also have user-level preemption within 
> a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053889#comment-14053889
 ] 

Hadoop QA commented on YARN-415:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12654334/YARN-415.201407071542.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4210//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4210//console

This message is automatically generated.

> Capture memory utilization at the app-level for chargeback
> --
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread Yuliya Feldman (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053885#comment-14053885
 ] 

Yuliya Feldman commented on YARN-796:
-

[~bcwalrus]
Thank you for your comments

Regarding:
The NM can still periodically refreshes its own labels, and update the RM via 
the heartbeat mechanism. The RM should also expose a "node label report", which 
is the real-time information of all nodes and their labels.
Yes - you would have yarn command to "showlabels" that would show  all the 
labels in the cluster
"yarn rmadmin -showlabels"

Regarding:
2. Labels are per-container, not per-app. Right? The doc keeps mentioning 
"application label", "ApplicationLabelExpression", etc. Should those be 
"container label" instead? I just want to confirm that each container request 
can carry its own label expression. Example use case: Only the mappers need 
GPU, not the reducers.

Proposal here to have labels per application, not per containers. Though it is 
not that hard to specify label per container (rather per Request) 
There are pros and cons for both (per container and per app):
pros for App - the only place to "setLabel" is ApplicationSubmissionContext
cons for App - as you said - you want one configuration for Mappers and other 
for Reducers
cons for container level labels - every application that wants to take 
advantage of the labels will have to code it in their AppMaster while creating 
ResourceRequests

Regarding: 
--- The proposal uses regexes on FQDN, such as perfnode.*. 

File with labels does not need to contain Regexes for FQDN - since it will be 
based solely on what "hostname" what is used in "isBlackListed()" method.
But I surely open to suggestions to get labels from nodes, as long as it is not 
high burden on the Cluster Admin who needs to specify labels per node on the 
node

Regarding:
--- Can we fail container requests with no satisfying nodes?

I think it would be the same behavior as for any other Request that can not be 
satisfied because queues were setup incorrectly, or there is no free resource 
available t the moment. How would you differentiate between those cases?






> Allow for (admin) labels on nodes and resource-requests
> ---
>
> Key: YARN-796
> URL: https://issues.apache.org/jira/browse/YARN-796
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Wangda Tan
> Attachments: LabelBasedScheduling.pdf, YARN-796.patch
>
>
> It will be useful for admins to specify labels for nodes. Examples of labels 
> are OS, processor architecture etc.
> We should expose these labels and allow applications to specify labels on 
> resource-requests.
> Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2014-07-07 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053850#comment-14053850
 ] 

Sangjin Lee commented on YARN-2238:
---

Ping? I'd like to find out more about this. Is this an "expected" behavior?

As in the previous comment, this issue boils down to filtering by search terms 
when a search had been done previously. However, in case of search by key, 
value, or source chain, the search term is not displayed in the UI, thus making 
this a real strange. I'd appreciate comments on this.

> filtering on UI sticks even if I move away from the page
> 
>
> Key: YARN-2238
> URL: https://issues.apache.org/jira/browse/YARN-2238
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.4.0
>Reporter: Sangjin Lee
> Attachments: filtered.png
>
>
> The main data table in many web pages (RM, AM, etc.) seems to show an 
> unexpected filtering behavior.
> If I filter the table by typing something in the key or value field (or I 
> suspect any search field), the data table gets filtered. The example I used 
> is the job configuration page for a MR job. That is expected.
> However, when I move away from that page and visit any other web page of the 
> same type (e.g. a job configuration page), the page is rendered with the 
> filtering! That is unexpected.
> What's even stranger is that it does not render the filtering term. As a 
> result, I have a page that's mysteriously filtered but doesn't tell me what 
> it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2257) Add user to queue mapping in Fair-Scheduler

2014-07-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053846#comment-14053846
 ] 

Sandy Ryza commented on YARN-2257:
--

Definitely needed.  This should be implemented as a QueuePlacementRule.

> Add user to queue mapping in Fair-Scheduler
> ---
>
> Key: YARN-2257
> URL: https://issues.apache.org/jira/browse/YARN-2257
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Patrick Liu
>  Labels: features
>
> Currently, the fair-scheduler supports two modes, default queue or individual 
> queue for each user.
> Apparently, the default queue is not a good option, because the resources 
> cannot be managed for each user or group.
> However, individual queue for each user is not good enough. Especially when 
> connecting yarn with hive. There will be increasing hive users in a corporate 
> environment. If we create a queue for a user, the resource management will be 
> hard to maintain.
> I think the problem can be solved like this:
> 1. Define user->queue mapping in Fair-Scheduler.xml. Inside each queue, use 
> aclSubmitApps to control user's ability.
> 2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
> the app will be scheduled to that queue; otherwise, the app will be submitted 
> to default queue.
> 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-07 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-415:


Attachment: YARN-415.201407071542.txt

This new patch addresses findbugs issues.

> Capture memory utilization at the app-level for chargeback
> --
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling

2014-07-07 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053790#comment-14053790
 ] 

Wei Yan commented on YARN-2252:
---

Thanks, [~rdsr]. I'll take a look later.

> Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
> 
>
> Key: YARN-2252
> URL: https://issues.apache.org/jira/browse/YARN-2252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: trunk-win
>Reporter: Ratandeep Ratti
>  Labels: hadoop2, scheduler, yarn
> Attachments: YARN-2252-1.patch
>
>
> This test-case is failing sporadically on my machine. I think I have a 
> plausible explanation  for this.
> It seems that when the Scheduler is being asked for resources, the resource 
> requests that are being constructed have no preference for the hosts (nodes).
> The two mock hosts constructed, both have a memory of 8192 mb.
> The containers(resources) being requested each require a memory of 1024mb, 
> hence a single node can execute both the resource requests for the 
> application.
> In the end of the test-case it is being asserted that the containers 
> (resource requests) be executed on different nodes, but since we haven't 
> specified any preferences for nodes when requesting the resources, the 
> scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling

2014-07-07 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2252:
--

Assignee: (was: Wei Yan)

> Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
> 
>
> Key: YARN-2252
> URL: https://issues.apache.org/jira/browse/YARN-2252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: trunk-win
>Reporter: Ratandeep Ratti
>  Labels: hadoop2, scheduler, yarn
> Attachments: YARN-2252-1.patch
>
>
> This test-case is failing sporadically on my machine. I think I have a 
> plausible explanation  for this.
> It seems that when the Scheduler is being asked for resources, the resource 
> requests that are being constructed have no preference for the hosts (nodes).
> The two mock hosts constructed, both have a memory of 8192 mb.
> The containers(resources) being requested each require a memory of 1024mb, 
> hence a single node can execute both the resource requests for the 
> application.
> In the end of the test-case it is being asserted that the containers 
> (resource requests) be executed on different nodes, but since we haven't 
> specified any preferences for nodes when requesting the resources, the 
> scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-07-07 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053779#comment-14053779
 ] 

Tsuyoshi OZAWA commented on YARN-2229:
--

I talked with [~jianhe] offline. We'll change ContainerId based on following 
design: 

1. Make containerId long. Add ContainerId#newInstance(ApplicationAttemptId 
appAttemptId, long containerId) as a factory method.
2. Mark {{getId}} as deprecated.
3. Remove epoch field from {{ContainerId}}.
4. Add {{getContainerId}} to return 64bit id including epoch.
5. {{ContainerId#toString}} will return  __


> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, 
> YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues

2014-07-07 Thread Krisztian Horvath (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053750#comment-14053750
 ] 

Krisztian Horvath commented on YARN-2248:
-

Can anyone take a look at the patch? I've some concerns regarding the live 
containers.

Movement steps:

1, Check if the target queue has enough capacity and some more validation, 
exception otherwise (same as with FairScheduler)
2, Remove the app attempt from the current queue
3, Release resources used by live containers on this queue 
4, Remove application upwards root (--numApplications)
5, QueueMetrics update 
6, Set new queue in application
7, Allocate resources consumed by the live containers (basically the resource 
usage moved here from the original queue)
8, Submit new app attempt
9, Add application (++numApplications)

> Capacity Scheduler changes for moving apps between queues
> -
>
> Key: YARN-2248
> URL: https://issues.apache.org/jira/browse/YARN-2248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Janos Matyas
>Assignee: Janos Matyas
> Attachments: YARN-2248-1.patch, YARN-2248-2.patch
>
>
> We would like to have the capability (same as the Fair Scheduler has) to move 
> applications between queues. 
> We have made a baseline implementation and tests to start with - and we would 
> like the community to review, come up with suggestions and finally have this 
> contributed. 
> The current implementation is available for 2.4.1 - so the first thing is 
> that we'd need to identify the target version as there are differences 
> between 2.4.* and 3.* interfaces.
> The story behind is available at 
> http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ 
> and the baseline implementation and test at:
> https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924
> https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-07-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053747#comment-14053747
 ] 

Hudson commented on YARN-1367:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1824 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1824/])
YARN-1367. Changed NM to not kill containers on NM resync if RM work-preserving 
restart is enabled. Contributed by Anubhav Dhoot (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608334)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java


> After restart NM should resync with the RM without killing containers
> -
>
> Key: YARN-1367
> URL: https://issues.apache.org/jira/browse/YARN-1367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Fix For: 2.5.0
>
> Attachments: YARN-1367.001.patch, YARN-1367.002.patch, 
> YARN-1367.003.patch, YARN-1367.prototype.patch
>
>
> After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
>  Upon receiving the resync response, the NM kills all containers and 
> re-registers with the RM. The NM should be changed to not kill the container 
> and instead inform the RM about all currently running containers including 
> their allocations etc. After the re-register, the NM should send all pending 
> container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053744#comment-14053744
 ] 

Steve Loughran commented on YARN-2242:
--

+1 for retaining/propagating as much information from the shell exception as 
possible. Also, if  {{this.getTrackingUrl()}} returns null, that line of output 
should be skipped

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, 
> YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2248) Capacity Scheduler changes for moving apps between queues

2014-07-07 Thread Krisztian Horvath (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Horvath updated YARN-2248:


Attachment: YARN-2248-2.patch

> Capacity Scheduler changes for moving apps between queues
> -
>
> Key: YARN-2248
> URL: https://issues.apache.org/jira/browse/YARN-2248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Janos Matyas
>Assignee: Janos Matyas
> Attachments: YARN-2248-1.patch, YARN-2248-2.patch
>
>
> We would like to have the capability (same as the Fair Scheduler has) to move 
> applications between queues. 
> We have made a baseline implementation and tests to start with - and we would 
> like the community to review, come up with suggestions and finally have this 
> contributed. 
> The current implementation is available for 2.4.1 - so the first thing is 
> that we'd need to identify the target version as there are differences 
> between 2.4.* and 3.* interfaces.
> The story behind is available at 
> http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ 
> and the baseline implementation and test at:
> https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924
> https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053704#comment-14053704
 ] 

Allen Wittenauer commented on YARN-796:
---

b

bq. Instead, each node can supply its own labels, via 
yarn.nodemanager.node.labels (which specifies labels directly) or 
yarn.nodemanager.node.labelFile (which points to a file that has a single line 
containing all the labels). It's easy to generate the label file for each node. 

Why not just generate this on the node manager a la health check or topology?  
Provide a hook to actually execute the script or the class and have the NM run 
it by a user-defined period, including "just at a boot".  [... and before it 
gets asked, yes, certain classes of hardware *do* allow such dynamic change.]

> Allow for (admin) labels on nodes and resource-requests
> ---
>
> Key: YARN-796
> URL: https://issues.apache.org/jira/browse/YARN-796
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Wangda Tan
> Attachments: LabelBasedScheduling.pdf, YARN-796.patch
>
>
> It will be useful for admins to specify labels for nodes. Examples of labels 
> are OS, processor architecture etc.
> We should expose these labels and allow applications to specify labels on 
> resource-requests.
> Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-07-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053674#comment-14053674
 ] 

Hudson commented on YARN-1367:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1797 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1797/])
YARN-1367. Changed NM to not kill containers on NM resync if RM work-preserving 
restart is enabled. Contributed by Anubhav Dhoot (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608334)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java


> After restart NM should resync with the RM without killing containers
> -
>
> Key: YARN-1367
> URL: https://issues.apache.org/jira/browse/YARN-1367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Fix For: 2.5.0
>
> Attachments: YARN-1367.001.patch, YARN-1367.002.patch, 
> YARN-1367.003.patch, YARN-1367.prototype.patch
>
>
> After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
>  Upon receiving the resync response, the NM kills all containers and 
> re-registers with the RM. The NM should be changed to not kill the container 
> and instead inform the RM about all currently running containers including 
> their allocations etc. After the re-register, the NM should send all pending 
> container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes

2014-07-07 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2013:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-522

> The diagnostics is always the ExitCodeException stack when the container 
> crashes
> 
>
> Key: YARN-2013
> URL: https://issues.apache.org/jira/browse/YARN-2013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Zhijie Shen
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
> YARN-2013.3-2.patch, YARN-2013.3.patch
>
>
> When a container crashes, ExitCodeException will be thrown from Shell. 
> Default/LinuxContainerExecutor captures the exception, put the exception 
> stack into the diagnostic. Therefore, the exception stack is always the same. 
> {code}
> String diagnostics = "Exception from container-launch: \n"
> + StringUtils.stringifyException(e) + "\n" + shExec.getOutput();
> container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
> diagnostics));
> {code}
> In addition, it seems that the exception always has a empty message as 
> there's no message from stderr. Hence the diagnostics is not of much use for 
> users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty, Huawei (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty, Huawei updated YARN-2259:


Description: 
Induce RM switchover while job is in progress

Observe that NM-Local dir cleanup failing when Resourcemanager switches.

> NM-Local dir cleanup failing when Resourcemanager switches
> --
>
> Key: YARN-2259
> URL: https://issues.apache.org/jira/browse/YARN-2259
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
> Environment: Induce RM switchover while job is in progress
> Observe that NM-Local dir cleanup failing when Resourcemanager switches.
>Reporter: Nishan Shetty, Huawei
>
> Induce RM switchover while job is in progress
> Observe that NM-Local dir cleanup failing when Resourcemanager switches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty, Huawei (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty, Huawei updated YARN-2259:


Environment: 



  was:
Induce RM switchover while job is in progress

Observe that NM-Local dir cleanup failing when Resourcemanager switches.



> NM-Local dir cleanup failing when Resourcemanager switches
> --
>
> Key: YARN-2259
> URL: https://issues.apache.org/jira/browse/YARN-2259
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
> Environment: 
>Reporter: Nishan Shetty, Huawei
>
> Induce RM switchover while job is in progress
> Observe that NM-Local dir cleanup failing when Resourcemanager switches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty, Huawei (JIRA)
Nishan Shetty, Huawei created YARN-2259:
---

 Summary: NM-Local dir cleanup failing when Resourcemanager switches
 Key: YARN-2259
 URL: https://issues.apache.org/jira/browse/YARN-2259
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
 Environment: Induce RM switchover while job is in progress

Observe that NM-Local dir cleanup failing when Resourcemanager switches.

Reporter: Nishan Shetty, Huawei






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-07-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053555#comment-14053555
 ] 

Hudson commented on YARN-1367:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #606 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/606/])
YARN-1367. Changed NM to not kill containers on NM resync if RM work-preserving 
restart is enabled. Contributed by Anubhav Dhoot (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608334)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java


> After restart NM should resync with the RM without killing containers
> -
>
> Key: YARN-1367
> URL: https://issues.apache.org/jira/browse/YARN-1367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Fix For: 2.5.0
>
> Attachments: YARN-1367.001.patch, YARN-1367.002.patch, 
> YARN-1367.003.patch, YARN-1367.prototype.patch
>
>
> After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
>  Upon receiving the resync response, the NM kills all containers and 
> re-registers with the RM. The NM should be changed to not kill the container 
> and instead inform the RM about all currently running containers including 
> their allocations etc. After the re-register, the NM should send all pending 
> container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling

2014-07-07 Thread Ratandeep Ratti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ratandeep Ratti updated YARN-2252:
--

Attachment: YARN-2252-1.patch

Wei, since I already have a patch for this on my 2.3 hadoop branch (for my 
internal org). I'm also submitting it here. Please have a look to see if it is 
fine by you.

> Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
> 
>
> Key: YARN-2252
> URL: https://issues.apache.org/jira/browse/YARN-2252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: trunk-win
>Reporter: Ratandeep Ratti
>Assignee: Wei Yan
>  Labels: hadoop2, scheduler, yarn
> Attachments: YARN-2252-1.patch
>
>
> This test-case is failing sporadically on my machine. I think I have a 
> plausible explanation  for this.
> It seems that when the Scheduler is being asked for resources, the resource 
> requests that are being constructed have no preference for the hosts (nodes).
> The two mock hosts constructed, both have a memory of 8192 mb.
> The containers(resources) being requested each require a memory of 1024mb, 
> hence a single node can execute both the resource requests for the 
> application.
> In the end of the test-case it is being asserted that the containers 
> (resource requests) be executed on different nodes, but since we haven't 
> specified any preferences for nodes when requesting the resources, the 
> scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches

2014-07-07 Thread Nishan Shetty, Huawei (JIRA)
Nishan Shetty, Huawei created YARN-2258:
---

 Summary: Aggregation of MR job logs failing when Resourcemanager 
switches
 Key: YARN-2258
 URL: https://issues.apache.org/jira/browse/YARN-2258
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, nodemanager
Affects Versions: 2.4.1
Reporter: Nishan Shetty, Huawei


1.Install RM in HA mode

2.Run a job with more tasks

3.Induce RM switchover while job is in progress

Observe that log aggregation fails for the job which is running when  
Resourcemanager switchover is induced.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053504#comment-14053504
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654294/t.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4209//console

This message is automatically generated.

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: t.patch

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: t.patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: trust .patch, trust.patch, trust.patch, trust003.patch, 
> trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053465#comment-14053465
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654288/t.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4208//console

This message is automatically generated.

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread bc Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053457#comment-14053457
 ] 

bc Wong commented on YARN-796:
--

[~yufeldman] & [~sdaingade], just read your proposal 
(LabelBasedScheduling.pdf). Has a few comments:

1. *Would let each node report its own labels.* The current proposal specifies 
the node-label mapping in a centralized file. This seems operationally 
unfriendly, as the file is hard to maintain.
* You need to get the DNS name right, which could be hard for a multi-homed 
setup.
* The proposal uses regexes on FQDN, such as {{perfnode.*}}. This may work if 
the hostnames are set up by IT like that. But in reality, I've seen lots of 
sites where the FQDN is like {{stmp09wk0013.foobar.com}}, where "stmp" refers 
to the data center, and "wk0013" refers to "worker 13", and other weird stuff 
like that. Now imagine that a centralized node-label mapping file with 2000 
nodes with such names. It'd be a nightmare.

Instead, each node can supply its own labels, via 
{{yarn.nodemanager.node.labels}} (which specifies labels directly) or 
{{yarn.nodemanager.node.labelFile}} (which points to a file that has a single 
line containing all the labels). It's easy to generate the label file for each 
node. The admin can have puppet push it out, or populate it when the VM is 
built, or compute it in a local script by inspecting /proc. (Oh I have 192GB, 
so add the label "largeMem".) There is little room for mistake.

The NM can still periodically refreshes its own labels, and update the RM via 
the heartbeat mechanism. The RM should also expose a "node label report", which 
is the real-time information of all nodes and their labels.

2. *Labels are per-container, not per-app. Right?* The doc keeps mentioning 
"application label", "ApplicationLabelExpression", etc. Should those be 
"container label" instead? I just want to confirm that each container request 
can carry its own label expression. Example use case: Only the mappers need 
GPU, not the reducers.

3. *Can we fail container requests with no satisfying nodes?* In 
"Considerations, #5", you wrote that the app would be in waiting state. Seems 
that a fail-fast behaviour would be better. If no node can satisfy the label 
expression, then it's better to tell the client "no". Very likely somebody made 
a typo somewhere.



> Allow for (admin) labels on nodes and resource-requests
> ---
>
> Key: YARN-796
> URL: https://issues.apache.org/jira/browse/YARN-796
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Wangda Tan
> Attachments: LabelBasedScheduling.pdf, YARN-796.patch
>
>
> It will be useful for admins to specify labels for nodes. Examples of labels 
> are OS, processor architecture etc.
> We should expose these labels and allow applications to specify labels on 
> resource-requests.
> Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: trust.patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: t.patch

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: t.patch, trust .patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2257) Add user to queue mapping in Fair-Scheduler

2014-07-07 Thread Patrick Liu (JIRA)
Patrick Liu created YARN-2257:
-

 Summary: Add user to queue mapping in Fair-Scheduler
 Key: YARN-2257
 URL: https://issues.apache.org/jira/browse/YARN-2257
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Patrick Liu


Currently, the fair-scheduler supports two modes, default queue or individual 
queue for each user.
Apparently, the default queue is not a good option, because the resources 
cannot be managed for each user or group.
However, individual queue for each user is not good enough. Especially when 
connecting yarn with hive. There will be increasing hive users in a corporate 
environment. If we create a queue for a user, the resource management will be 
hard to maintain.

I think the problem can be solved like this:
1. Define user->queue mapping in Fair-Scheduler.xml. Inside each queue, use 
aclSubmitApps to control user's ability.
2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
the app will be scheduled to that queue; otherwise, the app will be submitted 
to default queue.
3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053419#comment-14053419
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12654280/trust.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4207//console

This message is automatically generated.

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling

2014-07-07 Thread Ratandeep Ratti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053413#comment-14053413
 ] 

Ratandeep Ratti commented on YARN-2252:
---

Wei, calling fairscheduler.stop() will not stop the threads. It seems that the 
threads "continuousScheduling thread" and "update thread" are not handling 
interrupt properly. Though we are doing [schedulingthread | 
updateThread].interrupt(), we do need to keep checking the interrupt flag in 
the while loop of these threads.

> Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
> 
>
> Key: YARN-2252
> URL: https://issues.apache.org/jira/browse/YARN-2252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: trunk-win
>Reporter: Ratandeep Ratti
>Assignee: Wei Yan
>  Labels: hadoop2, scheduler, yarn
>
> This test-case is failing sporadically on my machine. I think I have a 
> plausible explanation  for this.
> It seems that when the Scheduler is being asked for resources, the resource 
> requests that are being constructed have no preference for the hosts (nodes).
> The two mock hosts constructed, both have a memory of 8192 mb.
> The containers(resources) being requested each require a memory of 1024mb, 
> hence a single node can execute both the resource requests for the 
> application.
> In the end of the test-case it is being asserted that the containers 
> (resource requests) be executed on different nodes, but since we haven't 
> specified any preferences for nodes when requesting the resources, the 
> scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: trust.patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: trust002.patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: trust.patch

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
> trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-07 Thread anders (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: (was: trust001.patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: trust .patch, trust.patch, trust.patch, trust.patch, 
> trust002.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)