[jira] [Commented] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049665#comment-14049665
 ] 

Hadoop QA commented on YARN-2241:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653540/YARN-2241.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4174//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4174//console

This message is automatically generated.

> ZKRMStateStore: On startup, show nicer messages when znodes already exist
> -
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
> Attachments: YARN-2241.patch, YARN-2241.patch
>
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist

2014-07-01 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-2241:


Attachment: YARN-2241.patch

You're right, it doesn't fail without the fix; I must have checked it with 
something slightly different than the old code when I tried it.  In that case I 
don't think we need the test; it's a pretty simple fix and I was able to verify 
that it worked correctly.

I've uploaded a new patch that doesn't have the test.

> ZKRMStateStore: On startup, show nicer messages when znodes already exist
> -
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
> Attachments: YARN-2241.patch, YARN-2241.patch
>
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049646#comment-14049646
 ] 

Hadoop QA commented on YARN-1366:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653538/YARN-1366.11.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4173//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4173//console

This message is automatically generated.

> AM should implement Resync with the ApplicationMasterService instead of 
> shutting down
> -
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.10.patch, 
> YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, 
> YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, 
> YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, 
> YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2245) AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used

2014-07-01 Thread Sangjin Lee (JIRA)
Sangjin Lee created YARN-2245:
-

 Summary: AM throws ClassNotFoundException with job classloader 
enabled if custom output format/committer is used
 Key: YARN-2245
 URL: https://issues.apache.org/jira/browse/YARN-2245
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee


With the job classloader enabled, the MR AM throws ClassNotFoundException if a 
custom output format class is specified.

{noformat}
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
com.foo.test.TestOutputFormat not found
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:473)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:374)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1459)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1456)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1389)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
com.foo.test.TestOutputFormat not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
at 
org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:222)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469)
... 8 more
Caused by: java.lang.ClassNotFoundException: Class 
com.foo.test.TestOutputFormat not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
... 10 more
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down

2014-07-01 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-1366:
-

Attachment: YARN-1366.11.patch

Updated patch fix findbug warning.

> AM should implement Resync with the ApplicationMasterService instead of 
> shutting down
> -
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.10.patch, 
> YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, 
> YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, 
> YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, 
> YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049621#comment-14049621
 ] 

Hadoop QA commented on YARN-2229:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653536/YARN-2229.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4172//console

This message is automatically generated.

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, 
> YARN-2229.3.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-01 Thread Lijuan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijuan Zhang updated YARN-2142:
---

Labels: features patch  (was: patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, 
> trust002.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-01 Thread Lijuan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijuan Zhang updated YARN-2142:
---

Labels: features  (was: features patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: features
> Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, 
> trust002.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-01 Thread Lijuan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijuan Zhang updated YARN-2142:
---

Affects Version/s: (was: 2.2.0)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: patch
> Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, 
> trust002.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (YARN-2244) FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596

2014-07-01 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reopened YARN-2244:
-


You resolved the bug as duplicate with itself. Reopening it.

> FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596 
> -
>
> Key: YARN-2244
> URL: https://issues.apache.org/jira/browse/YARN-2244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
>
> We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Important 
> fixes in that include handling unknown containers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-

Attachment: YARN-2229.3.patch

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, 
> YARN-2229.3.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049610#comment-14049610
 ] 

Hadoop QA commented on YARN-1366:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653530/YARN-1366.10.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4171//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4171//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4171//console

This message is automatically generated.

> AM should implement Resync with the ApplicationMasterService instead of 
> shutting down
> -
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.10.patch, 
> YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, 
> YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, 
> YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049605#comment-14049605
 ] 

Li Lu commented on YARN-2242:
-

Hi [~zjshen], you're right that the two jiras have much shared focus. If I 
understand correctly, the available patch for YARN-2013 focused on launch time, 
and the patch here focuses on help users making use of the logs generated by 
log aggregator. In a whole package I think these two patches can alleviate the 
problem on launch time crashes. 

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049598#comment-14049598
 ] 

Li Lu commented on YARN-2242:
-

Hi [~djp], sure I can definitely do that. One small question is, since this 
patch is too trivial, could you please give some suggestions on how to build or 
modify an unit test for it? I'm hoping this part is already test somewhere in 
the existing UTs, and some modification would suffice. Thanks! 

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2194) Add Cgroup support for RedHat 7

2014-07-01 Thread Beckham007 (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049593#comment-14049593
 ] 

Beckham007 commented on YARN-2194:
--

+1.
A new LCEResourceHandler is needed. To support more resource isolation, we also 
need to have init(), preExecute() and postExecute() for different resource. 
Adding an abstract CgroupsResourceManager and its implement 
CPUResourceManager\MemResourceManager is good.

> Add Cgroup support for RedHat 7
> ---
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Wei Yan
>
> In previous versions of RedHat, we can build custom cgroup hierarchies with 
> use of the cgconfig command from the libcgroup package. From RedHat 7, 
> package libcgroup is deprecated and it is not recommended to use it since it 
> can easily create conflicts with the default cgroup hierarchy. The systemd is 
> provided and recommended for cgroup management. We need to add support for 
> this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down

2014-07-01 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-1366:
-

Attachment: YARN-1366.10.patch

I updated the patch with addressing comments. Please review.. 

> AM should implement Resync with the ApplicationMasterService instead of 
> shutting down
> -
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.10.patch, 
> YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, 
> YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, 
> YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049589#comment-14049589
 ] 

Zhijie Shen commented on YARN-2242:
---

Hi [~gtCarrera9], the useless ExitCodeException stack is not just limited to AM 
container. Given a container gets crashed, we're always expecting this message. 
Previously, I filed similar ticket: YARN-2013. [~ozawa], was working on it, but 
I didn't have a chance to look into it. Maybe you want to consolidate the two 
jiras.

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049587#comment-14049587
 ] 

Junping Du commented on YARN-2242:
--

Hi [~gtCarrera9], Thanks for contributing a patch here! Would you mind to add a 
unit test to verify your exception messages? I will review your patch.

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049586#comment-14049586
 ] 

Vinod Kumar Vavilapalli commented on YARN-2175:
---

That is a reasonable proposal, but I'd like to see if there are any other bugs 
that are causing this to happen. Have we seen this in practice? If so, what is 
the underlying reason? Too big a resource? The source file-system is down? Or 
NM has a bug? We should try to address the right individual problem with its 
solution before we put a band-aid that may still be useful for issues that we 
cannot just address directly if any.

Contrast this with mapreduce.task.timeout. Arguably the config helped users 
timeout their jobs, but from my experience it prevented us from focusing on 
fixing point bugs that were hidden in the framework for a long time - it kind 
of hides the issues. It still is useful, for those unmanageable and unsolvable 
bugs, but I'd rather first fix the point problems and then put the band-aid. 
Thoughts?

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2244) FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2244.
---

Resolution: Duplicate

I see that YARN-2244 is already filed. Closing as dup. Please reopen if you 
disagree.

> FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596 
> -
>
> Key: YARN-2244
> URL: https://issues.apache.org/jira/browse/YARN-2244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
>
> We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Important 
> fixes in that include handling unknown containers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049577#comment-14049577
 ] 

Hadoop QA commented on YARN-2242:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12653521/YARN-2242-070114-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4170//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4170//console

This message is automatically generated.

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down

2014-07-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049572#comment-14049572
 ] 

Jian He commented on YARN-1366:
---

I meant, can we do this ?
{code}
synchronized (this) {
  // reset lastResponseId to 0
  lastResponseId = 0;
  release.addAll(this.pendingRelease);
  blacklistAdditions.addAll(this.blacklistedNodes);
  for (Map> rr : 
remoteRequestsTable
  .values()) {
for (Map capabalities : rr.values()) 
{
  for (ResourceRequestInfo request : capabalities.values()) {
addResourceRequestToAsk(request.remoteRequest);
  }
}
  }
}
   // re register with RM
registerApplicationMaster();
{code}
and "lastResponseId = 0;" may be put in registerApplicationMaster call also ?

> AM should implement Resync with the ApplicationMasterService instead of 
> shutting down
> -
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, 
> YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, 
> YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, 
> YARN-1366.prototype.patch, YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049566#comment-14049566
 ] 

Vinod Kumar Vavilapalli commented on YARN-2074:
---

bq. Talked with Vinod offline, the big problem with this is even if we don't 
count AM preemption towards AM failures on RM side, MR AM itself checks the 
attempt id against the max-attempt count for recovery. Work around is to reset 
the MAX-ATTEMPT env each time launching the AM which sounds a bit hacky though.
Filed MAPREDUCE-5956 for this..

> Preemption of AM containers shouldn't count towards AM failures
> ---
>
> Key: YARN-2074
> URL: https://issues.apache.org/jira/browse/YARN-2074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jian He
> Fix For: 2.5.0
>
> Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
> YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, 
> YARN-2074.7.patch, YARN-2074.7.patch, YARN-2074.8.patch
>
>
> One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
> containers getting preempted shouldn't count towards AM failures and thus 
> shouldn't eventually fail applications.
> We should explicitly handle AM container preemption/kill as a separate issue 
> and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down

2014-07-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049563#comment-14049563
 ] 

Rohith commented on YARN-1366:
--

bq. These two synchronized block can be merged into one ? 
This I separated intentionally for the handling the very corner scenario i.e 
after AM gets resync it go for re registering the AM. By worst case, with this 
period of time, if again RM goes down, then registerapplicationmaster start 
retry both RM's. Thought not to block AMRMClient oprations such as 
updateblacklist,addContainerRequest and others so on... Would you think time 
taken to retry is not more and it can be blocked?


> AM should implement Resync with the ApplicationMasterService instead of 
> shutting down
> -
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, 
> YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, 
> YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, 
> YARN-1366.prototype.patch, YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2131) Add a way to nuke the RMStateStore

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049561#comment-14049561
 ] 

Hadoop QA commented on YARN-2131:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653517/YARN-2131.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4168//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4168//console

This message is automatically generated.

> Add a way to nuke the RMStateStore
> --
>
> Key: YARN-2131
> URL: https://issues.apache.org/jira/browse/YARN-2131
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
> Attachments: YARN-2131.patch
>
>
> There are cases when we don't want to recover past applications, but recover 
> applications going forward. To do this, one has to clear the store. Today, 
> there is no easy way to do this and users should understand how each store 
> works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2022:
--

Fix Version/s: 2.5.0

> Preempting an Application Master container can be kept as least priority when 
> multiple applications are marked for preemption by 
> ProportionalCapacityPreemptionPolicy
> -
>
> Key: YARN-2022
> URL: https://issues.apache.org/jira/browse/YARN-2022
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sunil G
>Assignee: Sunil G
> Fix For: 2.5.0
>
> Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, 
> YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, 
> YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, 
> Yarn-2022.1.patch
>
>
> Cluster Size = 16GB [2NM's]
> Queue A Capacity = 50%
> Queue B Capacity = 50%
> Consider there are 3 applications running in Queue A which has taken the full 
> cluster capacity. 
> J1 = 2GB AM + 1GB * 4 Maps
> J2 = 2GB AM + 1GB * 4 Maps
> J3 = 2GB AM + 1GB * 2 Maps
> Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
> Currently in this scenario, Jobs J3 will get killed including its AM.
> It is better if AM can be given least priority among multiple applications. 
> In this same scenario, map tasks from J3 and J2 can be preempted.
> Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2204) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049553#comment-14049553
 ] 

Hudson commented on YARN-2204:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5806 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5806/])
YARN-2204. Explicitly enable vmem check in 
TestContainersMonitor#testContainerKillOnMemoryOverflow. (Anubhav Dhoot via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1607231)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java


> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
> ---
>
> Key: YARN-2204
> URL: https://issues.apache.org/jira/browse/YARN-2204
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Trivial
> Fix For: 2.5.0
>
> Attachments: YARN-2204.patch, YARN-2204_addendum.patch, 
> YARN-2204_addendum.patch
>
>
> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049552#comment-14049552
 ] 

Hudson commented on YARN-2022:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5806 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5806/])
YARN-2022 Preempting an Application Master container can be kept as least 
priority when multiple applications are marked for preemption by 
ProportionalCapacityPreemptionPolicy (Sunil G via mayank) (mayank: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1607227)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java


> Preempting an Application Master container can be kept as least priority when 
> multiple applications are marked for preemption by 
> ProportionalCapacityPreemptionPolicy
> -
>
> Key: YARN-2022
> URL: https://issues.apache.org/jira/browse/YARN-2022
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, 
> YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, 
> YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, 
> Yarn-2022.1.patch
>
>
> Cluster Size = 16GB [2NM's]
> Queue A Capacity = 50%
> Queue B Capacity = 50%
> Consider there are 3 applications running in Queue A which has taken the full 
> cluster capacity. 
> J1 = 2GB AM + 1GB * 4 Maps
> J2 = 2GB AM + 1GB * 4 Maps
> J3 = 2GB AM + 1GB * 2 Maps
> Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
> Currently in this scenario, Jobs J3 will get killed including its AM.
> It is better if AM can be given least priority among multiple applications. 
> In this same scenario, map tasks from J3 and J2 can be preempted.
> Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049548#comment-14049548
 ] 

Hadoop QA commented on YARN-611:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653515/YARN-611.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4166//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4166//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4166//console

This message is automatically generated.

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
> Attachments: YARN-611.1.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2242:


Attachment: YARN-2242-070114-1.patch

Second version, minimize change set. Test is not included since this patch only 
changes an output on the webpage UI, and could be verified on any AM launch 
crashes. 

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049546#comment-14049546
 ] 

Hadoop QA commented on YARN-2242:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12653509/YARN-2242-070114.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4167//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4167//console

This message is automatically generated.

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist

2014-07-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049541#comment-14049541
 ] 

Karthik Kambatla commented on YARN-2241:


I am okay with leaving the test in there to avoid regressions of throwing 
exceptions in the future. 

> ZKRMStateStore: On startup, show nicer messages when znodes already exist
> -
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
> Attachments: YARN-2241.patch
>
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist

2014-07-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2241:
---

Summary: ZKRMStateStore: On startup, show nicer messages when znodes 
already exist  (was: Show nicer messages when ZNodes already exist in 
ZKRMStateStore on startup)

> ZKRMStateStore: On startup, show nicer messages when znodes already exist
> -
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
> Attachments: YARN-2241.patch
>
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup

2014-07-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049539#comment-14049539
 ] 

Karthik Kambatla commented on YARN-2241:


I ll have to retract that +1. The fix is good, but the test doesn't do much. 

Actually, the test doesn't fail without the fix. Is that intentional? Without 
this patch, these exceptions are merely logged and not thrown. I am okay with a 
patch without the test since we are just changing the logging.

> Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
> --
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
> Attachments: YARN-2241.patch
>
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2224) Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow

2014-07-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2224:
---

Summary: Explicitly enable vmem check in 
TestContainersMonitor#testContainerKillOnMemoryOverflow  (was: Let 
TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of 
the default settings)

> Explicitly enable vmem check in 
> TestContainersMonitor#testContainerKillOnMemoryOverflow
> ---
>
> Key: YARN-2224
> URL: https://issues.apache.org/jira/browse/YARN-2224
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: nodemanager
>Affects Versions: 2.4.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Trivial
>  Labels: newbie
> Attachments: YARN-2224.patch
>
>
> If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test 
> will fail. Make the test pass not rely on the default settings but just let 
> it verify that once the setting is turned on it actually does the memory 
> check. See YARN-2225 which suggests we turn the default off.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049534#comment-14049534
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653518/trust.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4169//console

This message is automatically generated.

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
>Affects Versions: 2.2.0
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: patch
> Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, 
> trust002.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup

2014-07-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049532#comment-14049532
 ] 

Karthik Kambatla commented on YARN-2241:


Looks good. +1

> Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
> --
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
> Attachments: YARN-2241.patch
>
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-07-01 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049529#comment-14049529
 ] 

Mayank Bansal commented on YARN-2022:
-

+ 1 committing 

Thanks [~sunilg] for the patch.

Thanks [~vinodkv] and [~wangda] for the reviews.

Thanks,
Mayank

> Preempting an Application Master container can be kept as least priority when 
> multiple applications are marked for preemption by 
> ProportionalCapacityPreemptionPolicy
> -
>
> Key: YARN-2022
> URL: https://issues.apache.org/jira/browse/YARN-2022
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, 
> YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, 
> YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, 
> Yarn-2022.1.patch
>
>
> Cluster Size = 16GB [2NM's]
> Queue A Capacity = 50%
> Queue B Capacity = 50%
> Consider there are 3 applications running in Queue A which has taken the full 
> cluster capacity. 
> J1 = 2GB AM + 1GB * 4 Maps
> J2 = 2GB AM + 1GB * 4 Maps
> J3 = 2GB AM + 1GB * 2 Maps
> Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
> Currently in this scenario, Jobs J3 will get killed including its AM.
> It is better if AM can be given least priority among multiple applications. 
> In this same scenario, map tasks from J3 and J2 can be preempted.
> Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-01 Thread Lijuan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijuan Zhang updated YARN-2142:
---

Attachment: trust.patch

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
>Affects Versions: 2.2.0
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: patch
> Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, 
> trust002.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2131) Add a way to nuke the RMStateStore

2014-07-01 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-2131:


Attachment: YARN-2131.patch

The patch adds a {{deleteStore()}} method to RMStateStore and implementations 
for the ZKRMStateStore and the FileSystemRMStateStore; this gets called when 
you run {{yarn resourcemanager -format}}.

I also added a unit test and verified that it works in a cluster with the 
ZKRMStateStore and also the FileSystemRMStateStore with both the local FS and 
HDFS.

> Add a way to nuke the RMStateStore
> --
>
> Key: YARN-2131
> URL: https://issues.apache.org/jira/browse/YARN-2131
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
> Attachments: YARN-2131.patch
>
>
> There are cases when we don't want to recover past applications, but recover 
> applications going forward. To do this, one has to clear the store. Today, 
> there is no easy way to do this and users should understand how each store 
> works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-01 Thread Lijuan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijuan Zhang updated YARN-2142:
---

Attachment: (was: test.patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
>Affects Versions: 2.2.0
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: patch
> Attachments: trust.patch, trust.patch, trust001.patch, 
> trust002.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-07-01 Thread Lijuan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijuan Zhang updated YARN-2142:
---

Attachment: (was: trust.patch)

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler, webapp
>Affects Versions: 2.2.0
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
> Only in branch-2.2.0.
>Reporter: anders
>Priority: Minor
>  Labels: patch
> Attachments: trust.patch, trust.patch, trust001.patch, 
> trust002.patch, trust003.patch, trust2.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's health checkservice.
> ***Only in branch-2.2.0 , not in trunk***



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-01 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-611:
---

Attachment: YARN-611.1.patch

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
> Attachments: YARN-611.1.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-01 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049512#comment-14049512
 ] 

Xuan Gong commented on YARN-611:


Here is my Proposal:
We can make this resetCountPolicy choosable (If users have another 
requirements, we can implement more policies for them). Currently, we will 
provider WindowsSlideAMRetryCountResetPolicy. To use this policy, the users 
need to initiate it by passing a parameter which is used to define period of 
time in milliseconds that AM retry count will be reset. And they can put this 
policy into ApplicationSubmissionContext. In that case, the RMApp and 
RMAppAttempt can get this policy to use. Also, we need to change the way that 
we are using to decide whether this AppAttempt is lastRetry. We can use : 
{code}
maxAppAttempts == (getNumFailedAppAttempts() + 1 - this.attemptResetCount) to 
do the calculation. 
{code}
Note: getNumFailedAppAttempts() will calculate how many previous attempts are 
really failed (excluding the preemption, nm resync, hardware error and rm 
restart/failover). 

this.attemptResetCount is used to track the number of failure that we should 
reset. In every resetCountPolicy, we should provide a way to calculate this 
number on time. For WindowsSlideAMRetryCountResetPolicy, after AM successfully 
run a period of time, we can set this.attemptResetCount as the number of really 
failed previous attempts.

Also, we need to provide a way to re-build the  this.attemptResetCount value 
when RM restart/failover happens

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-01 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049513#comment-14049513
 ] 

Xuan Gong commented on YARN-611:


Upload a patch for this proposal

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2242:


Attachment: YARN-2242-070114.patch

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2242:


Attachment: (was: YARN-2242-070114.patch)

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2242:


Attachment: YARN-2242-070114.patch

This patch disables the confusing output of ShellExitCodeException, and "points 
to" the logs generated by each container attempt. For console users, the new 
exception information reports the URL where the application is traced. Then the 
exception information reminds users to click on the links in the same page to 
check out exception details. 

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-2242-070114.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049487#comment-14049487
 ] 

Hadoop QA commented on YARN-2241:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653496/YARN-2241.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4165//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4165//console

This message is automatically generated.

> Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
> --
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
> Attachments: YARN-2241.patch
>
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Moved] (YARN-2244) FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596

2014-07-01 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot moved MAPREDUCE-5955 to YARN-2244:


Component/s: (was: scheduler)
 fairscheduler
Key: YARN-2244  (was: MAPREDUCE-5955)
Project: Hadoop YARN  (was: Hadoop Map/Reduce)

> FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596 
> -
>
> Key: YARN-2244
> URL: https://issues.apache.org/jira/browse/YARN-2244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
>
> We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Important 
> fixes in that include handling unknown containers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049469#comment-14049469
 ] 

Hudson commented on YARN-1713:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5805 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5805/])
YARN-1713. Added get-new-app and submit-app functionality to RM web services. 
Contributed by Varun Vasudev. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1607216)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/GenericExceptionHandler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ApplicationSubmissionContextInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ContainerLaunchContextInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CredentialsInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LocalResourceInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NewApplication.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ResourceInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm


> Implement getnewapplication and submitapp as part of RM web service
> ---
>
> Key: YARN-1713
> URL: https://issues.apache.org/jira/browse/YARN-1713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Fix For: 2.5.0
>
> Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, 
> apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, 
> apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, 
> apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
> apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
> apache-yarn-1713.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu reassigned YARN-2242:
---

Assignee: Li Lu

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2243) Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor

2014-07-01 Thread Ted Yu (JIRA)
Ted Yu created YARN-2243:


 Summary: Order of arguments for Preconditions.checkNotNull() is 
wrong in SchedulerApplicationAttempt ctor
 Key: YARN-2243
 URL: https://issues.apache.org/jira/browse/YARN-2243
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


{code}
  public SchedulerApplicationAttempt(ApplicationAttemptId applicationAttemptId, 
  String user, Queue queue, ActiveUsersManager activeUsersManager,
  RMContext rmContext) {
Preconditions.checkNotNull("RMContext should not be null", rmContext);
{code}
Order of arguments is wrong for Preconditions.checkNotNull().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2242:


Description: Now on each time AM Container crashes during launch, both the 
console and the webpage UI only report a ShellExitCodeExecption. This is not 
only unhelpful, but sometimes confusing. With the help of log aggregator, 
container logs are actually aggregated, and can be very helpful for debugging. 
One possible way to improve the whole process is to send a "pointer" to the 
aggregated logs to the programmer when reporting exception information.   (was: 
Now on each time AM Container crashes during launch, both the console and the 
webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
but sometimes confusing. With the help of log aggregator, container logs are 
actually aggregated, and can be very helpful for debugging. One possible way to 
improve the whole process is to send a "pointer" to the aggregated logs to the 
programmer. )

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2242:


Description: Now on each time AM Container crashes during launch, both the 
console and the webpage UI only report a ShellExitCodeExecption. This is not 
only unhelpful, but sometimes confusing. With the help of log aggregator, 
container logs are actually aggregated, and can be very helpful for debugging. 
One possible way to improve the whole process is to send a "pointer" to the 
aggregated logs to the programmer. 

> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2233) Implement web services to create, renew and cancel delegation tokens

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2233:
--

 Component/s: resourcemanager
Priority: Blocker  (was: Major)
Target Version/s: 2.5.0

Marked for 2.5 and making it a blocker as I'd like to get it in to make RM 
web-services usable..

> Implement web services to create, renew and cancel delegation tokens
> 
>
> Key: YARN-2233
> URL: https://issues.apache.org/jira/browse/YARN-2233
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-2233.0.patch
>
>
> Implement functionality to create, renew and cancel delegation tokens.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2242) Improve exception information on AM launch crashes

2014-07-01 Thread Li Lu (JIRA)
Li Lu created YARN-2242:
---

 Summary: Improve exception information on AM launch crashes
 Key: YARN-2242
 URL: https://issues.apache.org/jira/browse/YARN-2242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049454#comment-14049454
 ] 

Vinod Kumar Vavilapalli commented on YARN-1713:
---

+1 looks good. Compiled the docs and read them - seem fine. Checking this in..

> Implement getnewapplication and submitapp as part of RM web service
> ---
>
> Key: YARN-1713
> URL: https://issues.apache.org/jira/browse/YARN-1713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, 
> apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, 
> apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, 
> apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
> apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
> apache-yarn-1713.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup

2014-07-01 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-2241:


Attachment: YARN-2241.patch

The Exception catching was simply in the wrong place; I moved it to the right 
place and it now prints a nicer DEBUG message instead of the exceptions/stack 
traces.  I also added a unit test.

> Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
> --
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
> Attachments: YARN-2241.patch
>
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-07-01 Thread bc Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049414#comment-14049414
 ] 

bc Wong commented on YARN-941:
--

I'm fine with [~xgong]'s solution. I'd still like to see something more generic 
to make tokens (HDFS token, HBase token, etc) work with long running apps 
though. Perhaps I'll pursue the "arbitrary expiration time" approach in another 
jira.

{quote}
RPC privacy is a very expensive solution for AM-RM communication. First, it 
needs setup so AM/RM have access to key infrastructure - having this burden on 
all applications is not reasonable. This is compounded by the fact that we use 
AMRMTokens in non-secure mode too. Second, AM - RM communication is a very 
chatty protocol, it's likely the overhead is huge..
{quote}

True security is often costly. The web/consumer industry went through the same 
exercise with HTTP vs HTTPS. You can get at least 10x better performance with 
HTTP. But in the end, everybody decided that it's worth it. And passing tokens 
around without RPC privacy is just like sending passwords around on HTTP 
without SSL.

{quote}
Unfortunately with long running services (the focus of this JIRA), this attack 
and its success is not as unlikely. This is the very reason why we roll 
master-keys every so often in the first place.
{quote}

With the rolling master key, it's unlikely for the attack to gather enough 
cipher text to mount that attack. Besides, a longer key would require so much 
computation to attack that it'd be infeasible.

Anyway, appreciate your response, and I'll follow up in another jira.

> RM Should have a way to update the tokens it has for a running application
> --
>
> Key: YARN-941
> URL: https://issues.apache.org/jira/browse/YARN-941
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Robert Joseph Evans
>Assignee: Xuan Gong
> Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, 
> YARN-941.preview.4.patch, YARN-941.preview.patch
>
>
> When an application is submitted to the RM it includes with it a set of 
> tokens that the RM will renew on behalf of the application, that will be 
> passed to the AM when the application is launched, and will be used when 
> launching the application to access HDFS to download files on behalf of the 
> application.
> For long lived applications/services these tokens can expire, and then the 
> tokens that the AM has will be invalid, and the tokens that the RM had will 
> also not work to launch a new AM.
> We need to provide an API that will allow the RM to replace the current 
> tokens for this application with a new set.  To avoid any real race issues, I 
> think this API should be something that the AM calls, so that the client can 
> connect to the AM with a new set of tokens it got using kerberos, then the AM 
> can inform the RM of the new set of tokens and quickly update its tokens 
> internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup

2014-07-01 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-2241:


Component/s: resourcemanager

> Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
> --
>
> Key: YARN-2241
> URL: https://issues.apache.org/jira/browse/YARN-2241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Minor
>
> When using the RMZKStateStore, if you restart the RM, you get a bunch of 
> stack traces with messages like 
> {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /rmstore}}.  This is expected as these nodes already exist 
> from before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup

2014-07-01 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-2241:
---

 Summary: Show nicer messages when ZNodes already exist in 
ZKRMStateStore on startup
 Key: YARN-2241
 URL: https://issues.apache.org/jira/browse/YARN-2241
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Minor


When using the RMZKStateStore, if you restart the RM, you get a bunch of stack 
traces with messages like 
{{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /rmstore}}.  This is expected as these nodes already exist from 
before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049381#comment-14049381
 ] 

Tsuyoshi OZAWA commented on YARN-2229:
--

Sorry for iterative compile error. The attached patch works well on my local. 
Let me try again.

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049373#comment-14049373
 ] 

Hadoop QA commented on YARN-2229:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653475/YARN-2229.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4164//console

This message is automatically generated.

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-

Attachment: YARN-2229.2.patch

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049350#comment-14049350
 ] 

Tsuyoshi OZAWA commented on YARN-2229:
--

No, it isn't. It can break the backward compatibility to change containerId 
type from int to long because {{ConverterUtil#toContainerId(str)}} cannot parse 
the container id string with 64 bit containerId.

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049326#comment-14049326
 ] 

Hadoop QA commented on YARN-1713:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12653450/apache-yarn-1713.10.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4163//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4163//console

This message is automatically generated.

> Implement getnewapplication and submitapp as part of RM web service
> ---
>
> Key: YARN-1713
> URL: https://issues.apache.org/jira/browse/YARN-1713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, 
> apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, 
> apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, 
> apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
> apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
> apache-yarn-1713.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049308#comment-14049308
 ] 

Jian He commented on YARN-2229:
---

This patch is supposed to change containerId type from int to long? 

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure

2014-07-01 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049300#comment-14049300
 ] 

Li Lu commented on YARN-675:


[~zjshen], I'd like to work on this. Would you mind if I take this over? 
Thanks! 

> In YarnClient, pull AM logs on AM container failure
> ---
>
> Key: YARN-675
> URL: https://issues.apache.org/jira/browse/YARN-675
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: 2.0.4-alpha
>Reporter: Sandy Ryza
>Assignee: Zhijie Shen
>
> Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to 
> pull its logs from the NM to the client so that they can be displayed 
> immediately to the user.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-07-01 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1713:


Attachment: apache-yarn-1713.10.patch

- XmlRootElement for ApplicationId”
 -> NewApplication

bq.Rename refs to AppId: {Cluster ApplicationId API}

Fixed.

bq.in the documentation. Need to fix all this documentation to not say 
ApplicationID. Similarly rename http:///ws/v1/cluster/apps/id

Fixed.

bq. I think you should create a writable APIs section in the doc, add a 
disclaimer saying this is alpha+public-unstable and then put the new APIs in 
there, so we can let it bake in for a release or two.

Fixed.

> Implement getnewapplication and submitapp as part of RM web service
> ---
>
> Key: YARN-1713
> URL: https://issues.apache.org/jira/browse/YARN-1713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, 
> apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, 
> apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, 
> apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
> apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
> apache-yarn-1713.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-01 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049223#comment-14049223
 ] 

Anubhav Dhoot commented on YARN-2175:
-

I should clarify the AM can kill this container manually but each AM will have 
to implement this logic to detect when localization takes longer and kill when 
its taking too long. Updating description.
We can make it much simpler for administrators and AM writers by having an 
automatic way to mitigate this. The NodeManager knows each state of the 
container. Instead of having a back and forth between AM and NM, it will be 
easier if we just let this be done by NM. We can start with a configurable 
timeout with a reasonable default. In future we can add ability in the AM to 
override this during the container request.
Lemme know what you guys think.

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-01 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2175:


Description: 
There are no timeouts that can be used to limit the time taken by various 
container startup operations. Localization for example could take a long time 
and there is no automated way to kill an task if its stuck in these states. 
These may have nothing to do with the task itself and could be an issue within 
the platform. 

Ideally there should be configurable limits for various states within the 
NodeManager to limit various states. The RM does not care about most of these 
and its only between AM and the NM. We can start by making these global 
configurable defaults and in future we can make it fancier by letting AM 
override them in the start container request.

This jira will be used to limit localization time and we open others if we feel 
we need to limit other operations.

  was:
There are no timeouts that can be used to limit the time taken by various 
container startup operations. Localization for example could take a long time 
and there is no way to kill an task if its stuck in these states. These may 
have nothing to do with the task itself and could be an issue within the 
platform. 

Ideally there should be configurable limits for various states within the 
NodeManager to limit various states. The RM does not care about most of these 
and its only between AM and the NM. We can start by making these global 
configurable defaults and in future we can make it fancier by letting AM 
override them in the start container request.

This jira will be used to limit localization time and we open others if we feel 
we need to limit other operations.


> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform. 
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request.
> This jira will be used to limit localization time and we open others if we 
> feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-01 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2175:


Description: 
There are no timeouts that can be used to limit the time taken by various 
container startup operations. Localization for example could take a long time 
and there is no automated way to kill an task if its stuck in these states. 
These may have nothing to do with the task itself and could be an issue within 
the platform.

Ideally there should be configurable limits for various states within the 
NodeManager to limit various states. The RM does not care about most of these 
and its only between AM and the NM. We can start by making these global 
configurable defaults and in future we can make it fancier by letting AM 
override them in the start container request. 

This jira will be used to limit localization time and we can open others if we 
feel we need to limit other operations.

  was:
There are no timeouts that can be used to limit the time taken by various 
container startup operations. Localization for example could take a long time 
and there is no automated way to kill an task if its stuck in these states. 
These may have nothing to do with the task itself and could be an issue within 
the platform. 

Ideally there should be configurable limits for various states within the 
NodeManager to limit various states. The RM does not care about most of these 
and its only between AM and the NM. We can start by making these global 
configurable defaults and in future we can make it fancier by letting AM 
override them in the start container request.

This jira will be used to limit localization time and we open others if we feel 
we need to limit other operations.


> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2224) Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of the default settings

2014-07-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049207#comment-14049207
 ] 

Karthik Kambatla commented on YARN-2224:


+1

I wish there were a simpler solution. 

> Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective 
> of the default settings
> -
>
> Key: YARN-2224
> URL: https://issues.apache.org/jira/browse/YARN-2224
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: nodemanager
>Affects Versions: 2.4.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Trivial
>  Labels: newbie
> Attachments: YARN-2224.patch
>
>
> If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test 
> will fail. Make the test pass not rely on the default settings but just let 
> it verify that once the setting is turned on it actually does the memory 
> check. See YARN-2225 which suggests we turn the default off.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2224) Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of the default settings

2014-07-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2224:
---

  Component/s: nodemanager
 Priority: Trivial  (was: Major)
 Target Version/s: 2.5.0
Affects Version/s: 2.4.1
   Labels: newbie  (was: )
   Issue Type: Test  (was: Bug)

> Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective 
> of the default settings
> -
>
> Key: YARN-2224
> URL: https://issues.apache.org/jira/browse/YARN-2224
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: nodemanager
>Affects Versions: 2.4.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Trivial
>  Labels: newbie
> Attachments: YARN-2224.patch
>
>
> If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test 
> will fail. Make the test pass not rely on the default settings but just let 
> it verify that once the setting is turned on it actually does the memory 
> check. See YARN-2225 which suggests we turn the default off.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049199#comment-14049199
 ] 

Vinod Kumar Vavilapalli commented on YARN-1713:
---

Looks much better. Final set of nits:
 - XmlRootElement for ApplicationId”
 -> NewApplication
 - Rename refs to AppId: {Cluster ApplicationId API} in the documentation. Need 
to fix all this documentation to not say ApplicationID.
 - Similarly rename http:///ws/v1/cluster/apps/id
 - I think you should create a writable APIs section in the doc, add a 
disclaimer saying this is alpha+public-unstable and then put the new APIs in 
there, so we can let it bake in for a release or two.

> Implement getnewapplication and submitapp as part of RM web service
> ---
>
> Key: YARN-1713
> URL: https://issues.apache.org/jira/browse/YARN-1713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, 
> apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, 
> apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, 
> apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
> apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
> apache-yarn-1713.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049175#comment-14049175
 ] 

Jian He commented on YARN-2001:
---

bq.  Insufficient state etc.
right,  found more issues that RM is possible to receive the 
release-container-requset(sent by AM on resync) before the containers are 
actually recovered. So we need to make sure the previous release-request is 
also processed correctly on recovery.

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down

2014-07-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049167#comment-14049167
 ] 

Jian He commented on YARN-1366:
---

-  SecurityUtil.java loads configurations during class loading.
I see. 

Patch looks good to me, just two more minor comments:
- These two synchronized block can be merged into one ?
{code}
synchronized (this) {
  // reset lastResponseId to 0
  lastResponseId = 0;
  release.addAll(this.pendingRelease);
  blacklistAdditions.addAll(this.blacklistedNodes);
}
// re register with RM
registerApplicationMaster();

synchronized (this) {
  for (Map> rr : 
remoteRequestsTable
  .values()) {
for (Map capabalities : rr.values()) 
{
  for (ResourceRequestInfo request : capabalities.values()) {
addResourceRequestToAsk(request.remoteRequest);
  }
}
  }
}
{code}
- The following reset of responseId in unregisterApplicationMaster is not 
needed?
{code}
  synchronized (this) {
// reset lastResponseId to 0
lastResponseId = 0;
  }
{code}

> AM should implement Resync with the ApplicationMasterService instead of 
> shutting down
> -
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, 
> YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, 
> YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, 
> YARN-1366.prototype.patch, YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2240) yarn logs can get corrupted if the aggregator does not have permissions to the log file it tries to read

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049104#comment-14049104
 ] 

Vinod Kumar Vavilapalli commented on YARN-2240:
---

[~mitdesai], this is interesting. We had seen a bunch of errors that we 
couldn't find the root cause for. Mind pasting the exception messages that you 
see on the client or the error message in the logs?

> yarn logs can get corrupted if the aggregator does not have permissions to 
> the log file it tries to read
> 
>
> Key: YARN-2240
> URL: https://issues.apache.org/jira/browse/YARN-2240
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.5.0
>Reporter: Mit Desai
>
> When the log aggregator is aggregating the logs, it writes the file length 
> first. Then tries to open the log file and if it does not have permission to 
> do that, it ends up just writing an error message to the aggregated logs.
> The mismatch between the file length and the actual length here makes the 
> aggregated logs corrupted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2139) Add support for disk IO isolation/scheduling for containers

2014-07-01 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2139:
--

Attachment: Disk_IO_Scheduling_Design_1.pdf

Attach a design draft.

> Add support for disk IO isolation/scheduling for containers
> ---
>
> Key: YARN-2139
> URL: https://issues.apache.org/jira/browse/YARN-2139
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: Disk_IO_Scheduling_Design_1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2240) yarn logs can get corrupted if the aggregator does not have permissions to the log file it tries to read

2014-07-01 Thread Mit Desai (JIRA)
Mit Desai created YARN-2240:
---

 Summary: yarn logs can get corrupted if the aggregator does not 
have permissions to the log file it tries to read
 Key: YARN-2240
 URL: https://issues.apache.org/jira/browse/YARN-2240
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Mit Desai


When the log aggregator is aggregating the logs, it writes the file length 
first. Then tries to open the log file and if it does not have permission to do 
that, it ends up just writing an error message to the aggregated logs.

The mismatch between the file length and the actual length here makes the 
aggregated logs corrupted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2146) Yarn logs aggregation error

2014-07-01 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048910#comment-14048910
 ] 

Mit Desai commented on YARN-2146:
-

I looked at it. The problem is due to the corner case in the fix. I will file 
another JIRA to track the issue. Thanks [~airbots]

> Yarn logs aggregation error
> ---
>
> Key: YARN-2146
> URL: https://issues.apache.org/jira/browse/YARN-2146
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chen He
>
> when I run "yarn logs -applicationId application_xxx > /tmp/application_xxx". 
> It creates file, also shows part of logs on the terminal screen, and reports 
> following error:   
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:430)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566)
>   at 
> org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens

2014-07-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048865#comment-14048865
 ] 

Zhijie Shen commented on YARN-2233:
---

Thanks Varun for the patch. In general, the patch looks good, and I like the 
detailed test cases:-) Here're some point I'd like to you help to further 
clarify:

1. 
bq. It should be noted that when cancelling a token, the token to be cancelled 
is specified by setting a header.

Any reason for specifying the token in head? If there's something 
non-intuitive, maybe we should have some in-code comments for other developers?

2. RPC get delegation token API doesn't have these fields, but it seems to be 
nice have. We may want to file a Jira.
{code}
+long currentExpiration = ident.getIssueDate() + tokenRenewInterval;
+long maxValidity = ident.getMaxDate();
{code}

3. Is it possible to reuse KerberosTestUtils in hadoop-auth?

4. Is this supposed to test invalid request body? It doesn't look like the 
invalid body construction in the later tests.
{code}
+response =
+resource().path("ws").path("v1").path("cluster")
+  .path("delegation-token").accept(contentType)
+  .entity(dtoken, mediaType).post(ClientResponse.class);
+assertEquals(Status.BAD_REQUEST, response.getClientResponseStatus());
{code}

Some minor issues:

1. No need of "== ture".
{code}
+if (usePrincipal == true) {
{code}
Similarly,
{code}
+if (KerberosAuthenticationHandler.TYPE.equals(authType) == false) {
{code}

2. If I remember it correctly, callerUGI.doAs will throw 
UndeclaredThrowableException, which wraps the real raised exception. However, 
UndeclaredThrowableException is an RE, this code cannot capture it.
{code}
+try {
+  resp =
+  callerUGI
+.doAs(new PrivilegedExceptionAction() {
+  @Override
+  public GetDelegationTokenResponse run() throws IOException,
+  YarnException {
+GetDelegationTokenRequest createReq =
+GetDelegationTokenRequest.newInstance(renewer);
+return rm.getClientRMService().getDelegationToken(createReq);
+  }
+});
+} catch (Exception e) {
+  LOG.info("Create delegation token request failed", e);
+  throw e;
+}
{code}

3. Cannot return respToken simply? The framework should generate "OK" status 
automatically, right?
{code}
+return Response.status(Status.OK).entity(respToken).build();
{code}

4. You can call tk.decodeIdentifier directly.
{code}
+RMDelegationTokenIdentifier ident = new RMDelegationTokenIdentifier();
+ByteArrayInputStream buf = new ByteArrayInputStream(tk.getIdentifier());
+DataInputStream in = new DataInputStream(buf);
+ident.readFields(in);
{code}

> Implement web services to create, renew and cancel delegation tokens
> 
>
> Key: YARN-2233
> URL: https://issues.apache.org/jira/browse/YARN-2233
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2233.0.patch
>
>
> Implement functionality to create, renew and cancel delegation tokens.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-07-01 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048774#comment-14048774
 ] 

Varun Vasudev commented on YARN-1713:
-

The test failure is unrelated.

> Implement getnewapplication and submitapp as part of RM web service
> ---
>
> Key: YARN-1713
> URL: https://issues.apache.org/jira/browse/YARN-1713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, 
> apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, 
> apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, 
> apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
> apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
> apache-yarn-1713.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048727#comment-14048727
 ] 

Hadoop QA commented on YARN-1713:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12653360/apache-yarn-1713.9.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4162//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4162//console

This message is automatically generated.

> Implement getnewapplication and submitapp as part of RM web service
> ---
>
> Key: YARN-1713
> URL: https://issues.apache.org/jira/browse/YARN-1713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, 
> apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, 
> apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, 
> apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
> apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
> apache-yarn-1713.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048714#comment-14048714
 ] 

Hadoop QA commented on YARN-2228:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653356/YARN-2228.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4161//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4161//console

This message is automatically generated.

> TimelineServer should load pseudo authentication filter when authentication = 
> simple
> 
>
> Key: YARN-2228
> URL: https://issues.apache.org/jira/browse/YARN-2228
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2228.1.patch
>
>
> When kerberos authentication is not enabled, we should let the timeline 
> server to work with pseudo authentication filter. In this way, the sever is 
> able to detect the request user by checking "user.name".
> On the other hand, timeline client should append "user.name" in un-secure 
> case as well, such that ACLs can keep working in this case. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1565) Add a way for YARN clients to get critical YARN system properties from the RM

2014-07-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048704#comment-14048704
 ] 

Steve Loughran commented on YARN-1565:
--

I think this should be part of the REST API -we just publish some JSON that 
provides this information to local and remote systems

# the values listed above
# all the special expanded variables you can use in command creation
# a select subset of YARN/Hadoop properties:defaultFS, yarn.vmem,  & some other 
props we think are useful for clients and debugging. We shouldn't publish the 
whole aggregate -site.xml values as that can leak private keys to object 
stores. 

> Add a way for YARN clients to get critical YARN system properties from the RM
> -
>
> Key: YARN-1565
> URL: https://issues.apache.org/jira/browse/YARN-1565
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>
> If you are trying to build up an AM request, you need to know
> # the limits of memory, core &c for the chosen queue
> # the existing YARN classpath
> # the path separator for the target platform (so your classpath comes out 
> right)
> # cluster OS: in case you need some OS-specific changes
> The classpath can be in yarn-site.xml, but a remote client may not have that. 
> The site-xml file doesn't list Queue resource limits, cluster OS or the path 
> separator.
> A way to query the RM for these values would make it easier for YARN clients 
> to build up AM submissions with less guesswork and client-side config.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-07-01 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1713:


Attachment: apache-yarn-1713.9.patch

bq.dao.AppId -> NewApplication? Similarly createApplicationId() -> 
createNewApplication() and createNewAppId() too in RMWebServices.

Fixed.

bq.Similarly rename vars in the test-case.

Fixed.

bq.AppSubmissionContextInfo -> AppSubmissionSubmissionContextInfo

Renamed to ApplicationSubmissionContextInfo.

bq.ContainerLaunchContextInfo 's XML element name 'containerinfo' needs to 
be updated.

Fixed.

bq.ResourceInfo.vCores -> virtualCores with xml name as virtual-cores

This field is already being used as part of a published API so we probably 
should leave it as is.

bq.CredentialsInfo.delegation-tokens -> simply tokens

Fixed.

bq.Can we keep the validation logic same for RPCs and web-services? You 
have additional checks in web-services that don't quite exist in RPCs. I still 
see some w.r.t CLC?

Fixed.

I've also updated the documentation.

> Implement getnewapplication and submitapp as part of RM web service
> ---
>
> Key: YARN-1713
> URL: https://issues.apache.org/jira/browse/YARN-1713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, 
> apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, 
> apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, 
> apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
> apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
> apache-yarn-1713.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple

2014-07-01 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2228:
--

Attachment: YARN-2228.1.patch

Created a patch to make the following major changes:

1. Always load TimelineAuthentcationFilter when the timeline server is up.

2. Completely separate the timeline authentication configuration dependency 
from the common part. All timeline authentication configurations start with 
"yarn.timeline-service.http.authentication".

3. When y.t.h.a.type = simple, TimelineAuthentcationFilter uses 
PseuodAuthenticationHandler to process the request. It allow the timeline 
server to get the user name if the user specifies "usern.name" in the URL 
param, and to use it as the owner of the entity that the user posts. In this 
way, we can enable timeline ACLs even when kerberos authentication is not 
enabled (aka insecure mode). When y.t.h.a.type = kerberos, everything works as 
before.

4. Updated TestTimelineWebServices to test ACLs under the "simple" 
authentication type instead of mocking user name.

I've verified the patch locally in both secure and insecure cluster, which 
looked generally fine.

> TimelineServer should load pseudo authentication filter when authentication = 
> simple
> 
>
> Key: YARN-2228
> URL: https://issues.apache.org/jira/browse/YARN-2228
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2228.1.patch
>
>
> When kerberos authentication is not enabled, we should let the timeline 
> server to work with pseudo authentication filter. In this way, the sever is 
> able to detect the request user by checking "user.name".
> On the other hand, timeline client should append "user.name" in un-secure 
> case as well, such that ACLs can keep working in this case. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2239) Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs()

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048657#comment-14048657
 ] 

Hadoop QA commented on YARN-2239:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653340/YARN-2239.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4159//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4159//console

This message is automatically generated.

> Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs()
> ---
>
> Key: YARN-2239
> URL: https://issues.apache.org/jira/browse/YARN-2239
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Kenji Kikushima
>Assignee: Kenji Kikushima
>Priority: Trivial
> Attachments: YARN-2239.patch
>
>
> In ClusterMetrics, other get NMs() methods have "Num" prefix. (Ex. 
> getNumLostNMs()/getNumRebootedNMs())
> For naming consistency, we should rename getUnhealthyNMs() to 
> getNumUnhealthyNMs().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048623#comment-14048623
 ] 

Hadoop QA commented on YARN-2229:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653339/YARN-2229.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4160//console

This message is automatically generated.

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2239) Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs()

2014-07-01 Thread Kenji Kikushima (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenji Kikushima updated YARN-2239:
--

Attachment: YARN-2239.patch

Attached a patch.

> Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs()
> ---
>
> Key: YARN-2239
> URL: https://issues.apache.org/jira/browse/YARN-2239
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Kenji Kikushima
>Assignee: Kenji Kikushima
>Priority: Trivial
> Attachments: YARN-2239.patch
>
>
> In ClusterMetrics, other get NMs() methods have "Num" prefix. (Ex. 
> getNumLostNMs()/getNumRebootedNMs())
> For naming consistency, we should rename getUnhealthyNMs() to 
> getNumUnhealthyNMs().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2239) Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs()

2014-07-01 Thread Kenji Kikushima (JIRA)
Kenji Kikushima created YARN-2239:
-

 Summary: Rename ClusterMetrics#getUnhealthyNMs() to 
getNumUnhealthyNMs()
 Key: YARN-2239
 URL: https://issues.apache.org/jira/browse/YARN-2239
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Kenji Kikushima
Assignee: Kenji Kikushima
Priority: Trivial


In ClusterMetrics, other get NMs() methods have "Num" prefix. (Ex. 
getNumLostNMs()/getNumRebootedNMs())
For naming consistency, we should rename getUnhealthyNMs() to 
getNumUnhealthyNMs().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-

Attachment: YARN-2229.2.patch

Fixed compile error.

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.2.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-

Attachment: (was: YARN-2229-wip.01.patch)

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048593#comment-14048593
 ] 

Hadoop QA commented on YARN-2229:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653335/YARN-2229.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4158//console

This message is automatically generated.

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229-wip.01.patch, YARN-2229.1.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048589#comment-14048589
 ] 

Tsuyoshi OZAWA commented on YARN-2229:
--

Attached a patch based on the idea described above.

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229-wip.01.patch, YARN-2229.1.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2232) ClientRMService doesn't allow delegation token owner to cancel their own token in secure mode

2014-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048587#comment-14048587
 ] 

Hadoop QA commented on YARN-2232:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12653327/apache-yarn-2232.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4157//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4157//console

This message is automatically generated.

> ClientRMService doesn't allow delegation token owner to cancel their own 
> token in secure mode
> -
>
> Key: YARN-2232
> URL: https://issues.apache.org/jira/browse/YARN-2232
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2232.0.patch, apache-yarn-2232.1.patch, 
> apache-yarn-2232.2.patch
>
>
> The ClientRMSerivce doesn't allow delegation token owners to cancel their own 
> tokens. The root cause is this piece of code from the cancelDelegationToken 
> function -
> {noformat}
> String user = getRenewerForToken(token);
> ...
> private String getRenewerForToken(Token token) 
> throws IOException {
>   UserGroupInformation user = UserGroupInformation.getCurrentUser();
>   UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
>   // we can always renew our own tokens
>   return loginUser.getUserName().equals(user.getUserName())
>   ? token.decodeIdentifier().getRenewer().toString()
>   : user.getShortUserName();
> }
> {noformat}
> It ends up passing the user short name to the cancelToken function whereas 
> AbstractDelegationTokenSecretManager::cancelToken expects the full user name. 
> This bug occurs in secure mode and is not an issue with simple auth.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2229) Making ContainerId long type

2014-07-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-

Attachment: YARN-2229.1.patch

> Making ContainerId long type
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229-wip.01.patch, YARN-2229.1.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)