[jira] [Commented] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049665#comment-14049665 ] Hadoop QA commented on YARN-2241: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653540/YARN-2241.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4174//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4174//console This message is automatically generated. > ZKRMStateStore: On startup, show nicer messages when znodes already exist > - > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > Attachments: YARN-2241.patch, YARN-2241.patch > > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2241: Attachment: YARN-2241.patch You're right, it doesn't fail without the fix; I must have checked it with something slightly different than the old code when I tried it. In that case I don't think we need the test; it's a pretty simple fix and I was able to verify that it worked correctly. I've uploaded a new patch that doesn't have the test. > ZKRMStateStore: On startup, show nicer messages when znodes already exist > - > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > Attachments: YARN-2241.patch, YARN-2241.patch > > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049646#comment-14049646 ] Hadoop QA commented on YARN-1366: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653538/YARN-1366.11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4173//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4173//console This message is automatically generated. > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.10.patch, > YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, > YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, > YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, > YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2245) AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used
Sangjin Lee created YARN-2245: - Summary: AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used Key: YARN-2245 URL: https://issues.apache.org/jira/browse/YARN-2245 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Sangjin Lee Assignee: Sangjin Lee With the job classloader enabled, the MR AM throws ClassNotFoundException if a custom output format class is specified. {noformat} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.foo.test.TestOutputFormat not found at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:473) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:374) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1459) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1456) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1389) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.foo.test.TestOutputFormat not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895) at org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:222) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469) ... 8 more Caused by: java.lang.ClassNotFoundException: Class com.foo.test.TestOutputFormat not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893) ... 10 more {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-1366: - Attachment: YARN-1366.11.patch Updated patch fix findbug warning. > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.10.patch, > YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, > YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, > YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, > YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049621#comment-14049621 ] Hadoop QA commented on YARN-2229: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653536/YARN-2229.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4172//console This message is automatically generated. > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, > YARN-2229.3.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijuan Zhang updated YARN-2142: --- Labels: features patch (was: patch) > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler, webapp > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 > Only in branch-2.2.0. >Reporter: anders >Priority: Minor > Labels: features > Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, > trust002.patch, trust003.patch, trust2.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's health checkservice. > ***Only in branch-2.2.0 , not in trunk*** -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijuan Zhang updated YARN-2142: --- Labels: features (was: features patch) > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler, webapp > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 > Only in branch-2.2.0. >Reporter: anders >Priority: Minor > Labels: features > Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, > trust002.patch, trust003.patch, trust2.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's health checkservice. > ***Only in branch-2.2.0 , not in trunk*** -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijuan Zhang updated YARN-2142: --- Affects Version/s: (was: 2.2.0) > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler, webapp > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 > Only in branch-2.2.0. >Reporter: anders >Priority: Minor > Labels: patch > Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, > trust002.patch, trust003.patch, trust2.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's health checkservice. > ***Only in branch-2.2.0 , not in trunk*** -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (YARN-2244) FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reopened YARN-2244: - You resolved the bug as duplicate with itself. Reopening it. > FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596 > - > > Key: YARN-2244 > URL: https://issues.apache.org/jira/browse/YARN-2244 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > > We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Important > fixes in that include handling unknown containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.3.patch > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, > YARN-2229.3.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049610#comment-14049610 ] Hadoop QA commented on YARN-1366: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653530/YARN-1366.10.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4171//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4171//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4171//console This message is automatically generated. > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.10.patch, > YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, > YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, > YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049605#comment-14049605 ] Li Lu commented on YARN-2242: - Hi [~zjshen], you're right that the two jiras have much shared focus. If I understand correctly, the available patch for YARN-2013 focused on launch time, and the patch here focuses on help users making use of the logs generated by log aggregator. In a whole package I think these two patches can alleviate the problem on launch time crashes. > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049598#comment-14049598 ] Li Lu commented on YARN-2242: - Hi [~djp], sure I can definitely do that. One small question is, since this patch is too trivial, could you please give some suggestions on how to build or modify an unit test for it? I'm hoping this part is already test somewhere in the existing UTs, and some modification would suffice. Thanks! > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2194) Add Cgroup support for RedHat 7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049593#comment-14049593 ] Beckham007 commented on YARN-2194: -- +1. A new LCEResourceHandler is needed. To support more resource isolation, we also need to have init(), preExecute() and postExecute() for different resource. Adding an abstract CgroupsResourceManager and its implement CPUResourceManager\MemResourceManager is good. > Add Cgroup support for RedHat 7 > --- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan > > In previous versions of RedHat, we can build custom cgroup hierarchies with > use of the cgconfig command from the libcgroup package. From RedHat 7, > package libcgroup is deprecated and it is not recommended to use it since it > can easily create conflicts with the default cgroup hierarchy. The systemd is > provided and recommended for cgroup management. We need to add support for > this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-1366: - Attachment: YARN-1366.10.patch I updated the patch with addressing comments. Please review.. > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.10.patch, > YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, > YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, > YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049589#comment-14049589 ] Zhijie Shen commented on YARN-2242: --- Hi [~gtCarrera9], the useless ExitCodeException stack is not just limited to AM container. Given a container gets crashed, we're always expecting this message. Previously, I filed similar ticket: YARN-2013. [~ozawa], was working on it, but I didn't have a chance to look into it. Maybe you want to consolidate the two jiras. > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049587#comment-14049587 ] Junping Du commented on YARN-2242: -- Hi [~gtCarrera9], Thanks for contributing a patch here! Would you mind to add a unit test to verify your exception messages? I will review your patch. > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049586#comment-14049586 ] Vinod Kumar Vavilapalli commented on YARN-2175: --- That is a reasonable proposal, but I'd like to see if there are any other bugs that are causing this to happen. Have we seen this in practice? If so, what is the underlying reason? Too big a resource? The source file-system is down? Or NM has a bug? We should try to address the right individual problem with its solution before we put a band-aid that may still be useful for issues that we cannot just address directly if any. Contrast this with mapreduce.task.timeout. Arguably the config helped users timeout their jobs, but from my experience it prevented us from focusing on fixing point bugs that were hidden in the framework for a long time - it kind of hides the issues. It still is useful, for those unmanageable and unsolvable bugs, but I'd rather first fix the point problems and then put the band-aid. Thoughts? > Container localization has no timeouts and tasks can be stuck there for a > long time > --- > > Key: YARN-2175 > URL: https://issues.apache.org/jira/browse/YARN-2175 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > There are no timeouts that can be used to limit the time taken by various > container startup operations. Localization for example could take a long time > and there is no automated way to kill an task if its stuck in these states. > These may have nothing to do with the task itself and could be an issue > within the platform. > Ideally there should be configurable limits for various states within the > NodeManager to limit various states. The RM does not care about most of these > and its only between AM and the NM. We can start by making these global > configurable defaults and in future we can make it fancier by letting AM > override them in the start container request. > This jira will be used to limit localization time and we can open others if > we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2244) FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-2244. --- Resolution: Duplicate I see that YARN-2244 is already filed. Closing as dup. Please reopen if you disagree. > FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596 > - > > Key: YARN-2244 > URL: https://issues.apache.org/jira/browse/YARN-2244 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > > We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Important > fixes in that include handling unknown containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049577#comment-14049577 ] Hadoop QA commented on YARN-2242: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653521/YARN-2242-070114-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4170//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4170//console This message is automatically generated. > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049572#comment-14049572 ] Jian He commented on YARN-1366: --- I meant, can we do this ? {code} synchronized (this) { // reset lastResponseId to 0 lastResponseId = 0; release.addAll(this.pendingRelease); blacklistAdditions.addAll(this.blacklistedNodes); for (Map> rr : remoteRequestsTable .values()) { for (Map capabalities : rr.values()) { for (ResourceRequestInfo request : capabalities.values()) { addResourceRequestToAsk(request.remoteRequest); } } } } // re register with RM registerApplicationMaster(); {code} and "lastResponseId = 0;" may be put in registerApplicationMaster call also ? > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, > YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, > YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, > YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049566#comment-14049566 ] Vinod Kumar Vavilapalli commented on YARN-2074: --- bq. Talked with Vinod offline, the big problem with this is even if we don't count AM preemption towards AM failures on RM side, MR AM itself checks the attempt id against the max-attempt count for recovery. Work around is to reset the MAX-ATTEMPT env each time launching the AM which sounds a bit hacky though. Filed MAPREDUCE-5956 for this.. > Preemption of AM containers shouldn't count towards AM failures > --- > > Key: YARN-2074 > URL: https://issues.apache.org/jira/browse/YARN-2074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Jian He > Fix For: 2.5.0 > > Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, > YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, > YARN-2074.7.patch, YARN-2074.7.patch, YARN-2074.8.patch > > > One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM > containers getting preempted shouldn't count towards AM failures and thus > shouldn't eventually fail applications. > We should explicitly handle AM container preemption/kill as a separate issue > and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049563#comment-14049563 ] Rohith commented on YARN-1366: -- bq. These two synchronized block can be merged into one ? This I separated intentionally for the handling the very corner scenario i.e after AM gets resync it go for re registering the AM. By worst case, with this period of time, if again RM goes down, then registerapplicationmaster start retry both RM's. Thought not to block AMRMClient oprations such as updateblacklist,addContainerRequest and others so on... Would you think time taken to retry is not more and it can be blocked? > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, > YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, > YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, > YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to nuke the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049561#comment-14049561 ] Hadoop QA commented on YARN-2131: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653517/YARN-2131.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4168//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4168//console This message is automatically generated. > Add a way to nuke the RMStateStore > -- > > Key: YARN-2131 > URL: https://issues.apache.org/jira/browse/YARN-2131 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter > Attachments: YARN-2131.patch > > > There are cases when we don't want to recover past applications, but recover > applications going forward. To do this, one has to clear the store. Today, > there is no easy way to do this and users should understand how each store > works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2022: -- Fix Version/s: 2.5.0 > Preempting an Application Master container can be kept as least priority when > multiple applications are marked for preemption by > ProportionalCapacityPreemptionPolicy > - > > Key: YARN-2022 > URL: https://issues.apache.org/jira/browse/YARN-2022 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sunil G >Assignee: Sunil G > Fix For: 2.5.0 > > Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, > YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, > YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, > Yarn-2022.1.patch > > > Cluster Size = 16GB [2NM's] > Queue A Capacity = 50% > Queue B Capacity = 50% > Consider there are 3 applications running in Queue A which has taken the full > cluster capacity. > J1 = 2GB AM + 1GB * 4 Maps > J2 = 2GB AM + 1GB * 4 Maps > J3 = 2GB AM + 1GB * 2 Maps > Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. > Currently in this scenario, Jobs J3 will get killed including its AM. > It is better if AM can be given least priority among multiple applications. > In this same scenario, map tasks from J3 and J2 can be preempted. > Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2204) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049553#comment-14049553 ] Hudson commented on YARN-2204: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5806 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5806/]) YARN-2204. Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow. (Anubhav Dhoot via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1607231) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler > --- > > Key: YARN-2204 > URL: https://issues.apache.org/jira/browse/YARN-2204 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Trivial > Fix For: 2.5.0 > > Attachments: YARN-2204.patch, YARN-2204_addendum.patch, > YARN-2204_addendum.patch > > > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049552#comment-14049552 ] Hudson commented on YARN-2022: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5806 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5806/]) YARN-2022 Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy (Sunil G via mayank) (mayank: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1607227) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java > Preempting an Application Master container can be kept as least priority when > multiple applications are marked for preemption by > ProportionalCapacityPreemptionPolicy > - > > Key: YARN-2022 > URL: https://issues.apache.org/jira/browse/YARN-2022 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, > YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, > YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, > Yarn-2022.1.patch > > > Cluster Size = 16GB [2NM's] > Queue A Capacity = 50% > Queue B Capacity = 50% > Consider there are 3 applications running in Queue A which has taken the full > cluster capacity. > J1 = 2GB AM + 1GB * 4 Maps > J2 = 2GB AM + 1GB * 4 Maps > J3 = 2GB AM + 1GB * 2 Maps > Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. > Currently in this scenario, Jobs J3 will get killed including its AM. > It is better if AM can be given least priority among multiple applications. > In this same scenario, map tasks from J3 and J2 can be preempted. > Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049548#comment-14049548 ] Hadoop QA commented on YARN-611: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653515/YARN-611.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4166//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4166//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4166//console This message is automatically generated. > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Attachment: YARN-2242-070114-1.patch Second version, minimize change set. Test is not included since this patch only changes an output on the webpage UI, and could be verified on any AM launch crashes. > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049546#comment-14049546 ] Hadoop QA commented on YARN-2242: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653509/YARN-2242-070114.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4167//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4167//console This message is automatically generated. > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049541#comment-14049541 ] Karthik Kambatla commented on YARN-2241: I am okay with leaving the test in there to avoid regressions of throwing exceptions in the future. > ZKRMStateStore: On startup, show nicer messages when znodes already exist > - > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > Attachments: YARN-2241.patch > > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2241: --- Summary: ZKRMStateStore: On startup, show nicer messages when znodes already exist (was: Show nicer messages when ZNodes already exist in ZKRMStateStore on startup) > ZKRMStateStore: On startup, show nicer messages when znodes already exist > - > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > Attachments: YARN-2241.patch > > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049539#comment-14049539 ] Karthik Kambatla commented on YARN-2241: I ll have to retract that +1. The fix is good, but the test doesn't do much. Actually, the test doesn't fail without the fix. Is that intentional? Without this patch, these exceptions are merely logged and not thrown. I am okay with a patch without the test since we are just changing the logging. > Show nicer messages when ZNodes already exist in ZKRMStateStore on startup > -- > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > Attachments: YARN-2241.patch > > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2224) Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow
[ https://issues.apache.org/jira/browse/YARN-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2224: --- Summary: Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow (was: Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of the default settings) > Explicitly enable vmem check in > TestContainersMonitor#testContainerKillOnMemoryOverflow > --- > > Key: YARN-2224 > URL: https://issues.apache.org/jira/browse/YARN-2224 > Project: Hadoop YARN > Issue Type: Test > Components: nodemanager >Affects Versions: 2.4.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Trivial > Labels: newbie > Attachments: YARN-2224.patch > > > If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test > will fail. Make the test pass not rely on the default settings but just let > it verify that once the setting is turned on it actually does the memory > check. See YARN-2225 which suggests we turn the default off. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049534#comment-14049534 ] Hadoop QA commented on YARN-2142: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653518/trust.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4169//console This message is automatically generated. > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler, webapp >Affects Versions: 2.2.0 > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 > Only in branch-2.2.0. >Reporter: anders >Priority: Minor > Labels: patch > Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, > trust002.patch, trust003.patch, trust2.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's health checkservice. > ***Only in branch-2.2.0 , not in trunk*** -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049532#comment-14049532 ] Karthik Kambatla commented on YARN-2241: Looks good. +1 > Show nicer messages when ZNodes already exist in ZKRMStateStore on startup > -- > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > Attachments: YARN-2241.patch > > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049529#comment-14049529 ] Mayank Bansal commented on YARN-2022: - + 1 committing Thanks [~sunilg] for the patch. Thanks [~vinodkv] and [~wangda] for the reviews. Thanks, Mayank > Preempting an Application Master container can be kept as least priority when > multiple applications are marked for preemption by > ProportionalCapacityPreemptionPolicy > - > > Key: YARN-2022 > URL: https://issues.apache.org/jira/browse/YARN-2022 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, > YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, > YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, > Yarn-2022.1.patch > > > Cluster Size = 16GB [2NM's] > Queue A Capacity = 50% > Queue B Capacity = 50% > Consider there are 3 applications running in Queue A which has taken the full > cluster capacity. > J1 = 2GB AM + 1GB * 4 Maps > J2 = 2GB AM + 1GB * 4 Maps > J3 = 2GB AM + 1GB * 2 Maps > Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. > Currently in this scenario, Jobs J3 will get killed including its AM. > It is better if AM can be given least priority among multiple applications. > In this same scenario, map tasks from J3 and J2 can be preempted. > Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijuan Zhang updated YARN-2142: --- Attachment: trust.patch > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler, webapp >Affects Versions: 2.2.0 > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 > Only in branch-2.2.0. >Reporter: anders >Priority: Minor > Labels: patch > Attachments: trust.patch, trust.patch, trust.patch, trust001.patch, > trust002.patch, trust003.patch, trust2.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's health checkservice. > ***Only in branch-2.2.0 , not in trunk*** -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2131) Add a way to nuke the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2131: Attachment: YARN-2131.patch The patch adds a {{deleteStore()}} method to RMStateStore and implementations for the ZKRMStateStore and the FileSystemRMStateStore; this gets called when you run {{yarn resourcemanager -format}}. I also added a unit test and verified that it works in a cluster with the ZKRMStateStore and also the FileSystemRMStateStore with both the local FS and HDFS. > Add a way to nuke the RMStateStore > -- > > Key: YARN-2131 > URL: https://issues.apache.org/jira/browse/YARN-2131 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter > Attachments: YARN-2131.patch > > > There are cases when we don't want to recover past applications, but recover > applications going forward. To do this, one has to clear the store. Today, > there is no easy way to do this and users should understand how each store > works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijuan Zhang updated YARN-2142: --- Attachment: (was: test.patch) > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler, webapp >Affects Versions: 2.2.0 > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 > Only in branch-2.2.0. >Reporter: anders >Priority: Minor > Labels: patch > Attachments: trust.patch, trust.patch, trust001.patch, > trust002.patch, trust003.patch, trust2.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's health checkservice. > ***Only in branch-2.2.0 , not in trunk*** -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijuan Zhang updated YARN-2142: --- Attachment: (was: trust.patch) > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler, webapp >Affects Versions: 2.2.0 > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 > Only in branch-2.2.0. >Reporter: anders >Priority: Minor > Labels: patch > Attachments: trust.patch, trust.patch, trust001.patch, > trust002.patch, trust003.patch, trust2.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's health checkservice. > ***Only in branch-2.2.0 , not in trunk*** -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-611: --- Attachment: YARN-611.1.patch > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049512#comment-14049512 ] Xuan Gong commented on YARN-611: Here is my Proposal: We can make this resetCountPolicy choosable (If users have another requirements, we can implement more policies for them). Currently, we will provider WindowsSlideAMRetryCountResetPolicy. To use this policy, the users need to initiate it by passing a parameter which is used to define period of time in milliseconds that AM retry count will be reset. And they can put this policy into ApplicationSubmissionContext. In that case, the RMApp and RMAppAttempt can get this policy to use. Also, we need to change the way that we are using to decide whether this AppAttempt is lastRetry. We can use : {code} maxAppAttempts == (getNumFailedAppAttempts() + 1 - this.attemptResetCount) to do the calculation. {code} Note: getNumFailedAppAttempts() will calculate how many previous attempts are really failed (excluding the preemption, nm resync, hardware error and rm restart/failover). this.attemptResetCount is used to track the number of failure that we should reset. In every resetCountPolicy, we should provide a way to calculate this number on time. For WindowsSlideAMRetryCountResetPolicy, after AM successfully run a period of time, we can set this.attemptResetCount as the number of really failed previous attempts. Also, we need to provide a way to re-build the this.attemptResetCount value when RM restart/failover happens > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049513#comment-14049513 ] Xuan Gong commented on YARN-611: Upload a patch for this proposal > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Attachment: YARN-2242-070114.patch > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Attachment: (was: YARN-2242-070114.patch) > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Attachment: YARN-2242-070114.patch This patch disables the confusing output of ShellExitCodeException, and "points to" the logs generated by each container attempt. For console users, the new exception information reports the URL where the application is traced. Then the exception information reminds users to click on the links in the same page to check out exception details. > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2242-070114.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049487#comment-14049487 ] Hadoop QA commented on YARN-2241: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653496/YARN-2241.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4165//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4165//console This message is automatically generated. > Show nicer messages when ZNodes already exist in ZKRMStateStore on startup > -- > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > Attachments: YARN-2241.patch > > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2244) FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot moved MAPREDUCE-5955 to YARN-2244: Component/s: (was: scheduler) fairscheduler Key: YARN-2244 (was: MAPREDUCE-5955) Project: Hadoop YARN (was: Hadoop Map/Reduce) > FairScheduler missing fixes made in other schedulers in patch MAPREDUCE-3596 > - > > Key: YARN-2244 > URL: https://issues.apache.org/jira/browse/YARN-2244 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > > We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Important > fixes in that include handling unknown containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049469#comment-14049469 ] Hudson commented on YARN-1713: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5805 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5805/]) YARN-1713. Added get-new-app and submit-app functionality to RM web services. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1607216) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/GenericExceptionHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ApplicationSubmissionContextInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ContainerLaunchContextInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CredentialsInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LocalResourceInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NewApplication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ResourceInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm > Implement getnewapplication and submitapp as part of RM web service > --- > > Key: YARN-1713 > URL: https://issues.apache.org/jira/browse/YARN-1713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Fix For: 2.5.0 > > Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, > apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, > apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, > apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, > apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, > apache-yarn-1713.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu reassigned YARN-2242: --- Assignee: Li Lu > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2243) Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor
Ted Yu created YARN-2243: Summary: Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor Key: YARN-2243 URL: https://issues.apache.org/jira/browse/YARN-2243 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Priority: Minor {code} public SchedulerApplicationAttempt(ApplicationAttemptId applicationAttemptId, String user, Queue queue, ActiveUsersManager activeUsersManager, RMContext rmContext) { Preconditions.checkNotNull("RMContext should not be null", rmContext); {code} Order of arguments is wrong for Preconditions.checkNotNull(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Description: Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a "pointer" to the aggregated logs to the programmer when reporting exception information. (was: Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a "pointer" to the aggregated logs to the programmer. ) > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Description: Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a "pointer" to the aggregated logs to the programmer. > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2233: -- Component/s: resourcemanager Priority: Blocker (was: Major) Target Version/s: 2.5.0 Marked for 2.5 and making it a blocker as I'd like to get it in to make RM web-services usable.. > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2233.0.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2242) Improve exception information on AM launch crashes
Li Lu created YARN-2242: --- Summary: Improve exception information on AM launch crashes Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049454#comment-14049454 ] Vinod Kumar Vavilapalli commented on YARN-1713: --- +1 looks good. Compiled the docs and read them - seem fine. Checking this in.. > Implement getnewapplication and submitapp as part of RM web service > --- > > Key: YARN-1713 > URL: https://issues.apache.org/jira/browse/YARN-1713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, > apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, > apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, > apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, > apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, > apache-yarn-1713.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2241: Attachment: YARN-2241.patch The Exception catching was simply in the wrong place; I moved it to the right place and it now prints a nicer DEBUG message instead of the exceptions/stack traces. I also added a unit test. > Show nicer messages when ZNodes already exist in ZKRMStateStore on startup > -- > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > Attachments: YARN-2241.patch > > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049414#comment-14049414 ] bc Wong commented on YARN-941: -- I'm fine with [~xgong]'s solution. I'd still like to see something more generic to make tokens (HDFS token, HBase token, etc) work with long running apps though. Perhaps I'll pursue the "arbitrary expiration time" approach in another jira. {quote} RPC privacy is a very expensive solution for AM-RM communication. First, it needs setup so AM/RM have access to key infrastructure - having this burden on all applications is not reasonable. This is compounded by the fact that we use AMRMTokens in non-secure mode too. Second, AM - RM communication is a very chatty protocol, it's likely the overhead is huge.. {quote} True security is often costly. The web/consumer industry went through the same exercise with HTTP vs HTTPS. You can get at least 10x better performance with HTTP. But in the end, everybody decided that it's worth it. And passing tokens around without RPC privacy is just like sending passwords around on HTTP without SSL. {quote} Unfortunately with long running services (the focus of this JIRA), this attack and its success is not as unlikely. This is the very reason why we roll master-keys every so often in the first place. {quote} With the rolling master key, it's unlikely for the attack to gather enough cipher text to mount that attack. Besides, a longer key would require so much computation to attack that it'd be infeasible. Anyway, appreciate your response, and I'll follow up in another jira. > RM Should have a way to update the tokens it has for a running application > -- > > Key: YARN-941 > URL: https://issues.apache.org/jira/browse/YARN-941 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Joseph Evans >Assignee: Xuan Gong > Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, > YARN-941.preview.4.patch, YARN-941.preview.patch > > > When an application is submitted to the RM it includes with it a set of > tokens that the RM will renew on behalf of the application, that will be > passed to the AM when the application is launched, and will be used when > launching the application to access HDFS to download files on behalf of the > application. > For long lived applications/services these tokens can expire, and then the > tokens that the AM has will be invalid, and the tokens that the RM had will > also not work to launch a new AM. > We need to provide an API that will allow the RM to replace the current > tokens for this application with a new set. To avoid any real race issues, I > think this API should be something that the AM calls, so that the client can > connect to the AM with a new set of tokens it got using kerberos, then the AM > can inform the RM of the new set of tokens and quickly update its tokens > internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2241: Component/s: resourcemanager > Show nicer messages when ZNodes already exist in ZKRMStateStore on startup > -- > > Key: YARN-2241 > URL: https://issues.apache.org/jira/browse/YARN-2241 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Minor > > When using the RMZKStateStore, if you restart the RM, you get a bunch of > stack traces with messages like > {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /rmstore}}. This is expected as these nodes already exist > from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
Robert Kanter created YARN-2241: --- Summary: Show nicer messages when ZNodes already exist in ZKRMStateStore on startup Key: YARN-2241 URL: https://issues.apache.org/jira/browse/YARN-2241 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Minor When using the RMZKStateStore, if you restart the RM, you get a bunch of stack traces with messages like {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /rmstore}}. This is expected as these nodes already exist from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049381#comment-14049381 ] Tsuyoshi OZAWA commented on YARN-2229: -- Sorry for iterative compile error. The attached patch works well on my local. Let me try again. > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049373#comment-14049373 ] Hadoop QA commented on YARN-2229: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653475/YARN-2229.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4164//console This message is automatically generated. > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.2.patch > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049350#comment-14049350 ] Tsuyoshi OZAWA commented on YARN-2229: -- No, it isn't. It can break the backward compatibility to change containerId type from int to long because {{ConverterUtil#toContainerId(str)}} cannot parse the container id string with 64 bit containerId. > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049326#comment-14049326 ] Hadoop QA commented on YARN-1713: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653450/apache-yarn-1713.10.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4163//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4163//console This message is automatically generated. > Implement getnewapplication and submitapp as part of RM web service > --- > > Key: YARN-1713 > URL: https://issues.apache.org/jira/browse/YARN-1713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, > apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, > apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, > apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, > apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, > apache-yarn-1713.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049308#comment-14049308 ] Jian He commented on YARN-2229: --- This patch is supposed to change containerId type from int to long? > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure
[ https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049300#comment-14049300 ] Li Lu commented on YARN-675: [~zjshen], I'd like to work on this. Would you mind if I take this over? Thanks! > In YarnClient, pull AM logs on AM container failure > --- > > Key: YARN-675 > URL: https://issues.apache.org/jira/browse/YARN-675 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Affects Versions: 2.0.4-alpha >Reporter: Sandy Ryza >Assignee: Zhijie Shen > > Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to > pull its logs from the NM to the client so that they can be displayed > immediately to the user. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-1713: Attachment: apache-yarn-1713.10.patch - XmlRootElement for ApplicationId” -> NewApplication bq.Rename refs to AppId: {Cluster ApplicationId API} Fixed. bq.in the documentation. Need to fix all this documentation to not say ApplicationID. Similarly rename http:///ws/v1/cluster/apps/id Fixed. bq. I think you should create a writable APIs section in the doc, add a disclaimer saying this is alpha+public-unstable and then put the new APIs in there, so we can let it bake in for a release or two. Fixed. > Implement getnewapplication and submitapp as part of RM web service > --- > > Key: YARN-1713 > URL: https://issues.apache.org/jira/browse/YARN-1713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, > apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, > apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, > apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, > apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, > apache-yarn-1713.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049223#comment-14049223 ] Anubhav Dhoot commented on YARN-2175: - I should clarify the AM can kill this container manually but each AM will have to implement this logic to detect when localization takes longer and kill when its taking too long. Updating description. We can make it much simpler for administrators and AM writers by having an automatic way to mitigate this. The NodeManager knows each state of the container. Instead of having a back and forth between AM and NM, it will be easier if we just let this be done by NM. We can start with a configurable timeout with a reasonable default. In future we can add ability in the AM to override this during the container request. Lemme know what you guys think. > Container localization has no timeouts and tasks can be stuck there for a > long time > --- > > Key: YARN-2175 > URL: https://issues.apache.org/jira/browse/YARN-2175 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > There are no timeouts that can be used to limit the time taken by various > container startup operations. Localization for example could take a long time > and there is no automated way to kill an task if its stuck in these states. > These may have nothing to do with the task itself and could be an issue > within the platform. > Ideally there should be configurable limits for various states within the > NodeManager to limit various states. The RM does not care about most of these > and its only between AM and the NM. We can start by making these global > configurable defaults and in future we can make it fancier by letting AM > override them in the start container request. > This jira will be used to limit localization time and we can open others if > we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2175: Description: There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no automated way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we open others if we feel we need to limit other operations. was: There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we open others if we feel we need to limit other operations. > Container localization has no timeouts and tasks can be stuck there for a > long time > --- > > Key: YARN-2175 > URL: https://issues.apache.org/jira/browse/YARN-2175 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > There are no timeouts that can be used to limit the time taken by various > container startup operations. Localization for example could take a long time > and there is no automated way to kill an task if its stuck in these states. > These may have nothing to do with the task itself and could be an issue > within the platform. > Ideally there should be configurable limits for various states within the > NodeManager to limit various states. The RM does not care about most of these > and its only between AM and the NM. We can start by making these global > configurable defaults and in future we can make it fancier by letting AM > override them in the start container request. > This jira will be used to limit localization time and we open others if we > feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2175: Description: There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no automated way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we can open others if we feel we need to limit other operations. was: There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no automated way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we open others if we feel we need to limit other operations. > Container localization has no timeouts and tasks can be stuck there for a > long time > --- > > Key: YARN-2175 > URL: https://issues.apache.org/jira/browse/YARN-2175 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > There are no timeouts that can be used to limit the time taken by various > container startup operations. Localization for example could take a long time > and there is no automated way to kill an task if its stuck in these states. > These may have nothing to do with the task itself and could be an issue > within the platform. > Ideally there should be configurable limits for various states within the > NodeManager to limit various states. The RM does not care about most of these > and its only between AM and the NM. We can start by making these global > configurable defaults and in future we can make it fancier by letting AM > override them in the start container request. > This jira will be used to limit localization time and we can open others if > we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2224) Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of the default settings
[ https://issues.apache.org/jira/browse/YARN-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049207#comment-14049207 ] Karthik Kambatla commented on YARN-2224: +1 I wish there were a simpler solution. > Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective > of the default settings > - > > Key: YARN-2224 > URL: https://issues.apache.org/jira/browse/YARN-2224 > Project: Hadoop YARN > Issue Type: Test > Components: nodemanager >Affects Versions: 2.4.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Trivial > Labels: newbie > Attachments: YARN-2224.patch > > > If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test > will fail. Make the test pass not rely on the default settings but just let > it verify that once the setting is turned on it actually does the memory > check. See YARN-2225 which suggests we turn the default off. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2224) Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of the default settings
[ https://issues.apache.org/jira/browse/YARN-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2224: --- Component/s: nodemanager Priority: Trivial (was: Major) Target Version/s: 2.5.0 Affects Version/s: 2.4.1 Labels: newbie (was: ) Issue Type: Test (was: Bug) > Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective > of the default settings > - > > Key: YARN-2224 > URL: https://issues.apache.org/jira/browse/YARN-2224 > Project: Hadoop YARN > Issue Type: Test > Components: nodemanager >Affects Versions: 2.4.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Trivial > Labels: newbie > Attachments: YARN-2224.patch > > > If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test > will fail. Make the test pass not rely on the default settings but just let > it verify that once the setting is turned on it actually does the memory > check. See YARN-2225 which suggests we turn the default off. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049199#comment-14049199 ] Vinod Kumar Vavilapalli commented on YARN-1713: --- Looks much better. Final set of nits: - XmlRootElement for ApplicationId” -> NewApplication - Rename refs to AppId: {Cluster ApplicationId API} in the documentation. Need to fix all this documentation to not say ApplicationID. - Similarly rename http:///ws/v1/cluster/apps/id - I think you should create a writable APIs section in the doc, add a disclaimer saying this is alpha+public-unstable and then put the new APIs in there, so we can let it bake in for a release or two. > Implement getnewapplication and submitapp as part of RM web service > --- > > Key: YARN-1713 > URL: https://issues.apache.org/jira/browse/YARN-1713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, > apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, > apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, > apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, > apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, > apache-yarn-1713.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049175#comment-14049175 ] Jian He commented on YARN-2001: --- bq. Insufficient state etc. right, found more issues that RM is possible to receive the release-container-requset(sent by AM on resync) before the containers are actually recovered. So we need to make sure the previous release-request is also processed correctly on recovery. > Threshold for RM to accept requests from AM after failover > -- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2001.1.patch > > > After failover, RM may require a certain threshold to determine whether it’s > safe to make scheduling decisions and start accepting new container requests > from AMs. The threshold could be a certain amount of nodes. i.e. RM waits > until a certain amount of nodes joining before accepting new container > requests. Or it could simply be a timeout, only after the timeout RM accepts > new requests. > NMs joined after the threshold can be treated as new NMs and instructed to > kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049167#comment-14049167 ] Jian He commented on YARN-1366: --- - SecurityUtil.java loads configurations during class loading. I see. Patch looks good to me, just two more minor comments: - These two synchronized block can be merged into one ? {code} synchronized (this) { // reset lastResponseId to 0 lastResponseId = 0; release.addAll(this.pendingRelease); blacklistAdditions.addAll(this.blacklistedNodes); } // re register with RM registerApplicationMaster(); synchronized (this) { for (Map> rr : remoteRequestsTable .values()) { for (Map capabalities : rr.values()) { for (ResourceRequestInfo request : capabalities.values()) { addResourceRequestToAsk(request.remoteRequest); } } } } {code} - The following reset of responseId in unregisterApplicationMaster is not needed? {code} synchronized (this) { // reset lastResponseId to 0 lastResponseId = 0; } {code} > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, > YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, > YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, > YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2240) yarn logs can get corrupted if the aggregator does not have permissions to the log file it tries to read
[ https://issues.apache.org/jira/browse/YARN-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049104#comment-14049104 ] Vinod Kumar Vavilapalli commented on YARN-2240: --- [~mitdesai], this is interesting. We had seen a bunch of errors that we couldn't find the root cause for. Mind pasting the exception messages that you see on the client or the error message in the logs? > yarn logs can get corrupted if the aggregator does not have permissions to > the log file it tries to read > > > Key: YARN-2240 > URL: https://issues.apache.org/jira/browse/YARN-2240 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.5.0 >Reporter: Mit Desai > > When the log aggregator is aggregating the logs, it writes the file length > first. Then tries to open the log file and if it does not have permission to > do that, it ends up just writing an error message to the aggregated logs. > The mismatch between the file length and the actual length here makes the > aggregated logs corrupted. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2139: -- Attachment: Disk_IO_Scheduling_Design_1.pdf Attach a design draft. > Add support for disk IO isolation/scheduling for containers > --- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: Disk_IO_Scheduling_Design_1.pdf > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2240) yarn logs can get corrupted if the aggregator does not have permissions to the log file it tries to read
Mit Desai created YARN-2240: --- Summary: yarn logs can get corrupted if the aggregator does not have permissions to the log file it tries to read Key: YARN-2240 URL: https://issues.apache.org/jira/browse/YARN-2240 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Mit Desai When the log aggregator is aggregating the logs, it writes the file length first. Then tries to open the log file and if it does not have permission to do that, it ends up just writing an error message to the aggregated logs. The mismatch between the file length and the actual length here makes the aggregated logs corrupted. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2146) Yarn logs aggregation error
[ https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048910#comment-14048910 ] Mit Desai commented on YARN-2146: - I looked at it. The problem is due to the corner case in the fix. I will file another JIRA to track the issue. Thanks [~airbots] > Yarn logs aggregation error > --- > > Key: YARN-2146 > URL: https://issues.apache.org/jira/browse/YARN-2146 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chen He > > when I run "yarn logs -applicationId application_xxx > /tmp/application_xxx". > It creates file, also shows part of logs on the terminal screen, and reports > following error: > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:430) > at java.lang.Long.parseLong(Long.java:483) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566) > at > org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139) > at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137) > at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048865#comment-14048865 ] Zhijie Shen commented on YARN-2233: --- Thanks Varun for the patch. In general, the patch looks good, and I like the detailed test cases:-) Here're some point I'd like to you help to further clarify: 1. bq. It should be noted that when cancelling a token, the token to be cancelled is specified by setting a header. Any reason for specifying the token in head? If there's something non-intuitive, maybe we should have some in-code comments for other developers? 2. RPC get delegation token API doesn't have these fields, but it seems to be nice have. We may want to file a Jira. {code} +long currentExpiration = ident.getIssueDate() + tokenRenewInterval; +long maxValidity = ident.getMaxDate(); {code} 3. Is it possible to reuse KerberosTestUtils in hadoop-auth? 4. Is this supposed to test invalid request body? It doesn't look like the invalid body construction in the later tests. {code} +response = +resource().path("ws").path("v1").path("cluster") + .path("delegation-token").accept(contentType) + .entity(dtoken, mediaType).post(ClientResponse.class); +assertEquals(Status.BAD_REQUEST, response.getClientResponseStatus()); {code} Some minor issues: 1. No need of "== ture". {code} +if (usePrincipal == true) { {code} Similarly, {code} +if (KerberosAuthenticationHandler.TYPE.equals(authType) == false) { {code} 2. If I remember it correctly, callerUGI.doAs will throw UndeclaredThrowableException, which wraps the real raised exception. However, UndeclaredThrowableException is an RE, this code cannot capture it. {code} +try { + resp = + callerUGI +.doAs(new PrivilegedExceptionAction() { + @Override + public GetDelegationTokenResponse run() throws IOException, + YarnException { +GetDelegationTokenRequest createReq = +GetDelegationTokenRequest.newInstance(renewer); +return rm.getClientRMService().getDelegationToken(createReq); + } +}); +} catch (Exception e) { + LOG.info("Create delegation token request failed", e); + throw e; +} {code} 3. Cannot return respToken simply? The framework should generate "OK" status automatically, right? {code} +return Response.status(Status.OK).entity(respToken).build(); {code} 4. You can call tk.decodeIdentifier directly. {code} +RMDelegationTokenIdentifier ident = new RMDelegationTokenIdentifier(); +ByteArrayInputStream buf = new ByteArrayInputStream(tk.getIdentifier()); +DataInputStream in = new DataInputStream(buf); +ident.readFields(in); {code} > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2233.0.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048774#comment-14048774 ] Varun Vasudev commented on YARN-1713: - The test failure is unrelated. > Implement getnewapplication and submitapp as part of RM web service > --- > > Key: YARN-1713 > URL: https://issues.apache.org/jira/browse/YARN-1713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, > apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, > apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, > apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, > apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, > apache-yarn-1713.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048727#comment-14048727 ] Hadoop QA commented on YARN-1713: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653360/apache-yarn-1713.9.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4162//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4162//console This message is automatically generated. > Implement getnewapplication and submitapp as part of RM web service > --- > > Key: YARN-1713 > URL: https://issues.apache.org/jira/browse/YARN-1713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, > apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, > apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, > apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, > apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, > apache-yarn-1713.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048714#comment-14048714 ] Hadoop QA commented on YARN-2228: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653356/YARN-2228.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4161//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4161//console This message is automatically generated. > TimelineServer should load pseudo authentication filter when authentication = > simple > > > Key: YARN-2228 > URL: https://issues.apache.org/jira/browse/YARN-2228 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2228.1.patch > > > When kerberos authentication is not enabled, we should let the timeline > server to work with pseudo authentication filter. In this way, the sever is > able to detect the request user by checking "user.name". > On the other hand, timeline client should append "user.name" in un-secure > case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1565) Add a way for YARN clients to get critical YARN system properties from the RM
[ https://issues.apache.org/jira/browse/YARN-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048704#comment-14048704 ] Steve Loughran commented on YARN-1565: -- I think this should be part of the REST API -we just publish some JSON that provides this information to local and remote systems # the values listed above # all the special expanded variables you can use in command creation # a select subset of YARN/Hadoop properties:defaultFS, yarn.vmem, & some other props we think are useful for clients and debugging. We shouldn't publish the whole aggregate -site.xml values as that can leak private keys to object stores. > Add a way for YARN clients to get critical YARN system properties from the RM > - > > Key: YARN-1565 > URL: https://issues.apache.org/jira/browse/YARN-1565 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Steve Loughran > > If you are trying to build up an AM request, you need to know > # the limits of memory, core &c for the chosen queue > # the existing YARN classpath > # the path separator for the target platform (so your classpath comes out > right) > # cluster OS: in case you need some OS-specific changes > The classpath can be in yarn-site.xml, but a remote client may not have that. > The site-xml file doesn't list Queue resource limits, cluster OS or the path > separator. > A way to query the RM for these values would make it easier for YARN clients > to build up AM submissions with less guesswork and client-side config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-1713: Attachment: apache-yarn-1713.9.patch bq.dao.AppId -> NewApplication? Similarly createApplicationId() -> createNewApplication() and createNewAppId() too in RMWebServices. Fixed. bq.Similarly rename vars in the test-case. Fixed. bq.AppSubmissionContextInfo -> AppSubmissionSubmissionContextInfo Renamed to ApplicationSubmissionContextInfo. bq.ContainerLaunchContextInfo 's XML element name 'containerinfo' needs to be updated. Fixed. bq.ResourceInfo.vCores -> virtualCores with xml name as virtual-cores This field is already being used as part of a published API so we probably should leave it as is. bq.CredentialsInfo.delegation-tokens -> simply tokens Fixed. bq.Can we keep the validation logic same for RPCs and web-services? You have additional checks in web-services that don't quite exist in RPCs. I still see some w.r.t CLC? Fixed. I've also updated the documentation. > Implement getnewapplication and submitapp as part of RM web service > --- > > Key: YARN-1713 > URL: https://issues.apache.org/jira/browse/YARN-1713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, > apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, > apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, > apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, > apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, > apache-yarn-1713.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2228: -- Attachment: YARN-2228.1.patch Created a patch to make the following major changes: 1. Always load TimelineAuthentcationFilter when the timeline server is up. 2. Completely separate the timeline authentication configuration dependency from the common part. All timeline authentication configurations start with "yarn.timeline-service.http.authentication". 3. When y.t.h.a.type = simple, TimelineAuthentcationFilter uses PseuodAuthenticationHandler to process the request. It allow the timeline server to get the user name if the user specifies "usern.name" in the URL param, and to use it as the owner of the entity that the user posts. In this way, we can enable timeline ACLs even when kerberos authentication is not enabled (aka insecure mode). When y.t.h.a.type = kerberos, everything works as before. 4. Updated TestTimelineWebServices to test ACLs under the "simple" authentication type instead of mocking user name. I've verified the patch locally in both secure and insecure cluster, which looked generally fine. > TimelineServer should load pseudo authentication filter when authentication = > simple > > > Key: YARN-2228 > URL: https://issues.apache.org/jira/browse/YARN-2228 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2228.1.patch > > > When kerberos authentication is not enabled, we should let the timeline > server to work with pseudo authentication filter. In this way, the sever is > able to detect the request user by checking "user.name". > On the other hand, timeline client should append "user.name" in un-secure > case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2239) Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs()
[ https://issues.apache.org/jira/browse/YARN-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048657#comment-14048657 ] Hadoop QA commented on YARN-2239: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653340/YARN-2239.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4159//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4159//console This message is automatically generated. > Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs() > --- > > Key: YARN-2239 > URL: https://issues.apache.org/jira/browse/YARN-2239 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Kenji Kikushima >Assignee: Kenji Kikushima >Priority: Trivial > Attachments: YARN-2239.patch > > > In ClusterMetrics, other get NMs() methods have "Num" prefix. (Ex. > getNumLostNMs()/getNumRebootedNMs()) > For naming consistency, we should rename getUnhealthyNMs() to > getNumUnhealthyNMs(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048623#comment-14048623 ] Hadoop QA commented on YARN-2229: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653339/YARN-2229.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4160//console This message is automatically generated. > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2239) Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs()
[ https://issues.apache.org/jira/browse/YARN-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenji Kikushima updated YARN-2239: -- Attachment: YARN-2239.patch Attached a patch. > Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs() > --- > > Key: YARN-2239 > URL: https://issues.apache.org/jira/browse/YARN-2239 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Kenji Kikushima >Assignee: Kenji Kikushima >Priority: Trivial > Attachments: YARN-2239.patch > > > In ClusterMetrics, other get NMs() methods have "Num" prefix. (Ex. > getNumLostNMs()/getNumRebootedNMs()) > For naming consistency, we should rename getUnhealthyNMs() to > getNumUnhealthyNMs(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2239) Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs()
Kenji Kikushima created YARN-2239: - Summary: Rename ClusterMetrics#getUnhealthyNMs() to getNumUnhealthyNMs() Key: YARN-2239 URL: https://issues.apache.org/jira/browse/YARN-2239 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Kenji Kikushima Assignee: Kenji Kikushima Priority: Trivial In ClusterMetrics, other get NMs() methods have "Num" prefix. (Ex. getNumLostNMs()/getNumRebootedNMs()) For naming consistency, we should rename getUnhealthyNMs() to getNumUnhealthyNMs(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.2.patch Fixed compile error. > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.2.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: (was: YARN-2229-wip.01.patch) > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048593#comment-14048593 ] Hadoop QA commented on YARN-2229: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653335/YARN-2229.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4158//console This message is automatically generated. > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229-wip.01.patch, YARN-2229.1.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048589#comment-14048589 ] Tsuyoshi OZAWA commented on YARN-2229: -- Attached a patch based on the idea described above. > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229-wip.01.patch, YARN-2229.1.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2232) ClientRMService doesn't allow delegation token owner to cancel their own token in secure mode
[ https://issues.apache.org/jira/browse/YARN-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048587#comment-14048587 ] Hadoop QA commented on YARN-2232: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653327/apache-yarn-2232.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4157//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4157//console This message is automatically generated. > ClientRMService doesn't allow delegation token owner to cancel their own > token in secure mode > - > > Key: YARN-2232 > URL: https://issues.apache.org/jira/browse/YARN-2232 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2232.0.patch, apache-yarn-2232.1.patch, > apache-yarn-2232.2.patch > > > The ClientRMSerivce doesn't allow delegation token owners to cancel their own > tokens. The root cause is this piece of code from the cancelDelegationToken > function - > {noformat} > String user = getRenewerForToken(token); > ... > private String getRenewerForToken(Token token) > throws IOException { > UserGroupInformation user = UserGroupInformation.getCurrentUser(); > UserGroupInformation loginUser = UserGroupInformation.getLoginUser(); > // we can always renew our own tokens > return loginUser.getUserName().equals(user.getUserName()) > ? token.decodeIdentifier().getRenewer().toString() > : user.getShortUserName(); > } > {noformat} > It ends up passing the user short name to the cancelToken function whereas > AbstractDelegationTokenSecretManager::cancelToken expects the full user name. > This bug occurs in secure mode and is not an issue with simple auth. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.1.patch > Making ContainerId long type > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229-wip.01.patch, YARN-2229.1.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)