[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630042#comment-14630042 ] Wangda Tan commented on YARN-2003: -- Thanks [~sunilg] to update, few more comments regarding the latest patch: - I suggest defer the consideration of queue checking. Currently we're changing how to do queue mapping. Ideally, it should be done before submit to scheduler (maybe before assigning application priority), see YARN-3635. - Assumption of queue will be existed before submit to scheduler may be not always valid. With queue mapping, scheduler can create queue when accepting application. I suggest remove the check of queue's existence. Instead, you can have a private method to get priority by queue name. If queue is not existed, you can assign default priority to application. - Comparison of priority should use Priority.compareTo instead of /. Support for Application priority : Changes in RM and Capacity Scheduler --- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 0021-YARN-2003.patch, 0022-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630090#comment-14630090 ] Bibin A Chundatt commented on YARN-3932: Hi [~leftnoteasy] i think we should iterate over {{liveContainers}} get sum of resource used. Any thoughts?? SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel --- Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: ApplicationReport.jpg Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630472#comment-14630472 ] Hadoop QA commented on YARN-3905: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 14s | Pre-patch trunk has 6 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 29s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 23s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 37s | The applied patch generated 1 new checkstyle issues (total was 39, now 40). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 23s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 9s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-server-common. | | | | 40m 39s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745708/YARN-3905.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0bda84f | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/diffcheckstylehadoop-yarn-server-common.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8562/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8562/console | This message was automatically generated. Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3905.001.patch From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630593#comment-14630593 ] Sangjin Lee commented on YARN-3906: --- The bulk of the work is done, but I'd like to wait until YARN-3908 is committed and update the changes. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629300#comment-14629300 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 14s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 44s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 41s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 46s | The applied patch generated 5 new checkstyle issues (total was 338, now 343). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 22s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 51m 30s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 89m 45s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745572/0005-YARN-3535.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 3ec0a04 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8554/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8554/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8554/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8554/console | This message was automatically generated. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3852) Add docker container support to container-executor
[ https://issues.apache.org/jira/browse/YARN-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629276#comment-14629276 ] Varun Vasudev commented on YARN-3852: - Thanks for the patch [~ashahab]. The patch isn't working for me. There are two issues - # No default value for docker.binary. I think we should assume this to be docker and allow it to be overriden. # The docker launch fails due to {code} if (change_effective_user(user_uid, user_gid) != 0) {code} in launch_docker_container_as_user. For docker run to work, the effective user needs to be root(something like change_effective_user(0, user_gid) is probably the right way). Some other issues - # {code} -static const char* DEFAULT_BANNED_USERS[] = {yarn, mapred, hdfs, bin, 0}; +static const char* DEFAULT_BANNED_USERS[] = {mapred, hdfs, bin, 0}; {code} Why are you removing the yarn user from the banned users? I'm guessing this is due to a branch-2/trunk issue. The yarn user is banned in trunk but not in branch-2 # A couple of formatting fixes {code} + fprintf(LOGFILE, done opening pid\n); +fflush(LOGFILE); {code} and {code} +fprintf(LOGFILE, done writing pid to tmp\n); + fflush(LOGFILE); {code} # Can we change the error message in the message below to a more descriptive one? {code} + fprintf(ERRORFILE, Error reading\n); + fflush(ERRORFILE); {code} # In parse_docker_command_file {code} + int read; {code} should we use ssize_t instead or int? # In parse_docker_command_file, we have some exit(1) calls - can we change this to use the error codes in container-executor.h? # In run_docker {code} + free(docker_binary); + free(args); + free(docker_command_with_binary); + free(docker_command); + exit_code = DOCKER_RUN_FAILED; + } + exit_code = 0; + return exit_code; {code} The exit code from the function will always be 0 # Formatting {code} +int create_script_paths(const char *work_dir, + const char *script_name, const char *cred_file, + char** script_file_dest, char** cred_file_dest, + int* container_file_source, int* cred_file_source ) { {code} # In create_script_paths, we use a bunch of goto's but the goto target doesn't have any special logic or handling. Can we avoid using the goto? # {code} +//kill me now. {code} No need for the commentary :) # In main.c {code} +char * resources = argv[optind++];// key,value pair describing resources +char * resources_key = malloc(strlen(resources)); +char * resources_value = malloc(strlen(resources)); {code} Can we move the declarations of resources, resources_key and resources_value out of the case block(since the same variables are used in two case blocks)? Add docker container support to container-executor --- Key: YARN-3852 URL: https://issues.apache.org/jira/browse/YARN-3852 Project: Hadoop YARN Issue Type: Sub-task Components: yarn Reporter: Sidharta Seethana Assignee: Abin Shahab Attachments: YARN-3852.patch For security reasons, we need to ensure that access to the docker daemon and the ability to run docker containers is restricted to privileged users ( i.e users running applications should not have direct access to docker). In order to ensure the node manager can run docker commands, we need to add docker support to the container-executor binary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3174: - Summary: Consolidate the NodeManager and NodeManagerRestart documentation into one (was: Consolidate the NodeManager documentation into one) Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629268#comment-14629268 ] Sunil G commented on YARN-2005: --- Thanks [~adhoot]. Sorry for delayed response. bq.The nodes are removed from blacklist once the launch of the AM happens to limit this issue. Yes. I feel this will be fine. Blacklisting support for scheduling AMs --- Key: YARN-2005 URL: https://issues.apache.org/jira/browse/YARN-2005 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Anubhav Dhoot Attachments: YARN-2005.001.patch, YARN-2005.002.patch, YARN-2005.003.patch, YARN-2005.004.patch It would be nice if the RM supported blacklisting a node for an AM launch after the same node fails a configurable number of AM attempts. This would be similar to the blacklisting support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3174: - Affects Version/s: 2.7.1 Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3174: - Component/s: documentation Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629308#comment-14629308 ] Hudson commented on YARN-3174: -- FAILURE: Integrated in Hadoop-trunk-Commit #8171 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8171/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629253#comment-14629253 ] Arun Suresh commented on YARN-3535: --- The patch looks good !! Thanks for working on this [~peng.zhang] and [~rohithsharma] +1, Pending successful jenkins run with latest patch ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3926) Extend the YARN resource model for easier resource-type management and profiles
[ https://issues.apache.org/jira/browse/YARN-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629288#comment-14629288 ] Karthik Kambatla commented on YARN-3926: Thanks a bunch for putting this proposal together, Varun. We are in dire need of improvements to our resource-model, and the proposal goes a long way in addressing some of these issues. Huge +1 to this effort. Comments on the proposal itself: # There is a significant overlap between resource-types.xml and node-resources.xml. It would be nice to consolidate at least these parts. # Can we avoid the mismatch between the resource types on RM and NM altogether? # Can we avoid different restart paths for adding and removing resources? # Really like the concise configs proposed at the end of the document. What do you think of the following modifications to the proposal to address above wishes? I have clearly not thought as much before making these suggestions, so please feel free to shoot them down. # How about calling them yarn.resource-types, yarn.resource-types.memory.*, yarn.resource-types.cpu.*. Further, memory/cpu specific configs could be made simpler per the suggestions later in the document? # yarn.scheduler.resource-types is a subset of yarn.resource-types, and captures the resource-types the scheduler supports. This could be in yarn-site on RM. # yarn.nodemanager.resource-types.monitored and yarn.nodemanager.resource-types.enforced also are subsets of yarn.resource-types and could define the resources the NM monitors and enforces respectively. These could be in yarn-site on the NM. I understand isolation is out of scope here, but would be nice to have configs that lend themselves to future work. # yarn.nodemanager.[resources|resource-types].available could be a map where each key should be an entry in yarn.resource-types. You mention capturing node-labels etc. similarly. Could you elaborate on your thoughts, at least informally? Would be super nice to have a path in mind even if we were to do as follow-up work. Extend the YARN resource model for easier resource-type management and profiles --- Key: YARN-3926 URL: https://issues.apache.org/jira/browse/YARN-3926 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Proposal for modifying resource model and profiles.pdf Currently, there are efforts to add support for various resource-types such as disk(YARN-2139), network(YARN-2140), and HDFS bandwidth(YARN-2681). These efforts all aim to add support for a new resource type and are fairly involved efforts. In addition, once support is added, it becomes harder for users to specify the resources they need. All existing jobs have to be modified, or have to use the minimum allocation. This ticket is a proposal to extend the YARN resource model to a more flexible model which makes it easier to support additional resource-types. It also considers the related aspect of “resource profiles” which allow users to easily specify the various resources they need for any given container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629302#comment-14629302 ] Tsuyoshi Ozawa commented on YARN-3805: -- [~iwasakims] could you rebase it? Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3805.001.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629249#comment-14629249 ] Arun Suresh commented on YARN-3535: --- I meant for the FairScheduler... but looks like your new patch has it... thanks ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629250#comment-14629250 ] Akira AJISAKA commented on YARN-2578: - Thanks [~iwasakims] for creating the patch. One comment and one question from me. bq. The default value is 0 in order to keep current behaviour. 1. We would like to fix this bug, so default to 1min is good for me. 2. Would you tell me why {{Client.getRpcTimeout}} returns 0 if {{ipc.client.ping}} is false? NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Attachments: YARN-2578.002.patch, YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629263#comment-14629263 ] Tsuyoshi Ozawa commented on YARN-3174: -- +1 Consolidate the NodeManager documentation into one -- Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629283#comment-14629283 ] Masatake Iwasaki commented on YARN-3174: Thanks, [~ozawa]! Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629296#comment-14629296 ] zhihai xu commented on YARN-3535: - Sorry for coming late into this issue. The latest Patch looks good to me except one nit: Can we make {{ContainerRescheduledTransition}} child class of {{FinishedTransition}} similar as {{KillTransition}}? So we can call {{super.transition(container, event);}} instead of {{new FinishedTransition().transition(container, event);}}. I think this will make the code more readable and match other transition class implementation. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629410#comment-14629410 ] wangfeng commented on YARN-2809: failed when patching this to hadoop2.6.0,console output: patch -u -p0 YARN-2809-v3.patch patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Hunk #1 succeeded at 984 (offset -16 lines). patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java Hunk #1 FAILED at 22. Hunk #2 succeeded at 33 (offset -4 lines). Hunk #3 succeeded at 71 (offset -5 lines). Hunk #4 succeeded at 105 (offset -5 lines). Hunk #5 succeeded at 266 (offset -10 lines). Hunk #6 succeeded at 338 (offset -10 lines). 1 out of 6 hunks FAILED -- saving rejects to file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java.rej patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature
Dongwook Kwon created YARN-3929: --- Summary: Uncleaning option for local app log files with log-aggregation feature Key: YARN-3929 URL: https://issues.apache.org/jira/browse/YARN-3929 Project: Hadoop YARN Issue Type: New Feature Components: log-aggregation Affects Versions: 2.6.0, 2.4.0 Reporter: Dongwook Kwon Priority: Minor Although it makes sense to delete local app log files once AppLogAggregator copied all files into remote location(HDFS), I have some use cases that need to leave local app log files after it's copied to HDFS. Mostly it's for own backup purpose. I would like to use log-aggregation feature of YARN and want to back up app log files too. Without this option, files has to copy from HDFS to local again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629429#comment-14629429 ] Tsuyoshi Ozawa commented on YARN-3805: -- Checking this in. Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature
[ https://issues.apache.org/jira/browse/YARN-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongwook Kwon updated YARN-3929: Attachment: YARN-3929.01.patch Could you review this patch, Thanks. Uncleaning option for local app log files with log-aggregation feature -- Key: YARN-3929 URL: https://issues.apache.org/jira/browse/YARN-3929 Project: Hadoop YARN Issue Type: New Feature Components: log-aggregation Affects Versions: 2.4.0, 2.6.0 Reporter: Dongwook Kwon Priority: Minor Attachments: YARN-3929.01.patch Although it makes sense to delete local app log files once AppLogAggregator copied all files into remote location(HDFS), I have some use cases that need to leave local app log files after it's copied to HDFS. Mostly it's for own backup purpose. I would like to use log-aggregation feature of YARN and want to back up app log files too. Without this option, files has to copy from HDFS to local again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629491#comment-14629491 ] Naganarasimha G R commented on YARN-3931: - Hi [~kyungwan nam], Thanks for raising the issue ... i have assigned this jira to my name but if you are interested to further look into this jira and solve it . please reassign. default-node-label-expression doesn’t apply when an application is submitted by RM rest api --- Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam Assignee: Naganarasimha G R * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629489#comment-14629489 ] kyungwan nam commented on YARN-3931: node-label-expression is initialized to empty string {code} ... public ApplicationSubmissionContextInfo() { applicationId = ; applicationName = ; containerInfo = new ContainerLaunchContextInfo(); resource = new ResourceInfo(); priority = Priority.UNDEFINED.getPriority(); isUnmanagedAM = false; cancelTokensWhenComplete = true; keepContainers = false; applicationType = ; tags = new HashSetString(); appNodeLabelExpression = ; amContainerNodeLabelExpression = ; } {code} but, check whether node-label-expression is null or not {code} // check labels in the resource request. String labelExp = resReq.getNodeLabelExpression(); // if queue has default label expression, and RR doesn't have, use the // default label expression of queue if (labelExp == null queueInfo != null) { labelExp = queueInfo.getDefaultNodeLabelExpression(); resReq.setNodeLabelExpression(labelExp); } {code} default-node-label-expression doesn’t apply when an application is submitted by RM rest api --- Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam Assignee: Naganarasimha G R * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-3931: --- Assignee: Naganarasimha G R default-node-label-expression doesn’t apply when an application is submitted by RM rest api --- Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam Assignee: Naganarasimha G R * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629555#comment-14629555 ] Ajith S commented on YARN-3885: --- not because of the patch ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
kyungwan nam created YARN-3931: -- Summary: default-node-label-expression doesn’t apply when an application is submitted by RM rest api Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3928) launch application master on specific host
[ https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629412#comment-14629412 ] Varun Saxena commented on YARN-3928: Duplicate of MAPREDUCE-6402 launch application master on specific host -- Key: YARN-3928 URL: https://issues.apache.org/jira/browse/YARN-3928 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 2.6.0 Environment: Ubuntu 12.04 Reporter: Wenrui Hi, Is there a way to launch application master on a specific host ? If we can not do this in a managed-AM-launcher? then is it possible to achieve this in unmanaged-AM-launcher? I just find it's quite necessary to set application master on a specific host in some scenes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629411#comment-14629411 ] Sunil G commented on YARN-3535: --- Thank you [~peng.zhang] and [~asuresh] for correcting. bq.that notification will happen only AFTER the recoverResourceRequest has completed.. since it will be handled by the same dispatcher Yes. I missed this. Ordering will be corrected here. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629464#comment-14629464 ] Hudson commented on YARN-3805: -- FAILURE: Integrated in Hadoop-trunk-Commit #8173 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8173/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629465#comment-14629465 ] Hudson commented on YARN-90: FAILURE: Integrated in Hadoop-trunk-Commit #8173 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8173/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
Dian Fu created YARN-3930: - Summary: FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Reporter: Dian Fu Assignee: Dian Fu When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu updated YARN-3930: -- Attachment: YARN-3930.001.patch A simple patch attached. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated YARN-3885: -- Attachment: YARN-3885.08.patch ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629452#comment-14629452 ] zhihai xu commented on YARN-3535: - Also because {{containerCompleted}} and {{pullNewlyAllocatedContainersAndNMTokens}} are synchronized, it will guarantee if AM gets the container, {{ContainerRescheduledEvent}}({{recoverResourceRequestForContainer}}) won't be called. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629418#comment-14629418 ] Tsuyoshi Ozawa commented on YARN-3805: -- +1, pending for Jenkins. Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629446#comment-14629446 ] Hadoop QA commented on YARN-3885: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 12s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 46s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 37s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 50s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 18s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 23s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 61m 19s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 99m 23s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745584/YARN-3885.08.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 90bda9c | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8555/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8555/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8555/console | This message was automatically generated. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629348#comment-14629348 ] Sunil G commented on YARN-3535: --- Hi [~rohithsharma] and [~peng.zhang] After seeing this patch, I feel there may a synchronization problem. Please correct me if I am wrong. In ContainerRescheduledTransition code, its been used like {code} + container.eventHandler.handle(new ContainerRescheduledEvent(container)); + new FinishedTransition().transition(container, event); {code} Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will process the {{recoverResourceRequestForContainer}} is a separate thread. Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked and it will be processed for closure for this container. If the Scheduler dispatcher is slower in processing due to pending event queue length, there are chances that recoverResourceRequest may not be correct. I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to WAIT_FOR_REQUEST_RECOVERY and scheduler can fire back an event to {{RMContainerImpl}} indicate recovery of resource request is completed. This can move the state forward to KILLED in {{RMContainerImpl}}. Please share your thoughts. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629394#comment-14629394 ] Arun Suresh commented on YARN-3535: --- bq. I think recoverResourceRequest will not be affected by whether container finished event is processed faster. Cause recoverResourceRequest only process the ResourceRequest in container and not care its status. I agree with [~peng.zhang] here. IIUC, The {{recoverResourceRequest}} only affects state of the Scheduler and the SchedulerApp. In any case, the fact that the container is killed (the outcome of the {{RMAppAttemptContainerFinishedEvent}} fired by {{FinishedTransition#transition}}) will be notified to the Scheduler.. and that notification will happen only AFTER the recoverResourceRequest has completed.. since it will be handled by the same dispatcher. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629369#comment-14629369 ] Peng Zhang commented on YARN-3535: -- bq. there are chances that recoverResourceRequest may not be correct. Sorry, I didn't catch this, maybe I missed sth?. I think {{recoverResourceRequest}} will not be affected by whether container finished event is processed faster. Cause {{recoverResourceRequest}} only process the ResourceRequest in container and not care its status. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-3805: --- Attachment: YARN-3805.002.patch I rebased the patch. Thanks for pinging me, [~ozawa]. Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629423#comment-14629423 ] Hadoop QA commented on YARN-3805: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 3m 42s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 59s | Site still builds. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | | | 7m 5s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745590/YARN-3805.002.patch | | Optional Tests | site | | git revision | trunk / 90bda9c | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8556/console | This message was automatically generated. Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629615#comment-14629615 ] Hudson commented on YARN-3805: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629646#comment-14629646 ] Hadoop QA commented on YARN-3930: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 8s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 39s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 52s | The applied patch generated 2 new checkstyle issues (total was 14, now 15). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 19s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 34s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 56s | Tests passed in hadoop-yarn-common. | | | | 40m 1s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745596/YARN-3930.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 1ba2986 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8557/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8557/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8557/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8557/console | This message was automatically generated. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629722#comment-14629722 ] Hudson commented on YARN-3174: -- ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-yarn-project/CHANGES.txt * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629728#comment-14629728 ] Hudson commented on YARN-90: ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629716#comment-14629716 ] Hudson commented on YARN-3805: -- ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629715#comment-14629715 ] Hudson commented on YARN-3805: -- ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629719#comment-14629719 ] Hudson commented on YARN-3174: -- ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629724#comment-14629724 ] Hudson commented on YARN-90: ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629618#comment-14629618 ] Hudson commented on YARN-90: SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629616#comment-14629616 ] Hudson commented on YARN-3174: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-project/src/site/site.xml Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629717#comment-14629717 ] Hudson commented on YARN-3174: -- ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629660#comment-14629660 ] Varun Saxena commented on YARN-3877: [~chris.douglas], thanks for the review. Yes, you are correct that this config is not required for test. Will remove it. Will move the relevant test code to a separate test. YarnClientImpl.submitApplication swallows exceptions Key: YARN-3877 URL: https://issues.apache.org/jira/browse/YARN-3877 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.7.2 Reporter: Steve Loughran Assignee: Varun Saxena Priority: Minor Attachments: YARN-3877.01.patch When {{YarnClientImpl.submitApplication}} spins waiting for the application to be accepted, any interruption during its Sleep() calls are logged and swallowed. this makes it hard to interrupt the thread during shutdown. Really it should throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3877: --- Attachment: YARN-3877.02.patch YarnClientImpl.submitApplication swallows exceptions Key: YARN-3877 URL: https://issues.apache.org/jira/browse/YARN-3877 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.7.2 Reporter: Steve Loughran Assignee: Varun Saxena Priority: Minor Attachments: YARN-3877.01.patch, YARN-3877.02.patch When {{YarnClientImpl.submitApplication}} spins waiting for the application to be accepted, any interruption during its Sleep() calls are logged and swallowed. this makes it hard to interrupt the thread during shutdown. Really it should throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3928) launch application master on specific host
[ https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629750#comment-14629750 ] Lei Guo commented on YARN-3928: --- [~varun_saxena], I read this JIRA as a host preference requirement during container allocation, it's not a duplicate of MAPREDUCE-6402. [~wenrui], can you confirm? launch application master on specific host -- Key: YARN-3928 URL: https://issues.apache.org/jira/browse/YARN-3928 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 2.6.0 Environment: Ubuntu 12.04 Reporter: Wenrui Hi, Is there a way to launch application master on a specific host ? If we can not do this in a managed-AM-launcher? then is it possible to achieve this in unmanaged-AM-launcher? I just find it's quite necessary to set application master on a specific host in some scenes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629622#comment-14629622 ] Hudson commented on YARN-3805: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/988/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629623#comment-14629623 ] Hudson commented on YARN-3174: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/988/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-project/src/site/site.xml * hadoop-yarn-project/CHANGES.txt Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629625#comment-14629625 ] Hudson commented on YARN-90: SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/988/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629718#comment-14629718 ] Hudson commented on YARN-3805: -- ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629723#comment-14629723 ] Hudson commented on YARN-90: ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3928) launch application master on specific host
[ https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629753#comment-14629753 ] Varun Saxena commented on YARN-3928: Oh, then it is not. Misread the JIRA title. Apologies. launch application master on specific host -- Key: YARN-3928 URL: https://issues.apache.org/jira/browse/YARN-3928 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 2.6.0 Environment: Ubuntu 12.04 Reporter: Wenrui Hi, Is there a way to launch application master on a specific host ? If we can not do this in a managed-AM-launcher? then is it possible to achieve this in unmanaged-AM-launcher? I just find it's quite necessary to set application master on a specific host in some scenes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629811#comment-14629811 ] Hadoop QA commented on YARN-3877: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 34s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 28s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 53s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 55s | Tests passed in hadoop-yarn-client. | | | | 43m 31s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745625/YARN-3877.02.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 1ba2986 | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/8558/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8558/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8558/console | This message was automatically generated. YarnClientImpl.submitApplication swallows exceptions Key: YARN-3877 URL: https://issues.apache.org/jira/browse/YARN-3877 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.7.2 Reporter: Steve Loughran Assignee: Varun Saxena Priority: Minor Attachments: YARN-3877.01.patch, YARN-3877.02.patch When {{YarnClientImpl.submitApplication}} spins waiting for the application to be accepted, any interruption during its Sleep() calls are logged and swallowed. this makes it hard to interrupt the thread during shutdown. Really it should throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)
[ https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3784: -- Attachment: 0002-YARN-3784.patch Uploading a new version of the patch. Initially RM was sending list of container IDs in the preemption message. This patch is now improved that to include timeout also along with container id. New timeout is an optional param in proto. [~chris.douglas] Could you please take a look. Indicate preemption timout along with the list of containers to AM (preemption message) --- Key: YARN-3784 URL: https://issues.apache.org/jira/browse/YARN-3784 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch Currently during preemption, AM is notified with a list of containers which are marked for preemption. Introducing a timeout duration also along with this container list so that AM can know how much time it will get to do a graceful shutdown to its containers (assuming one of preemption policy is loaded in AM). This will help in decommissioning NM scenarios, where NM will be decommissioned after a timeout (also killing containers on it). This timeout will be helpful to indicate AM that those containers can be killed by RM forcefully after the timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629790#comment-14629790 ] Naganarasimha G R commented on YARN-3931: - [~kyungwan nam], Good that you are trying to contribute :), we need to request some committer to add you to the list of contributors but in the mean time you can upload the patch with test case i can help you in reviewing [~wangda tan], Can you please add [~kyungwan nam] to the contributor list and assign him this jira ? default-node-label-expression doesn’t apply when an application is submitted by RM rest api --- Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam Assignee: Naganarasimha G R * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature
[ https://issues.apache.org/jira/browse/YARN-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629998#comment-14629998 ] Xuan Gong commented on YARN-3929: - [~dongwook] Does this configuration: yarn.nodemanager.delete.debug-delay-sec satisfy your requirement ? Uncleaning option for local app log files with log-aggregation feature -- Key: YARN-3929 URL: https://issues.apache.org/jira/browse/YARN-3929 Project: Hadoop YARN Issue Type: New Feature Components: log-aggregation Affects Versions: 2.4.0, 2.6.0 Reporter: Dongwook Kwon Priority: Minor Attachments: YARN-3929.01.patch Although it makes sense to delete local app log files once AppLogAggregator copied all files into remote location(HDFS), I have some use cases that need to leave local app log files after it's copied to HDFS. Mostly it's for own backup purpose. I would like to use log-aggregation feature of YARN and want to back up app log files too. Without this option, files has to copy from HDFS to local again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3893: Issue Type: Sub-task (was: Bug) Parent: YARN-149 Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3931: - Assignee: kyungwan nam (was: Naganarasimha G R) default-node-label-expression doesn’t apply when an application is submitted by RM rest api --- Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam Assignee: kyungwan nam * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630010#comment-14630010 ] Wangda Tan commented on YARN-3931: -- Thanks for raising the issue [~kyungwan nam], I just added you to contributor list and assigned the JIRA to you. default-node-label-expression doesn’t apply when an application is submitted by RM rest api --- Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam Assignee: kyungwan nam * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630017#comment-14630017 ] Wangda Tan commented on YARN-3930: -- [~dian.fu], Thanks for working on the JIRA. Patch looks good, will commit soon. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630022#comment-14630022 ] Wangda Tan commented on YARN-3885: -- Patch LGTM, +1, will commit soon. Thanks [~ajithshetty]. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629701#comment-14629701 ] kyungwan nam commented on YARN-3931: hi, i couldn't reassign it to me. i think i don't have the privilege to assign issue default-node-label-expression doesn’t apply when an application is submitted by RM rest api --- Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam Assignee: Naganarasimha G R * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629714#comment-14629714 ] Varun Saxena commented on YARN-3877: [~chris.douglas], updated a new patch. Kindly review. To avoid timing issues in test, added code to wait for thread to enter sleep(enter TIMED_WAITING state) before call to interrupt. YarnClientImpl.submitApplication swallows exceptions Key: YARN-3877 URL: https://issues.apache.org/jira/browse/YARN-3877 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.7.2 Reporter: Steve Loughran Assignee: Varun Saxena Priority: Minor Attachments: YARN-3877.01.patch, YARN-3877.02.patch When {{YarnClientImpl.submitApplication}} spins waiting for the application to be accepted, any interruption during its Sleep() calls are logged and swallowed. this makes it hard to interrupt the thread during shutdown. Really it should throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3932: --- Attachment: ApplicationReport.jpg SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel --- Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: ApplicationReport.jpg Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
Bibin A Chundatt created YARN-3932: -- Summary: SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1644) RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-1644: Attachment: YARN-1644-YARN-1197.4.patch Updated this patch as dependent patch has been updated. RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing - Key: YARN-1644 URL: https://issues.apache.org/jira/browse/YARN-1644 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1644-YARN-1197.4.patch, YARN-1644.1.patch, YARN-1644.2.patch, YARN-1644.3.patch, yarn-1644.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629940#comment-14629940 ] Ming Ma commented on YARN-2578: --- Thanks [~iwasakims]. Is it similar to HADOOP-11252? Given your latest patch is in hadoop-common, it might be better to fix it as a HADOOP jira instead. NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Attachments: YARN-2578.002.patch, YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630151#comment-14630151 ] Hadoop QA commented on YARN-433: \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12740222/YARN-433.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 1ba2986 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8559/console | This message was automatically generated. When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-433.1.patch, YARN-433.2.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3868) ContainerManager recovery for container resizing
[ https://issues.apache.org/jira/browse/YARN-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-3868: Attachment: YARN-3868-YARN-1197.3.patch Update patch as dependent patches have been updated. ContainerManager recovery for container resizing Key: YARN-3868 URL: https://issues.apache.org/jira/browse/YARN-3868 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: MENG DING Assignee: MENG DING Attachments: YARN-3868-YARN-1197.3.patch, YARN-3868.1.patch, YARN-3868.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630251#comment-14630251 ] Subru Krishnan commented on YARN-3656: -- Thanks [~asuresh] for reviewing the patch. We did consider allowing declarative plugging of planners during the early stages of development but decided against it to keep the code base simpler to make it easier to grok as the current algorithms themselves are non-trivial. We are open to doing this in the future as when the need arises. LowCost: A Cost-Based Placement Agent for YARN Reservations --- Key: YARN-3656 URL: https://issues.apache.org/jira/browse/YARN-3656 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Ishai Menache Assignee: Jonathan Yaniv Labels: capacity-scheduler, resourcemanager Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, YARN-3656-v1.2.patch, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf YARN-1051 enables SLA support by allowing users to reserve cluster capacity ahead of time. YARN-1710 introduced a greedy agent for placing user reservations. The greedy agent makes fast placement decisions but at the cost of ignoring the cluster committed resources, which might result in blocking the cluster resources for certain periods of time, and in turn rejecting some arriving jobs. We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)
[ https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630167#comment-14630167 ] Wangda Tan commented on YARN-3784: -- Beyond timeout, another thing we may need consider is: after a container is removed from to-be-preempted list, should we notify scheduler/AM about that? This could happen if other applications release containers, or other queues/applications cancel resource requests. Now proportionalCPP can notify scheduler many times for a same container, if we have to-preempt/remove-from-to-preempt event, we can also reduce number of messages send to scheduler (which could cause YARN-3508 happens). Indicate preemption timout along with the list of containers to AM (preemption message) --- Key: YARN-3784 URL: https://issues.apache.org/jira/browse/YARN-3784 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch Currently during preemption, AM is notified with a list of containers which are marked for preemption. Introducing a timeout duration also along with this container list so that AM can know how much time it will get to do a graceful shutdown to its containers (assuming one of preemption policy is loaded in AM). This will help in decommissioning NM scenarios, where NM will be decommissioned after a timeout (also killing containers on it). This timeout will be helpful to indicate AM that those containers can be killed by RM forcefully after the timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it
[ https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630209#comment-14630209 ] Anubhav Dhoot commented on YARN-3900: - This is needed for YARN-3736. Without this the leveldb state store implementation of YARN-3736 actually causes a dump Protobuf layout of yarn_security_token causes errors in other protos that include it - Key: YARN-3900 URL: https://issues.apache.org/jira/browse/YARN-3900 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3900.001.patch, YARN-3900.001.patch Because of the subdirectory server used in {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} there are errors in other protos that include them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630275#comment-14630275 ] Joep Rottinghuis commented on YARN-3908: bq. In fact, I'm wondering if we should but info and events into a separate column family like what we did for configs/metrics? In principle we should keep everything in the same column family (fewer store files) unless: a) The items that we store require a different TTL, compression, etc. This is the case for metrics where we need a separate TTL. b) The columns are rather significant in size, and in many queries they'll be skipped (and specifically not used in push-down predicate ie. column value filters etc). This is the case for configuration. If we have many queries to just retrieve info fields and we skip configs in these, then iterating over just the rows in the info column family will have a benefit of not needing to access the config store files. Otherwise a separate column family just results in more store files and doesn't really gain us anything. Given the current code setup, switching column family is almost trivial, so given that there are no functionality differences, I'd say let's not even try to further optimize this until we have way more code in place. Then we can run large batches of historical job history files and other inputs (perhaps porting data from ATS v1) and then we can see the potential benefit or downside. The other reason to not do premature optimization is that I'm still thinking of adding a few more perf tweaks. Those would also just be performance optimizations, and not any functionality different, so also not a priority now. We should look at tuning all those things much later and together in a coherent way. Additional settings that we need to test are RPC compression, encoding of the store files and/or compression of the same. In short, let's focus on completing functionality and then tinker with these settings later. Bugs in HBaseTimelineWriterImpl --- Key: YARN-3908 URL: https://issues.apache.org/jira/browse/YARN-3908 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Attachments: YARN-3908-YARN-2928.001.patch, YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch 1. In HBaseTimelineWriterImpl, the info column family contains the basic fields of a timeline entity plus events. However, entity#info map is not stored at all. 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-433: --- Attachment: YARN-433.3.patch rebase the patch When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630283#comment-14630283 ] Wangda Tan commented on YARN-3932: -- [~bibinchundatt], I think we can add a method such as getTotalUsed in ResourceUsage class, which will be more efficient than iterating all liveContainers. This can be done in the near term. To make it correct, I think we need to return usage-by-partition object to application, which requires to change APIs. SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel --- Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: ApplicationReport.jpg Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3914) Entity created time should be part of the row key of entity table
[ https://issues.apache.org/jira/browse/YARN-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630125#comment-14630125 ] Sangjin Lee commented on YARN-3914: --- [~zjshen], we have been discussing this. While adding entity creation time to the row key may solve this problem, the concern is that it may introduce others. If the row key is (user/cluster/flow/run/app_id/entity_type/created_time/entity_id), then even the most basic query for (entity_type + entity_id) will get much more complicated, right? We cannot expect readers to provide the creation time every time they query for an entity by id. Also, as you said, we cannot always accommodate different query vectors by adding more to the row key, or we would be risking blowing up the row key size or breaking other queries. We should be real judicious what goes into the row key... I think it's reasonable to expect that the entity id order would be either completely or nearly identical to the chronological order (e.g. app id, or container id). So perhaps we could rely on the entity id order to help mitigate this problem. Thoughts? Entity created time should be part of the row key of entity table - Key: YARN-3914 URL: https://issues.apache.org/jira/browse/YARN-3914 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Entity created time should be part of the row key of entity table, between entity type and entity Id. The reason to have it is to index the entities. Though we cannot index the entities for all kinds of information, indexing them according to the created time is very necessary. Without it, every query for the latest entities that belong to an application and a type will scan through all the entities that belong to them. For example, if we want to list the 100 latest started containers in an YARN app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3635) Get-queue-mapping should be a common interface of YarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630139#comment-14630139 ] Wangda Tan commented on YARN-3635: -- Hi [~sandyr], Thanks for your comments, actually I have read QueuePlacementPolicy/QueuePlacementRule from FS before working on this patch. The generic design fo this patch is based on FS's queue placement policy structure, but also with some changes. To your comments: bq. Is a common way of configuration proposed? No common configuration, it only defined a set of common interfaces. Since FS/CS have very different ways to configuration, now rules are created by different schedulers, see CapacityScheduler#updatePlacementRules as an example. bq. What steps are required for the Fair Scheduler to integrate with this? 1) Port existing rules to new APIs defined in the patch, this should be simple 2) Change configuration implementation to instance new defined PlacementRule, you may not need to change existing configuration items itself. 3) Change FS workflow, with this patch, queue mapping is happened before submit to scheduler. Remove queue mapping related logics from FS and create queue if needed. bq. Each placement rule gets the chance to assign the app to a queue, reject the app, or pass. If it passes, the next rule gets a chance. New APIs are very similar: Non-null is determined Null is not determined Throw exception when rejected. You can take a look at {{org.apache.hadoop.yarn.server.resourcemanager.placement.PlacementRule}} bq. A placement rule can base its decision on: bq. Yes you can do all of them with the new API except The set of queues given in the Fair Scheduler configuration.: I was thinking necessarity of passing set of queues in the interface. In existing implementations of QueuePlacementPolicy, FS queues are only used to check mapped queue's existence. I would prefer to delay the check to submit to scheduler. See my next comment about create flag for more details. Another reason of not passing queue names set via interface is, queues are very dynamic. For example, if user wants to submit application to queue with lowest utilization, queue names set may not enough. I would prefer to let rule choose to get what need from scheduler. bq. Rules are marked as terminal if they will never pass. This helps to avoid misconfigurations where users place rules after terminal rules. I'm not sure if is it necessary. I think terminal or not should be determined by runtime, but I'm OK if you think it's must to have. bq. Rules have a create attribute which determines whether they can create a new queue or whether they must assign to existing queues. I think queue is create-able or not should be determined by scheduler, it should be a part of scheduler configuration instead of rule itself. You can put create to your implemented rules without any issue, but I prefer not to expose it to public interface. bq. Currently the set of placement rules is limited to what's implemented in YARN. I.e. there's no public pluggable rule support. Agree, this is one thing we need to do in the future. For now, we can make queue mapping happens in a central place first. bq. Are there places where my summary would not describe what's going on in this patch? I think it should covers most of my patch, you can also take a look at my patch to see if anything unexpected :). Get-queue-mapping should be a common interface of YarnScheduler --- Key: YARN-3635 URL: https://issues.apache.org/jira/browse/YARN-3635 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Wangda Tan Assignee: Tan, Wangda Attachments: YARN-3635.1.patch, YARN-3635.2.patch, YARN-3635.3.patch, YARN-3635.4.patch, YARN-3635.5.patch, YARN-3635.6.patch Currently, both of fair/capacity scheduler support queue mapping, which makes scheduler can change queue of an application after submitted to scheduler. One issue of doing this in specific scheduler is: If the queue after mapping has different maximum_allocation/default-node-label-expression of the original queue, {{validateAndCreateResourceRequest}} in RMAppManager checks the wrong queue. I propose to make the queue mapping as a common interface of scheduler, and RMAppManager set the queue after mapping before doing validations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630289#comment-14630289 ] Joep Rottinghuis commented on YARN-3908: Patch looks good with one comment. I completely overlooked the event info map, because it isn't part of the javadoc on the EntityTable. I should have double-checked but didn't. Thanks for catching this. [~sjlee0] I think it would be good to update the javadoc that describes the EntityTable in the EntityTable.java file. The same is probably missing from the doc Timeline service schema for native HBase tables (not sure which jira the PDF for that doc is attached to), because I copied it from the code. I don't think that the application table has been copied yet, so it won't be missing from there. Bugs in HBaseTimelineWriterImpl --- Key: YARN-3908 URL: https://issues.apache.org/jira/browse/YARN-3908 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Attachments: YARN-3908-YARN-2928.001.patch, YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch 1. In HBaseTimelineWriterImpl, the info column family contains the basic fields of a timeline entity plus events. However, entity#info map is not stored at all. 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630659#comment-14630659 ] Xianyin Xin commented on YARN-3931: --- This reminds me an earlier trouble i have met. Hi [~Naganarasimha], can we consider to remove the node label expression in the code? It seems not make sense we set a node label as . For node label expression, it should be some_label or null. Just an unrigorous thoughts, what do you think? default-node-label-expression doesn’t apply when an application is submitted by RM rest api --- Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam Assignee: kyungwan nam Attachments: YARN-3931.001.patch * yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk * submit an application using rest api without app-node-label-expression”, am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630669#comment-14630669 ] Ajith S commented on YARN-3885: --- Thanks [~leftnoteasy] , [~xinxianyin] and [~sunilg] :) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Fix For: 2.8.0 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630694#comment-14630694 ] Hong Zhiguo commented on YARN-2306: --- hi, [~rchiang], do you mean running the unit test in patch againt trunk? leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2306-2.patch, YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-1974) add args for DistributedShell to specify a set of nodes on which the tasks run
[ https://issues.apache.org/jira/browse/YARN-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo resolved YARN-1974. --- Resolution: Not A Problem add args for DistributedShell to specify a set of nodes on which the tasks run -- Key: YARN-1974 URL: https://issues.apache.org/jira/browse/YARN-1974 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.7.0 Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-1974.patch It's very useful to execute a script on a specific set of machines for both testing and maintenance purpose. The args --nodes and --relax_locality are added to DistributedShell. Together with an unit test using miniCluster. It's also tested on our real cluster with Fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630742#comment-14630742 ] Hong Zhiguo commented on YARN-2306: --- Updated the patch. I ran testReservationMetrics several times and no failure now. leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2306-2.patch, YARN-2306-3.patch, YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3049: -- Attachment: YARN-3049-WIP.2.patch [~sjlee0] and [~gtCarrera9], thanks for review the patch. I'm currently targeting an E2E reader POC, and I'll try to address your comments a bit later. I upload a new WIP patch, which basically makes the reader work E2E, while their are couple of bugs. I'll spend some more time to fix them. [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630688#comment-14630688 ] Hong Zhiguo commented on YARN-2768: --- [~kasha], could you please review the patch? optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3845) [YARN] YARN status in web ui does not show correctly in IE 11
[ https://issues.apache.org/jira/browse/YARN-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3845: --- Attachment: YARN-3845.patch [YARN] YARN status in web ui does not show correctly in IE 11 - Key: YARN-3845 URL: https://issues.apache.org/jira/browse/YARN-3845 Project: Hadoop YARN Issue Type: Bug Reporter: Jagadesh Kiran N Assignee: Mohammad Shahid Khan Priority: Trivial Attachments: IE11_yarn.gif, YARN-3845.patch In IE 11 , the color display is not proper for the scheduler . In other browser it is showing correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)
[ https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630794#comment-14630794 ] Sunil G commented on YARN-3784: --- Yes [~leftnoteasy] Thank you for sharing your thoughts. I f I understood you correctly, there are chances that to-be-preempted container will reside in FicaSchedulerApp till allocate call comes from AM. Within this duration, there are chances that some more containers got free or cancelled its resource requests. Due to this, we may remove this container from this to-be-preempted list. I feel we can have a remove-from-to-preempt in scheduler, and propportionalCPP can notify the app when such scenario occurs. This can be added as a new argument to AM response also. I will separate this improvement in to another ticket. From your second point, I feel we can keep a getter api (synchronized) for to-be-preempted containers which is present in FicaSchedulerApp (scheduler level). With this api, proportionalCPP can have look whether the container which is newly identified to preempt is already reported as to-be-preempted container in app level. If so, proportionalCPP need not have to raise another event to scheduler. I ll separate this if its ok. Indicate preemption timout along with the list of containers to AM (preemption message) --- Key: YARN-3784 URL: https://issues.apache.org/jira/browse/YARN-3784 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch Currently during preemption, AM is notified with a list of containers which are marked for preemption. Introducing a timeout duration also along with this container list so that AM can know how much time it will get to do a graceful shutdown to its containers (assuming one of preemption policy is loaded in AM). This will help in decommissioning NM scenarios, where NM will be decommissioned after a timeout (also killing containers on it). This timeout will be helpful to indicate AM that those containers can be killed by RM forcefully after the timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3736) Persist the Plan information, ie. accepted reservations to the RMStateStore for failover
[ https://issues.apache.org/jira/browse/YARN-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630793#comment-14630793 ] Hadoop QA commented on YARN-3736: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 17s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 7s | The applied patch generated 1 new checkstyle issues (total was 104, now 104). | | {color:green}+1{color} | whitespace | 1m 53s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 4m 1s | The patch appears to introduce 3 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 3m 6s | Tests passed in hadoop-yarn-server-applicationhistoryservice. | | {color:green}+1{color} | yarn tests | 51m 3s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 101m 56s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745739/YARN-3736.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ee36f4f | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8566/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8566/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8566/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-applicationhistoryservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8566/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8566/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8566/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8566/console | This message was automatically generated. Persist the Plan information, ie. accepted reservations to the RMStateStore for failover Key: YARN-3736 URL: https://issues.apache.org/jira/browse/YARN-3736 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler, resourcemanager Reporter: Subru Krishnan Assignee: Anubhav Dhoot Attachments: YARN-3736.001.patch, YARN-3736.001.patch We need to persist the current state of the plan, i.e. the accepted ReservationAllocations corresponding RLESpareseResourceAllocations to the RMStateStore so that we can recover them on RM failover. This involves making all the reservation system data structures protobuf friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630733#comment-14630733 ] Ray Chiang commented on YARN-2306: -- Heh. That was two months ago. I believe I was referring to the unit test. leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2306-2.patch, YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)