[jira] [Commented] (YARN-3553) TreeSet is not a nice container for organizing schedulableEntities.
[ https://issues.apache.org/jira/browse/YARN-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526295#comment-14526295 ] Xianyin Xin commented on YARN-3553: --- Thanks, [~leftnoteasy] and [~cwelch]. Now that it is not an issue, just close it. TreeSet is not a nice container for organizing schedulableEntities. --- Key: YARN-3553 URL: https://issues.apache.org/jira/browse/YARN-3553 Project: Hadoop YARN Issue Type: Wish Components: scheduler Reporter: Xianyin Xin For TreeSet, element is identified by comparator, not the object reference. If any *attributes that used for comparing two elements* of an specific element is modified by other methods, the TreeSet will be in an un-sorted state, and cannot become sorted forever except that we reconstruct another TreeSet with the elements. To avoid this, one must be *very careful* when they try to modify the attributes (such as increase or decrease the used capacity of a schedulabeEntity) of an object. An example in AbstractComparatorOrderingPolicy.java, Line63, {code} protected void reorderSchedulableEntity(S schedulableEntity) { //remove, update comparable data, and reinsert to update position in order schedulableEntities.remove(schedulableEntity); updateSchedulingResourceUsage( schedulableEntity.getSchedulingResourceUsage()); schedulableEntities.add(schedulableEntity); } {code} This method tries to remove the schedulableEntity first and then reinsert it so as to reorder the set. However, the changes of the schedulableEntity should be done in the middle of the above two operations. But the comparator of the class is not clear, so we don't know which attributes of the schedulableEntity was changed. If we changed the schedulableEntity outside the method and then inform the orderingPolicy that we made such a change, the operation schedulableEntities.remove(schedulableEntity) would not work correctly since the element of a TreeSet is identified by comparator. Any implement class of this abstract class should overwrite this method, but few does. Another choice is that we make modification of a schedulableEntity manually, but we mustn't forget to reorder the set when we do so and must remember the order: remove, modify the attributes(used for comparing), insert, or use an iterator to mark the schedulableEntity so that we can remove and reinsert it correctly. YARN-897 is an example that we fell into the trap. If the comparator become complex in future, e.g., we consider other types of resources in comparator, such traps will be more and disperse anywhere, which makes it easy to let a TreeSet become a un-sorted state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1612) FairScheduler: Enable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526296#comment-14526296 ] Chen He commented on YARN-1612: --- Thank you for reviewing, Karthik, I will update the patch tomorrow. FairScheduler: Enable delay scheduling by default - Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612-v2.patch, YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience
[ https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526301#comment-14526301 ] Naganarasimha G R commented on YARN-3523: - Hi [~vinodkv], Point noted :), as per my knowledge i think mistakenly set in 2 jiras and it was not intentional (if so would have set to earlier version than 2.8.0). My 2 cents here, if there is an option in the jira to disable/enable editing for particular group then can we apply that such that Fix version can be added/modified by committers only ? Cleanup ResourceManagerAdministrationProtocol interface audience Key: YARN-3523 URL: https://issues.apache.org/jira/browse/YARN-3523 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3523.20150422-1.patch I noticed ResourceManagerAdministrationProtocol has @Private audience for the class and @Public audience for methods. It doesn't make sense to me. We should make class audience and methods audience consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nijel updated YARN-3018: Attachment: YARN-3018-1.patch Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial Attachments: YARN-3018-1.patch For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526307#comment-14526307 ] nijel commented on YARN-3018: - Thanks [~leftnoteasy] Uploaded the patch Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial Attachments: YARN-3018-1.patch For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526308#comment-14526308 ] Hadoop QA commented on YARN-3069: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 46s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 43s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 52s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 4 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 23s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 1m 55s | Tests failed in hadoop-yarn-common. | | | | 38m 49s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.conf.TestYarnConfigurationFields | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730122/YARN-3069.005.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a319771 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7682/artifact/patchprocess/whitespace.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7682/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7682/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7682/console | This message was automatically generated. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval
[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526309#comment-14526309 ] Hadoop QA commented on YARN-3018: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 0m 0s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 14s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | | | 0m 19s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730124/YARN-3018-1.patch | | Optional Tests | | | git revision | trunk / 3ba1836 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7683/artifact/patchprocess/whitespace.txt | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7683/console | This message was automatically generated. Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial Attachments: YARN-3018-1.patch For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1971) WindowsLocalWrapperScriptBuilder does not check for errors in generated script
[ https://issues.apache.org/jira/browse/YARN-1971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526344#comment-14526344 ] Remus Rusanu commented on YARN-1971: The problem is that there is no error check in the generated script. For comparison the ContainerLaunch.WindowsShellScriptBuilder will check each line in the generated script by adding this line automatically in the script, after each command: {code} @if %errorlevel% neq 0 exit /b %errorlevel% {code} I'm not advocating checking for various error conditions before launching the script, I'm saying the generated script itself should have error checking and handling. WindowsLocalWrapperScriptBuilder does not check for errors in generated script -- Key: YARN-1971 URL: https://issues.apache.org/jira/browse/YARN-1971 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Priority: Minor Similar to YARN-1865. The DefaultContainerExecutor.WindowsLocalWrapperScriptBuilder builds a shell script that contains commands that potentially may fail: {code} pout.println(@echo + containerIdStr ++ normalizedPidFile +.tmp); pout.println(@move /Y + normalizedPidFile + .tmp + normalizedPidFile); {code} These can fail due to access permissions, disc out of space, bad hardware, cosmic rays etc etc. There should be proper error checking to ease troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2775) There is no close method in NMWebServices#getLogs()
[ https://issues.apache.org/jira/browse/YARN-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526366#comment-14526366 ] Tsuyoshi Ozawa commented on YARN-2775: -- [~skrho], thank you for taking this issue. I agree with that we need to close files after creating FileInputStream. How about using try-with-resources statement since now we only supports JDK 7 or later? http://docs.oracle.com/javase/7/docs/technotes/guides/language/try-with-resources.html {code} try (final FileInputStream fis = ContainerLogsUtils.openLogFileForRead( containerIdStr, logFile, nmContext)) { } catch () { } {code} There is no close method in NMWebServices#getLogs() --- Key: YARN-2775 URL: https://issues.apache.org/jira/browse/YARN-2775 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: skrho Priority: Minor Attachments: YARN-2775_001.patch If getLogs method is called, fileInputStream object is accumulated in memory.. Because fileinputStream object is not closed.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler
[ https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526399#comment-14526399 ] Sunil G commented on YARN-3557: --- bq.Currently for centralized node label configuration, it only supports admin configure node label through CLI. Apart from CLI and REST, do u mean like exposing these configuration for a specific user (i assume this user will have some security approval in the cluster) so that this user can make the config via REST or api's. Support Intel Trusted Execution Technology(TXT) in YARN scheduler - Key: YARN-3557 URL: https://issues.apache.org/jira/browse/YARN-3557 Project: Hadoop YARN Issue Type: New Feature Reporter: Dian Fu Attachments: Support TXT in YARN high level design doc.pdf Intel TXT defines platform-level enhancements that provide the building blocks for creating trusted platforms. A TXT aware YARN scheduler can schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 provides the capacity to restrict YARN applications to run only on cluster nodes that have a specified node label. This is a good mechanism that be utilized for TXT aware YARN scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler
[ https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526416#comment-14526416 ] Dian Fu commented on YARN-3557: --- Hi [~sunilg], Thanks for your comments. {quote}Apart from CLI and REST, do u mean like exposing these configuration for a specific user (i assume this user will have some security approval in the cluster) so that this user can make the config via REST or api's.{quote} Exposing these configuration for a specific user can be one option. But this will require users to start a job which updates the labels periodically and is complicated for users. If we can provide the similar method to YARN-2495 at RM side, user will just need to provide a script(which takes node hostname/ip as input and output the node labels). Support Intel Trusted Execution Technology(TXT) in YARN scheduler - Key: YARN-3557 URL: https://issues.apache.org/jira/browse/YARN-3557 Project: Hadoop YARN Issue Type: New Feature Reporter: Dian Fu Attachments: Support TXT in YARN high level design doc.pdf Intel TXT defines platform-level enhancements that provide the building blocks for creating trusted platforms. A TXT aware YARN scheduler can schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 provides the capacity to restrict YARN applications to run only on cluster nodes that have a specified node label. This is a good mechanism that be utilized for TXT aware YARN scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526423#comment-14526423 ] Sunil G commented on YARN-2305: --- Yes. This is can be closed. I have checked, and it was not occurring. Still i will perform few more tests, and if persists, I will reopen. Thank you [~leftnoteasy] When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Sunil G Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nijel updated YARN-3018: Attachment: YARN-3018-2.patch Updated the patch to remove the white spaces Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial Attachments: YARN-3018-1.patch, YARN-3018-2.patch For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526435#comment-14526435 ] Hadoop QA commented on YARN-3018: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730139/YARN-3018-2.patch | | Optional Tests | | | git revision | trunk / bb9ddef | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7684/console | This message was automatically generated. Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial Attachments: YARN-3018-1.patch, YARN-3018-2.patch For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2293) Scoring for NMs to identify a better candidate to launch AMs
[ https://issues.apache.org/jira/browse/YARN-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526441#comment-14526441 ] Sunil G commented on YARN-2293: --- Hi [~zjshen] This work is moved to YARN-2005, I will share a basic prototype soon in that. This can be made as duplicated to YARN-2005. Scoring for NMs to identify a better candidate to launch AMs Key: YARN-2293 URL: https://issues.apache.org/jira/browse/YARN-2293 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Sunil G Assignee: Sunil G Container exit status from NM is giving indications of reasons for its failure. Some times, it may be because of container launching problems in NM. In a heterogeneous cluster, some machines with weak hardware may cause more failures. It will be better not to launch AMs there more often. Also I would like to clear that container failures because of buggy job should not result in decreasing score. As mentioned earlier, based on exit status if a scoring mechanism is added for NMs in RM, then NMs with better scores can be given for launching AMs. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2293) Scoring for NMs to identify a better candidate to launch AMs
[ https://issues.apache.org/jira/browse/YARN-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G resolved YARN-2293. --- Resolution: Duplicate Scoring for NMs to identify a better candidate to launch AMs Key: YARN-2293 URL: https://issues.apache.org/jira/browse/YARN-2293 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Sunil G Assignee: Sunil G Container exit status from NM is giving indications of reasons for its failure. Some times, it may be because of container launching problems in NM. In a heterogeneous cluster, some machines with weak hardware may cause more failures. It will be better not to launch AMs there more often. Also I would like to clear that container failures because of buggy job should not result in decreasing score. As mentioned earlier, based on exit status if a scoring mechanism is added for NMs in RM, then NMs with better scores can be given for launching AMs. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2256) Too many nodemanager and resourcemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526445#comment-14526445 ] Varun Saxena commented on YARN-2256: [~zjshen], just to brief you on the issue; in our setup we were getting too many audit logs related to container events. We also found some other unnecessary logs(not required for debugging) appearing frequently. Had raised another JIRA for this. Anyways, so we internally we took up the task of cleaning up these logs. This also made a slight improvement in the throughput of running process(2.4.0) To resolve the problem, one option was to remove these logs completely. But, we decided to support different log levels for audit logs so that if some customer requires these logs, we can enable them by merely changing log4j properties. The scope of these 2 JIRAs' is indeed inter related. But I segregated them, because I wasnt sure if community would accept support for different log levels. We can decide if we need either one of these. Too many nodemanager and resourcemanager audit logs are generated - Key: YARN-2256 URL: https://issues.apache.org/jira/browse/YARN-2256 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 2.4.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2256.patch Following audit logs are generated too many times(due to the possibility of a large number of containers) : 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a container 2. In RM - Audit logs corresponding to AM allocating a container and AM releasing a container We can have different log levels even for NM and RM audit logs and move these successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2267) Auxiliary Service support in RM
[ https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526446#comment-14526446 ] Sunil G commented on YARN-2267: --- It would be a good feature if we can plugin few resource monitoring services to RM such as mentioned in *Scenario 1* above. Could you please share the design thoughts for same, and main question will be like how this can be done in controlled way. By this what i meant is, an introduction of plugin should not conflict the existing behavior of scheduler's etc. Auxiliary Service support in RM --- Key: YARN-2267 URL: https://issues.apache.org/jira/browse/YARN-2267 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Naganarasimha G R Assignee: Rohith Currently RM does not have a provision to run any Auxiliary services. For health/monitoring in RM, its better to make a plugin mechanism in RM itself, similar to NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526447#comment-14526447 ] Varun Saxena commented on YARN-3148: Thanks [~gtCarrera] for looking at this. Will update the patch ASAP. allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Reporter: Prakash Ramachandran Assignee: Varun Saxena Attachments: YARN-3148.001.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526452#comment-14526452 ] Varun Saxena commented on YARN-2902: [~leftnoteasy], sorry for the delay. Was on long leave and have come back today. We are pretty clear on how to handle it for private resources(as per the comment you highlighted) but hadn't updated a patch as need to simulate and investigate further for public resources. I will check it and update ASAP. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nijel updated YARN-3018: Attachment: YARN-3018-3.patch Re trigger the CIS. Patch was wrongly generated sorry for the noise Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial Attachments: YARN-3018-1.patch, YARN-3018-2.patch, YARN-3018-3.patch For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526466#comment-14526466 ] Hadoop QA commented on YARN-3018: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 0m 0s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 15s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | | | 0m 18s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730144/YARN-3018-3.patch | | Optional Tests | | | git revision | trunk / bb9ddef | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7685/console | This message was automatically generated. Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial Attachments: YARN-3018-1.patch, YARN-3018-2.patch, YARN-3018-3.patch For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang
[ https://issues.apache.org/jira/browse/YARN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G resolved YARN-1662. --- Resolution: Invalid Capacity Scheduler reservation issue cause Job Hang --- Key: YARN-1662 URL: https://issues.apache.org/jira/browse/YARN-1662 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.2.0 Environment: Suse 11 SP1 + Linux Reporter: Sunil G There are 2 node managers in my cluster. NM1 with 8GB NM2 with 8GB I am submitting a Job with below details: AM with 2GB Map needs 5GB Reducer needs 3GB slowstart is enabled with 0.5 10maps and 50reducers are assigned. 5maps are completed. Now few reducers got scheduled. Now NM1 has 2GB AM and 3Gb Reducer_1[Used 5GB] NM2 has 3Gb Reducer_2 [Used 3GB] A Map has now reserved(5GB) in NM1 which has only 3Gb free. It hangs forever. Potential issue is, reservation is now blocked in NM1 for a Map which needs 5GB. But the Reducer_1 hangs by waiting for few map ouputs. Reducer side preemption also not happened as few headroom is still available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang
[ https://issues.apache.org/jira/browse/YARN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526473#comment-14526473 ] Sunil G commented on YARN-1662: --- Yes [~jianhe] we can close this issue. After YARN-1769, we have a better reservation too. I checked this and its not happening now. Capacity Scheduler reservation issue cause Job Hang --- Key: YARN-1662 URL: https://issues.apache.org/jira/browse/YARN-1662 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.2.0 Environment: Suse 11 SP1 + Linux Reporter: Sunil G There are 2 node managers in my cluster. NM1 with 8GB NM2 with 8GB I am submitting a Job with below details: AM with 2GB Map needs 5GB Reducer needs 3GB slowstart is enabled with 0.5 10maps and 50reducers are assigned. 5maps are completed. Now few reducers got scheduled. Now NM1 has 2GB AM and 3Gb Reducer_1[Used 5GB] NM2 has 3Gb Reducer_2 [Used 3GB] A Map has now reserved(5GB) in NM1 which has only 3Gb free. It hangs forever. Potential issue is, reservation is now blocked in NM1 for a Map which needs 5GB. But the Reducer_1 hangs by waiting for few map ouputs. Reducer side preemption also not happened as few headroom is still available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies
[ https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526577#comment-14526577 ] Eric Payne commented on YARN-3097: -- {quote} -1 The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {quote} Since the only change in this patch is to change an info log message to a debug log message, no tests were included. Logging of resource recovery on NM restart has redundancies --- Key: YARN-3097 URL: https://issues.apache.org/jira/browse/YARN-3097 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Minor Labels: newbie Attachments: YARN-3097.001.patch ResourceLocalizationService logs that it is recovering a resource with the remote and local paths, but then very shortly afterwards the LocalizedResource emits an INIT-LOCALIZED transition that also logs the same remote and local paths. The recovery message should be a debug message, since it's not conveying any useful information that isn't already covered by the resource state transition log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins
[ https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526578#comment-14526578 ] Naganarasimha G R commented on YARN-3562: - Seems to be some issue with Jenkins, compilation is passing and the test logs are showing as compilation issues ! unit tests failures and issues found from findbug from earlier ATS checkins --- Key: YARN-3562 URL: https://issues.apache.org/jira/browse/YARN-3562 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Naganarasimha G R Priority: Minor Attachments: YARN-3562-YARN-2928.001.patch *Issues reported from MAPREDUCE-6337* : A bunch of MR unit tests are failing on our branch whenever the mini YARN cluster needs to bring up multiple node managers. For example, see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/ It is because the NMCollectorService is using a fixed port for the RPC (8048). *Issues reported from YARN-3044* : Test case failures and tools(FB CS) issues found : # find bugs issue : Comparison of String objects using == or != in ResourceTrackerService.updateAppCollectorsMap # find bugs issue : Boxing/unboxing to parse a primitive RMTimelineCollectorManager.postPut. Called method Long.longValue() Should call Long.parseLong(String) instead. # find bugs issue : DM_DEFAULT_ENCODING Called method new java.io.FileWriter(String, boolean) At FileSystemTimelineWriterImpl.java:\[line 86\] # hadoop.yarn.server.resourcemanager.TestAppManager, hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, hadoop.yarn.server.resourcemanager.TestClientRMService hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus, refer https://builds.apache.org/job/PreCommit-YARN-Build/7534/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers
[ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526588#comment-14526588 ] Jun Gong commented on YARN-3474: [~vinodkv] Thank you for the explanation. Closing it now. Add a way to let NM wait RM to come back, not kill running containers - Key: YARN-3474 URL: https://issues.apache.org/jira/browse/YARN-3474 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3474.01.patch When RM HA is enabled and active RM shuts down, standby RM will become active, recover apps and attempts. Apps will not be affected. If there are some cases or bugs that cause both RM could not start normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM could not connect with ZK well). NM will kill containers running on it when it could not heartbeat with RM for some time(max retry time is 15 mins by default). Then all apps will be killed. In production cluster, we might come across above cases and fixing these bugs might need time more than 15 mins. In order to let apps not be affected and killed by NM, YARN admin could set a flag(the flag is a znode '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to come back and not kill running containers. After fixing bugs and RM start normally, clear the flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers
[ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-3474. Resolution: Invalid Add a way to let NM wait RM to come back, not kill running containers - Key: YARN-3474 URL: https://issues.apache.org/jira/browse/YARN-3474 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3474.01.patch When RM HA is enabled and active RM shuts down, standby RM will become active, recover apps and attempts. Apps will not be affected. If there are some cases or bugs that cause both RM could not start normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM could not connect with ZK well). NM will kill containers running on it when it could not heartbeat with RM for some time(max retry time is 15 mins by default). Then all apps will be killed. In production cluster, we might come across above cases and fixing these bugs might need time more than 15 mins. In order to let apps not be affected and killed by NM, YARN admin could set a flag(the flag is a znode '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to come back and not kill running containers. After fixing bugs and RM start normally, clear the flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526623#comment-14526623 ] Naganarasimha G R commented on YARN-2729: - Thanks for the review comments [~vinodkv], bq.SCRIPT_NODE_LABELS_PROVIDER and CONFIG_NODE_LABELS_PROVIDER are not needed, delete them, you have separate constants for their prefixes Actually these are not preffixes, as per [~Wangda]'s [comment|https://issues.apache.org/jira/browse/YARN-2729?focusedCommentId=14393545page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14393545] we had decided have whitelisting for provider : {{The option will be: yarn.node-labels.nm.provider = config/script/other-class-name.}} . These are modifications for it. bq. DISABLE_NODE_LABELS_PROVIDER_FETCH_TIMER doesn't need to be in YarnConfiguration Well As per one of the wangda's comment he had mentioned that possible values or default values of configurations had to be kept in YARNConfiguration, hence had placed it here, if required as per your comment can move it to AbstractNodeLabelsProvider bq. LOG is not used anywhere Are the logs expected when the labels are set in {{setNodeLabels}} ? i can add here but anyway there were logs in NodeStatusUpdaterImpl on successfull and unsuccessfull attempts. bq. BTW, assuming YARN-3565 goes in first, you will have to make some changes here. bq. I think the format expected from the command should be more structured. Specifically as we expect more per-label attributes in line with YARN-3565. Well was thinking about this while working on YARN-3565, but dint modify the NodeLabelsProvider as currently Labels(currently partitions) which needs to be sent from NM have to be one of RM's CLUSTER NodeLabel set. So exclusiveness need not be sent from NM to RM as currently we support specifying the exclusiveness only during adding clusterNode labels. So IMHO if there is plan to make this interface public stable then would be better do these changes now itself if not it would better done after requirement for constraint labels, so that more clarity on structure would be there? [~wangda] and you can share your opinion on this, based on it will do the modifications. bq. Not caused by your patch but worth fixing here. NodeStatusUpdaterImpl shouldn't worry about invalid label-set, previous-valid-labels and label validation. You should move all that functionality into NodeLabelsProvider. Well as per the class reponsibility i understand that NodeStatusUpdaterImpl is not supposed to have it but as it might be expected to be public we had to ensure that * For every heartbeat labels are sent across only if modified * doing basic validations before sending the modified labels These needs to be done irrespective of the label provider (system or user's) hence kept it in NodeStatusUpdaterImpl , but if req to be moved out then we need to bring in some intermediate manager(/helper/delegator) class between NodeStatusUpdaterImpl and NodeLabelsProvider. Those changes were also from my previous patch, so no hard feelings in taking care of it if req :). bq. Can you add the documentation for setting this up too too? Well was planning to raise jira for updating documentation on top of NodeLabels but documentation for it is not yet completed. If required can just add some pdf here Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup --- Key: YARN-2729 URL: https://issues.apache.org/jira/browse/YARN-2729 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, YARN-2729.20150402-1.patch, YARN-2729.20150404-1.patch Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String
[ https://issues.apache.org/jira/browse/YARN-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526629#comment-14526629 ] Naganarasimha G R commented on YARN-3565: - Thanks for the review comments [~vinodkv] Agree with most of your suggestions but had few queries overall, * can there be changes again when labels as constraints are introduced ? As i am not sure exclusivity will have any significance with constraints, if we plan to make use of NodeLabel class for constraints too. * CLI will also require changes for adding, removing cluster node labels and mapping of nodes to labels ? * If required to modify RMNodeLabelsManager.replaceLabelsOnNode() then i think we need to make yarn-3521 dependent on this jira, right ? NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String - Key: YARN-3565 URL: https://issues.apache.org/jira/browse/YARN-3565 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Priority: Blocker Attachments: YARN-3565-20150502-1.patch Now NM HB/Register uses SetString, it will be hard to add new fields if we want to support specifying NodeLabel type such as exclusivity/constraints, etc. We need to make sure rolling upgrade works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience
[ https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3523: Attachment: YARN-3523.20150504-1.patch Have checked the 2.7.0 api docs and neither this class nor its package has been captured. Hence have modified visibility of the methods as private in this updated patch Cleanup ResourceManagerAdministrationProtocol interface audience Key: YARN-3523 URL: https://issues.apache.org/jira/browse/YARN-3523 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch I noticed ResourceManagerAdministrationProtocol has @Private audience for the class and @Public audience for methods. It doesn't make sense to me. We should make class audience and methods audience consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain
[ https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2622: - Target Version/s: (was: 2.6.0) RM should put the application related timeline data into a secured domain - Key: YARN-2622 URL: https://issues.apache.org/jira/browse/YARN-2622 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the application related timeline data is put into the default domain. It is not secured. We should let RM to choose a secured domain to put the system metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526700#comment-14526700 ] Hadoop QA commented on YARN-2618: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12723515/YARN-2618-7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / bb9ddef | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7687/console | This message was automatically generated. Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience
[ https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526710#comment-14526710 ] Hadoop QA commented on YARN-3523: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 55s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 52s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 2s | The applied patch generated 1 new checkstyle issues (total was 17, now 18). | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 6 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 24s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 27s | Tests passed in hadoop-yarn-api. | | | | 38m 5s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730182/YARN-3523.20150504-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / bb9ddef | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7686/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7686/artifact/patchprocess/whitespace.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7686/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7686/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7686/console | This message was automatically generated. Cleanup ResourceManagerAdministrationProtocol interface audience Key: YARN-3523 URL: https://issues.apache.org/jira/browse/YARN-3523 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch I noticed ResourceManagerAdministrationProtocol has @Private audience for the class and @Public audience for methods. It doesn't make sense to me. We should make class audience and methods audience consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526747#comment-14526747 ] Sunil G commented on YARN-3521: --- 1. bq.Should be exclusivity. Yes. I updated the same 2. bq.Did we ever call these APIs stable? No. I have changed to a NodeLabelsInfo object and added new getter which can supply list/set of string names. 3. Why are we not dropping the name-only records? I have removed *NodeLabelsName*. And instead use *NodeLabelsInfo*, also added a new getter which can give back String of label names. NodeToLabelsName is renamed as NodeToLabelsInfo and internally it also uses NodeLabelInfo. Support return structured NodeLabel objects in REST API when call getClusterNodeLabels -- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 0003-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3521: -- Attachment: 0004-YARN-3521.patch [~vinodkv] and [~leftnoteasy] Pls share your thoughts on this updated patch. IMO I also feel that NodeLabelManager apis can use Object rather than Strings. Admin interface can take this conversion logic. Support return structured NodeLabel objects in REST API when call getClusterNodeLabels -- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 0003-YARN-3521.patch, 0004-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies
[ https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526755#comment-14526755 ] Jason Lowe commented on YARN-3097: -- +1, committing this. Logging of resource recovery on NM restart has redundancies --- Key: YARN-3097 URL: https://issues.apache.org/jira/browse/YARN-3097 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Minor Labels: newbie Attachments: YARN-3097.001.patch ResourceLocalizationService logs that it is recovering a resource with the remote and local paths, but then very shortly afterwards the LocalizedResource emits an INIT-LOCALIZED transition that also logs the same remote and local paths. The recovery message should be a debug message, since it's not conveying any useful information that isn't already covered by the resource state transition log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3388) Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit
[ https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526763#comment-14526763 ] Nathan Roberts commented on YARN-3388: -- Yes. I have a patch which I think is close. I need to merge to latest trunk. then I'll post for review. Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit - Key: YARN-3388 URL: https://issues.apache.org/jira/browse/YARN-3388 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-3388-v0.patch, YARN-3388-v1.patch When there are multiple active users in a queue, it should be possible for those users to make use of capacity up-to max_capacity (or close). The resources should be fairly distributed among the active users in the queue. This works pretty well when there is a single resource being scheduled. However, when there are multiple resources the situation gets more complex and the current algorithm tends to get stuck at Capacity. Example illustrated in subsequent comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies
[ https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526764#comment-14526764 ] Hudson commented on YARN-3097: -- FAILURE: Integrated in Hadoop-trunk-Commit #7723 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7723/]) YARN-3097. Logging of resource recovery on NM restart has redundancies. Contributed by Eric Payne (jlowe: rev 8f65c793f2930bfd16885a2ab188a9970b754974) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt Logging of resource recovery on NM restart has redundancies --- Key: YARN-3097 URL: https://issues.apache.org/jira/browse/YARN-3097 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Minor Labels: newbie Fix For: 2.8.0 Attachments: YARN-3097.001.patch ResourceLocalizationService logs that it is recovering a resource with the remote and local paths, but then very shortly afterwards the LocalizedResource emits an INIT-LOCALIZED transition that also logs the same remote and local paths. The recovery message should be a debug message, since it's not conveying any useful information that isn't already covered by the resource state transition log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526775#comment-14526775 ] Jason Lowe commented on YARN-3554: -- YARN-3518 is a separate concern with different ramifications. We should discuss it there and not mix these two. bq. set this to a bigger value maybe based on network partition considerations not only for nm restart. What value do you propose? As pointed out earlier, anything over 10 minutes is pointless since the container allocation expires in that time. Is it common for network partitions to take longer than 3 minutes but less than 10 minutes? If so we should tune the value for that. If not then making the value larger just slows recovery time. bq. 3 mins seems dangerous, If rm fails over and the recover takes serval mins, nm maybe kill all containers, in production env, it's not expected. This JIRA is configuring the amount of time NM clients (i.e.: primarily ApplicationMasters and the RM when launching ApplicationMasters) will try to connect to a particular NM before failing. I'm missing how RM failover leads to a mass killing of containers due to this proposed change. This is not a property used by the NM, so the NM is not going to start killing all containers differently based on an updated value for it. The only case where the RM will use this property is when connecting to NMs to launch AM containers, and it will only do so for NMs that have recently heartbeated. Could you explain how this leads to all containers getting killed on a particular node? Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526791#comment-14526791 ] Wei Yan commented on YARN-2618: --- Thanks, [~djp], I'll rebase the patch. Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526934#comment-14526934 ] Naganarasimha G R commented on YARN-3554: - Hi [~jlowe], earlier my query of ideal time and [~sandflee]'s comment is related to yarn.resourcemanager.connect.max-wait.ms and as [~gtCarrera] mentioned its just for discussion purpose. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526957#comment-14526957 ] Wilfred Spiegelenburg commented on YARN-3491: - Can we clean up the getInitializedLogDirs() and getInitializedLogDirs() now that we're changing them? Neither of the methods need to return anything since we do not use the return value. Also a rename of the methods would make it clearer: getInitializedLogDirs() -- initializeLogDirs() getInitializedLocalDirs() -- initializeLocalDirs() PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch, YARN-3491.002.patch, YARN-3491.003.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1564) add some basic workflow YARN services
[ https://issues.apache.org/jira/browse/YARN-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526963#comment-14526963 ] Zhijie Shen commented on YARN-1564: --- YARN-2928 is going to support flow as the first class citizen.It will be great if we can coordinate on this between app management and monitoring. add some basic workflow YARN services - Key: YARN-1564 URL: https://issues.apache.org/jira/browse/YARN-1564 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Attachments: YARN-1564-001.patch Original Estimate: 24h Time Spent: 48h Remaining Estimate: 0h I've been using some alternative composite services to help build workflows of process execution in a YARN AM. They and their tests could be moved in YARN for the use by others -this would make it easier to build aggregate services in an AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526983#comment-14526983 ] Jason Lowe commented on YARN-3554: -- Ah, thanks [~Naganarasimha], sorry I missed that. We can continue discussing the proper RM connect wait time over at YARN-3518, as obviously I cannot keep them straight here. ;-) Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 for the second patch, but I'll wait a few days before committing to give time for alternate proposals. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3422) relatedentities always return empty list when primary filter is set
[ https://issues.apache.org/jira/browse/YARN-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527012#comment-14527012 ] Zhijie Shen commented on YARN-3422: --- [~billie.rina...@gmail.com], thanks for explaining the rationale. Hence the attache patch should not be the right fix. bq. In retrospect, the directional nature of the related entity relationship seems to make things more confusing. Perhaps it would be better if relatedness were bidirectional. I think directional may be okay, but the confusing part is that we're storing A - B, but we query B - A, while we always say related entities. In fact, we need to differentiate both. When storing A, B resides in A entity as isRelatedTo entity, and when querying B, A is shown as the relatesTo entity. Of cause, we can querying A, and B should be shown as the isRelatedTo entity, which is not supported here. This problem will be resolved in ATS v2. Moreover, it's also the limitation about the way we store the primary filter. The index table is a copy of the whole entity (only the information comes with the current put) and attach the primary filter as the prefix of the key. It makes it expensive to define one primary key for an entity, and probably results in different snapshot of the entity with different primary keys. In this example, B doesn't have primary filter C, but even later we add C for B, we will still be not able to get related entity A when querying B via primary filter C. That's one reason why I suggest using reverse index in YARN-3448. However, for current LeveldbTimelineStore, I'm not sure if we have a quick way to resolve the problem. Thoughts? relatedentities always return empty list when primary filter is set --- Key: YARN-3422 URL: https://issues.apache.org/jira/browse/YARN-3422 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Chang Li Assignee: Chang Li Attachments: YARN-3422.1.patch When you curl for ats entities with a primary filter, the relatedentities fields always return empty list -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527020#comment-14527020 ] Vinod Kumar Vavilapalli commented on YARN-2618: --- Haven't looked at this so far, Tx for rekicking it Junping! Taking a quick look now.. Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated
Mit Desai created YARN-3573: --- Summary: MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated Key: YARN-3573 URL: https://issues.apache.org/jira/browse/YARN-3573 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code} starts the timeline server using *boolean enableAHS*. It is better to have the timelineserver started based on the config value. We should mark this constructor as deprecated to avoid its future use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527035#comment-14527035 ] Vinod Kumar Vavilapalli commented on YARN-2618: --- Okay, quickly scanned. Seems like you are having other related discussions at the umbrella ticket and other JIRAs. So please go ahead. Is this only for trunk or branch-2 also? Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated
[ https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai reassigned YARN-3573: --- Assignee: Mit Desai MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated - Key: YARN-3573 URL: https://issues.apache.org/jira/browse/YARN-3573 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code} starts the timeline server using *boolean enableAHS*. It is better to have the timelineserver started based on the config value. We should mark this constructor as deprecated to avoid its future use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1612) FairScheduler: Enable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1612: -- Attachment: YARN-1612-003.patch patch updated. FairScheduler: Enable delay scheduling by default - Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612-003.patch, YARN-1612-v2.patch, YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1612) FairScheduler: Enable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1612: -- Attachment: (was: YARN-1612-003.patch) FairScheduler: Enable delay scheduling by default - Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612-003.patch, YARN-1612-v2.patch, YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1612) FairScheduler: Enable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1612: -- Attachment: YARN-1612-003.patch FairScheduler: Enable delay scheduling by default - Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612-003.patch, YARN-1612-v2.patch, YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated
[ https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-3573: Assignee: (was: Mit Desai) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated - Key: YARN-3573 URL: https://issues.apache.org/jira/browse/YARN-3573 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code} starts the timeline server using *boolean enableAHS*. It is better to have the timelineserver started based on the config value. We should mark this constructor as deprecated to avoid its future use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527069#comment-14527069 ] Jian He commented on YARN-3480: --- [~hex108], generally, it's better to avoid a global config for an outlier app. 1. How often do you see an app failed with a large number of attempts? If it's limited to a few apps. I wouldn't worry so much. bq. make RM recover process much slower. 2. How slower it is in reality in your case? we've done some benchmark, recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or so. 3. Limiting the attempts to be recorded means we are losing history. it's a trade off. My main point is that if you can provide some real numbers showing how slow the recovery process in real scenario, we can figure out where the bottleneck is and how to improve it. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527072#comment-14527072 ] Vinod Kumar Vavilapalli commented on YARN-3554: --- HADOOP-11398 and YARN-3238 relevant in that they caused AM-NM communication take a long time to timeout. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2267) Auxiliary Service support in RM
[ https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527084#comment-14527084 ] Zhijie Shen commented on YARN-2267: --- Sunil, my 2 cents: if you can have some detailed proposal doc to share with community, and use it for further discussion, it will be more easier to catch the community's eyes and better to understand you proposal. You may want to focus on stating your problem, why it's general, what are the possible solutions, what are their pros and cos and so on. For example, Vinod may want to understand why we need to make monitoring as a aux service instead of builtin func of RM. Auxiliary Service support in RM --- Key: YARN-2267 URL: https://issues.apache.org/jira/browse/YARN-2267 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Naganarasimha G R Assignee: Rohith Currently RM does not have a provision to run any Auxiliary services. For health/monitoring in RM, its better to make a plugin mechanism in RM itself, similar to NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container
[ https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bartosz Ługowski updated YARN-1621: --- Attachment: (was: YARN-1621.6.patch) Add CLI to list rows of task attempt ID, container ID, host of container, state of container -- Key: YARN-1621 URL: https://issues.apache.org/jira/browse/YARN-1621 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Tassapol Athiapinya Assignee: Bartosz Ługowski Attachments: YARN-1621.1.patch, YARN-1621.2.patch, YARN-1621.3.patch, YARN-1621.4.patch, YARN-1621.5.patch, YARN-1621.6.patch As more applications are moved to YARN, we need generic CLI to list rows of task attempt ID, container ID, host of container, state of container. Today if YARN application running in a container does hang, there is no way to find out more info because a user does not know where each attempt is running in. For each running application, it is useful to differentiate between running/succeeded/failed/killed containers. {code:title=proposed yarn cli} $ yarn application -list-containers -applicationId appId [-containerState state of container] where containerState is optional filter to list container in given state only. container state can be running/succeeded/killed/failed/all. A user can specify more than one container state at once e.g. KILLED,FAILED. task attempt ID container ID host of container state of container {code} CLI should work with running application/completed application. If a container runs many task attempts, all attempts should be shown. That will likely be the case of Tez container-reuse application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container
[ https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bartosz Ługowski updated YARN-1621: --- Attachment: YARN-1621.6.patch Add CLI to list rows of task attempt ID, container ID, host of container, state of container -- Key: YARN-1621 URL: https://issues.apache.org/jira/browse/YARN-1621 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Tassapol Athiapinya Assignee: Bartosz Ługowski Attachments: YARN-1621.1.patch, YARN-1621.2.patch, YARN-1621.3.patch, YARN-1621.4.patch, YARN-1621.5.patch, YARN-1621.6.patch As more applications are moved to YARN, we need generic CLI to list rows of task attempt ID, container ID, host of container, state of container. Today if YARN application running in a container does hang, there is no way to find out more info because a user does not know where each attempt is running in. For each running application, it is useful to differentiate between running/succeeded/failed/killed containers. {code:title=proposed yarn cli} $ yarn application -list-containers -applicationId appId [-containerState state of container] where containerState is optional filter to list container in given state only. container state can be running/succeeded/killed/failed/all. A user can specify more than one container state at once e.g. KILLED,FAILED. task attempt ID container ID host of container state of container {code} CLI should work with running application/completed application. If a container runs many task attempts, all attempts should be shown. That will likely be the case of Tez container-reuse application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527087#comment-14527087 ] Vinod Kumar Vavilapalli commented on YARN-3554: --- bq. Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 for the second patch, but I'll wait a few days before committing to give time for alternate proposals. For our users, we explicitly set yarn.client.nodemanager-connect.max-wait-ms to 60,000 (one minute). As HADOOP-11398 is still not in, this ends up becoming 6 minutes timeout (assuming each of the underlying rpc retries takes 1 sec * 50 times to finish (50 secs), plus 10 seconds retry interval, causing 1min per retry and 6 retries overall). Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3518: -- Assignee: sandflee default rm/am expire interval should not less than default resourcemanager connect wait time Key: YARN-3518 URL: https://issues.apache.org/jira/browse/YARN-3518 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Reporter: sandflee Assignee: sandflee Labels: configuration, newbie Attachments: YARN-3518.001.patch take am for example, if am can't connect to RM, after am expire (600s), RM relaunch am, and there will be two am at the same time util resourcemanager connect max wait time(900s) passed. DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS = 15 * 60 * 1000; DEFAULT_RM_AM_EXPIRY_INTERVAL_MS = 60; DEFAULT_RM_NM_EXPIRY_INTERVAL_MS = 60; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527100#comment-14527100 ] Vinod Kumar Vavilapalli commented on YARN-3518: --- We need to be careful here. Clients from gateway machines should be treated separately from AMs - a distinction we don't have today. It actually makes sense for clients to retry for a longer time than is usual for AMs. default rm/am expire interval should not less than default resourcemanager connect wait time Key: YARN-3518 URL: https://issues.apache.org/jira/browse/YARN-3518 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Reporter: sandflee Assignee: sandflee Labels: configuration, newbie Attachments: YARN-3518.001.patch take am for example, if am can't connect to RM, after am expire (600s), RM relaunch am, and there will be two am at the same time util resourcemanager connect max wait time(900s) passed. DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS = 15 * 60 * 1000; DEFAULT_RM_AM_EXPIRY_INTERVAL_MS = 60; DEFAULT_RM_NM_EXPIRY_INTERVAL_MS = 60; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins
[ https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527614#comment-14527614 ] Naganarasimha G R commented on YARN-3562: - Thanks [~sjlee0], yes lately seeing some strange jenkins output and thanks for testing locally, but there might be some other unrelated test case failure as we are modifying the miniyarncluster, so not sure how to proceed in that case ? also how do you guys kickoff jenkins ? delete and reupload the patch ? unit tests failures and issues found from findbug from earlier ATS checkins --- Key: YARN-3562 URL: https://issues.apache.org/jira/browse/YARN-3562 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Naganarasimha G R Priority: Minor Attachments: YARN-3562-YARN-2928.001.patch *Issues reported from MAPREDUCE-6337* : A bunch of MR unit tests are failing on our branch whenever the mini YARN cluster needs to bring up multiple node managers. For example, see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/ It is because the NMCollectorService is using a fixed port for the RPC (8048). *Issues reported from YARN-3044* : Test case failures and tools(FB CS) issues found : # find bugs issue : Comparison of String objects using == or != in ResourceTrackerService.updateAppCollectorsMap # find bugs issue : Boxing/unboxing to parse a primitive RMTimelineCollectorManager.postPut. Called method Long.longValue() Should call Long.parseLong(String) instead. # find bugs issue : DM_DEFAULT_ENCODING Called method new java.io.FileWriter(String, boolean) At FileSystemTimelineWriterImpl.java:\[line 86\] # hadoop.yarn.server.resourcemanager.TestAppManager, hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, hadoop.yarn.server.resourcemanager.TestClientRMService hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus, refer https://builds.apache.org/job/PreCommit-YARN-Build/7534/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527612#comment-14527612 ] Jian He commented on YARN-3018: --- hi [~nijel], below code in CapacitySchedulerConfiguration actually uses 0 instead. How about change it to be 0 ? and simplify below code to {{return getInt(NODE_LOCALITY_DELAY, DEFAULT_NODE_LOCALITY_DELAY);}} {code} public int getNodeLocalityDelay() { int delay = getInt(NODE_LOCALITY_DELAY, DEFAULT_NODE_LOCALITY_DELAY); return (delay == DEFAULT_NODE_LOCALITY_DELAY) ? 0 : delay; } {code} Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial Attachments: YARN-3018-1.patch, YARN-3018-2.patch, YARN-3018-3.patch For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527758#comment-14527758 ] Hudson commented on YARN-2725: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7729 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7729/]) Adding test cases of retrying requests about ZKRMStateStore --- Key: YARN-2725 URL: https://issues.apache.org/jira/browse/YARN-2725 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi Ozawa Assignee: Tsuyoshi Ozawa Fix For: 2.8.0 Attachments: YARN-2725.1.patch, YARN-2725.1.patch YARN-2721 found a race condition for ZK-specific retry semantics. We should add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3375) NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting NodeHealthScriptRunner
[ https://issues.apache.org/jira/browse/YARN-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527756#comment-14527756 ] Hudson commented on YARN-3375: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7729 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7729/]) NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting NodeHealthScriptRunner -- Key: YARN-3375 URL: https://issues.apache.org/jira/browse/YARN-3375 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Devaraj K Assignee: Devaraj K Priority: Minor Fix For: 2.8.0 Attachments: YARN-3375.patch 1. NodeHealthScriptRunner.shouldRun() check is happening 3 times for starting the NodeHealthScriptRunner. {code:title=NodeManager.java|borderStyle=solid} if(!NodeHealthScriptRunner.shouldRun(nodeHealthScript)) { LOG.info(Abey khali); return null; } {code} {code:title=NodeHealthCheckerService.java|borderStyle=solid} if (NodeHealthScriptRunner.shouldRun( conf.get(YarnConfiguration.NM_HEALTH_CHECK_SCRIPT_PATH))) { addService(nodeHealthScriptRunner); } {code} {code:title=NodeHealthScriptRunner.java|borderStyle=solid} if (!shouldRun(nodeHealthScript)) { LOG.info(Not starting node health monitor); return; } {code} 2. If we don't configure node health script or configured health script doesn't execute permission, NM logs with the below message. {code:xml} 2015-03-19 19:55:45,713 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: Abey khali {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527802#comment-14527802 ] zhihai xu commented on YARN-3491: - thanks [~wilfreds] for the review. I uploaded a new patch YARN-3491.004.patch, which addressed all your comments. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch, YARN-3491.002.patch, YARN-3491.003.patch, YARN-3491.004.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated
[ https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3573: --- Attachment: YARN-3573.patch MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated - Key: YARN-3573 URL: https://issues.apache.org/jira/browse/YARN-3573 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Brahma Reddy Battula Attachments: YARN-3573.patch {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code} starts the timeline server using *boolean enableAHS*. It is better to have the timelineserver started based on the config value. We should mark this constructor as deprecated to avoid its future use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures
[ https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527639#comment-14527639 ] Wangda Tan commented on YARN-3514: -- [~cnauroth], I think this causes other problems in latest YARN as well, for example: If a user with name with mixed cases for example De, if we have a rule /L in kerberos side to make all names to lower case, when NM doing log aggregation, it will fail because user name doesn't match (in UserGroupInformation is de, but in OS). {code} java.io.IOException: Owner 'De' for path /hadoop/yarn2/log/application_1428608050835_0013/container_1428608050835_0013_01_06/stder r did not match expected owner 'de' at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:285) at org.apache.hadoop.io.SecureIOUtils.forceSecureOpenForRead(SecureIOUtils.java:219) at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:204) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogValue.secureOpenFile(AggregatedLogFormat.java:275) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogValue.write(AggregatedLogFormat.java:227) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl$ContainerLogAggregator.doContainer LogAggregation(AppLogAggregatorImpl.java:534) at ... {code} One possible solution is ignoring cases while compare user name, but that will be problematic when user De/de existed at the same time. Any thoughts? [~cnauroth]. Active directory usernames like domain\login cause YARN failures Key: YARN-3514 URL: https://issues.apache.org/jira/browse/YARN-3514 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Environment: CentOS6 Reporter: john lilley Assignee: Chris Nauroth Priority: Minor Attachments: YARN-3514.001.patch, YARN-3514.002.patch We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is Kerberos-enabled and uses an external AD domain controller for the KDC. We are able to authenticate, browse HDFS, etc. However, YARN fails during localization because it seems to get confused by the presence of a \ character in the local user name. Our AD authentication on the nodes goes through sssd and set configured to map AD users onto the form domain\username. For example, our test user has a Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user domain\hadoopuser. We have no problem validating that user with PAM, logging in as that user, su-ing to that user, etc. However, when we attempt to run a YARN application master, the localization step fails when setting up the local cache directory for the AM. The error that comes out of the RM logs: 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, diagnostics='Application application_1429295486450_0001 failed 1 times due to AM Container for appattempt_1429295486450_0001_01 exited with exitCode: -1000 due to: Application application_1429295486450_0001 initialization failed (exitCode=255) with output: main : command provided 0 main : user is DOMAIN\hadoopuser main : requested yarn user is domain\hadoopuser org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create directory: /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10 at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347) .Failing this attempt.. Failing the application.' However, when we look on the node launching the AM, we see this: [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache [root@rpb-cdh-kerb-2 usercache]# ls -l drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser There appears to be different treatment of the \ character in different places. Something creates the directory as domain\hadoopuser
[jira] [Commented] (YARN-3343) TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527640#comment-14527640 ] Jian He commented on YARN-3343: --- [~rohithsharma], is this still reproducible ? seems not on my side. TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk --- Key: YARN-3343 URL: https://issues.apache.org/jira/browse/YARN-3343 Project: Hadoop YARN Issue Type: Test Reporter: Xuan Gong Assignee: Rohith Priority: Minor Attachments: 0001-YARN-3343.patch Error Message test timed out after 3 milliseconds Stacktrace java.lang.Exception: test timed out after 3 milliseconds at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getAllByName0(InetAddress.java:1246) at java.net.InetAddress.getAllByName(InetAddress.java:1162) at java.net.InetAddress.getAllByName(InetAddress.java:1098) at java.net.InetAddress.getByName(InetAddress.java:1048) at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563) at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:367) at org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:178) at org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:136) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:206) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate(TestCapacitySchedulerNodeLabelUpdate.java:157) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527697#comment-14527697 ] Wangda Tan commented on YARN-3521: -- [~sunilg], Make sense to me, bq. IMO I also feel that NodeLabelManager apis can use Object rather than Strings. Admin interface can take this conversion logic. Sorry I didn't get this, currently addToCluserNodeLabels is already takes object instead of String and you're using it in your patch. Support return structured NodeLabel objects in REST API when call getClusterNodeLabels -- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 0003-YARN-3521.patch, 0004-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby
[ https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3574: -- Description: We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter {code} main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in Object.wait() [0x7f9afe7eb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1281) - locked 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked 0xc0503568 (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076) - locked 0xc03fe3b8 (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322) - locked 0xc0502b10 (a org.apache.hadoop.yarn.server.resourcemanager.AdminService) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135) at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428) - locked 0xc0718940 (a org.apache.hadoop.ha.ActiveStandbyElector) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} {code} timeline daemon prio=10 tid=0x7f9b34d55000 nid=0x1d93 runnable [0x7f9b0cbbf000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) - locked 0xc0f522c8 (a java.io.BufferedInputStream) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.hadoop.metrics2.sink.timeline.AbstractTimelineMetricsSink.emitMetrics(AbstractTimelineMetricsSink.java:66) at org.apache.hadoop.metrics2.sink.timeline.HadoopTimelineMetricsSink.putMetrics(HadoopTimelineMetricsSink.java:203) at org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.consume(MetricsSinkAdapter.java:175) at
[jira] [Commented] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby
[ https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527698#comment-14527698 ] Jian He commented on YARN-3574: --- [~brahmareddy], I'm also not able to repro.. I wondered if any other folks have seen this issue before. we found this while doing ambari integration testing. I added one more stack trace for the blocking thread in the description. RM hangs on stopping MetricsSinkAdapter when transitioning to standby - Key: YARN-3574 URL: https://issues.apache.org/jira/browse/YARN-3574 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Brahma Reddy Battula We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter {code} main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in Object.wait() [0x7f9afe7eb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1281) - locked 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked 0xc0503568 (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076) - locked 0xc03fe3b8 (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322) - locked 0xc0502b10 (a org.apache.hadoop.yarn.server.resourcemanager.AdminService) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135) at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428) - locked 0xc0718940 (a org.apache.hadoop.ha.ActiveStandbyElector) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} {code} timeline daemon prio=10 tid=0x7f9b34d55000 nid=0x1d93 runnable [0x7f9b0cbbf000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) - locked 0xc0f522c8 (a java.io.BufferedInputStream) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at
[jira] [Commented] (YARN-1612) FairScheduler: Enable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527820#comment-14527820 ] Hadoop QA commented on YARN-1612: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 33s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 31s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 50s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 14s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 52m 36s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 52s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730302/YARN-1612-004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 551615f | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7695/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7695/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7695/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7695/console | This message was automatically generated. FairScheduler: Enable delay scheduling by default - Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612-003.patch, YARN-1612-004.patch, YARN-1612-v2.patch, YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527835#comment-14527835 ] Hadoop QA commented on YARN-3134: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 51s | Pre-patch YARN-2928 compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 34s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 26m 0s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730332/YARN-3134-YARN-2928.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / 557a395 | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/7698/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7698/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7698/console | This message was automatically generated. [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, YARN-3134-YARN-2928.003.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527847#comment-14527847 ] Jun Gong commented on YARN-3480: [~jianhe], sorry for not specifying our scenario: RM HA is enabled, use ZK to store apps' info, most apps running in the cluster are long running(service) apps, yarn.resourcemanager.am.max-attempts is set to 1 because we have not patched YARN-611 and we want apps to retry more times. There are 10K apps with 1~1 attempts stored in ZK. It will take about 6 mins to recover those apps when RM HA. {quote} 1. How often do you see an app failed with a large number of attempts? If it's limited to a few apps. I wouldn't worry so much. 2. How slower it is in reality in your case? we've done some benchmark, recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or so. {quote} Please see above. I think it will be OK for map-reduce jobs. But it might not be OK for service apps which have been running several months. {quote} 3. Limiting the attempts to be recorded means we are losing history. it's a trade off. {quote} Yes, I agree. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2921) MockRM#waitForState methods can be too slow and flaky
[ https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527855#comment-14527855 ] Tsuyoshi Ozawa commented on YARN-2921: -- [~leftnoteasy] thank you for pinging me. Yes, it looks related. Let me survey MockRM#waitForState methods can be too slow and flaky - Key: YARN-2921 URL: https://issues.apache.org/jira/browse/YARN-2921 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Tsuyoshi Ozawa Attachments: YARN-2921.001.patch, YARN-2921.002.patch, YARN-2921.003.patch, YARN-2921.004.patch MockRM#waitForState methods currently sleep for too long (2 seconds and 1 second). This leads to slow tests and sometimes failures if the App/AppAttempt moves to another state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby
[ https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527616#comment-14527616 ] Brahma Reddy Battula commented on YARN-3574: [~jianhe] I would like to work on this.. I am not able to reproduce this .. can you please give scenario ..? RM hangs on stopping MetricsSinkAdapter when transitioning to standby - Key: YARN-3574 URL: https://issues.apache.org/jira/browse/YARN-3574 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter {code} main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in Object.wait() [0x7f9afe7eb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1281) - locked 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked 0xc0503568 (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076) - locked 0xc03fe3b8 (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322) - locked 0xc0502b10 (a org.apache.hadoop.yarn.server.resourcemanager.AdminService) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135) at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428) - locked 0xc0718940 (a org.apache.hadoop.ha.ActiveStandbyElector) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} looks like the {{sinkThread.interrupt();}} in MetricsSinkAdapter#stop doesn't really interrupt the thread, which cause it to hang at join. This appears only once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby
[ https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula reassigned YARN-3574: -- Assignee: Brahma Reddy Battula RM hangs on stopping MetricsSinkAdapter when transitioning to standby - Key: YARN-3574 URL: https://issues.apache.org/jira/browse/YARN-3574 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Brahma Reddy Battula We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter {code} main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in Object.wait() [0x7f9afe7eb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1281) - locked 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked 0xc0503568 (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076) - locked 0xc03fe3b8 (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322) - locked 0xc0502b10 (a org.apache.hadoop.yarn.server.resourcemanager.AdminService) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135) at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428) - locked 0xc0718940 (a org.apache.hadoop.ha.ActiveStandbyElector) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} looks like the {{sinkThread.interrupt();}} in MetricsSinkAdapter#stop doesn't really interrupt the thread, which cause it to hang at join. This appears only once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated
[ https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527619#comment-14527619 ] Brahma Reddy Battula commented on YARN-3573: [~mitdesai] Thanks for reporting, Attached the patch kindly review.. MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated - Key: YARN-3573 URL: https://issues.apache.org/jira/browse/YARN-3573 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Brahma Reddy Battula Attachments: YARN-3573.patch {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code} starts the timeline server using *boolean enableAHS*. It is better to have the timelineserver started based on the config value. We should mark this constructor as deprecated to avoid its future use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3134: Attachment: YARN-3134-YARN-2928.003.patch Updated my patch according to the latest comments. I've rebased the patch to the latest YARN-2928 branch, with YARN-3551 in. In this version we're no longer swallowing exceptions. I have not made the change on the Phoenix connection string since, according to our previous discussion, we're planning to address this after we've decided which implementation to pursue in the future. A special note to [~zjshen]: I'm not sure my current way to access the singleData section of a TimelineMetric is correct (since the field no longer exists). It would be great if you can take a look at it. Thanks! [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, YARN-3134-YARN-2928.003.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience
[ https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527649#comment-14527649 ] Naganarasimha G R commented on YARN-3523: - in that case better to remove @Stable and not add @Unstable .. thoughts ? Cleanup ResourceManagerAdministrationProtocol interface audience Key: YARN-3523 URL: https://issues.apache.org/jira/browse/YARN-3523 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch I noticed ResourceManagerAdministrationProtocol has @Private audience for the class and @Public audience for methods. It doesn't make sense to me. We should make class audience and methods audience consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler
[ https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527699#comment-14527699 ] Dian Fu commented on YARN-3557: --- Hi [~leftnoteasy], Thanks a lot for your comments. What about the support of both distributed configuration and centralized configuration? Any thoughts about the solution I mentioned in the above comment? Support Intel Trusted Execution Technology(TXT) in YARN scheduler - Key: YARN-3557 URL: https://issues.apache.org/jira/browse/YARN-3557 Project: Hadoop YARN Issue Type: New Feature Reporter: Dian Fu Attachments: Support TXT in YARN high level design doc.pdf Intel TXT defines platform-level enhancements that provide the building blocks for creating trusted platforms. A TXT aware YARN scheduler can schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 provides the capacity to restrict YARN applications to run only on cluster nodes that have a specified node label. This is a good mechanism that be utilized for TXT aware YARN scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures
[ https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527702#comment-14527702 ] Vinod Kumar Vavilapalli commented on YARN-3514: --- I also doubt if this (the fix by the patch) is the only place where domain\login type of user-names will fail in YARN. Active directory usernames like domain\login cause YARN failures Key: YARN-3514 URL: https://issues.apache.org/jira/browse/YARN-3514 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Environment: CentOS6 Reporter: john lilley Assignee: Chris Nauroth Priority: Minor Attachments: YARN-3514.001.patch, YARN-3514.002.patch We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is Kerberos-enabled and uses an external AD domain controller for the KDC. We are able to authenticate, browse HDFS, etc. However, YARN fails during localization because it seems to get confused by the presence of a \ character in the local user name. Our AD authentication on the nodes goes through sssd and set configured to map AD users onto the form domain\username. For example, our test user has a Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user domain\hadoopuser. We have no problem validating that user with PAM, logging in as that user, su-ing to that user, etc. However, when we attempt to run a YARN application master, the localization step fails when setting up the local cache directory for the AM. The error that comes out of the RM logs: 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, diagnostics='Application application_1429295486450_0001 failed 1 times due to AM Container for appattempt_1429295486450_0001_01 exited with exitCode: -1000 due to: Application application_1429295486450_0001 initialization failed (exitCode=255) with output: main : command provided 0 main : user is DOMAIN\hadoopuser main : requested yarn user is domain\hadoopuser org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create directory: /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10 at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347) .Failing this attempt.. Failing the application.' However, when we look on the node launching the AM, we see this: [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache [root@rpb-cdh-kerb-2 usercache]# ls -l drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser There appears to be different treatment of the \ character in different places. Something creates the directory as domain\hadoopuser but something else later attempts to use it as domain%5Chadoopuser. I’m not sure where or why the URL escapement converts the \ to %5C or why this is not consistent. I should also mention, for the sake of completeness, our auth_to_local rule is set up to map u...@domain.com to domain\user: RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience
[ https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527705#comment-14527705 ] Vinod Kumar Vavilapalli commented on YARN-3523: --- Makes sense. Cleanup ResourceManagerAdministrationProtocol interface audience Key: YARN-3523 URL: https://issues.apache.org/jira/browse/YARN-3523 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch I noticed ResourceManagerAdministrationProtocol has @Private audience for the class and @Public audience for methods. It doesn't make sense to me. We should make class audience and methods audience consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins
[ https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527752#comment-14527752 ] Hadoop QA commented on YARN-3562: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 53s | Pre-patch YARN-2928 compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 5 new or modified test files. | | {color:green}+1{color} | javac | 7m 36s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 41s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 40s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 25s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 53m 18s | Tests failed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 2m 34s | Tests passed in hadoop-yarn-server-tests. | | {color:green}+1{color} | yarn tests | 0m 21s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 94m 1s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestClientRMService | | | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730037/YARN-3562-YARN-2928.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / 557a395 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7694/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-tests test log | https://builds.apache.org/job/PreCommit-YARN-Build/7694/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/7694/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7694/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7694/console | This message was automatically generated. unit tests failures and issues found from findbug from earlier ATS checkins --- Key: YARN-3562 URL: https://issues.apache.org/jira/browse/YARN-3562 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Naganarasimha G R Priority: Minor Attachments: YARN-3562-YARN-2928.001.patch *Issues reported from MAPREDUCE-6337* : A bunch of MR unit tests are failing on our branch whenever the mini YARN cluster needs to bring up multiple node managers. For example, see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/ It is because the NMCollectorService is using a fixed port for the RPC (8048). *Issues reported from YARN-3044* : Test case failures and tools(FB CS) issues found : # find bugs issue : Comparison of String objects using == or != in ResourceTrackerService.updateAppCollectorsMap # find bugs issue : Boxing/unboxing to parse a primitive RMTimelineCollectorManager.postPut. Called method Long.longValue() Should call Long.parseLong(String) instead. # find bugs issue : DM_DEFAULT_ENCODING Called method new java.io.FileWriter(String, boolean) At FileSystemTimelineWriterImpl.java:\[line 86\] # hadoop.yarn.server.resourcemanager.TestAppManager, hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, hadoop.yarn.server.resourcemanager.TestClientRMService hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus, refer https://builds.apache.org/job/PreCommit-YARN-Build/7534/testReport/ -- This message was sent by Atlassian JIRA
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Attachment: YARN-3491.004.patch PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch, YARN-3491.002.patch, YARN-3491.003.patch, YARN-3491.004.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527632#comment-14527632 ] Li Lu commented on YARN-3134: - And, one more thing: I'm closing all PreparedStatements implicitly in the try-with-resource statements. This statement will not swallow any exceptions (since there's no catch after it) but will guarantee the resource is released after the block's execution, even if there're exceptions. [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, YARN-3134-YARN-2928.003.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527706#comment-14527706 ] sandflee commented on YARN-3518: agree, we should set nm, am, client separately default rm/am expire interval should not less than default resourcemanager connect wait time Key: YARN-3518 URL: https://issues.apache.org/jira/browse/YARN-3518 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Reporter: sandflee Assignee: sandflee Labels: configuration, newbie Attachments: YARN-3518.001.patch take am for example, if am can't connect to RM, after am expire (600s), RM relaunch am, and there will be two am at the same time util resourcemanager connect max wait time(900s) passed. DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS = 15 * 60 * 1000; DEFAULT_RM_AM_EXPIRY_INTERVAL_MS = 60; DEFAULT_RM_NM_EXPIRY_INTERVAL_MS = 60; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527732#comment-14527732 ] Hadoop QA commented on YARN-3069: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 46s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 4m 45s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 39s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 7m 10s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | common tests | 23m 32s | Tests passed in hadoop-common. | | {color:green}+1{color} | mapreduce tests | 9m 42s | Tests passed in hadoop-mapreduce-client-app. | | {color:green}+1{color} | yarn tests | 1m 59s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | hdfs tests | 164m 48s | Tests failed in hadoop-hdfs. | | | | 246m 47s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.TestFileCreation | | | hadoop.hdfs.TestHDFSFileSystemContract | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730267/YARN-3069.006.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | trunk / bf70c5a | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7691/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-mapreduce-client-app test log | https://builds.apache.org/job/PreCommit-YARN-Build/7691/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7691/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-YARN-Build/7691/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7691/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7691/console | This message was automatically generated. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
[jira] [Commented] (YARN-3547) FairScheduler: Apps that have no resource demand should not participate scheduling
[ https://issues.apache.org/jira/browse/YARN-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527827#comment-14527827 ] Hadoop QA commented on YARN-3547: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 49s | The applied patch generated 1 new checkstyle issues (total was 9, now 10). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 15s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 52m 58s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 89m 29s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730098/YARN-3547.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 551615f | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7696/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7696/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7696/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7696/console | This message was automatically generated. FairScheduler: Apps that have no resource demand should not participate scheduling -- Key: YARN-3547 URL: https://issues.apache.org/jira/browse/YARN-3547 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Xianyin Xin Assignee: Xianyin Xin Attachments: YARN-3547.001.patch, YARN-3547.002.patch, YARN-3547.003.patch At present, all of the 'running' apps participate the scheduling process, however, most of them may have no resource demand on a production cluster, as the app's status is running other than waiting for resource at the most of the app's lifetime. It's not a wise way we sort all the 'running' apps and try to fulfill them, especially on a large-scale cluster which has heavy scheduling load. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527898#comment-14527898 ] Hadoop QA commented on YARN-3491: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 43s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 36s | The applied patch generated 3 new checkstyle issues (total was 177, now 178). | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 2s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 5m 57s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 42m 5s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730351/YARN-3491.004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 338e88a | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7700/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7700/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7700/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7700/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7700/console | This message was automatically generated. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch, YARN-3491.002.patch, YARN-3491.003.patch, YARN-3491.004.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527904#comment-14527904 ] Sunil G commented on YARN-3521: --- [~leftnoteasy] Yes, Its not a valid point. replaceLabelsOnNode and removeFromClusterNodeLabels doesn't need node label object, name is enough. Pls discard my earlier comment. Support return structured NodeLabel objects in REST API when call getClusterNodeLabels -- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 0003-YARN-3521.patch, 0004-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated
[ https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527906#comment-14527906 ] Hadoop QA commented on YARN-3573: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 5m 9s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 28s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 19s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 32s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 31s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 40s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | mapreduce tests | 106m 29s | Tests passed in hadoop-mapreduce-client-jobclient. | | | | 122m 45s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730327/YARN-3573.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / 551615f | | hadoop-mapreduce-client-jobclient test log | https://builds.apache.org/job/PreCommit-YARN-Build/7699/artifact/patchprocess/testrun_hadoop-mapreduce-client-jobclient.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7699/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7699/console | This message was automatically generated. MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated - Key: YARN-3573 URL: https://issues.apache.org/jira/browse/YARN-3573 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Brahma Reddy Battula Attachments: YARN-3573.patch {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code} starts the timeline server using *boolean enableAHS*. It is better to have the timelineserver started based on the config value. We should mark this constructor as deprecated to avoid its future use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures
[ https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527907#comment-14527907 ] Chris Nauroth commented on YARN-3514: - Looking at the original description, I see upper-case DOMAIN is getting translated to lower-case domain in this environment. It's likely that this environment would get an ownership mismatch error even after getting past the current bug. {code} drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser {code} Nice catch, Wangda. Is it necessary to translate to lower-case, or can the domain portion of the name be left in upper-case to match the OS level? bq. One possible solution is ignoring cases while compare user name, but that will be problematic when user De/de existed at the same time. I've seen a few mentions online that Active Directory is not case-sensitive but is case-preserving. That means it will preserve the case you used in usernames, but the case doesn't matter for comparisons. I've also seen references that DNS has similar behavior with regards to case. I can't find a definitive statement though that this is guaranteed behavior. I'd feel safer making this kind of change if we had a definitive reference. Active directory usernames like domain\login cause YARN failures Key: YARN-3514 URL: https://issues.apache.org/jira/browse/YARN-3514 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Environment: CentOS6 Reporter: john lilley Assignee: Chris Nauroth Priority: Minor Attachments: YARN-3514.001.patch, YARN-3514.002.patch We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is Kerberos-enabled and uses an external AD domain controller for the KDC. We are able to authenticate, browse HDFS, etc. However, YARN fails during localization because it seems to get confused by the presence of a \ character in the local user name. Our AD authentication on the nodes goes through sssd and set configured to map AD users onto the form domain\username. For example, our test user has a Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user domain\hadoopuser. We have no problem validating that user with PAM, logging in as that user, su-ing to that user, etc. However, when we attempt to run a YARN application master, the localization step fails when setting up the local cache directory for the AM. The error that comes out of the RM logs: 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, diagnostics='Application application_1429295486450_0001 failed 1 times due to AM Container for appattempt_1429295486450_0001_01 exited with exitCode: -1000 due to: Application application_1429295486450_0001 initialization failed (exitCode=255) with output: main : command provided 0 main : user is DOMAIN\hadoopuser main : requested yarn user is domain\hadoopuser org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create directory: /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10 at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347) .Failing this attempt.. Failing the application.' However, when we look on the node launching the AM, we see this: [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache [root@rpb-cdh-kerb-2 usercache]# ls -l drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser There appears to be different treatment of the \ character in different places. Something creates the directory as domain\hadoopuser but something else later attempts to use it as domain%5Chadoopuser. I’m not sure where or why the URL escapement converts the \ to %5C or why this is not consistent. I should also mention, for the sake of completeness, our auth_to_local rule is set up to map u...@domain.com to domain\user: RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job
[ https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527917#comment-14527917 ] Mohammad Shahid Khan commented on YARN-3560: The issue is happening due to the wrong hyperlink url formation. The system is always setting forming the url with the default port name even when the yarn.resourcemanager.webapp.address is being configured with different port numnber. Not able to navigate to the cluster from tracking url (proxy) generated after submission of job --- Key: YARN-3560 URL: https://issues.apache.org/jira/browse/YARN-3560 Project: Hadoop YARN Issue Type: Bug Reporter: Anushri Priority: Minor a standalone web proxy server is enabled in the cluster when a job is submitted the url generated contains proxy track this url in the web page , if we try to navigate to the cluster links [about. applications, or scheduler] it gets redirected to some default port instead of actual RM web port configured as such it throws webpage not available -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3343) TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527929#comment-14527929 ] Rohith commented on YARN-3343: -- [~jianhe] I was able to reproduce it. When I debug this issue, found that the 30 sec timeout was so aggressive to complete the test. On an average , the test case was taken around 35-45 sec. Changed the timeout to 60 sec TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk --- Key: YARN-3343 URL: https://issues.apache.org/jira/browse/YARN-3343 Project: Hadoop YARN Issue Type: Test Reporter: Xuan Gong Assignee: Rohith Priority: Minor Attachments: 0001-YARN-3343.patch Error Message test timed out after 3 milliseconds Stacktrace java.lang.Exception: test timed out after 3 milliseconds at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getAllByName0(InetAddress.java:1246) at java.net.InetAddress.getAllByName(InetAddress.java:1162) at java.net.InetAddress.getAllByName(InetAddress.java:1098) at java.net.InetAddress.getByName(InetAddress.java:1048) at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563) at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:367) at org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:178) at org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:136) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:206) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate(TestCapacitySchedulerNodeLabelUpdate.java:157) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Attachment: (was: YARN-3491.004.patch) PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch, YARN-3491.002.patch, YARN-3491.003.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Attachment: YARN-3491.004.patch PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch, YARN-3491.002.patch, YARN-3491.003.patch, YARN-3491.004.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3552) RM Web UI shows -1 running containers for completed apps
[ https://issues.apache.org/jira/browse/YARN-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3552: - Attachment: (was: 0001-YARN-3552.patch) RM Web UI shows -1 running containers for completed apps Key: YARN-3552 URL: https://issues.apache.org/jira/browse/YARN-3552 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Rohith Assignee: Rohith Priority: Trivial Labels: newbie Attachments: 0001-YARN-3552.patch, 0001-YARN-3552.patch, yarn-3352.PNG In the RMServerUtils, the default values are negative number which results in the displayiing the RM web UI also negative number. {code} public static final ApplicationResourceUsageReport DUMMY_APPLICATION_RESOURCE_USAGE_REPORT = BuilderUtils.newApplicationResourceUsageReport(-1, -1, Resources.createResource(-1, -1), Resources.createResource(-1, -1), Resources.createResource(-1, -1), 0, 0); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)