[jira] [Updated] (YARN-3763) Support fuzzy search in ATS
[ https://issues.apache.org/jira/browse/YARN-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated YARN-3763: - Description: Currently ATS only support exact match. Sometimes fuzzy match may be helpful when the entities in the ATS has some common prefix or suffix. Link with TEZ-2531 (was: Currently ATS only support exact match. Sometimes fuzzy match may be helpful when the entities in the ATS has some common prefix or suffix. ) > Support fuzzy search in ATS > --- > > Key: YARN-3763 > URL: https://issues.apache.org/jira/browse/YARN-3763 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 2.7.0 >Reporter: Jeff Zhang > > Currently ATS only support exact match. Sometimes fuzzy match may be helpful > when the entities in the ATS has some common prefix or suffix. Link with > TEZ-2531 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3763) Support for fuzzy search in ATS
Jeff Zhang created YARN-3763: Summary: Support for fuzzy search in ATS Key: YARN-3763 URL: https://issues.apache.org/jira/browse/YARN-3763 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.7.0 Reporter: Jeff Zhang Currently ATS only support exact match. Sometimes fuzzy match may be helpful when the entities in the ATS has some common prefix or suffix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3763) Support fuzzy search in ATS
[ https://issues.apache.org/jira/browse/YARN-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated YARN-3763: - Summary: Support fuzzy search in ATS (was: Support for fuzzy search in ATS) > Support fuzzy search in ATS > --- > > Key: YARN-3763 > URL: https://issues.apache.org/jira/browse/YARN-3763 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 2.7.0 >Reporter: Jeff Zhang > > Currently ATS only support exact match. Sometimes fuzzy match may be helpful > when the entities in the ATS has some common prefix or suffix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570284#comment-14570284 ] Sidharta Seethana commented on YARN-2194: - [~mjacobs] , Yes, that is what I am proposing. If we handle the path separation correctly, we should be able to continue using the current (deprecated, but still workable) mechanism for using cgroups. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570276#comment-14570276 ] Jeff Zhang commented on YARN-3755: -- Close it as won't fix > Log the command of launching containers > --- > > Key: YARN-3755 > URL: https://issues.apache.org/jira/browse/YARN-3755 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: YARN-3755-1.patch, YARN-3755-2.patch > > > In the resource manager log, yarn would log the command for launching AM, > this is very useful. But there's no such log in the NN log for launching > containers. It would be difficult to diagnose when containers fails to launch > due to some issue in the commands. Although user can look at the commands in > the container launch script file, this is an internal things of yarn, usually > user don't know that. In user's perspective, they only know what commands > they specify when building yarn application. > {code} > 2015-06-01 16:06:42,245 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command > to launch container container_1433145984561_0001_01_01 : > $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true > -Dhadoop.metrics.log.level=WARN -Xmx1024m > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA > -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster > 1>/stdout 2>/stderr > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570275#comment-14570275 ] Jeff Zhang commented on YARN-3755: -- bq. How about we let individual frameworks like MapReduce/Tez log them as needed? That seems like the right place for debugging too - app developers don't always get access to the daemon logs. Make sense. > Log the command of launching containers > --- > > Key: YARN-3755 > URL: https://issues.apache.org/jira/browse/YARN-3755 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: YARN-3755-1.patch, YARN-3755-2.patch > > > In the resource manager log, yarn would log the command for launching AM, > this is very useful. But there's no such log in the NN log for launching > containers. It would be difficult to diagnose when containers fails to launch > due to some issue in the commands. Although user can look at the commands in > the container launch script file, this is an internal things of yarn, usually > user don't know that. In user's perspective, they only know what commands > they specify when building yarn application. > {code} > 2015-06-01 16:06:42,245 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command > to launch container container_1433145984561_0001_01_01 : > $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true > -Dhadoop.metrics.log.level=WARN -Xmx1024m > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA > -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster > 1>/stdout 2>/stderr > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570202#comment-14570202 ] Chun Chen commented on YARN-3749: - Thanks for reviewing the patch, [~zxu] ! > We should make a copy of configuration when init MiniYARNCluster with > multiple RMs > -- > > Key: YARN-3749 > URL: https://issues.apache.org/jira/browse/YARN-3749 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chun Chen >Assignee: Chun Chen > Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, > YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch > > > When I was trying to write a test case for YARN-2674, I found DS client > trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 > when RM failover. But I initially set > yarn.resourcemanager.address.rm1=0.0.0.0:18032, > yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is > in ClientRMService where the value of yarn.resourcemanager.address.rm2 > changed to 0.0.0.0:18032. See the following code in ClientRMService: > {code} > clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, >YarnConfiguration.RM_ADDRESS, > > YarnConfiguration.DEFAULT_RM_ADDRESS, >server.getListenerAddress()); > {code} > Since we use the same instance of configuration in rm1 and rm2 and init both > RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 > during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during > starting of rm1. > So I think it is safe to make a copy of configuration when init both of the > rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3558) Additional containers getting reserved from RM in case of Fair scheduler
[ https://issues.apache.org/jira/browse/YARN-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-3558: - Assignee: Sunil G > Additional containers getting reserved from RM in case of Fair scheduler > > > Key: YARN-3558 > URL: https://issues.apache.org/jira/browse/YARN-3558 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.7.0 > Environment: OS :Suse 11 Sp3 > Setup : 2 RM 2 NM > Scheduler : Fair scheduler >Reporter: Bibin A Chundatt >Assignee: Sunil G > Attachments: Amlog.txt, rm.log > > > Submit PI job with 16 maps > Total container expected : 16 MAPS + 1 Reduce + 1 AM > Total containers reserved by RM is 21 > Below set of containers are not being used for execution > container_1430213948957_0001_01_20 > container_1430213948957_0001_01_19 > RM Containers reservation and states > {code} > Processing container_1430213948957_0001_01_01 of type START > Processing container_1430213948957_0001_01_01 of type ACQUIRED > Processing container_1430213948957_0001_01_01 of type LAUNCHED > Processing container_1430213948957_0001_01_02 of type START > Processing container_1430213948957_0001_01_03 of type START > Processing container_1430213948957_0001_01_02 of type ACQUIRED > Processing container_1430213948957_0001_01_03 of type ACQUIRED > Processing container_1430213948957_0001_01_04 of type START > Processing container_1430213948957_0001_01_05 of type START > Processing container_1430213948957_0001_01_04 of type ACQUIRED > Processing container_1430213948957_0001_01_05 of type ACQUIRED > Processing container_1430213948957_0001_01_02 of type LAUNCHED > Processing container_1430213948957_0001_01_04 of type LAUNCHED > Processing container_1430213948957_0001_01_06 of type RESERVED > Processing container_1430213948957_0001_01_03 of type LAUNCHED > Processing container_1430213948957_0001_01_05 of type LAUNCHED > Processing container_1430213948957_0001_01_07 of type START > Processing container_1430213948957_0001_01_07 of type ACQUIRED > Processing container_1430213948957_0001_01_07 of type LAUNCHED > Processing container_1430213948957_0001_01_08 of type RESERVED > Processing container_1430213948957_0001_01_02 of type FINISHED > Processing container_1430213948957_0001_01_06 of type START > Processing container_1430213948957_0001_01_06 of type ACQUIRED > Processing container_1430213948957_0001_01_06 of type LAUNCHED > Processing container_1430213948957_0001_01_04 of type FINISHED > Processing container_1430213948957_0001_01_09 of type START > Processing container_1430213948957_0001_01_09 of type ACQUIRED > Processing container_1430213948957_0001_01_09 of type LAUNCHED > Processing container_1430213948957_0001_01_10 of type RESERVED > Processing container_1430213948957_0001_01_03 of type FINISHED > Processing container_1430213948957_0001_01_08 of type START > Processing container_1430213948957_0001_01_08 of type ACQUIRED > Processing container_1430213948957_0001_01_08 of type LAUNCHED > Processing container_1430213948957_0001_01_05 of type FINISHED > Processing container_1430213948957_0001_01_11 of type START > Processing container_1430213948957_0001_01_11 of type ACQUIRED > Processing container_1430213948957_0001_01_11 of type LAUNCHED > Processing container_1430213948957_0001_01_07 of type FINISHED > Processing container_1430213948957_0001_01_12 of type START > Processing container_1430213948957_0001_01_12 of type ACQUIRED > Processing container_1430213948957_0001_01_12 of type LAUNCHED > Processing container_1430213948957_0001_01_13 of type RESERVED > Processing container_1430213948957_0001_01_06 of type FINISHED > Processing container_1430213948957_0001_01_10 of type START > Processing container_1430213948957_0001_01_10 of type ACQUIRED > Processing container_1430213948957_0001_01_10 of type LAUNCHED > Processing container_1430213948957_0001_01_09 of type FINISHED > Processing container_1430213948957_0001_01_14 of type START > Processing container_1430213948957_0001_01_14 of type ACQUIRED > Processing container_1430213948957_0001_01_14 of type LAUNCHED > Processing container_1430213948957_0001_01_15 of type RESERVED > Processing container_1430213948957_0001_01_08 of type FINISHED > Processing container_1430213948957_0001_01_13 of type START > Processing container_1430213948957_0001_01_16 of type RESERVED > Processing container_1430213948957_0001_01_13 of type ACQUIRED > Processing container_1430213948957_0001_01_13 of type
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570171#comment-14570171 ] Zhijie Shen commented on YARN-3044: --- [~Naganarasimha], I'm fine with the last patch. Will do some local test. However, the patch doesn't apply because of YARN-1462. I think we need to add tag info for v2 publisher too. Would you mind taking care of it? > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044-YARN-2928.004.patch, > YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, > YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, > YARN-3044-YARN-2928.009.patch, YARN-3044.20150325-1.patch, > YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570131#comment-14570131 ] zhihai xu commented on YARN-3749: - Hi [~chenchun], thanks for updating the patch quickly bq. only make a copy of the configuration in initResourceManager when there are multiple RMs. It is a nice optimization. The lastest patch YARN-3749.7.patch LGTM Also the test failure(TestNodeLabelContainerAllocation) is not related to the patch. > We should make a copy of configuration when init MiniYARNCluster with > multiple RMs > -- > > Key: YARN-3749 > URL: https://issues.apache.org/jira/browse/YARN-3749 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chun Chen >Assignee: Chun Chen > Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, > YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch > > > When I was trying to write a test case for YARN-2674, I found DS client > trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 > when RM failover. But I initially set > yarn.resourcemanager.address.rm1=0.0.0.0:18032, > yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is > in ClientRMService where the value of yarn.resourcemanager.address.rm2 > changed to 0.0.0.0:18032. See the following code in ClientRMService: > {code} > clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, >YarnConfiguration.RM_ADDRESS, > > YarnConfiguration.DEFAULT_RM_ADDRESS, >server.getListenerAddress()); > {code} > Since we use the same instance of configuration in rm1 and rm2 and init both > RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 > during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during > starting of rm1. > So I think it is safe to make a copy of configuration when init both of the > rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570082#comment-14570082 ] Hadoop QA commented on YARN-3762: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 28s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 53s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 51s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 32s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 25s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 17s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12737043/yarn-3762-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c1d50a9 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8170/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8170/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8170/console | This message was automatically generated. > FairScheduler: CME on FSParentQueue#getQueueUserAclInfo > --- > > Key: YARN-3762 > URL: https://issues.apache.org/jira/browse/YARN-3762 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > Attachments: yarn-3762-1.patch, yarn-3762-1.patch > > > In our testing, we ran into the following ConcurrentModificationException: > {noformat} > halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 > 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, > queueName=root.testyarnpool3, queueCurrentCapacity=0.0, > queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 > 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client > java.util.ConcurrentModificationException: > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570040#comment-14570040 ] Matthew Jacobs commented on YARN-2194: -- Thanks, [sidharta-s]. So the change would be in how the container-executor accepts lists of paths, not attempting to re-mount the controllers, right? If I understand it correctly, that sounds like a good plan to me. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3510) Create an extension of ProportionalCapacityPreemptionPolicy which preempts a number of containers from each application in a way which respects fairness
[ https://issues.apache.org/jira/browse/YARN-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570039#comment-14570039 ] Craig Welch commented on YARN-3510: --- [~leftnoteasy] and I had some offline discussion. The patch currently here is simply meant to keep from unbalancing whatever allocation process is active by, generally, keeping relative usage between applications the same. It doesn't attempt to actively re-allocate in a way which achieves the overall allocation policy, i.e., "as if all the applications had started at once". (this is a more complex proposition, obviously). There's a desire to have this because, among other things, sometime down the road we may do preemption just among users/applications in a queue and it will be necessary for the preemption to actively work toward the allocation goals to do that, rather than just maintain current levels. This will add some medium level complexity to the current patch, deltas with the current approach are: Since the effect of preemption on order for fairness doesn't occur until the container is released, and we want to consider it right away, there will need to be a need to retain info about "pending preemption" for comparison on the app resources (it will be a deduction from usage for ordering purposes, as if the preemption had already happened) The preemptEvenly loop will need to reorder the app which was preempted after each preemption and then restart the iteration over apps (not necessarily over all apps, again, just until the first preemption) > Create an extension of ProportionalCapacityPreemptionPolicy which preempts a > number of containers from each application in a way which respects fairness > > > Key: YARN-3510 > URL: https://issues.apache.org/jira/browse/YARN-3510 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Craig Welch >Assignee: Craig Welch > Attachments: YARN-3510.2.patch, YARN-3510.3.patch, YARN-3510.5.patch, > YARN-3510.6.patch > > > The ProportionalCapacityPreemptionPolicy preempts as many containers from > applications as it can during it's preemption run. For fifo this makes > sense, as it is prempting in reverse order & therefore maintaining the > primacy of the "oldest". For fair ordering this does not have the desired > effect - instead, it should preempt a number of containers from each > application which maintains a fair balance /close to a fair balance between > them -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570037#comment-14570037 ] Sidharta Seethana commented on YARN-2194: - There are two different issues here : * container-executor binary invocation uses ‘,’ as a separator when supplying a list of paths - which breaks when the path contains ‘,’ * cpu,cpuacct are mounted together by default on RHEL7 Now, for the latter issue : In {{CgroupsLCEResourcesHandler}}, the following steps occur : * If the {{yarn.nodemanager.linux-container-executor.cgroups.mount}} switch is enabled , the ‘cpu’ controller is explicitly mounted at the specified path. * (irrespective of the state of the switch) The {{/proc/mounts}} file (possibly updated by the previous step) is subsequently parsed to determine the mount locations for the various cgroup controllers - this parsing code seems to be correct even if cpu and cpuacct are mounted in one location. So, the thing we need to fix is the separator issue and we should be good. The important thing to remember is that there are *two* cgroups implementation classes ( {{CgroupsLCEResourcesHandler}} and {{CGroupsHandlerImpl}} ). Hopefully, this will be addressed soon ( YARN-3542 ) - or we risk divergence. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570025#comment-14570025 ] Wangda Tan commented on YARN-3733: -- Took a look at the patch and discussion. Thanks for working on this [~rohithsharma]. I think [~sunilg] mentioned https://issues.apache.org/jira/browse/YARN-3733?focusedCommentId=14568880&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14568880 makes sense to me. If the clusterResource is 0, we can compare individual resource type. It could be: {code} Returns >: when l.mem > right.mem || l.cpu > right.cpu Returns =: when (l.mem <= right.mem && l.cpu >= right.cpu) || (l.mem >= right.mem && l.cpu <= right.cpu) Returns <: when l.mem < right.mem || l.cpu < right.cpu {code} This produces same result as the INF approach in the patch, but also can compare if both l/r have > 0 values. The reason I prefer this is, I'm sure the patch can solve the am-resource-percent problem. But with suggested approach, we can make sure getting more reasonable result if we need to compare non-zero-resource when clusterResource is zero. (For example, sort applications by their requirements when clusterResource is zero). And to avoid future regression, could you add a test to verify the am-resource-limit problem is solved? > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3534) Collect memory/cpu usage on the node
[ https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri updated YARN-3534: -- Attachment: YARN-3534-10.patch Solved some comments > Collect memory/cpu usage on the node > > > Key: YARN-3534 > URL: https://issues.apache.org/jira/browse/YARN-3534 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.0 >Reporter: Inigo Goiri >Assignee: Inigo Goiri > Attachments: YARN-3534-1.patch, YARN-3534-10.patch, > YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch, YARN-3534-4.patch, > YARN-3534-5.patch, YARN-3534-6.patch, YARN-3534-7.patch, YARN-3534-8.patch, > YARN-3534-9.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > YARN should be aware of the resource utilization of the nodes when scheduling > containers. For this, this task will implement the collection of memory/cpu > usage on the node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570007#comment-14570007 ] Zhijie Shen commented on YARN-3725: --- bq. is there a JIRA for the longer term fix? Yeah, I've filed YARN-3761 previously. > App submission via REST API is broken in secure mode due to Timeline DT > service address is empty > > > Key: YARN-3725 > URL: https://issues.apache.org/jira/browse/YARN-3725 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineserver >Affects Versions: 2.7.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Fix For: 2.7.1 > > Attachments: YARN-3725.1.patch > > > YARN-2971 changes TimelineClient to use the service address from Timeline DT > to renew the DT instead of configured address. This break the procedure of > submitting an YARN app via REST API in the secure mode. > The problem is that service address is set by the client instead of the > server in Java code. REST API response is an encode token Sting, such that > it's so inconvenient to deserialize it and set the service address and > serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569992#comment-14569992 ] Karthik Kambatla commented on YARN-3762: Changed it to critical and targeting 2.8.0, as it only fails the application and not the RM. > FairScheduler: CME on FSParentQueue#getQueueUserAclInfo > --- > > Key: YARN-3762 > URL: https://issues.apache.org/jira/browse/YARN-3762 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > Attachments: yarn-3762-1.patch, yarn-3762-1.patch > > > In our testing, we ran into the following ConcurrentModificationException: > {noformat} > halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 > 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, > queueName=root.testyarnpool3, queueCurrentCapacity=0.0, > queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 > 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client > java.util.ConcurrentModificationException: > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3762: --- Priority: Critical (was: Blocker) Target Version/s: 2.8.0 (was: 2.7.1) > FairScheduler: CME on FSParentQueue#getQueueUserAclInfo > --- > > Key: YARN-3762 > URL: https://issues.apache.org/jira/browse/YARN-3762 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > Attachments: yarn-3762-1.patch, yarn-3762-1.patch > > > In our testing, we ran into the following ConcurrentModificationException: > {noformat} > halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 > 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, > queueName=root.testyarnpool3, queueCurrentCapacity=0.0, > queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 > 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client > java.util.ConcurrentModificationException: > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3762: --- Attachment: yarn-3762-1.patch Sorry, I forgot to rebase and included some HDFS change as well. > FairScheduler: CME on FSParentQueue#getQueueUserAclInfo > --- > > Key: YARN-3762 > URL: https://issues.apache.org/jira/browse/YARN-3762 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: yarn-3762-1.patch, yarn-3762-1.patch > > > In our testing, we ran into the following ConcurrentModificationException: > {noformat} > halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 > 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, > queueName=root.testyarnpool3, queueCurrentCapacity=0.0, > queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 > 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client > java.util.ConcurrentModificationException: > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3762: --- Attachment: yarn-3762-1.patch Here is a patch that protects FSParentQueue members with read-write locks. > FairScheduler: CME on FSParentQueue#getQueueUserAclInfo > --- > > Key: YARN-3762 > URL: https://issues.apache.org/jira/browse/YARN-3762 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: yarn-3762-1.patch > > > In our testing, we ran into the following ConcurrentModificationException: > {noformat} > halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 > 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, > queueName=root.testyarnpool3, queueCurrentCapacity=0.0, > queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 > 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client > java.util.ConcurrentModificationException: > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
Karthik Kambatla created YARN-3762: -- Summary: FairScheduler: CME on FSParentQueue#getQueueUserAclInfo Key: YARN-3762 URL: https://issues.apache.org/jira/browse/YARN-3762 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker In our testing, we ran into the following ConcurrentModificationException: {noformat} halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569918#comment-14569918 ] Vinod Kumar Vavilapalli commented on YARN-3725: --- [~zjshen], is there a JIRA for the longer term fix? > App submission via REST API is broken in secure mode due to Timeline DT > service address is empty > > > Key: YARN-3725 > URL: https://issues.apache.org/jira/browse/YARN-3725 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineserver >Affects Versions: 2.7.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Fix For: 2.7.1 > > Attachments: YARN-3725.1.patch > > > YARN-2971 changes TimelineClient to use the service address from Timeline DT > to renew the DT instead of configured address. This break the procedure of > submitting an YARN app via REST API in the secure mode. > The problem is that service address is set by the client instead of the > server in Java code. REST API response is an encode token Sting, such that > it's so inconvenient to deserialize it and set the service address and > serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569899#comment-14569899 ] Philip Langdale commented on YARN-2194: --- You can remount controllers if you retain the same combination as the existing mount point, so I guess you could replace the ',' with something your parsing code can handle (or you could fix the parsing code). In general, life is a lot easier if you can avoid remounting as you then don't have to worry about managing their lifecycle. I'd argue the most robust thing to do is discover the existing mount point from /proc/mounts and then use it (assuming the comma parsing can be fixed) if it's present (and don't forget to respect the NodeManager's cgroup paths from /proc/self/mounts) > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569869#comment-14569869 ] Hadoop QA commented on YARN-2392: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 23s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 9m 25s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 23s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 56s | The applied patch generated 2 new checkstyle issues (total was 244, now 245). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 46s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 42s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 52m 1s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 94m 38s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12737003/YARN-2392-002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 03fb5c6 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8169/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8169/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8169/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8169/console | This message was automatically generated. > add more diags about app retry limits on AM failures > > > Key: YARN-2392 > URL: https://issues.apache.org/jira/browse/YARN-2392 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: YARN-2392-001.patch, YARN-2392-002.patch, > YARN-2392-002.patch > > > # when an app fails the failure count is shown, but not what the global + > local limits are. If the two are different, they should both be printed. > # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569778#comment-14569778 ] Matthew Jacobs commented on YARN-2194: -- I'm confused, does this mean that you'll re-mount the cpu and cpuacct controllers? Do we know that other components in the RHEL7 world don't expect them to be in the default place? > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569773#comment-14569773 ] Jason Lowe commented on YARN-3585: -- +1 latest patch lgtm. Will commit this tomorrow if there are no objections. > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith >Priority: Critical > Attachments: 0001-YARN-3585.patch, YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2392: - Priority: Minor (was: Major) > add more diags about app retry limits on AM failures > > > Key: YARN-2392 > URL: https://issues.apache.org/jira/browse/YARN-2392 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: YARN-2392-001.patch, YARN-2392-002.patch, > YARN-2392-002.patch > > > # when an app fails the failure count is shown, but not what the global + > local limits are. If the two are different, they should both be printed. > # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2392: - Attachment: YARN-2392-002.patch Patch 002 * in sync with trunk * uses String.format for a more readable format of the response * includes sliding window details in the message There's no test here, for which I apologise. To test this I'd need a test to trigger failures and look for the final error message, which seems excessive for a log tuning. If there's a test for the sliding-window retry that could be patched, I'll do it there. > add more diags about app retry limits on AM failures > > > Key: YARN-2392 > URL: https://issues.apache.org/jira/browse/YARN-2392 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-2392-001.patch, YARN-2392-002.patch, > YARN-2392-002.patch > > > # when an app fails the failure count is shown, but not what the global + > local limits are. If the two are different, they should both be printed. > # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569727#comment-14569727 ] zhihai xu commented on YARN-3591: - Hi [~lavkesh], I think we can create a separate JIRA for storing local Error directories in NM state store, which will be a good enhancement. thanks [~sunilg]! Adding a new API to get local error directories is also a good suggestion. But I think it will be enough to just check newErrorDirs instead of all errorDirs. To better support NM recovery and make DirsChangeListener interface simple, I propose the following changes: 1.In DirectoryCollection, notify listener when any set of dirs(localDirs, errorDirs and fullDirs) are changed The code change at {{DirectoryCollection#checkDirs}} looks like the following: {code} bool needNotifyListener = false; needNotifyListener = setChanged; for (String dir : preCheckFullDirs) { if (postCheckOtherDirs.contains(dir)) { needNotifyListener = true; LOG.warn("Directory " + dir + " error " + dirsFailedCheck.get(dir).message); } } for (String dir : preCheckOtherErrorDirs) { if (postCheckFullDirs.contains(dir)) { needNotifyListener = true; LOG.warn("Directory " + dir + " error " + dirsFailedCheck.get(dir).message); } } if (needNotifyListener) { for (DirsChangeListener listener : dirsChangeListeners) { listener.onDirsChanged(); } } {code} 2. add an API to get local error directories. As [~sunilg] suggested, We can add an API {{synchronized List getErrorDirs()}} in DirectoryCollection.java We also need add an API {{public List getLocalErrorDirs()}} in LocalDirsHandlerService.java, which will call {{DirectoryCollection#getErrorDirs}} 3. add a field {{Set preLocalErrorDirs}} in ResourceLocalizationService.java to store previous local error directories. {{ResourceLocalizationService#preLocalErrorDirs}} should be loaded from state store at the beginning if we support storing local Error directories in NM state store. 4.The following is pseudo code for {{localDirsChangeListener#onDirsChanged}}: {code} Set curLocalErrorDirs = new HashSet(dirsHandler.getLocalErrorDirs()); List newErrorDirs = new ArrayList(); List newRepairedDirs = new ArrayList(); for (String dir : curLocalErrorDirs) { if (!preLocalErrorDirs.contains(dir)) { newErrorDirs.add(dir); } } for (String dir : preLocalErrorDirs) { if (!curLocalErrorDirs.contains(dir)) { newRepairedDirs.add(dir); } } for (String localDir : newRepairedDirs) { cleanUpLocalDir(lfs, delService, localDir); } if (!newErrorDirs.isEmpty()) { //As Sunil suggested, checkLocalizedResources will call removeResource on those localized resources whose parent is present in newErrorDirs. publicRsrc.checkLocalizedResources(newErrorDirs); for (LocalResourcesTracker tracker : privateRsrc.values()) { tracker.checkLocalizedResources(newErrorDirs); } } if (!newErrorDirs.isEmpty() || !newRepairedDirs.isEmpty()) { preLocalErrorDirs = curLocalErrorDirs; stateStore.storeLocalErrorDirs(StringUtils.arrayToString(curLocalErrorDirs.toArray(new String[0]))); } checkAndInitializeLocalDirs(); {code} 5. It will be better to move {{verifyDirUsingMkdir(testDir)}} right after {{DiskChecker.checkDir(testDir)}} in {{DirectoryCollection#testDirs}}, so we can detect the error directory before detecting the full directory. Please feel free to change or add more to my proposal. > Resource Localisation on a bad disk causes subsequent containers failure > - > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569648#comment-14569648 ] Siddharth Seth commented on YARN-1462: -- ApplicationReport.newInstance is used by mapreduce and Tez, and potentially other applications which may be modeled along the same AMs. It'll be useful to make the API change here compatible. This is along the lines of newInstances being used for various constructs like ContainerId, AppId, etc. With the change, I don't believe MR2.6 will work with a 2.8 cluster - depending on how the classpath is setup. > AHS API and other AHS changes to handle tags for completed MR jobs > -- > > Key: YARN-1462 > URL: https://issues.apache.org/jira/browse/YARN-1462 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1462-branch-2.7-1.2.patch, > YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, > YARN-1462.3.patch > > > AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569646#comment-14569646 ] Hadoop QA commented on YARN-3069: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 19m 46s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 58s | Site still builds. | | {color:green}+1{color} | checkstyle | 1m 36s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 22s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 23m 34s | Tests passed in hadoop-common. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | | | 72m 56s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736976/YARN-3069.011.patch | | Optional Tests | site javadoc javac unit findbugs checkstyle | | git revision | trunk / a2bd621 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8168/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8168/console | This message was automatically generated. > Document missing properties in yarn-default.xml > --- > > Key: YARN-3069 > URL: https://issues.apache.org/jira/browse/YARN-3069 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: BB2015-05-TBR, supportability > Attachments: YARN-3069.001.patch, YARN-3069.002.patch, > YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, > YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, > YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch > > > The following properties are currently not defined in yarn-default.xml. > These properties should either be > A) documented in yarn-default.xml OR > B) listed as an exception (with comments, e.g. for internal use) in the > TestYarnConfigurationFields unit test > Any comments for any of the properties below are welcome. > org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker > org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore > security.applicationhistory.protocol.acl > yarn.app.container.log.backups > yarn.app.container.log.dir > yarn.app.container.log.filesize > yarn.client.app-submission.poll-interval > yarn.client.application-client-protocol.poll-timeout-ms > yarn.is.minicluster > yarn.log.server.url > yarn.minicluster.control-resource-monitoring > yarn.minicluster.fixed.ports > yarn.minicluster.use-rpc > yarn.node-labels.fs-store.retry-policy-spec > yarn.node-labels.fs-store.root-dir > yarn.node-labels.manager-class > yarn.nodemanager.container-executor.os.sched.priority.adjustment > yarn.nodemanager.container-monitor.process-tree.class > yarn.nodemanager.disk-health-checker.enable > yarn.nodemanager.docker-container-executor.image-name > yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms > yarn.nodemanager.linux-container-executor.group > yarn.nodemanager.log.deletion-threads-count > yarn.nodemanager.user-home-dir > yarn.nodemanager.webapp.https.address > yarn.nodemanager.webapp.spnego-keytab-fil
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569548#comment-14569548 ] Sergey Shelukhin commented on YARN-1462: [~sseth] can you please comment on the above (use of Private API)? > AHS API and other AHS changes to handle tags for completed MR jobs > -- > > Key: YARN-1462 > URL: https://issues.apache.org/jira/browse/YARN-1462 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1462-branch-2.7-1.2.patch, > YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, > YARN-1462.3.patch > > > AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-3069: - Attachment: YARN-3069.011.patch Thanks Akira! New patch with the following changes: - Fix description for yarn.node-labels.fs-store.retry-policy-spec - Remove YARN registry entries from yarn-default.xml - Remove one outdated entry yarn.application.classpath.prepend.distcache - Add entry for yarn.intermediate-data-encryption.enable I'll also go through the yarn-default.xml file once more to make sure no default values will change. > Document missing properties in yarn-default.xml > --- > > Key: YARN-3069 > URL: https://issues.apache.org/jira/browse/YARN-3069 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: BB2015-05-TBR, supportability > Attachments: YARN-3069.001.patch, YARN-3069.002.patch, > YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, > YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, > YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch > > > The following properties are currently not defined in yarn-default.xml. > These properties should either be > A) documented in yarn-default.xml OR > B) listed as an exception (with comments, e.g. for internal use) in the > TestYarnConfigurationFields unit test > Any comments for any of the properties below are welcome. > org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker > org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore > security.applicationhistory.protocol.acl > yarn.app.container.log.backups > yarn.app.container.log.dir > yarn.app.container.log.filesize > yarn.client.app-submission.poll-interval > yarn.client.application-client-protocol.poll-timeout-ms > yarn.is.minicluster > yarn.log.server.url > yarn.minicluster.control-resource-monitoring > yarn.minicluster.fixed.ports > yarn.minicluster.use-rpc > yarn.node-labels.fs-store.retry-policy-spec > yarn.node-labels.fs-store.root-dir > yarn.node-labels.manager-class > yarn.nodemanager.container-executor.os.sched.priority.adjustment > yarn.nodemanager.container-monitor.process-tree.class > yarn.nodemanager.disk-health-checker.enable > yarn.nodemanager.docker-container-executor.image-name > yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms > yarn.nodemanager.linux-container-executor.group > yarn.nodemanager.log.deletion-threads-count > yarn.nodemanager.user-home-dir > yarn.nodemanager.webapp.https.address > yarn.nodemanager.webapp.spnego-keytab-file > yarn.nodemanager.webapp.spnego-principal > yarn.nodemanager.windows-secure-container-executor.group > yarn.resourcemanager.configuration.file-system-based-store > yarn.resourcemanager.delegation-token-renewer.thread-count > yarn.resourcemanager.delegation.key.update-interval > yarn.resourcemanager.delegation.token.max-lifetime > yarn.resourcemanager.delegation.token.renew-interval > yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size > yarn.resourcemanager.metrics.runtime.buckets > yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs > yarn.resourcemanager.reservation-system.class > yarn.resourcemanager.reservation-system.enable > yarn.resourcemanager.reservation-system.plan.follower > yarn.resourcemanager.reservation-system.planfollower.time-step > yarn.resourcemanager.rm.container-allocation.expiry-interval-ms > yarn.resourcemanager.webapp.spnego-keytab-file > yarn.resourcemanager.webapp.spnego-principal > yarn.scheduler.include-port-in-node-name > yarn.timeline-service.delegation.key.update-interval > yarn.timeline-service.delegation.token.max-lifetime > yarn.timeline-service.delegation.token.renew-interval > yarn.timeline-service.generic-application-history.enabled > > yarn.timeline-service.generic-application-history.fs-history-store.compression-type > yarn.timeline-service.generic-application-history.fs-history-store.uri > yarn.timeline-service.generic-application-history.store-class > yarn.timeline-service.http-cross-origin.enabled > yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3761) Set delegation token service address at the server side
[ https://issues.apache.org/jira/browse/YARN-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3761: -- Assignee: Varun Saxena > Set delegation token service address at the server side > --- > > Key: YARN-3761 > URL: https://issues.apache.org/jira/browse/YARN-3761 > Project: Hadoop YARN > Issue Type: Improvement > Components: security >Reporter: Zhijie Shen >Assignee: Varun Saxena > > Nowadays, YARN components generate the delegation token without the service > address set, and leave it to the client to set. With our java client library, > it is usually fine. However, if users are using REST API, it's going to be a > problem: The delegation token is returned as a url string. It's so unfriendly > for the thin client to deserialize the url string, set the token service > address and serialize it again for further usage. If we move the task of > setting the service address to the server side, the client can get rid of > this trouble. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569480#comment-14569480 ] Sunil G commented on YARN-3591: --- If we have a new api which returns the present set of error dirs alone (w/o full dirs) {code} synchronized List getErrorDirs() {code} then could we modify LocalResourcesTrackerImpl#checkLocalizedResources in such a way that we call *removeResource* on those localized resources whose parent is present in ErrorDirs. > Resource Localisation on a bad disk causes subsequent containers failure > - > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569465#comment-14569465 ] Xuan Gong commented on YARN-3753: - Committed into branch-2.7. Thanks, Jian > RM failed to come up with "java.io.IOException: Wait for ZKClient creation > timed out" > - > > Key: YARN-3753 > URL: https://issues.apache.org/jira/browse/YARN-3753 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Fix For: 2.7.1 > > Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch > > > RM failed to come up with the following error while submitting an mapreduce > job. > {code:title=RM log} > 015-05-30 03:40:12,190 ERROR recovery.RMStateStore > (RMStateStore.java:transition(179)) - Error storing app: > application_1432956515242_0006 > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager > (ResourceManager.java:handle(750)) - Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMac
[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569456#comment-14569456 ] Xuan Gong commented on YARN-3753: - +1, LGTM. Check this in > RM failed to come up with "java.io.IOException: Wait for ZKClient creation > timed out" > - > > Key: YARN-3753 > URL: https://issues.apache.org/jira/browse/YARN-3753 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch > > > RM failed to come up with the following error while submitting an mapreduce > job. > {code:title=RM log} > 015-05-30 03:40:12,190 ERROR recovery.RMStateStore > (RMStateStore.java:transition(179)) - Error storing app: > application_1432956515242_0006 > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager > (ResourceManager.java:handle(750)) - Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.h
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569453#comment-14569453 ] Hadoop QA commented on YARN-2618: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12723515/YARN-2618-7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a2bd621 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8167/console | This message was automatically generated. > Avoid over-allocation of disk resources > --- > > Key: YARN-2618 > URL: https://issues.apache.org/jira/browse/YARN-2618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wei Yan >Assignee: Wei Yan > Labels: BB2015-05-TBR > Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, > YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch > > > Subtask of YARN-2139. > This should include > - Add API support for introducing disk I/O as the 3rd type resource. > - NM should report this information to the RM > - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569452#comment-14569452 ] Karthik Kambatla commented on YARN-2618: [~vvasudev] - thanks for the ping. I haven't had the time to do a thorough review of remaining tasks here, and hence avoided committing this. Do you have the cycles to help shepherd this work into the branch? And yes, we should true YARN-2139 up to trunk and commit this. > Avoid over-allocation of disk resources > --- > > Key: YARN-2618 > URL: https://issues.apache.org/jira/browse/YARN-2618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wei Yan >Assignee: Wei Yan > Labels: BB2015-05-TBR > Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, > YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch > > > Subtask of YARN-2139. > This should include > - Add API support for introducing disk I/O as the 3rd type resource. > - NM should report this information to the RM > - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569445#comment-14569445 ] Karthik Kambatla commented on YARN-2194: I haven't looked at it closely, but think Yarn doesn't pick the separator. If we could easily change the separator from within Yarn, that is without requiring any other environment changes by the admin, I ll be a +1 for that change. By the way, Linux allows anything but '/' and '%' for filenames. So, picking ':' or '|' is only less likely to cause issues in the future. Who would have thought they would use ', in a filename? If we continue with the patch posted here, I think [~mjacobs]' suggestion makes sense. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569437#comment-14569437 ] Varun Vasudev commented on YARN-2618: - [~kasha] - should we commit this to the YARN-2139 branch? Should we get the branch up to date with trunk first? > Avoid over-allocation of disk resources > --- > > Key: YARN-2618 > URL: https://issues.apache.org/jira/browse/YARN-2618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wei Yan >Assignee: Wei Yan > Labels: BB2015-05-TBR > Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, > YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch > > > Subtask of YARN-2139. > This should include > - Add API support for introducing disk I/O as the 3rd type resource. > - NM should report this information to the RM > - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569421#comment-14569421 ] Wei Yan commented on YARN-2194: --- [~sidharta-s], thanks for the advice. Use a different separator LGTM. In that way, we can trust the "cpu" controller, and can also help avoid doing OS-specific changes. Comments? [~kasha], [~vinodkv], [~mjacobs]. And for the new CGroupsHandlerImpl, didn't find any problem when I checked the patch. [~vvasudev], please correct me if I missed anything. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3761) Set delegation token service address at the server side
Zhijie Shen created YARN-3761: - Summary: Set delegation token service address at the server side Key: YARN-3761 URL: https://issues.apache.org/jira/browse/YARN-3761 Project: Hadoop YARN Issue Type: Improvement Components: security Reporter: Zhijie Shen Nowadays, YARN components generate the delegation token without the service address set, and leave it to the client to set. With our java client library, it is usually fine. However, if users are using REST API, it's going to be a problem: The delegation token is returned as a url string. It's so unfriendly for the thin client to deserialize the url string, set the token service address and serialize it again for further usage. If we move the task of setting the service address to the server side, the client can get rid of this trouble. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569381#comment-14569381 ] Sunil G commented on YARN-3754: --- [~bibinchundatt] Could u also please attach NM logs here. > Race condition when the NodeManager is shutting down and container is launched > -- > > Key: YARN-3754 > URL: https://issues.apache.org/jira/browse/YARN-3754 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Sunil G >Priority: Critical > > Container is launched and returned to ContainerImpl > NodeManager closed the DB connection which resulting in > {{org.iq80.leveldb.DBException: Closed}}. > *Attaching the exception trace* > {code} > 2015-05-30 02:11:49,122 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Unable to update state store diagnostics for > container_e310_1432817693365_3338_01_02 > java.io.IOException: org.iq80.leveldb.DBException: Closed > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.iq80.leveldb.DBException: Closed > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) > ... 15 more > {code} > we can add a check whether DB is closed while we move container from ACQUIRED > state. > As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569347#comment-14569347 ] Vinod Kumar Vavilapalli commented on YARN-3755: --- We had this long ago in YARN, but removed it as the log files were getting inundated in large/high throughput clusters. If you combine the command line with the environment (classpath etc), this can get very long. How about we let individual frameworks like MapReduce/Tez log them as needed? That seems like the right place for debugging too - app developers don't always get access to the daemon logs. > Log the command of launching containers > --- > > Key: YARN-3755 > URL: https://issues.apache.org/jira/browse/YARN-3755 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: YARN-3755-1.patch, YARN-3755-2.patch > > > In the resource manager log, yarn would log the command for launching AM, > this is very useful. But there's no such log in the NN log for launching > containers. It would be difficult to diagnose when containers fails to launch > due to some issue in the commands. Although user can look at the commands in > the container launch script file, this is an internal things of yarn, usually > user don't know that. In user's perspective, they only know what commands > they specify when building yarn application. > {code} > 2015-06-01 16:06:42,245 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command > to launch container container_1433145984561_0001_01_01 : > $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true > -Dhadoop.metrics.log.level=WARN -Xmx1024m > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA > -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster > 1>/stdout 2>/stderr > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569342#comment-14569342 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2162 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2162/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt > AHS API and other AHS changes to handle tags for completed MR jobs > -- > > Key: YARN-1462 > URL: https://issues.apache.org/jira/browse/YARN-1462 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1462-branch-2.7-1.2.patch, > YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, > YARN-1462.3.patch > > > AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569329#comment-14569329 ] Sidharta Seethana commented on YARN-2194: - To clarify, my comment was respect to this line in the description {{The comma in the controller name leads to container launch failure.}} . I believe switching separators or encoding arguments in some way is a better approach than requiring symlinks or transform "cpu,cpuacct" to "cpu" as the controller name. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569322#comment-14569322 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #214 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/214/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt > AHS API and other AHS changes to handle tags for completed MR jobs > -- > > Key: YARN-1462 > URL: https://issues.apache.org/jira/browse/YARN-1462 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1462-branch-2.7-1.2.patch, > YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, > YARN-1462.3.patch > > > AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3760) Log aggregation failures
[ https://issues.apache.org/jira/browse/YARN-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569289#comment-14569289 ] Daryn Sharp commented on YARN-3760: --- Cancelled tokens trigger the retry proxy bug. > Log aggregation failures > - > > Key: YARN-3760 > URL: https://issues.apache.org/jira/browse/YARN-3760 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Daryn Sharp >Priority: Critical > > The aggregated log file does not appear to be properly closed when writes > fail. This leaves a lease renewer active in the NM that spams the NN with > lease renewals. If the token is marked not to be cancelled, the renewals > appear to continue until the token expires. If the token is cancelled, the > periodic renew spam turns into a flood of failed connections until the lease > renewer gives up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569278#comment-14569278 ] Devaraj K commented on YARN-41: --- Thanks a lot [~djp] for your review and comments, I really appreciate your help on reviewing the patch. > The RM should handle the graceful shutdown of the NM. > - > > Key: YARN-41 > URL: https://issues.apache.org/jira/browse/YARN-41 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Ravi Teja Ch N V >Assignee: Devaraj K > Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, > MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, > YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, > YARN-41-8.patch, YARN-41.patch > > > Instead of waiting for the NM expiry, RM should remove and handle the NM, > which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3760) Log aggregation failures
Daryn Sharp created YARN-3760: - Summary: Log aggregation failures Key: YARN-3760 URL: https://issues.apache.org/jira/browse/YARN-3760 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Daryn Sharp Priority: Critical The aggregated log file does not appear to be properly closed when writes fail. This leaves a lease renewer active in the NM that spams the NN with lease renewals. If the token is marked not to be cancelled, the renewals appear to continue until the token expires. If the token is cancelled, the periodic renew spam turns into a flood of failed connections until the lease renewer gives up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569267#comment-14569267 ] Karthik Kambatla commented on YARN-3753: Fix looks reasonable to me. > RM failed to come up with "java.io.IOException: Wait for ZKClient creation > timed out" > - > > Key: YARN-3753 > URL: https://issues.apache.org/jira/browse/YARN-3753 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch > > > RM failed to come up with the following error while submitting an mapreduce > job. > {code:title=RM log} > 015-05-30 03:40:12,190 ERROR recovery.RMStateStore > (RMStateStore.java:transition(179)) - Error storing app: > application_1432956515242_0006 > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager > (ResourceManager.java:handle(750)) - Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) >
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569262#comment-14569262 ] Karthik Kambatla commented on YARN-2962: YARN-3643 should help alleviate most of the issues users face. This JIRA could be targeted only at trunk, without worrying about rolling upgrades. > ZKRMStateStore: Limit the number of znodes under a znode > > > Key: YARN-2962 > URL: https://issues.apache.org/jira/browse/YARN-2962 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch > > > We ran into this issue where we were hitting the default ZK server message > size configs, primarily because the message had too many znodes even though > they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3754: --- Target Version/s: 2.8.0 (was: 2.7.1) > Race condition when the NodeManager is shutting down and container is launched > -- > > Key: YARN-3754 > URL: https://issues.apache.org/jira/browse/YARN-3754 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Sunil G >Priority: Critical > > Container is launched and returned to ContainerImpl > NodeManager closed the DB connection which resulting in > {{org.iq80.leveldb.DBException: Closed}}. > *Attaching the exception trace* > {code} > 2015-05-30 02:11:49,122 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Unable to update state store diagnostics for > container_e310_1432817693365_3338_01_02 > java.io.IOException: org.iq80.leveldb.DBException: Closed > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.iq80.leveldb.DBException: Closed > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) > ... 15 more > {code} > we can add a check whether DB is closed while we move container from ACQUIRED > state. > As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3754: --- Priority: Critical (was: Major) Target Version/s: 2.7.1 > Race condition when the NodeManager is shutting down and container is launched > -- > > Key: YARN-3754 > URL: https://issues.apache.org/jira/browse/YARN-3754 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Sunil G >Priority: Critical > > Container is launched and returned to ContainerImpl > NodeManager closed the DB connection which resulting in > {{org.iq80.leveldb.DBException: Closed}}. > *Attaching the exception trace* > {code} > 2015-05-30 02:11:49,122 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Unable to update state store diagnostics for > container_e310_1432817693365_3338_01_02 > java.io.IOException: org.iq80.leveldb.DBException: Closed > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.iq80.leveldb.DBException: Closed > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) > ... 15 more > {code} > we can add a check whether DB is closed while we move container from ACQUIRED > state. > As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569237#comment-14569237 ] Jason Lowe commented on YARN-3758: -- First off, one should never set the heap size and the container size to the same value. The container size needs to be big enough to hold the entire process, not just the heap, so it needs to also consider the overhead of the JVM itself and any off-heap usage (e.g.: JVM code, data, thread stacks, shared libs, off-heap allocations, etc.). If you set the heap size to the same size as the container then when the heap fills up the process overall will be bigger than the heap size and YARN will kill the container. Couple of things to check: - Does the job configuration show that it is indeed asking for only 256 MB containers for tasks? Check the job configuration link for the job on the job history server or the configuration link for the AM's UI while the job is running. - Check the RM logs to verify what minimum allocation size it is loading from the configs and what request size it is allocating per task > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working in container > > > Key: YARN-3758 > URL: https://issues.apache.org/jira/browse/YARN-3758 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G > Physical memory each node > Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G > Physical memory each node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3603) Application Attempts page confusing
[ https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3603: -- Attachment: ahs1.png > Application Attempts page confusing > --- > > Key: YARN-3603 > URL: https://issues.apache.org/jira/browse/YARN-3603 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.8.0 >Reporter: Thomas Graves >Assignee: Sunil G > Attachments: 0001-YARN-3603.patch, 0002-YARN-3603.patch, ahs1.png > > > The application attempts page > (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) > is a bit confusing on what is going on. I think the table of containers > there is for only Running containers and when the app is completed or killed > its empty. The table should have a label on it stating so. > Also the "AM Container" field is a link when running but not when its killed. > That might be confusing. > There is no link to the logs in this page but there is in the app attempt > table when looking at http:// > rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3603) Application Attempts page confusing
[ https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3603: -- Attachment: 0002-YARN-3603.patch Attaching an updated version of patch. Also attaching screen shots of UI. [~tgraves] Could u please take a look on this. Thank you. > Application Attempts page confusing > --- > > Key: YARN-3603 > URL: https://issues.apache.org/jira/browse/YARN-3603 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.8.0 >Reporter: Thomas Graves >Assignee: Sunil G > Attachments: 0001-YARN-3603.patch, 0002-YARN-3603.patch, ahs1.png > > > The application attempts page > (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) > is a bit confusing on what is going on. I think the table of containers > there is for only Running containers and when the app is completed or killed > its empty. The table should have a label on it stating so. > Also the "AM Container" field is a link when running but not when its killed. > That might be confusing. > There is no link to the logs in this page but there is in the app attempt > table when looking at http:// > rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569182#comment-14569182 ] Junping Du commented on YARN-41: Thanks [~devaraj.k] for updating the patch with addressing previous comments! Latest patch LGTM. +1. Will commit it tomorrow if no further comments on the code from other reviewers. In addition, given the patch involve new SHUTDOWN category on: NodeState, UI and Cluster Metrics. Although it doesn't break any public APIs, we should mark this JIRA as incompatible for its inconsistent behaviors with previous releases in UI, CLI, Metrics (to notify users or third-party management & monitor software). In general, I think it should be fine to keep the plan to include this patch in 2.x releases. However, please comments here to let us know if you have any concerns. > The RM should handle the graceful shutdown of the NM. > - > > Key: YARN-41 > URL: https://issues.apache.org/jira/browse/YARN-41 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Ravi Teja Ch N V >Assignee: Devaraj K > Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, > MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, > YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, > YARN-41-8.patch, YARN-41.patch > > > Instead of waiting for the NM expiry, RM should remove and handle the NM, > which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569191#comment-14569191 ] Hudson commented on YARN-1462: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #205 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/205/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt > AHS API and other AHS changes to handle tags for completed MR jobs > -- > > Key: YARN-1462 > URL: https://issues.apache.org/jira/browse/YARN-1462 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1462-branch-2.7-1.2.patch, > YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, > YARN-1462.3.patch > > > AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569171#comment-14569171 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2144 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2144/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt > AHS API and other AHS changes to handle tags for completed MR jobs > -- > > Key: YARN-1462 > URL: https://issues.apache.org/jira/browse/YARN-1462 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1462-branch-2.7-1.2.patch, > YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, > YARN-1462.3.patch > > > AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568947#comment-14568947 ] Junping Du commented on YARN-41: bq. Junping Du I have updated the patch with review comments. Can you have a look into this? Sorry for being late on this as taking travel last week. I will review your latest patch today. > The RM should handle the graceful shutdown of the NM. > - > > Key: YARN-41 > URL: https://issues.apache.org/jira/browse/YARN-41 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Ravi Teja Ch N V >Assignee: Devaraj K > Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, > MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, > YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, > YARN-41-8.patch, YARN-41.patch > > > Instead of waiting for the NM expiry, RM should remove and handle the NM, > which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568926#comment-14568926 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Yarn-trunk #946 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/946/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt > AHS API and other AHS changes to handle tags for completed MR jobs > -- > > Key: YARN-1462 > URL: https://issues.apache.org/jira/browse/YARN-1462 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1462-branch-2.7-1.2.patch, > YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, > YARN-1462.3.patch > > > AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568920#comment-14568920 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #216 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/216/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt > AHS API and other AHS changes to handle tags for completed MR jobs > -- > > Key: YARN-1462 > URL: https://issues.apache.org/jira/browse/YARN-1462 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1462-branch-2.7-1.2.patch, > YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, > YARN-1462.3.patch > > > AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568913#comment-14568913 ] Hadoop QA commented on YARN-3733: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 6s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 54s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 33s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | | | 40m 10s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736802/0001-YARN-3733.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8166/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8166/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8166/console | This message was automatically generated. > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3759) Include command line, localization info and env vars on AM launch failure
Steve Loughran created YARN-3759: Summary: Include command line, localization info and env vars on AM launch failure Key: YARN-3759 URL: https://issues.apache.org/jira/browse/YARN-3759 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.7.0 Reporter: Steve Loughran Priority: Minor While trying to diagnose AM launch failures, its important to be able to get at the final, expanded {{CLASSPATH}} and other env variables. We don't get that today: you can log the unexpanded values on the client, and tweak NM ContainerExecutor log levels to DEBUG & get some of this —‚ut you don't get it in the task logs, and tuning NM log level isn't viable on a large, busy cluster. Launch failures should include some env specifics: # list of env vars (ideally, full getenv values), with some stripping of "sensitive" options (i'm thinking AWS env vars here) # command line # path localisations These can go in the task logs, we don't need to include them in the application report. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568880#comment-14568880 ] Sunil G commented on YARN-3733: --- Hi [~rohithsharma] Thanks for the detailed scenario. Scenario 4 can be possible, correct?. clusterResource<0,0> : lhs <2,2> and rhs <3,2>. Currently getResourceAsValue gives back the max ratio of mem/vcores if dominent. Else gives the min ratio. If clusterResource is 0, then could we directly send the max of mem/vcore if dominent, and min in other case. This has to be made more better algorithm when more resources comes in. This is not completely perfect as we treat memory and vcores leniently. Pls share your thoughts. > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Attachment: 0001-YARN-3733.patch The updated patch that fixes for 2nd and 3rd scenarios(This issue scenario fixes) in above table and refactored the test code. As a overall solution that solves input combination like 4th and 5th from above table, need to explore more on how to define fraction and how to decide which one is dominant. Any suggestions on this? > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568853#comment-14568853 ] Brahma Reddy Battula commented on YARN-3170: Updated patch..Kindly review!! > YARN architecture document needs updating > - > > Key: YARN-3170 > URL: https://issues.apache.org/jira/browse/YARN-3170 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Reporter: Allen Wittenauer >Assignee: Brahma Reddy Battula > Attachments: YARN-3170-002.patch, YARN-3170-003.patch, > YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, > YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, > YARN-3170-010.patch, YARN-3170.patch > > > The marketing paragraph at the top, "NextGen MapReduce", etc are all > marketing rather than actual descriptions. It also needs some general > updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568834#comment-14568834 ] Lavkesh Lahngir commented on YARN-3591: --- [~zxu] :Can we get away without storing into NMstateStore? Other changes seems to be okay. It's not a big change in terms of the code, but adding in NMstate could be debatable. [~vvasudev]: Thoughts? > Resource Localisation on a bad disk causes subsequent containers failure > - > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568801#comment-14568801 ] Hadoop QA commented on YARN-3755: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 45s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 33s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 36s | The applied patch generated 1 new checkstyle issues (total was 58, now 58). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 7s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 43m 31s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736781/YARN-3755-2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8165/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8165/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8165/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8165/console | This message was automatically generated. > Log the command of launching containers > --- > > Key: YARN-3755 > URL: https://issues.apache.org/jira/browse/YARN-3755 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: YARN-3755-1.patch, YARN-3755-2.patch > > > In the resource manager log, yarn would log the command for launching AM, > this is very useful. But there's no such log in the NN log for launching > containers. It would be difficult to diagnose when containers fails to launch > due to some issue in the commands. Although user can look at the commands in > the container launch script file, this is an internal things of yarn, usually > user don't know that. In user's perspective, they only know what commands > they specify when building yarn application. > {code} > 2015-06-01 16:06:42,245 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command > to launch container container_1433145984561_0001_01_01 : > $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true > -Dhadoop.metrics.log.level=WARN -Xmx1024m > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA > -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster > 1>/stdout 2>/stderr > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568797#comment-14568797 ] Hadoop QA commented on YARN-3753: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 53s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 48s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 27s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 6s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 7s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736776/YARN-3753.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8164/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8164/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8164/console | This message was automatically generated. > RM failed to come up with "java.io.IOException: Wait for ZKClient creation > timed out" > - > > Key: YARN-3753 > URL: https://issues.apache.org/jira/browse/YARN-3753 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch > > > RM failed to come up with the following error while submitting an mapreduce > job. > {code:title=RM log} > 015-05-30 03:40:12,190 ERROR recovery.RMStateStore > (RMStateStore.java:transition(179)) - Error storing app: > application_1432956515242_0006 > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.ha
[jira] [Resolved] (YARN-3757) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] skrho resolved YARN-3757. - Resolution: Duplicate > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working in container > > > Key: YARN-3757 > URL: https://issues.apache.org/jira/browse/YARN-3757 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: Hadoop 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, 8G Physical memory > each node > Second cluster is 10 node, 2 application queuey, 230G Physical memory each > node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3756) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] skrho resolved YARN-3756. - Resolution: Duplicate > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working in container > > > Key: YARN-3756 > URL: https://issues.apache.org/jira/browse/YARN-3756 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: hadoop 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, 8G Physical memory > each node > Second cluster is 10 node, 2 application queuey, 230G Physical memory each > node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568790#comment-14568790 ] Naganarasimha G R commented on YARN-3758: - YARN-3756 and YARN-3757 are same as this issue ! can you close them . > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working in container > > > Key: YARN-3758 > URL: https://issues.apache.org/jira/browse/YARN-3758 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G > Physical memory each node > Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G > Physical memory each node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568789#comment-14568789 ] Varun Saxena commented on YARN-2962: Was waiting for an input from [~vinodkv] and [~asuresh] so that we reach a common understanding on what we will do on the backward compatibility part. Anyways in the coming week, plan to upload a patch implementing one of the approaches discussed. > ZKRMStateStore: Limit the number of znodes under a znode > > > Key: YARN-2962 > URL: https://issues.apache.org/jira/browse/YARN-2962 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch > > > We ran into this issue where we were hitting the default ZK server message > size configs, primarily because the message had too many znodes even though > they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables
[ https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joep Rottinghuis updated YARN-3706: --- Attachment: YARN-3726-YARN-2928.004.patch YARN-3726-YARN-2928.004.patch : - fixed bug in cleanse (found thanks to unit test) - fixed value separator (was ! instead of ?). - Added readResult and readResults to EntityColumnPrefix (still need to add signature in interface). - Added initial unit test for TimeLineWriterUtils - Added relationship checking to TestTimelineWriterImpl > Generalize native HBase writer for additional tables > > > Key: YARN-3706 > URL: https://issues.apache.org/jira/browse/YARN-3706 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Joep Rottinghuis >Assignee: Joep Rottinghuis >Priority: Minor > Attachments: YARN-3706-YARN-2928.001.patch, > YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, > YARN-3726-YARN-2928.004.patch > > > When reviewing YARN-3411 we noticed that we could change the class hierarchy > a little in order to accommodate additional tables easily. > In order to get ready for benchmark testing we left the original layout in > place, as performance would not be impacted by the code hierarchy. > Here is a separate jira to address the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] skrho updated YARN-3758: Description: Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml & mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ was: Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml & mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working in container > > > Key: YARN-3758 > URL: https://issues.apache.org/jira/browse/YARN-3758 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G > Physical memory each node > Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G > Physical memory each node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated YARN-3755: - Attachment: YARN-3755-2.patch Upload new patch to address the checkstyle issue > Log the command of launching containers > --- > > Key: YARN-3755 > URL: https://issues.apache.org/jira/browse/YARN-3755 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: YARN-3755-1.patch, YARN-3755-2.patch > > > In the resource manager log, yarn would log the command for launching AM, > this is very useful. But there's no such log in the NN log for launching > containers. It would be difficult to diagnose when containers fails to launch > due to some issue in the commands. Although user can look at the commands in > the container launch script file, this is an internal things of yarn, usually > user don't know that. In user's perspective, they only know what commands > they specify when building yarn application. > {code} > 2015-06-01 16:06:42,245 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command > to launch container container_1433145984561_0001_01_01 : > $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true > -Dhadoop.metrics.log.level=WARN -Xmx1024m > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA > -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster > 1>/stdout 2>/stderr > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568735#comment-14568735 ] Hadoop QA commented on YARN-3749: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 20m 3s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 8 new or modified test files. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 17s | The applied patch generated 1 new checkstyle issues (total was 212, now 213). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 6m 5s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 6m 58s | Tests passed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 60m 34s | Tests failed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 1m 51s | Tests passed in hadoop-yarn-server-tests. | | | | 121m 2s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736753/YARN-3749.7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-tests test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8163/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8163/console | This message was automatically generated. > We should make a copy of configuration when init MiniYARNCluster with > multiple RMs > -- > > Key: YARN-3749 > URL: https://issues.apache.org/jira/browse/YARN-3749 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chun Chen >Assignee: Chun Chen > Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, > YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch > > > When I was trying to write a test case for YARN-2674, I found DS client > trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 > when RM failover. But I initially set > yarn.resourcemanager.address.rm1=0.0.0.0:18032, > yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is > in ClientRMService where the value of yarn.resourcemanager.address.rm2 > changed to 0.0.0.0:18032. See the following code in ClientRMService: > {code} > clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, >YarnConfiguration.RM_ADDRESS, > > YarnConfiguration.DEFAULT_RM_ADDRESS, >server.getListenerAddr
[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] skrho updated YARN-3758: Description: Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml & mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ was: Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working in container > > > Key: YARN-3758 > URL: https://issues.apache.org/jira/browse/YARN-3758 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, 8G Physical memory > each node > Second cluster is 10 node, 2 application queuey, 230G Physical memory each > node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
skrho created YARN-3758: --- Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3758 URL: https://issues.apache.org/jira/browse/YARN-3758 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3757) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
skrho created YARN-3757: --- Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3757 URL: https://issues.apache.org/jira/browse/YARN-3757 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Environment: Hadoop 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3756) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
skrho created YARN-3756: --- Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3756 URL: https://issues.apache.org/jira/browse/YARN-3756 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Environment: hadoop 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml & mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568700#comment-14568700 ] Jun Xu commented on YARN-2962: -- We suffered from this problem too. Seems this issue has last for nearly half a year, any new progress guys? > ZKRMStateStore: Limit the number of znodes under a znode > > > Key: YARN-2962 > URL: https://issues.apache.org/jira/browse/YARN-2962 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch > > > We ran into this issue where we were hitting the default ZK server message > size configs, primarily because the message had too many znodes even though > they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3753: -- Attachment: YARN-3753.2.patch > RM failed to come up with "java.io.IOException: Wait for ZKClient creation > timed out" > - > > Key: YARN-3753 > URL: https://issues.apache.org/jira/browse/YARN-3753 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch > > > RM failed to come up with the following error while submitting an mapreduce > job. > {code:title=RM log} > 015-05-30 03:40:12,190 ERROR recovery.RMStateStore > (RMStateStore.java:transition(179)) - Error storing app: > application_1432956515242_0006 > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager > (ResourceManager.java:handle(750)) - Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Wait for ZKClient creation timed out > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMa
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568682#comment-14568682 ] Rohith commented on YARN-3733: -- Updated the summary as per defect. > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)