[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631400#comment-14631400 ] Hudson commented on YARN-3930: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #248 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/248/]) YARN-3930. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown. (Dian Fu via wangda) (wangda: rev fa2b63ed162410ba05eadf211a1da068351b293a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Fix For: 2.8.0 Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631401#comment-14631401 ] Hudson commented on YARN-3885: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #248 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/248/]) YARN-3885. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level. (Ajith S via wangda) (wangda: rev 3540d5fe4b1da942ea80c9e7ca1126b1abb8a68a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Fix For: 2.8.0 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nam H. Do updated YARN-2681: Attachment: YARN-2681.005.patch fixed javadoc warnings Support bandwidth enforcement for containers while reading from HDFS Key: YARN-2681 URL: https://issues.apache.org/jira/browse/YARN-2681 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Affects Versions: 2.5.1 Environment: Linux Reporter: Nam H. Do Labels: BB2015-05-TBR Fix For: 2.7.0 Attachments: Traffic Control Design.png, YARN-2681.001.patch, YARN-2681.002.patch, YARN-2681.003.patch, YARN-2681.004.patch, YARN-2681.005.patch, YARN-2681.patch To read/write data from HDFS on data node, applications establise TCP/IP connections with the datanode. The HDFS read can be controled by setting Linux Traffic Control (TC) subsystem on the data node to make filters on appropriate connections. The current cgroups net_cls concept can not be applied on the node where the container is launched, netheir on data node since: - TC hanldes outgoing bandwidth only, so it can be set on container node (HDFS read = incoming data for the container) - Since HDFS data node is handled by only one process, it is not possible to use net_cls to separate connections from different containers to the datanode. Tasks: 1) Extend Resource model to define bandwidth enforcement rate 2) Monitor TCP/IP connection estabilised by container handling process and its child processes 3) Set Linux Traffic Control rules on data node base on address:port pairs in order to enforce bandwidth of outgoing data Concept: http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf Implementation: http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl.pdf http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl_UML_diagram.png -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631222#comment-14631222 ] Hadoop QA commented on YARN-2003: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 53s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 10 new or modified test files. | | {color:green}+1{color} | javac | 8m 25s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 21s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 57s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 22s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 8s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | tools/hadoop tests | 0m 53s | Tests passed in hadoop-sls. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 52m 26s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 101m 32s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745796/0023-YARN-2003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ee36f4f | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8571/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-sls test log | https://builds.apache.org/job/PreCommit-YARN-Build/8571/artifact/patchprocess/testrun_hadoop-sls.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8571/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8571/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8571/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8571/console | This message was automatically generated. Support for Application priority : Changes in RM and Capacity Scheduler --- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 0021-YARN-2003.patch, 0022-YARN-2003.patch, 0023-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631223#comment-14631223 ] Hudson commented on YARN-3535: -- FAILURE: Integrated in Hadoop-trunk-Commit #8179 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8179/]) YARN-3535. Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED (rohithsharma and peng.zhang via asuresh) (Arun Suresh: rev 9b272ccae78918e7d756d84920a9322187d61eed) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/ContainerRescheduledEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/SchedulerEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED --- Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Fix For: 2.8.0 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631230#comment-14631230 ] Hudson commented on YARN-3885: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #259 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/259/]) YARN-3885. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level. (Ajith S via wangda) (wangda: rev 3540d5fe4b1da942ea80c9e7ca1126b1abb8a68a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/CHANGES.txt ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Fix For: 2.8.0 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631229#comment-14631229 ] Hudson commented on YARN-3930: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #259 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/259/]) YARN-3930. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown. (Dian Fu via wangda) (wangda: rev fa2b63ed162410ba05eadf211a1da068351b293a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Fix For: 2.8.0 Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631239#comment-14631239 ] Hudson commented on YARN-3930: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #989 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/989/]) YARN-3930. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown. (Dian Fu via wangda) (wangda: rev fa2b63ed162410ba05eadf211a1da068351b293a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/CHANGES.txt FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Fix For: 2.8.0 Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631240#comment-14631240 ] Hudson commented on YARN-3885: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #989 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/989/]) YARN-3885. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level. (Ajith S via wangda) (wangda: rev 3540d5fe4b1da942ea80c9e7ca1126b1abb8a68a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Fix For: 2.8.0 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631390#comment-14631390 ] Hudson commented on YARN-3885: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2186 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2186/]) YARN-3885. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level. (Ajith S via wangda) (wangda: rev 3540d5fe4b1da942ea80c9e7ca1126b1abb8a68a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/CHANGES.txt ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Fix For: 2.8.0 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631389#comment-14631389 ] Hudson commented on YARN-3930: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2186 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2186/]) YARN-3930. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown. (Dian Fu via wangda) (wangda: rev fa2b63ed162410ba05eadf211a1da068351b293a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/CHANGES.txt FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Fix For: 2.8.0 Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631395#comment-14631395 ] Hadoop QA commented on YARN-3905: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 16s | Pre-patch trunk has 6 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 31s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 23s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 35s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 7s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-server-common. | | | | 40m 44s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745819/YARN-3905.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 9b272cc | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8572/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8572/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8572/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8572/console | This message was automatically generated. Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3905.001.patch, YARN-3905.002.patch From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This
[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3905: - Attachment: YARN-3905.002.patch Fixing checkstyle bug. I forgot to remove the now-unused {{ContainerID}} import. Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3905.001.patch, YARN-3905.002.patch From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631298#comment-14631298 ] Hudson commented on YARN-3174: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #256 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/256/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt Consolidate the NodeManager and NodeManagerRestart documentation into one - Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.1 Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Fix For: 2.8.0 Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631304#comment-14631304 ] Hudson commented on YARN-3885: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #256 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/256/]) YARN-3885. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level. (Ajith S via wangda) (wangda: rev 3540d5fe4b1da942ea80c9e7ca1126b1abb8a68a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/CHANGES.txt ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Fix For: 2.8.0 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631297#comment-14631297 ] Hudson commented on YARN-3805: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #256 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/256/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt Update the documentation of Disk Checker based on YARN-90 - Key: YARN-3805 URL: https://issues.apache.org/jira/browse/YARN-3805 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3805.001.patch, YARN-3805.002.patch NodeManager is able to recover status of the disk once broken and fixed without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631300#comment-14631300 ] Hudson commented on YARN-3535: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #256 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/256/]) YARN-3535. Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED (rohithsharma and peng.zhang via asuresh) (Arun Suresh: rev 9b272ccae78918e7d756d84920a9322187d61eed) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/SchedulerEventType.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/ContainerRescheduledEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED --- Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Fix For: 2.8.0 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631302#comment-14631302 ] Hudson commented on YARN-90: FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #256 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/256/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631301#comment-14631301 ] Hudson commented on YARN-3930: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #256 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/256/]) YARN-3930. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown. (Dian Fu via wangda) (wangda: rev fa2b63ed162410ba05eadf211a1da068351b293a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Fix For: 2.8.0 Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631317#comment-14631317 ] Bibin A Chundatt commented on YARN-3938: Hi [~leftnoteasy] .As i understand {{ labelManager.getResourceByLabel(RMNodeLabelsManager.NO_LABEL, clusterResource)}} will return {{0}} that is the reason its going wrong. Please correct me if i am wrong. Any thoughts? AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel Key: YARN-3938 URL: https://issues.apache.org/jira/browse/YARN-3938 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: Am limit for subqueue.jpg In case of leaf queue the AM resource calculation is based on {{absoluteCapacityResource}}. Below is the calculation for absolute capacity {{LeafQueue#updateAbsoluteCapacityResource()}} {code} private void updateAbsoluteCapacityResource(Resource clusterResource) { absoluteCapacityResource = Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, clusterResource), queueCapacities.getAbsoluteCapacity(), minimumAllocation); } {code} If default partition resource is zero for all Leaf queue the resource for AM will be zero Snapshot also attached for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631455#comment-14631455 ] Hudson commented on YARN-3930: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2205 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2205/]) YARN-3930. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown. (Dian Fu via wangda) (wangda: rev fa2b63ed162410ba05eadf211a1da068351b293a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/CHANGES.txt FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown - Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Dian Fu Assignee: Dian Fu Fix For: 2.8.0 Attachments: YARN-3930.001.patch When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631456#comment-14631456 ] Hudson commented on YARN-3885: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2205 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2205/]) YARN-3885. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level. (Ajith S via wangda) (wangda: rev 3540d5fe4b1da942ea80c9e7ca1126b1abb8a68a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level -- Key: YARN-3885 URL: https://issues.apache.org/jira/browse/YARN-3885 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: Ajith S Assignee: Ajith S Priority: Blocker Fix For: 2.8.0 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} this piece of code, to calculate {{untoucable}} doesnt consider al the children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631454#comment-14631454 ] Hudson commented on YARN-3535: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2205 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2205/]) YARN-3535. Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED (rohithsharma and peng.zhang via asuresh) (Arun Suresh: rev 9b272ccae78918e7d756d84920a9322187d61eed) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestAbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/ContainerRescheduledEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/SchedulerEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED --- Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Fix For: 2.8.0 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630842#comment-14630842 ] Hadoop QA commented on YARN-2306: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 9m 57s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 10m 1s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 28s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 2s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 39s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 40s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 52m 5s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 77m 36s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745751/YARN-2306-3.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / ee36f4f | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8567/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8567/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8567/console | This message was automatically generated. leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2306-2.patch, YARN-2306-3.patch, YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavkesh Lahngir updated YARN-3933: -- Summary: Resources(both core and memory) are being negative (was: Resources(bothe core and memory) are being negative) Resources(both core and memory) are being negative -- Key: YARN-3933 URL: https://issues.apache.org/jira/browse/YARN-3933 Project: Hadoop YARN Issue Type: Bug Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir In our cluster we are seeing available memory and cores being negative. Initial inspection: Scenario no. 1: In capacity scheduler the method allocateContainersToNode() checks if there are excess reservation of containers for an application, and they are no longer needed then it calls queue.completedContainer() which causes resources being negative. And they were never assigned in the first place. I am still looking through the code. Can somebody suggest how to simulate excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3936) Add metrics for RMStateStore
Ming Ma created YARN-3936: - Summary: Add metrics for RMStateStore Key: YARN-3936 URL: https://issues.apache.org/jira/browse/YARN-3936 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma It might be useful to collect some metrics w.r.t. RMStateStore such as: * Write latency * The ApplicationStateData size distribution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3845) [YARN] YARN status in web ui does not show correctly in IE 11
[ https://issues.apache.org/jira/browse/YARN-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630869#comment-14630869 ] Hadoop QA commented on YARN-3845: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 24s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 53s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 45s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 46s | The applied patch generated 3 new checkstyle issues (total was 70, now 70). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 51m 10s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 89m 45s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745746/YARN-3845.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ee36f4f | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8569/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8569/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8569/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8569/console | This message was automatically generated. [YARN] YARN status in web ui does not show correctly in IE 11 - Key: YARN-3845 URL: https://issues.apache.org/jira/browse/YARN-3845 Project: Hadoop YARN Issue Type: Bug Reporter: Jagadesh Kiran N Assignee: Mohammad Shahid Khan Priority: Trivial Attachments: IE11_yarn.gif, YARN-3845.patch In IE 11 , the color display is not proper for the scheduler . In other browser it is showing correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3934) Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used
[ https://issues.apache.org/jira/browse/YARN-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630884#comment-14630884 ] Sunil G commented on YARN-3934: --- As of now there are no checks for the size of ApplicationSubmissionContext while processing submitApplication in RM. I feel we can have a check for the size in RMAppManager for this. An upper check with the ZK's max size will be a good solution here. I will check whether we can get the object size from ZK here and will update. Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used -- Key: YARN-3934 URL: https://issues.apache.org/jira/browse/YARN-3934 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Use the following steps to test. 1. Set up ZK as the RM HA store. 2. Submit a job that refers to lots of distributed cache files with long HDFS path, which will cause the app state size to exceed ZK's max object size limit. 3. RM can't write to ZK and exit with the following exception. {noformat} 2015-07-10 22:21:13,002 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:944) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:941) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1083) {noformat} In this case, RM could have rejected the app during submitApplication RPC if the size of ApplicationSubmissionContext is too large. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630860#comment-14630860 ] Rohith Sharma K S commented on YARN-3543: - Discussed with [~xgong] offline, as per YARN-1462 [comment|https://issues.apache.org/jira/browse/YARN-1462?focusedCommentId=14568189page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14568189] discussion ApplicationReport should be backword compatible. ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3932: --- Attachment: TestResult.jpg SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel --- Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: 0001-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3934) Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used
Ming Ma created YARN-3934: - Summary: Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used Key: YARN-3934 URL: https://issues.apache.org/jira/browse/YARN-3934 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Use the following steps to test. 1. Set up ZK as the RM HA store. 2. Submit a job that refers to lots of distributed cache files with long HDFS path, which will cause the app state size to exceed ZK's max object size limit. 3. RM can't write to ZK and exit with the following exception. {noformat} 2015-07-10 22:21:13,002 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:944) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:941) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1083) {noformat} In this case, RM could have rejected the app during submitApplication RPC if the size of ApplicationSubmissionContext is too large. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3932: --- Attachment: 0001-YARN-3932.patch SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel --- Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: 0001-YARN-3932.patch, ApplicationReport.jpg Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630867#comment-14630867 ] Varun Saxena commented on YARN-3049: [~zjshen], should cluster ID be mandatory in REST URL ? We can assume it to be belonging to same cluster as where this timeline reader is running and take it from config, if its not supplied by client. Thats how I did it in YARN-3814. [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3936) Add metrics for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630878#comment-14630878 ] Sunil G commented on YARN-3936: --- Hi [~mingma] I would like to work on this. Please let me know if you are looking into this. Add metrics for RMStateStore Key: YARN-3936 URL: https://issues.apache.org/jira/browse/YARN-3936 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma It might be useful to collect some metrics w.r.t. RMStateStore such as: * Write latency * The ApplicationStateData size distribution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3935) Support compression for RM HA ApplicationStateData
Ming Ma created YARN-3935: - Summary: Support compression for RM HA ApplicationStateData Key: YARN-3935 URL: https://issues.apache.org/jira/browse/YARN-3935 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma If we use ZK as the RM HA, it is possible for some application state to exceed the max object size imposed by ZK service. We can apply compression before storing the data to ZK. We might want to add the compression functionality at RMStateStore layer so that different store implementations can use it. The design might also want to take care of compatibility issue. After compression is enabled and RM restarts; the older state store should still be loaded properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630861#comment-14630861 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 18s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 48s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 47s | The applied patch generated 5 new checkstyle issues (total was 337, now 342). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 24s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 51m 21s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 89m 43s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745756/0006-YARN-3535.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ee36f4f | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8568/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8568/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8568/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8568/console | This message was automatically generated. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630877#comment-14630877 ] Bibin A Chundatt commented on YARN-3932: [~leftnoteasy] used {{attemptResourceUsage.getAllUsed()}} already available method. SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel --- Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: 0001-YARN-3932.patch, ApplicationReport.jpg Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631501#comment-14631501 ] Jonathan Eagles commented on YARN-3905: --- +1. Committing this patch [~eepayne]. Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3905.001.patch, YARN-3905.002.patch From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631530#comment-14631530 ] Hudson commented on YARN-3905: -- FAILURE: Integrated in Hadoop-trunk-Commit #8180 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8180/]) YARN-3905. Application History Server UI NPEs when accessing apps run after RM restart (Eric Payne via jeagles) (jeagles: rev 7faae0e6fe027a3886d9f4e290b6a488a2c55b3a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/CHANGES.txt Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3905.001.patch, YARN-3905.002.patch From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3937) Introducing REMOVE_CONTAINER_FROM_PREEMPTION event to notify Scheduler and AM when a container is no longer to be preempted
Sunil G created YARN-3937: - Summary: Introducing REMOVE_CONTAINER_FROM_PREEMPTION event to notify Scheduler and AM when a container is no longer to be preempted Key: YARN-3937 URL: https://issues.apache.org/jira/browse/YARN-3937 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Sunil G Assignee: Sunil G As discussed in YARN-3784, there are scenarios like few other applications released containers or same application has revoked its resource requests. In these cases, we may not have to preempt a container which would have been marked for preemption earlier. Introduce a new event to remove such containers if present in the to-be-preempted list of scheduler or inform AM about such a scenario. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3938: --- Summary: AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel (was: AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel Key: YARN-3938 URL: https://issues.apache.org/jira/browse/YARN-3938 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: Am limit for subqueue.jpg In case of leaf queue the AM resource calculation is based on {{absoluteCapacityResource}}. Below is the calculation for absolute capacity {{LeafQueue#updateAbsoluteCapacityResource()}} {code} private void updateAbsoluteCapacityResource(Resource clusterResource) { absoluteCapacityResource = Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, clusterResource), queueCapacities.getAbsoluteCapacity(), minimumAllocation); } {code} If default partition resource is zero for all Leaf queue the resource for AM will be zero Snapshot also attached for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-3535: -- Component/s: resourcemanager fairscheduler capacityscheduler ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Fix For: 2.8.0 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-3535: -- Fix Version/s: 2.8.0 ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Fix For: 2.8.0 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631205#comment-14631205 ] Arun Suresh commented on YARN-3535: --- +1, Committing this shortly. Thanks to everyone involved. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Fix For: 2.8.0 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero
Bibin A Chundatt created YARN-3938: -- Summary: AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero Key: YARN-3938 URL: https://issues.apache.org/jira/browse/YARN-3938 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical In case of leaf queue the AM resource calculation is based on {{absoluteCapacityResource}}. Below is the calculation for absolute capacity {{LeafQueue#updateAbsoluteCapacityResource()}} {code} private void updateAbsoluteCapacityResource(Resource clusterResource) { absoluteCapacityResource = Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, clusterResource), queueCapacities.getAbsoluteCapacity(), minimumAllocation); } {code} If default partition resource is zero for all Leaf queue the resource for AM will be zero Snapshot also attached for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero
[ https://issues.apache.org/jira/browse/YARN-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3938: --- Attachment: Am limit for subqueue.jpg AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero - Key: YARN-3938 URL: https://issues.apache.org/jira/browse/YARN-3938 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: Am limit for subqueue.jpg In case of leaf queue the AM resource calculation is based on {{absoluteCapacityResource}}. Below is the calculation for absolute capacity {{LeafQueue#updateAbsoluteCapacityResource()}} {code} private void updateAbsoluteCapacityResource(Resource clusterResource) { absoluteCapacityResource = Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, clusterResource), queueCapacities.getAbsoluteCapacity(), minimumAllocation); } {code} If default partition resource is zero for all Leaf queue the resource for AM will be zero Snapshot also attached for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-3535: -- Summary: Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED (was: ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED --- Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Fix For: 2.8.0 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2003: -- Attachment: 0023-YARN-2003.patch Thank you [~leftnoteasy] for the comments. Uploading a new patch addressing these. {{compareTo}} is used with priority of containers where lower integer value is highest in priority. Now its used in opposite context. Hence I added a comment there. Kindly review the same. Support for Application priority : Changes in RM and Capacity Scheduler --- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 0021-YARN-2003.patch, 0022-YARN-2003.patch, 0023-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3937) Introducing REMOVE_CONTAINER_FROM_PREEMPTION event to notify Scheduler and AM when a container is no longer to be preempted
[ https://issues.apache.org/jira/browse/YARN-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3937: -- Issue Type: Sub-task (was: Bug) Parent: YARN-45 Introducing REMOVE_CONTAINER_FROM_PREEMPTION event to notify Scheduler and AM when a container is no longer to be preempted --- Key: YARN-3937 URL: https://issues.apache.org/jira/browse/YARN-3937 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Sunil G Assignee: Sunil G As discussed in YARN-3784, there are scenarios like few other applications released containers or same application has revoked its resource requests. In these cases, we may not have to preempt a container which would have been marked for preemption earlier. Introduce a new event to remove such containers if present in the to-be-preempted list of scheduler or inform AM about such a scenario. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3453) Fair Scheduler: Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing
[ https://issues.apache.org/jira/browse/YARN-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-3453: -- Fix Version/s: 2.8.0 Fair Scheduler: Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing --- Key: YARN-3453 URL: https://issues.apache.org/jira/browse/YARN-3453 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Ashwin Shankar Assignee: Arun Suresh Fix For: 2.8.0 Attachments: YARN-3453.1.patch, YARN-3453.2.patch, YARN-3453.3.patch, YARN-3453.4.patch, YARN-3453.5.patch There are two places in preemption code flow where DefaultResourceCalculator is used, even in DRF mode. Which basically results in more resources getting preempted than needed, and those extra preempted containers aren’t even getting to the “starved” queue since scheduling logic is based on DRF's Calculator. Following are the two places : 1. {code:title=FSLeafQueue.java|borderStyle=solid} private boolean isStarved(Resource share) {code} A queue shouldn’t be marked as “starved” if the dominant resource usage is = fair/minshare. 2. {code:title=FairScheduler.java|borderStyle=solid} protected Resource resToPreempt(FSLeafQueue sched, long curTime) {code} -- One more thing that I believe needs to change in DRF mode is : during a preemption round,if preempting a few containers results in satisfying needs of a resource type, then we should exit that preemption round, since the containers that we just preempted should bring the dominant resource usage to min/fair share. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631652#comment-14631652 ] Wangda Tan commented on YARN-3938: -- Hi [~bibinchundatt], Thanks for reporting this issue, this is a known issue of node label. Possible solutions: # Make {{maxAMResource = queue's-total-guaranteed-resource (Sum of queue's guaranteed resource on all partitions) * maxAmResourcePercent}}. It will be straightforward, but also can lead to too many AMs launched under a single partition. # Make maxAMResource computed per queue per partition, this can make AM usages under partitions are more balanced, but can also lead to hard debugging (My application get stuck because of AMResourceLimit for a partition is violated). I prefer 1st solution since it's easier to understand and debug. Thoughts? And could I take over this issue if you haven't get started? AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel Key: YARN-3938 URL: https://issues.apache.org/jira/browse/YARN-3938 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: Am limit for subqueue.jpg In case of leaf queue the AM resource calculation is based on {{absoluteCapacityResource}}. Below is the calculation for absolute capacity {{LeafQueue#updateAbsoluteCapacityResource()}} {code} private void updateAbsoluteCapacityResource(Resource clusterResource) { absoluteCapacityResource = Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, clusterResource), queueCapacities.getAbsoluteCapacity(), minimumAllocation); } {code} If default partition resource is zero for all Leaf queue the resource for AM will be zero Snapshot also attached for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631623#comment-14631623 ] Wangda Tan commented on YARN-2003: -- Latest patch looks good, [~sunilg], could you take a look at failed tests? Support for Application priority : Changes in RM and Capacity Scheduler --- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 0021-YARN-2003.patch, 0022-YARN-2003.patch, 0023-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631656#comment-14631656 ] Wangda Tan commented on YARN-3932: -- Thanks for update [~bibinchundatt], could you add a test for this to avoid future regression? SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel --- Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: 0001-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-3905: -- Fix Version/s: 3.0.0 2.7.2 2.8.0 Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Fix For: 3.0.0, 2.8.0, 2.7.2 Attachments: YARN-3905.001.patch, YARN-3905.002.patch From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631638#comment-14631638 ] Hadoop QA commented on YARN-2681: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 23m 21s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 13 new or modified test files. | | {color:green}+1{color} | javac | 8m 3s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 3s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 3m 32s | The applied patch generated 1 new checkstyle issues (total was 221, now 221). | | {color:green}+1{color} | whitespace | 0m 46s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 26s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 8m 30s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 9m 16s | Tests passed in hadoop-mapreduce-client-app. | | {color:green}+1{color} | mapreduce tests | 1m 46s | Tests passed in hadoop-mapreduce-client-core. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 6m 37s | Tests failed in hadoop-yarn-server-nodemanager. | | {color:green}+1{color} | yarn tests | 51m 49s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 129m 16s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.TestDeletionService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745824/YARN-2681.005.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 9b272cc | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8573/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-mapreduce-client-app test log | https://builds.apache.org/job/PreCommit-YARN-Build/8573/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt | | hadoop-mapreduce-client-core test log | https://builds.apache.org/job/PreCommit-YARN-Build/8573/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8573/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8573/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8573/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8573/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8573/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8573/console | This message was automatically generated. Support bandwidth enforcement for containers while reading from HDFS Key: YARN-2681 URL: https://issues.apache.org/jira/browse/YARN-2681 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Affects Versions: 2.5.1 Environment: Linux Reporter: Nam H. Do Labels: BB2015-05-TBR Fix For: 2.7.0 Attachments: Traffic Control Design.png, YARN-2681.001.patch, YARN-2681.002.patch, YARN-2681.003.patch, YARN-2681.004.patch, YARN-2681.005.patch, YARN-2681.patch To read/write data from HDFS on data node, applications establise TCP/IP connections with the datanode. The HDFS read can be controled by setting Linux Traffic Control (TC) subsystem on the data node to make filters on appropriate connections. The current cgroups net_cls concept can not be applied on the node where the container is launched, netheir on data node since: - TC hanldes outgoing
[jira] [Commented] (YARN-1645) ContainerManager implementation to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631667#comment-14631667 ] Hadoop QA commented on YARN-1645: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 2s | Findbugs (version ) appears to be broken on YARN-1197. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 18s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 17s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 21s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 12s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 6m 14s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 42m 45s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.containermanager.TestContainerManager | | | hadoop.yarn.server.nodemanager.TestContainerManagerWithLCE | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745467/YARN-1645-YARN-1197.3.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-1197 / 8041fd8 | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8575/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8575/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8575/console | This message was automatically generated. ContainerManager implementation to support container resizing - Key: YARN-1645 URL: https://issues.apache.org/jira/browse/YARN-1645 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1645-YARN-1197.3.patch, YARN-1645.1.patch, YARN-1645.2.patch, yarn-1645.1.patch Implementation of ContainerManager for container resize, including: 1) ContainerManager resize logic 2) Relevant test cases -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631732#comment-14631732 ] Hadoop QA commented on YARN-2003: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 26s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 10 new or modified test files. | | {color:green}+1{color} | javac | 8m 1s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 46s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 49s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 23s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 25s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 3s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | tools/hadoop tests | 0m 23s | Tests failed in hadoop-sls. | | {color:green}+1{color} | yarn tests | 0m 28s | Tests passed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 52m 4s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 99m 1s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.sls.nodemanager.TestNMSimulator | | | hadoop.yarn.sls.appmaster.TestAMSimulator | | | hadoop.yarn.sls.TestSLSRunner | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745796/0023-YARN-2003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 7faae0e | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8574/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-sls test log | https://builds.apache.org/job/PreCommit-YARN-Build/8574/artifact/patchprocess/testrun_hadoop-sls.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8574/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8574/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8574/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8574/console | This message was automatically generated. Support for Application priority : Changes in RM and Capacity Scheduler --- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 0021-YARN-2003.patch, 0022-YARN-2003.patch, 0023-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3878: Attachment: YARN-3878.09_reprorace.pat_h Attaching a patch just to demonstrate the race. Since its trying to demonstrate the race it injects an artificial delay, hence not making it an official patch. Run the test testBlockNewEvents to show that an event can be in the queue while serviceStop happens. AsyncDispatcher can hang while stopping if it is configured for draining events on stop --- Key: YARN-3878 URL: https://issues.apache.org/jira/browse/YARN-3878 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Critical Fix For: 2.7.2 Attachments: YARN-3878.01.patch, YARN-3878.02.patch, YARN-3878.03.patch, YARN-3878.04.patch, YARN-3878.05.patch, YARN-3878.06.patch, YARN-3878.07.patch, YARN-3878.08.patch, YARN-3878.09.patch, YARN-3878.09_reprorace.pat_h The sequence of events is as under : # RM is stopped while putting a RMStateStore Event to RMStateStore's AsyncDispatcher. This leads to an Interrupted Exception being thrown. # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On {{serviceStop}}, we will check if all events have been drained and wait for event queue to drain(as RM State Store dispatcher is configured for queue to drain on stop). # This condition never becomes true and AsyncDispatcher keeps on waiting incessantly for dispatcher event queue to drain till JVM exits. *Initial exception while posting RM State store event to queue* {noformat} 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService (AbstractService.java:enterState(452)) - Service: Dispatcher entered state STOPPED 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher thread interrupted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) {noformat} *JStack of AsyncDispatcher hanging on stop* {noformat} AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e waiting on condition [0x7fb9654e9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000700b79250 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at
[jira] [Commented] (YARN-3937) Introducing REMOVE_CONTAINER_FROM_PREEMPTION event to notify Scheduler and AM when a container is no longer to be preempted
[ https://issues.apache.org/jira/browse/YARN-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631677#comment-14631677 ] Wangda Tan commented on YARN-3937: -- [~sunilg], I agree to have a separated event add to API/scheduler. And maybe add to scheduler is more important since YARN-3769 can potentially leverage it. I don't have a solid design for YARN-3769 yet, but I think if a container is removed from to-be-preempted list, we shouldn't do lazy preemption for such containers. For API changes, I'm not sure if we need it, since a container can occur on list / off list frequently, we cannot guarantee once a container is removed from list, it won't be marked again. Personally I think we can make this is an internal event first to avoid too much noises. Introducing REMOVE_CONTAINER_FROM_PREEMPTION event to notify Scheduler and AM when a container is no longer to be preempted --- Key: YARN-3937 URL: https://issues.apache.org/jira/browse/YARN-3937 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Sunil G Assignee: Sunil G As discussed in YARN-3784, there are scenarios like few other applications released containers or same application has revoked its resource requests. In these cases, we may not have to preempt a container which would have been marked for preemption earlier. Introduce a new event to remove such containers if present in the to-be-preempted list of scheduler or inform AM about such a scenario. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3844) Make hadoop-yarn-project Native code -Wall-clean
[ https://issues.apache.org/jira/browse/YARN-3844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631720#comment-14631720 ] Colin Patrick McCabe commented on YARN-3844: +1. Thanks, Alan. Make hadoop-yarn-project Native code -Wall-clean Key: YARN-3844 URL: https://issues.apache.org/jira/browse/YARN-3844 Project: Hadoop YARN Issue Type: Sub-task Components: build Affects Versions: 2.7.0 Environment: As we specify -Wall as a default compilation flag, it would be helpful if the Native code was -Wall-clean Reporter: Alan Burlison Assignee: Alan Burlison Attachments: YARN-3844.001.patch, YARN-3844.002.patch, YARN-3844.007.patch As we specify -Wall as a default compilation flag, it would be helpful if the Native code was -Wall-clean -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)
[ https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631671#comment-14631671 ] Wangda Tan commented on YARN-3784: -- Hi [~sunilg], Thanks for your comments, I will post cancel-preemption event related comments to YARN-3937 soon. Indicate preemption timout along with the list of containers to AM (preemption message) --- Key: YARN-3784 URL: https://issues.apache.org/jira/browse/YARN-3784 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch Currently during preemption, AM is notified with a list of containers which are marked for preemption. Introducing a timeout duration also along with this container list so that AM can know how much time it will get to do a graceful shutdown to its containers (assuming one of preemption policy is loaded in AM). This will help in decommissioning NM scenarios, where NM will be decommissioned after a timeout (also killing containers on it). This timeout will be helpful to indicate AM that those containers can be killed by RM forcefully after the timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)
[ https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631691#comment-14631691 ] Wangda Tan commented on YARN-3784: -- And also about this patch, same as commented by [~chris.douglas]. I found timeout sent to AM is maxWaitTime, which I think should be how much time till the container preempted. Maybe one solution is compute a absolute time for each to-be-preempted containers, and timeout will be computed when AM is pulling these information. Indicate preemption timout along with the list of containers to AM (preemption message) --- Key: YARN-3784 URL: https://issues.apache.org/jira/browse/YARN-3784 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch Currently during preemption, AM is notified with a list of containers which are marked for preemption. Introducing a timeout duration also along with this container list so that AM can know how much time it will get to do a graceful shutdown to its containers (assuming one of preemption policy is loaded in AM). This will help in decommissioning NM scenarios, where NM will be decommissioned after a timeout (also killing containers on it). This timeout will be helpful to indicate AM that those containers can be killed by RM forcefully after the timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3934) Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used
[ https://issues.apache.org/jira/browse/YARN-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631783#comment-14631783 ] Karthik Kambatla commented on YARN-3934: Are we sure this is because of the size of a single ASC and not the number of applications at all? The latter can be fixed by setting the max-completed-applications. Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used -- Key: YARN-3934 URL: https://issues.apache.org/jira/browse/YARN-3934 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Use the following steps to test. 1. Set up ZK as the RM HA store. 2. Submit a job that refers to lots of distributed cache files with long HDFS path, which will cause the app state size to exceed ZK's max object size limit. 3. RM can't write to ZK and exit with the following exception. {noformat} 2015-07-10 22:21:13,002 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:944) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:941) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1083) {noformat} In this case, RM could have rejected the app during submitApplication RPC if the size of ApplicationSubmissionContext is too large. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3853) Add docker container runtime support to LinuxContainterExecutor
[ https://issues.apache.org/jira/browse/YARN-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632026#comment-14632026 ] Varun Vasudev commented on YARN-3853: - Thanks for the patch [~sidharta-s]. One question - Can you explain what the purpose of {code} whitelist.add(YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME); {code} is? Some feedback on the patch: # Can we rename the DefaultLinuxContainerRuntime to ProcessContainerRuntime and rename DockerLinuxContainerRuntime to DockerContainerRuntime - both are already in the nodemanager.containermanager.linux.runtime package so the Linux seems redundant and I think Process is better than Default. # In LinuxContainerExecutor {code} + + public LinuxContainerExecutor() { + } + + // created primarily for testing + public LinuxContainerExecutor(LinuxContainerRuntime linuxContainerRuntime) { +this.linuxContainerRuntime = linuxContainerRuntime; + } {code} Maybe these should be protected? In addition, the VisibleForTesting annotation should be used # In LinuxContainerExecutor {code} - containerSchedPriorityIsSet = true; - containerSchedPriorityAdjustment = conf - .getInt(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY, - YarnConfiguration.DEFAULT_NM_CONTAINER_EXECUTOR_SCHED_PRIORITY); + containerSchedPriorityIsSet = true; + containerSchedPriorityAdjustment = conf + .getInt(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY, + YarnConfiguration.DEFAULT_NM_CONTAINER_EXECUTOR_SCHED_PRIORITY); } {code} Looks like the formatting is messed up. # In LinuxContainerExecutor, we've removed some debug statements; we should put them back in {code} -if (LOG.isDebugEnabled()) { - LOG.debug(Output from LinuxContainerExecutor's launchContainer follows:); - logOutput(shExec.getOutput()); -} {code} and {code} -if (LOG.isDebugEnabled()) { - LOG.debug(signalContainer: + Arrays.toString(command)); -} {code} # In ContainerLaunch.java {code} @Override +public void whitelistedEnv(String key, String value) throws IOException { + lineWithLenCheck(@set , key, =, value); + errorCheck(); +} {code} This code is exactly the same as the env() function. Maybe it should just call the env() function instead? # There are some unused imports in DockerLinuxContainerRuntime, DockerClient and TestDockerContainerRuntime Add docker container runtime support to LinuxContainterExecutor --- Key: YARN-3853 URL: https://issues.apache.org/jira/browse/YARN-3853 Project: Hadoop YARN Issue Type: Sub-task Components: yarn Reporter: Sidharta Seethana Assignee: Sidharta Seethana Attachments: YARN-3853.001.patch Create a new DockerContainerRuntime that implements support for docker containers via container-executor. LinuxContainerExecutor should default to current behavior when launching containers but switch to docker when requested. Overview === The current mechanism of launching/signaling containers is moved to its own (default) container runtime. In order to use docker container runtime a couple of environment variables have to be set. This will have to be revisited when we have a first class client side API to specify different container types and associated parameters. Using ‘pi’ as an example and using a custom docker image, this is how you could use the docker container runtime (LinuxContainerExecutor must be in use and the docker daemon needs to be running) : {code} export YARN_EXAMPLES_JAR=./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar bin/yarn jar $YARN_EXAMPLES_JAR pi -Dmapreduce.map.env=YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=ashahab/hadoop-trunk -Dyarn.app.mapreduce.am.env=YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=ashahab/hadoop-trunk -Dmapreduce.reduce.env=YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=ashahab/hadoop-trunk 4 1000 {code} LinuxContainerExecutor can delegate to either runtime on a per container basis. If the docker container type is selected, LinuxContainerExecutor delegates to the DockerContainerRuntime which in turn uses docker support in the container-executor binary to launch/manage docker containers ( see YARN-3852 ) . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632094#comment-14632094 ] Hadoop QA commented on YARN-3908: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 5s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 3s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 55s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 18s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 21s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 22s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 43m 4s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745903/YARN-3908-YARN-2928.005.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / eb1932d | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8578/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8578/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8578/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8578/console | This message was automatically generated. Bugs in HBaseTimelineWriterImpl --- Key: YARN-3908 URL: https://issues.apache.org/jira/browse/YARN-3908 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Attachments: YARN-3908-YARN-2928.001.patch, YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.005.patch 1. In HBaseTimelineWriterImpl, the info column family contains the basic fields of a timeline entity plus events. However, entity#info map is not stored at all. 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2964: -- Labels: 2.6.1-candidate (was: ) RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632280#comment-14632280 ] zhihai xu commented on YARN-3798: - Thanks for the new patch [~ozawa]! the patch looks good to me except two nits: # Using {{rc == Code.OK.intValue()}} instead of {{rc == 0}} may be more maintainable and readable when checking the return value from AsyncCallback. # It may be better to add {{Thread.currentThread().interrupt();}} to restore the interrupted status after catching InterruptedException from {{syncInternal}}. ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED --- Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-2.7.002.patch, YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.003.patch, YARN-3798-branch-2.7.004.patch, YARN-3798-branch-2.7.005.patch, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at
[jira] [Updated] (YARN-2890) MiniYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-2890: -- Labels: 2.6.1-candidate (was: ) MiniYarnCluster should turn on timeline service if configured to do so -- Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Labels: 2.6.1-candidate Fix For: 2.8.0 Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2859) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster
[ https://issues.apache.org/jira/browse/YARN-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-2859: -- Labels: 2.6.1-candidate (was: ) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster -- Key: YARN-2859 URL: https://issues.apache.org/jira/browse/YARN-2859 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Hitesh Shah Assignee: Zhijie Shen Priority: Critical Labels: 2.6.1-candidate In mini cluster, a random port should be used. Also, the config is not updated to the host that the process got bound to. {code} 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster (MiniYARNCluster.java:serviceStart(722)) - MiniYARN ApplicationHistoryServer address: localhost:10200 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster (MiniYARNCluster.java:serviceStart(724)) - MiniYARN ApplicationHistoryServer web address: 0.0.0.0:8188 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631832#comment-14631832 ] Jason Lowe commented on YARN-3535: -- Should this go in to 2.7.2? It's been seen by multiple users and seems appropriate for that release. Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED --- Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Fix For: 2.8.0 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631911#comment-14631911 ] Wangda Tan commented on YARN-2003: -- It seems latest tests are all passed. But [~sunilg], for the failed test of previous build, it reports: {code} Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 60.123 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority testPriorityWithPendingApplications(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority) Time elapsed: 48.422 sec FAILURE! java.lang.AssertionError: Attempt state is not correct (timedout): expected: ALLOCATED actual: SCHEDULED at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:98) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:573) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testPriorityWithPendingApplications(TestApplicationPriority.java:315) {code} Is it caused by your patch or implementation of MockRM since it is related to your changes. Support for Application priority : Changes in RM and Capacity Scheduler --- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 0021-YARN-2003.patch, 0022-YARN-2003.patch, 0023-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3934) Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used
[ https://issues.apache.org/jira/browse/YARN-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632017#comment-14632017 ] Ming Ma commented on YARN-3934: --- This is due to a single ASC object size. You can repro this with RM starting with empty state. So it is different from YARN-2962. Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used -- Key: YARN-3934 URL: https://issues.apache.org/jira/browse/YARN-3934 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Use the following steps to test. 1. Set up ZK as the RM HA store. 2. Submit a job that refers to lots of distributed cache files with long HDFS path, which will cause the app state size to exceed ZK's max object size limit. 3. RM can't write to ZK and exit with the following exception. {noformat} 2015-07-10 22:21:13,002 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:944) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:941) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1083) {noformat} In this case, RM could have rejected the app during submitApplication RPC if the size of ApplicationSubmissionContext is too large. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3700) ATS Web Performance issue at load time when large number of jobs
[ https://issues.apache.org/jira/browse/YARN-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3700: -- Labels: 2.6.1-candidate 2.7.2-candidate (was: ) ATS Web Performance issue at load time when large number of jobs Key: YARN-3700 URL: https://issues.apache.org/jira/browse/YARN-3700 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Labels: 2.6.1-candidate, 2.7.2-candidate Fix For: 2.8.0 Attachments: YARN-3700.1.patch, YARN-3700.2.1.patch, YARN-3700.2.2.patch, YARN-3700.2.patch, YARN-3700.3.patch, YARN-3700.4.patch Currently, we will load all the apps when we try to load the yarn timelineservice web page. If we have large number of jobs, it will be very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery
[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2816: -- Labels: 2.6.1-candidate (was: ) NM fail to start with NPE during container recovery --- Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: YARN-2816.000.patch, YARN-2816.001.patch, YARN-2816.002.patch, leveldb_records.txt NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1645) ContainerManager implementation to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632029#comment-14632029 ] Hadoop QA commented on YARN-1645: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 13s | Findbugs (version ) appears to be broken on YARN-1197. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 40s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 19s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 9s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 21s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 12s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 6m 19s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 42m 50s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.containermanager.TestContainerManager | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745884/YARN-1645-YARN-1197.4.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-1197 / 8041fd8 | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8577/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8577/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8577/console | This message was automatically generated. ContainerManager implementation to support container resizing - Key: YARN-1645 URL: https://issues.apache.org/jira/browse/YARN-1645 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1645-YARN-1197.3.patch, YARN-1645-YARN-1197.4.patch, YARN-1645.1.patch, YARN-1645.2.patch, yarn-1645.1.patch Implementation of ContainerManager for container resize, including: 1) ContainerManager resize logic 2) Relevant test cases -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3908: -- Attachment: YARN-3908-YARN-2928.005.patch v.5 patch posted The {{readTimeseriesResults()}} method has been renamed to {{readResultsWithTimestamps()}}. Hopefully it's bit more appropriate. Bugs in HBaseTimelineWriterImpl --- Key: YARN-3908 URL: https://issues.apache.org/jira/browse/YARN-3908 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Attachments: YARN-3908-YARN-2928.001.patch, YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.005.patch 1. In HBaseTimelineWriterImpl, the info column family contains the basic fields of a timeline entity plus events. However, entity#info map is not stored at all. 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3654) ContainerLogsPage web UI should not have meta-refresh
[ https://issues.apache.org/jira/browse/YARN-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3654: -- Labels: 2.7.2-candidate (was: ) ContainerLogsPage web UI should not have meta-refresh - Key: YARN-3654 URL: https://issues.apache.org/jira/browse/YARN-3654 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.1 Reporter: Xuan Gong Assignee: Xuan Gong Labels: 2.7.2-candidate Fix For: 2.8.0 Attachments: YARN-3654.1.patch, YARN-3654.2.patch Currently, When we try to find the container logs for the finished application, it will re-direct to the url which we re-configured for yarn.log.server.url in yarn-site.xml. But in ContainerLogsPage, we are using meta-refresh: {code} set(TITLE, join(Redirecting to log server for , $(CONTAINER_ID))); html.meta_http(refresh, 1; url= + redirectUrl); {code} which is not good for some browsers which need to enable the meta-refresh in their security setting, especially for IE which meta-refresh is considered a security hole. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632079#comment-14632079 ] Sunil G commented on YARN-2003: --- Hi Wangda. This issue in MockRM was intermittent issue we faced early. This random failure was supposed to be fixed in YARN-3533. this is not happened because of my change as I have not added any new api in MockRM now. YARN-3533 fixed this issue in launchAM. May be issue is there for send AM launched. I ll check this and if needed will open a test ticket to track this. Support for Application priority : Changes in RM and Capacity Scheduler --- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 0021-YARN-2003.patch, 0022-YARN-2003.patch, 0023-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2340) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2340: -- Labels: 2.6.1-candidate (was: ) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby -- Key: YARN-2340 URL: https://issues.apache.org/jira/browse/YARN-2340 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.1 Environment: Capacityscheduler with Queue a, b Reporter: Nishan Shetty Assignee: Rohith Sharma K S Priority: Critical Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: 0001-YARN-2340.patch While job is in progress make Queue state as STOPPED and then restart RM Observe that standby RM fails to come up as acive throwing below NPE 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) at java.lang.Thread.run(Thread.java:662) 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2414) RM web UI: app page will crash if app is failed before any attempt has been created
[ https://issues.apache.org/jira/browse/YARN-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2414: -- Labels: 2.6.1-candidate (was: ) RM web UI: app page will crash if app is failed before any attempt has been created --- Key: YARN-2414 URL: https://issues.apache.org/jira/browse/YARN-2414 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Zhijie Shen Assignee: Wangda Tan Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: YARN-2414.20141104-1.patch, YARN-2414.20141104-2.patch, YARN-2414.patch {code} 2014-08-12 16:45:13,573 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/app/application_1407887030038_0001 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:460) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1191) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at
[jira] [Updated] (YARN-3227) Timeline renew delegation token fails when RM user's TGT is expired
[ https://issues.apache.org/jira/browse/YARN-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3227: -- Labels: 2.6.1-candidate (was: ) Timeline renew delegation token fails when RM user's TGT is expired --- Key: YARN-3227 URL: https://issues.apache.org/jira/browse/YARN-3227 Project: Hadoop YARN Issue Type: Bug Reporter: Jonathan Eagles Assignee: Zhijie Shen Priority: Critical Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: YARN-3227.1.patch, YARN-3227.test.patch When the RM user's kerberos TGT is expired, the RM renew delegation token operation fails as part of job submission. Expected behavior is that RM will relogin to get a new TGT. {quote} 2015-02-06 18:54:05,617 [DelegationTokenRenewer #25954] WARN security.DelegationTokenRenewer: Unable to add the application to the delegation token renewer. java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, Service: timelineserver.example.com:4080, Ident: (owner=user, renewer=rmuser, realUser=oozie, issueDate=1423248845528, maxDate=1423853645528, sequenceNumber=9716, masterKeyId=9) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:443) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$800(DelegationTokenRenewer.java:77) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:808) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:789) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.IOException: HTTP status [401], message [Unauthorized] at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:286) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:211) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:374) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:360) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$4.run(TimelineClientImpl.java:429) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:161) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:444) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:378) at org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81) at org.apache.hadoop.security.token.Token.renew(Token.java:377) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:532) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:529) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3393) Getting application(s) goes wrong when app finishes before starting the attempt
[ https://issues.apache.org/jira/browse/YARN-3393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3393: -- Labels: 2.6.1-candidate (was: ) Getting application(s) goes wrong when app finishes before starting the attempt --- Key: YARN-3393 URL: https://issues.apache.org/jira/browse/YARN-3393 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: YARN-3393.1.patch When generating app report in ApplicationHistoryManagerOnTimelineStore, it checks if appAttempt == null. {code} ApplicationAttemptReport appAttempt = getApplicationAttempt(app.appReport.getCurrentApplicationAttemptId()); if (appAttempt != null) { app.appReport.setHost(appAttempt.getHost()); app.appReport.setRpcPort(appAttempt.getRpcPort()); app.appReport.setTrackingUrl(appAttempt.getTrackingUrl()); app.appReport.setOriginalTrackingUrl(appAttempt.getOriginalTrackingUrl()); } {code} However, {{getApplicationAttempt}} doesn't return null but throws ApplicationAttemptNotFoundException: {code} if (entity == null) { throw new ApplicationAttemptNotFoundException( The entity for application attempt + appAttemptId + doesn't exist in the timeline store); } else { return convertToApplicationAttemptReport(entity); } {code} They code isn't coupled well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631816#comment-14631816 ] Hadoop QA commented on YARN-3878: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | patch | 0m 1s | The patch file was not named according to hadoop's naming conventions. Please see https://wiki.apache.org/hadoop/HowToContribute for instructions. | | {color:red}-1{color} | pre-patch | 15m 1s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:red}-1{color} | javac | 7m 37s | The applied patch generated 1 additional warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 25s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 21s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 33s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 1m 57s | Tests failed in hadoop-yarn-common. | | | | 38m 34s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.event.TestAsyncDispatcher | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745865/YARN-3878.09_reprorace.pat_h | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 419c51d | | javac | https://builds.apache.org/job/PreCommit-YARN-Build/8576/artifact/patchprocess/diffJavacWarnings.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8576/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8576/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8576/console | This message was automatically generated. AsyncDispatcher can hang while stopping if it is configured for draining events on stop --- Key: YARN-3878 URL: https://issues.apache.org/jira/browse/YARN-3878 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Critical Fix For: 2.7.2 Attachments: YARN-3878.01.patch, YARN-3878.02.patch, YARN-3878.03.patch, YARN-3878.04.patch, YARN-3878.05.patch, YARN-3878.06.patch, YARN-3878.07.patch, YARN-3878.08.patch, YARN-3878.09.patch, YARN-3878.09_reprorace.pat_h The sequence of events is as under : # RM is stopped while putting a RMStateStore Event to RMStateStore's AsyncDispatcher. This leads to an Interrupted Exception being thrown. # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On {{serviceStop}}, we will check if all events have been drained and wait for event queue to drain(as RM State Store dispatcher is configured for queue to drain on stop). # This condition never becomes true and AsyncDispatcher keeps on waiting incessantly for dispatcher event queue to drain till JVM exits. *Initial exception while posting RM State store event to queue* {noformat} 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService (AbstractService.java:enterState(452)) - Service: Dispatcher entered state STOPPED 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher thread interrupted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) at
[jira] [Updated] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3216: - Description: Currently, max-am-resource-percentage considers default_partition only. When a queue can access multiple partitions, we should be able to compute max-am-resource-percentage based on that. Max-AM-Resource-Percentage should respect node labels - Key: YARN-3216 URL: https://issues.apache.org/jira/browse/YARN-3216 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Currently, max-am-resource-percentage considers default_partition only. When a queue can access multiple partitions, we should be able to compute max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632125#comment-14632125 ] Jian He commented on YARN-2005: --- Seems the patch will blacklist a node immediately once the AM container fails, I think we may black list a node only after a configurable threshold ? Some apps may still like to be re-started on the same node for reasons like data locality - AM does not want to transfer the local data to a different machine when restarted. Blacklisting support for scheduling AMs --- Key: YARN-2005 URL: https://issues.apache.org/jira/browse/YARN-2005 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Anubhav Dhoot Attachments: YARN-2005.001.patch, YARN-2005.002.patch, YARN-2005.003.patch, YARN-2005.004.patch It would be nice if the RM supported blacklisting a node for an AM launch after the same node fails a configurable number of AM attempts. This would be similar to the blacklisting support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1645) ContainerManager implementation to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-1645: Attachment: YARN-1645-YARN-1197.5.patch I think you are right that these functions don't need to be synchronized. Originally I was directly modifying container sizes in {{changeContainerResourceInternal}}, so I thought I need to synchronize functions that may potentially access the same containers. This is no longer the case as container size are now changed in ContainerImpl via events, and access to a container is already properly synchronized in ContainerImpl. Thanks for catching this. Attach updated patch. ContainerManager implementation to support container resizing - Key: YARN-1645 URL: https://issues.apache.org/jira/browse/YARN-1645 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1645-YARN-1197.3.patch, YARN-1645-YARN-1197.4.patch, YARN-1645-YARN-1197.5.patch, YARN-1645.1.patch, YARN-1645.2.patch, yarn-1645.1.patch Implementation of ContainerManager for container resize, including: 1) ContainerManager resize logic 2) Relevant test cases -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1645) ContainerManager implementation to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632037#comment-14632037 ] Jian He commented on YARN-1645: --- looks good overall, one question is: - why is this changed to be synchronized? {code} private synchronized void stopContainerInternal( private synchronized ContainerStatus getContainerStatusInternal( {code} ContainerManager implementation to support container resizing - Key: YARN-1645 URL: https://issues.apache.org/jira/browse/YARN-1645 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1645-YARN-1197.3.patch, YARN-1645-YARN-1197.4.patch, YARN-1645.1.patch, YARN-1645.2.patch, yarn-1645.1.patch Implementation of ContainerManager for container resize, including: 1) ContainerManager resize logic 2) Relevant test cases -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it
[ https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632042#comment-14632042 ] Jian He commented on YARN-3900: --- lgtm Protobuf layout of yarn_security_token causes errors in other protos that include it - Key: YARN-3900 URL: https://issues.apache.org/jira/browse/YARN-3900 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3900.001.patch, YARN-3900.001.patch, YARN-3900.002.patch Because of the subdirectory server used in {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} there are errors in other protos that include them. As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html {noformat} Any subdirectories under src/main/proto/ are treated as package structure for protobuf definition imports.{noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1645) ContainerManager implementation to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632200#comment-14632200 ] Hadoop QA commented on YARN-1645: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 29s | Findbugs (version ) appears to be broken on YARN-1197. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 47s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 49s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 20s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 8s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 24s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 17s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 43m 26s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745917/YARN-1645-YARN-1197.5.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-1197 / 8041fd8 | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8579/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8579/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8579/console | This message was automatically generated. ContainerManager implementation to support container resizing - Key: YARN-1645 URL: https://issues.apache.org/jira/browse/YARN-1645 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1645-YARN-1197.3.patch, YARN-1645-YARN-1197.4.patch, YARN-1645-YARN-1197.5.patch, YARN-1645.1.patch, YARN-1645.2.patch, yarn-1645.1.patch Implementation of ContainerManager for container resize, including: 1) ContainerManager resize logic 2) Relevant test cases -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-3938. -- Resolution: Duplicate I just found I filed one JIRA for this issue before, which is YARN-3216. Closing this as duplicated. Thanks for reporting, [~bibinchundatt]. AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel Key: YARN-3938 URL: https://issues.apache.org/jira/browse/YARN-3938 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: Am limit for subqueue.jpg In case of leaf queue the AM resource calculation is based on {{absoluteCapacityResource}}. Below is the calculation for absolute capacity {{LeafQueue#updateAbsoluteCapacityResource()}} {code} private void updateAbsoluteCapacityResource(Resource clusterResource) { absoluteCapacityResource = Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, clusterResource), queueCapacities.getAbsoluteCapacity(), minimumAllocation); } {code} If default partition resource is zero for all Leaf queue the resource for AM will be zero Snapshot also attached for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2246) Job History Link in RM UI is redirecting to the URL which contains Job Id twice
[ https://issues.apache.org/jira/browse/YARN-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2246: -- Labels: 2.6.1-candidate (was: ) Job History Link in RM UI is redirecting to the URL which contains Job Id twice --- Key: YARN-2246 URL: https://issues.apache.org/jira/browse/YARN-2246 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Devaraj K Assignee: Devaraj K Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: MAPREDUCE-4064-1.patch, MAPREDUCE-4064.patch, YARN-2246-3.patch, YARN-2246-4.patch, YARN-2246.2.patch, YARN-2246.patch {code:xml} http://xx.x.x.x:19888/jobhistory/job/job_1332435449546_0001/jobhistory/job/job_1332435449546_0001 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3239) WebAppProxy does not support a final tracking url which has query fragments and params
[ https://issues.apache.org/jira/browse/YARN-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3239: -- Labels: 2.6.1-candidate (was: ) WebAppProxy does not support a final tracking url which has query fragments and params --- Key: YARN-3239 URL: https://issues.apache.org/jira/browse/YARN-3239 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: YARN-3239.1.patch Examples of failures: Expected: {{http://uihost:8080/#/main/views/TEZ/0.5.2.2.2.2.0-947/tez?viewPath=%2F%23%2Ftez-app%2Fapplication_1424384418229_0005}} Actual: {{http://uihost:8080}} Tried with a minor change to remove the #. Saw a different issue: Expected: {{http://uihost:8080/views/TEZ/0.5.2.2.2.2.0-947/tez?viewPath=%2F%23%2Ftez-app%2Fapplication_1424388018547_0001}} Actual: {{http://uihost:8080/views/TEZ/0.5.2.2.2.2.0-947/tez/}} yarn application -status appId returns the expected value correctly. However, invoking an http get on http://rm:8088/proxy/appId/ returns the wrong value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.
[ https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3207: -- Labels: 2.6.1-candidate (was: ) secondary filter matches entites which do not have the key being filtered for. -- Key: YARN-3207 URL: https://issues.apache.org/jira/browse/YARN-3207 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Prakash Ramachandran Assignee: Zhijie Shen Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: YARN-3207.1.patch in the leveldb implementation of the TimelineStore the secondary filter matches entities where the key being searched for is not present. ex query from tez ui http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar will match and return the entity even though there is no entity with otherinfo.foo defined. the issue seems to be in {code:title=LeveldbTimelineStore:675} if (vs != null !vs.contains(filter.getValue())) { filterPassed = false; break; } {code} this should be IMHO vs == null || !vs.contains(filter.getValue()) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3936) Add metrics for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-3936: - Assignee: Sunil G Add metrics for RMStateStore Key: YARN-3936 URL: https://issues.apache.org/jira/browse/YARN-3936 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma Assignee: Sunil G It might be useful to collect some metrics w.r.t. RMStateStore such as: * Write latency * The ApplicationStateData size distribution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1645) ContainerManager implementation to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-1645: Attachment: YARN-1645-YARN-1197.4.patch The {{testChangeContainerResource}} has dependency on YARN-3867 and YARN-1643. Will move the test case to YARN-1643. ContainerManager implementation to support container resizing - Key: YARN-1645 URL: https://issues.apache.org/jira/browse/YARN-1645 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1645-YARN-1197.3.patch, YARN-1645-YARN-1197.4.patch, YARN-1645.1.patch, YARN-1645.2.patch, yarn-1645.1.patch Implementation of ContainerManager for container resize, including: 1) ContainerManager resize logic 2) Relevant test cases -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631928#comment-14631928 ] Wangda Tan commented on YARN-3216: -- There're two approaches of doing that, - Make maxAMResource = queue's-total-guaranteed-resource (Sum of queue's guaranteed resource on all partitions) * maxAmResourcePercent. It will be straightforward, but also can lead to too many AMs launched under a single partition. - Make maxAMResource computed per queue per partition, this can make AM usages under partitions are more balanced, but can also lead to hard debugging (My application get stuck because of AMResourceLimit for a partition is violated). I prefer 1st solution since it's easier to understand and debugging. Max-AM-Resource-Percentage should respect node labels - Key: YARN-3216 URL: https://issues.apache.org/jira/browse/YARN-3216 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Currently, max-am-resource-percentage considers default_partition only. When a queue can access multiple partitions, we should be able to compute max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2905) AggregatedLogsBlock page can infinitely loop if the aggregated log file is corrupted
[ https://issues.apache.org/jira/browse/YARN-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2905: -- Labels: 2.6.1-candidate (was: ) AggregatedLogsBlock page can infinitely loop if the aggregated log file is corrupted Key: YARN-2905 URL: https://issues.apache.org/jira/browse/YARN-2905 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Priority: Blocker Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: YARN-2905.patch If the AggregatedLogsBlock page tries to serve up a portion of a log file that has been corrupted (e.g.: like the case that was fixed by YARN-2724) then it can spin forever trying to seek to the targeted log segment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2917) Potential deadlock in AsyncDispatcher when system.exit called in AsyncDispatcher#dispatch and AsyscDispatcher#serviceStop from shutdown hook
[ https://issues.apache.org/jira/browse/YARN-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2917: -- Labels: 2.6.1-candidate (was: ) Potential deadlock in AsyncDispatcher when system.exit called in AsyncDispatcher#dispatch and AsyscDispatcher#serviceStop from shutdown hook Key: YARN-2917 URL: https://issues.apache.org/jira/browse/YARN-2917 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Priority: Critical Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: 0001-YARN-2917.patch, 0002-YARN-2917.patch I encoutered scenario where RM hanged while shutting down and keep on logging {{2014-12-03 19:32:44,283 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain.}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631826#comment-14631826 ] Sangjin Lee commented on YARN-3908: --- Thanks for the comment [~gtCarrera9]. I agree the name is bit awkward. Let me see if I can rename it to something more appropriate. Will update. Bugs in HBaseTimelineWriterImpl --- Key: YARN-3908 URL: https://issues.apache.org/jira/browse/YARN-3908 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Attachments: YARN-3908-YARN-2928.001.patch, YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch 1. In HBaseTimelineWriterImpl, the info column family contains the basic fields of a timeline entity plus events. However, entity#info map is not stored at all. 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631849#comment-14631849 ] Jian He commented on YARN-3878: --- Anubhav, thanks for reviewing the patch. I think given that we cannot guarantee shutdown will process all events - main dispatcher may also have some events pending which are not drained - in any case we are going to lose those events, to keep it simple, it's ok to not handle this rare condition. AsyncDispatcher can hang while stopping if it is configured for draining events on stop --- Key: YARN-3878 URL: https://issues.apache.org/jira/browse/YARN-3878 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Critical Fix For: 2.7.2 Attachments: YARN-3878.01.patch, YARN-3878.02.patch, YARN-3878.03.patch, YARN-3878.04.patch, YARN-3878.05.patch, YARN-3878.06.patch, YARN-3878.07.patch, YARN-3878.08.patch, YARN-3878.09.patch, YARN-3878.09_reprorace.pat_h The sequence of events is as under : # RM is stopped while putting a RMStateStore Event to RMStateStore's AsyncDispatcher. This leads to an Interrupted Exception being thrown. # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On {{serviceStop}}, we will check if all events have been drained and wait for event queue to drain(as RM State Store dispatcher is configured for queue to drain on stop). # This condition never becomes true and AsyncDispatcher keeps on waiting incessantly for dispatcher event queue to drain till JVM exits. *Initial exception while posting RM State store event to queue* {noformat} 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService (AbstractService.java:enterState(452)) - Service: Dispatcher entered state STOPPED 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher thread interrupted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) {noformat} *JStack of AsyncDispatcher hanging on stop* {noformat} AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e waiting on condition [0x7fb9654e9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000700b79250 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at