[jira] [Updated] (YARN-3271) FairScheduler: Move tests related to max-runnable-apps from TestFairScheduler to TestAppRunnability
[ https://issues.apache.org/jira/browse/YARN-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nijel updated YARN-3271: Attachment: YARN-3271.1.patch Attaching the patch kindly review FairScheduler: Move tests related to max-runnable-apps from TestFairScheduler to TestAppRunnability --- Key: YARN-3271 URL: https://issues.apache.org/jira/browse/YARN-3271 Project: Hadoop YARN Issue Type: Improvement Reporter: Karthik Kambatla Assignee: nijel Attachments: YARN-3271.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3559) Mark org.apache.hadoop.security.token.Token as @InterfaceAudience.Public
[ https://issues.apache.org/jira/browse/YARN-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519137#comment-14519137 ] J.Andreina commented on YARN-3559: -- [~ste...@apache.org] ,I would like to work on this issue. If you have not already started working on this , shall i take this issue? Mark org.apache.hadoop.security.token.Token as @InterfaceAudience.Public Key: YARN-3559 URL: https://issues.apache.org/jira/browse/YARN-3559 Project: Hadoop YARN Issue Type: Improvement Components: security Affects Versions: 2.6.0 Reporter: Steve Loughran {{org.apache.hadoop.security.token.Token}} is tagged {{@InterfaceAudience.LimitedPrivate}} for HDFS and MapReduce. However, it is used throughout YARN apps, where both the clients and the AM need to work with tokens. This class and related all need to be declared public. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3535: - Attachment: YARN-3535-002.patch # Remove call of recoverResourceRequestForContainer from preemption to avoid duplication of recover RR. # Fix broken tests. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3559) Mark org.apache.hadoop.security.token.Token as @InterfaceAudience.Public
Steve Loughran created YARN-3559: Summary: Mark org.apache.hadoop.security.token.Token as @InterfaceAudience.Public Key: YARN-3559 URL: https://issues.apache.org/jira/browse/YARN-3559 Project: Hadoop YARN Issue Type: Improvement Components: security Affects Versions: 2.6.0 Reporter: Steve Loughran {{org.apache.hadoop.security.token.Token}} is tagged {{@InterfaceAudience.LimitedPrivate}} for HDFS and MapReduce. However, it is used throughout YARN apps, where both the clients and the AM need to work with tokens. This class and related all need to be declared public. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3485) FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies
[ https://issues.apache.org/jira/browse/YARN-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519196#comment-14519196 ] Hudson commented on YARN-3485: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2110 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2110/]) YARN-3485. FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies. (kasha) (kasha: rev 8f82970e0c247b37b2bf33aa21f6a39afa07efde) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies Key: YARN-3485 URL: https://issues.apache.org/jira/browse/YARN-3485 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Fix For: 2.7.1 Attachments: yarn-3485-1.patch, yarn-3485-2.patch, yarn-3485-3.patch, yarn-3485-prelim.patch FairScheduler's headroom calculations consider the fairshare and cluster-available-resources, and the fairshare has maxResources. However, for Fifo and Fairshare policies, the fairshare is used only for memory and not cpu. So, the scheduler ends up showing a higher headroom than is actually available. This could lead to applications waiting for resources far longer than then intend to. e.g. MAPREDUCE-6302. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3485) FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies
[ https://issues.apache.org/jira/browse/YARN-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519222#comment-14519222 ] Hudson commented on YARN-3485: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #169 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/169/]) YARN-3485. FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies. (kasha) (kasha: rev 8f82970e0c247b37b2bf33aa21f6a39afa07efde) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies Key: YARN-3485 URL: https://issues.apache.org/jira/browse/YARN-3485 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Fix For: 2.7.1 Attachments: yarn-3485-1.patch, yarn-3485-2.patch, yarn-3485-3.patch, yarn-3485-prelim.patch FairScheduler's headroom calculations consider the fairshare and cluster-available-resources, and the fairshare has maxResources. However, for Fifo and Fairshare policies, the fairshare is used only for memory and not cpu. So, the scheduler ends up showing a higher headroom than is actually available. This could lead to applications waiting for resources far longer than then intend to. e.g. MAPREDUCE-6302. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3485) FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies
[ https://issues.apache.org/jira/browse/YARN-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519247#comment-14519247 ] Hudson commented on YARN-3485: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #178 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/178/]) YARN-3485. FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies. (kasha) (kasha: rev 8f82970e0c247b37b2bf33aa21f6a39afa07efde) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies Key: YARN-3485 URL: https://issues.apache.org/jira/browse/YARN-3485 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Fix For: 2.7.1 Attachments: yarn-3485-1.patch, yarn-3485-2.patch, yarn-3485-3.patch, yarn-3485-prelim.patch FairScheduler's headroom calculations consider the fairshare and cluster-available-resources, and the fairshare has maxResources. However, for Fifo and Fairshare policies, the fairshare is used only for memory and not cpu. So, the scheduler ends up showing a higher headroom than is actually available. This could lead to applications waiting for resources far longer than then intend to. e.g. MAPREDUCE-6302. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job
Anushri created YARN-3560: - Summary: Not able to navigate to the cluster from tracking url (proxy) generated after submission of job Key: YARN-3560 URL: https://issues.apache.org/jira/browse/YARN-3560 Project: Hadoop YARN Issue Type: Bug Reporter: Anushri Priority: Minor a standalone web proxy server is enabled in the cluster when a job is submitted the url generated contains proxy track this url in the web page , if we try to navigate to the cluster links [about. applications, or scheduler] it gets redirected to some default port instead of actual RM web port configured as such it throws webpage not available -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3271) FairScheduler: Move tests related to max-runnable-apps from TestFairScheduler to TestAppRunnability
[ https://issues.apache.org/jira/browse/YARN-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519326#comment-14519326 ] Hadoop QA commented on YARN-3271: - (!) The patch artifact directory on has been removed! This is a fatal error for test-patch.sh. Aborting. Jenkins (node H4) information at https://builds.apache.org/job/PreCommit-YARN-Build/7537/ may provide some hints. FairScheduler: Move tests related to max-runnable-apps from TestFairScheduler to TestAppRunnability --- Key: YARN-3271 URL: https://issues.apache.org/jira/browse/YARN-3271 Project: Hadoop YARN Issue Type: Improvement Reporter: Karthik Kambatla Assignee: nijel Attachments: YARN-3271.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3485) FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies
[ https://issues.apache.org/jira/browse/YARN-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519278#comment-14519278 ] Hudson commented on YARN-3485: -- FAILURE: Integrated in Hadoop-Yarn-trunk #912 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/912/]) YARN-3485. FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies. (kasha) (kasha: rev 8f82970e0c247b37b2bf33aa21f6a39afa07efde) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies Key: YARN-3485 URL: https://issues.apache.org/jira/browse/YARN-3485 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Fix For: 2.7.1 Attachments: yarn-3485-1.patch, yarn-3485-2.patch, yarn-3485-3.patch, yarn-3485-prelim.patch FairScheduler's headroom calculations consider the fairshare and cluster-available-resources, and the fairshare has maxResources. However, for Fifo and Fairshare policies, the fairshare is used only for memory and not cpu. So, the scheduler ends up showing a higher headroom than is actually available. This could lead to applications waiting for resources far longer than then intend to. e.g. MAPREDUCE-6302. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2893: Attachment: YARN-2893.004.patch AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch, YARN-2893.004.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518761#comment-14518761 ] Sangjin Lee commented on YARN-3044: --- [~zjshen], sorry I missed your comment earlier... bq. Say we have a big cluster that can afford 5,000 concurrent containers... I follow your logic there. But I meant 5,000 containers allocated *per second*, not 5,000 concurrent containers. In a large cluster, it is entirely possible that containers are allocated and released on the order of thousands per second easily. Then, it follows we're already talking about 2 * 5,000 events per second in such a situation. And if we add more event types it is reasonable to expect each of them to happen as fast as 5,000 events per second. [Event producers] Implement RM writing app lifecycle events to ATS -- Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3044-YARN-2928.004.patch, YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2176) CapacityScheduler loops over all running applications rather than actively requesting apps
[ https://issues.apache.org/jira/browse/YARN-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518781#comment-14518781 ] Xianyin Xin commented on YARN-2176: --- Sorry [~jlowe], i've made a mistake. What i thought was Fair, where we resort all the apps when we make scheduling. When the number of the running apps is thousands, the time consume for resorting is hundreds of milliseconds. You're right that the overhead in CS is low. CapacityScheduler loops over all running applications rather than actively requesting apps -- Key: YARN-2176 URL: https://issues.apache.org/jira/browse/YARN-2176 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.4.0 Reporter: Jason Lowe The capacity scheduler performance is primarily dominated by LeafQueue.assignContainers, and that currently loops over all applications that are running in the queue. It would be more efficient if we looped over just the applications that are actively asking for resources rather than all applications, as there could be thousands of applications running but only a few hundred that are currently asking for resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518917#comment-14518917 ] zhihai xu commented on YARN-2893: - The TestAMRestart failure is not related to my change. YARN-2483 is for this test failure. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch, YARN-2893.004.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3552) RMServerUtils#DUMMY_APPLICATION_RESOURCE_USAGE_REPORT has negative numbers
[ https://issues.apache.org/jira/browse/YARN-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519096#comment-14519096 ] Rohith commented on YARN-3552: -- I think for UI dispaly 'N/A' is resonable, and REST we should keep existing behavior since it affect compatibility. RMServerUtils#DUMMY_APPLICATION_RESOURCE_USAGE_REPORT has negative numbers --- Key: YARN-3552 URL: https://issues.apache.org/jira/browse/YARN-3552 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Rohith Assignee: Rohith Priority: Trivial Attachments: 0001-YARN-3552.patch, yarn-3352.PNG In the RMServerUtils, the default values are negative number which results in the displayiing the RM web UI also negative number. {code} public static final ApplicationResourceUsageReport DUMMY_APPLICATION_RESOURCE_USAGE_REPORT = BuilderUtils.newApplicationResourceUsageReport(-1, -1, Resources.createResource(-1, -1), Resources.createResource(-1, -1), Resources.createResource(-1, -1), 0, 0); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519027#comment-14519027 ] Steve Loughran commented on YARN-3539: -- bq. we need to update all the API classes to remark them stable. Good point. My next patch will tag the relevant classes as @Evolving. Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, YARN-3539-003.patch, YARN-3539-004.patch The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3558) Additional containers getting reserved from RM in case of Fair scheduler
Bibin A Chundatt created YARN-3558: -- Summary: Additional containers getting reserved from RM in case of Fair scheduler Key: YARN-3558 URL: https://issues.apache.org/jira/browse/YARN-3558 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.7.0 Environment: OS :Suse 11 Sp3 Setup : 2 RM 2 NM Scheduler : Fair scheduler Reporter: Bibin A Chundatt Submit PI job with 16 maps Total container expected : 16 MAPS + 1 Reduce + 1 AM Total containers reserved by RM is 21 Below set of containers are not being used for execution container_1430213948957_0001_01_20 container_1430213948957_0001_01_19 RM Containers reservation and states {code} Processing container_1430213948957_0001_01_01 of type START Processing container_1430213948957_0001_01_01 of type ACQUIRED Processing container_1430213948957_0001_01_01 of type LAUNCHED Processing container_1430213948957_0001_01_02 of type START Processing container_1430213948957_0001_01_03 of type START Processing container_1430213948957_0001_01_02 of type ACQUIRED Processing container_1430213948957_0001_01_03 of type ACQUIRED Processing container_1430213948957_0001_01_04 of type START Processing container_1430213948957_0001_01_05 of type START Processing container_1430213948957_0001_01_04 of type ACQUIRED Processing container_1430213948957_0001_01_05 of type ACQUIRED Processing container_1430213948957_0001_01_02 of type LAUNCHED Processing container_1430213948957_0001_01_04 of type LAUNCHED Processing container_1430213948957_0001_01_06 of type RESERVED Processing container_1430213948957_0001_01_03 of type LAUNCHED Processing container_1430213948957_0001_01_05 of type LAUNCHED Processing container_1430213948957_0001_01_07 of type START Processing container_1430213948957_0001_01_07 of type ACQUIRED Processing container_1430213948957_0001_01_07 of type LAUNCHED Processing container_1430213948957_0001_01_08 of type RESERVED Processing container_1430213948957_0001_01_02 of type FINISHED Processing container_1430213948957_0001_01_06 of type START Processing container_1430213948957_0001_01_06 of type ACQUIRED Processing container_1430213948957_0001_01_06 of type LAUNCHED Processing container_1430213948957_0001_01_04 of type FINISHED Processing container_1430213948957_0001_01_09 of type START Processing container_1430213948957_0001_01_09 of type ACQUIRED Processing container_1430213948957_0001_01_09 of type LAUNCHED Processing container_1430213948957_0001_01_10 of type RESERVED Processing container_1430213948957_0001_01_03 of type FINISHED Processing container_1430213948957_0001_01_08 of type START Processing container_1430213948957_0001_01_08 of type ACQUIRED Processing container_1430213948957_0001_01_08 of type LAUNCHED Processing container_1430213948957_0001_01_05 of type FINISHED Processing container_1430213948957_0001_01_11 of type START Processing container_1430213948957_0001_01_11 of type ACQUIRED Processing container_1430213948957_0001_01_11 of type LAUNCHED Processing container_1430213948957_0001_01_07 of type FINISHED Processing container_1430213948957_0001_01_12 of type START Processing container_1430213948957_0001_01_12 of type ACQUIRED Processing container_1430213948957_0001_01_12 of type LAUNCHED Processing container_1430213948957_0001_01_13 of type RESERVED Processing container_1430213948957_0001_01_06 of type FINISHED Processing container_1430213948957_0001_01_10 of type START Processing container_1430213948957_0001_01_10 of type ACQUIRED Processing container_1430213948957_0001_01_10 of type LAUNCHED Processing container_1430213948957_0001_01_09 of type FINISHED Processing container_1430213948957_0001_01_14 of type START Processing container_1430213948957_0001_01_14 of type ACQUIRED Processing container_1430213948957_0001_01_14 of type LAUNCHED Processing container_1430213948957_0001_01_15 of type RESERVED Processing container_1430213948957_0001_01_08 of type FINISHED Processing container_1430213948957_0001_01_13 of type START Processing container_1430213948957_0001_01_16 of type RESERVED Processing container_1430213948957_0001_01_13 of type ACQUIRED Processing container_1430213948957_0001_01_13 of type LAUNCHED Processing container_1430213948957_0001_01_11 of type FINISHED Processing container_1430213948957_0001_01_16 of type START Processing container_1430213948957_0001_01_10 of type FINISHED Processing container_1430213948957_0001_01_15 of type START Processing container_1430213948957_0001_01_16 of type ACQUIRED Processing container_1430213948957_0001_01_15 of
[jira] [Updated] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2893: Attachment: (was: YARN-2893.004.patch) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518760#comment-14518760 ] Sangjin Lee commented on YARN-3044: --- It looks like some of the issues reported by the jenkins build might be related to the patch? It would be great if you could look into them and see if we can resolve them. Some additional comments: (RMContainerEntity.java) - l.28-29: NM - RM (TimelineServiceV2Publisher.java) - l.141: I would prefer explicit entity.setQueue() over setting the info directly. Although it is currently equivalent, we should stick with the high level methods we introduced and that would be robust even if we should change how the queue is set. - l.147: how about using a simple for loop? - l.179: curious, we could add them to the entity as metrics, right? - l.300: unnecessary line? [Event producers] Implement RM writing app lifecycle events to ATS -- Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3044-YARN-2928.004.patch, YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3484) Fix up yarn top shell code
[ https://issues.apache.org/jira/browse/YARN-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3484: Attachment: YARN-3484.002.patch bq. variables that are local to a function should be declared local. Fixed. bq. avoid using mixed case as per the shell programming guidelines Fixed. bq. yarnTopArgs is effectively a global. It should either get renamed to YARN_foo or another to not pollute the shell name space or another approach is process set_yarn_top_args as a subshell, reading its input directly to avoid the global entirely Fixed; renamed it to YARN_TOP_ARGS. bq. set_yarn_top_args should be hadoop_ something so as to not pollute the shell name space Fixed; changed the name to hadoop_set_yarn_top_args Fix up yarn top shell code -- Key: YARN-3484 URL: https://issues.apache.org/jira/browse/YARN-3484 Project: Hadoop YARN Issue Type: Bug Components: scripts Affects Versions: 3.0.0 Reporter: Allen Wittenauer Assignee: Varun Vasudev Attachments: YARN-3484.001.patch, YARN-3484.002.patch We need to do some work on yarn top's shell code. a) Just checking for TERM isn't good enough. We really need to check the return on tput, especially since the output will not be a number but an error string which will likely blow up the java code in horrible ways. b) All the single bracket tests should be double brackets to force the bash built-in. c) I'd think I'd rather see the shell portion in a function since it's rather large. This will allow for args, etc, to get local'ized and clean up the case statement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3554: Attachment: YARN-3554-20150429-2.patch Hi [~jlowe] Updating with 3 minutes as the timeout Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3485) FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies
[ https://issues.apache.org/jira/browse/YARN-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519568#comment-14519568 ] Hudson commented on YARN-3485: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2128 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2128/]) YARN-3485. FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies. (kasha) (kasha: rev 8f82970e0c247b37b2bf33aa21f6a39afa07efde) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies Key: YARN-3485 URL: https://issues.apache.org/jira/browse/YARN-3485 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Fix For: 2.7.1 Attachments: yarn-3485-1.patch, yarn-3485-2.patch, yarn-3485-3.patch, yarn-3485-prelim.patch FairScheduler's headroom calculations consider the fairshare and cluster-available-resources, and the fairshare has maxResources. However, for Fifo and Fairshare policies, the fairshare is used only for memory and not cpu. So, the scheduler ends up showing a higher headroom than is actually available. This could lead to applications waiting for resources far longer than then intend to. e.g. MAPREDUCE-6302. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519576#comment-14519576 ] Naganarasimha G R commented on YARN-3554: - Agree with you [~jlowe], but what do you feel the ideal timeout should be, 3 mins / 5 mins ? May be as you guys would have better experience with large number of nodes and see frequent NM failures you can suggest a better value here . Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519603#comment-14519603 ] Jason Lowe commented on YARN-3554: -- I suggest we go with 3 minutes. The retry interval is 10 seconds, so we'll get plenty of retries in that time if the failure is fast (e.g.: unknown host, connection refused) and still get a few retries in if the failure is slow (e.g.: connection timeout). Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3362) Add node label usage in RM CapacityScheduler web UI
[ https://issues.apache.org/jira/browse/YARN-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3362: - Attachment: capacity-scheduler.xml YARN-3362.20150428-3-modified.patch Add node label usage in RM CapacityScheduler web UI --- Key: YARN-3362 URL: https://issues.apache.org/jira/browse/YARN-3362 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, webapp Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: CSWithLabelsView.png, Screen Shot 2015-04-29 at 11.42.17 AM.png, YARN-3362.20150428-3-modified.patch, YARN-3362.20150428-3.patch, capacity-scheduler.xml We don't have node label usage in RM CapacityScheduler web UI now, without this, user will be hard to understand what happened to nodes have labels assign to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3362) Add node label usage in RM CapacityScheduler web UI
[ https://issues.apache.org/jira/browse/YARN-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520074#comment-14520074 ] Wangda Tan commented on YARN-3362: -- bq. Well i understand that in the later patches we are targetting it more as partition than labels, but in that case shall i modify the same in other locations of WEB like node labels page, in CS page shall i mark it as Accessible Partitions ? Good point, I think we may need keep it to be label, and do the renaming in a separated patch. bq. in CS page shall i mark it as Accessible Partitions We can keep calling it label avoid confusion. bq. you mean if no node is mapped to cluster node label then not to show that Node Label ? In my mind is, show all node labels no matter they mapped to nodes/queues or not. We can optimize this easily in the future, I prefer to keep completed message before people post their comments. bq. you mean the existing names of metrics entries needs to be appended with (Partition=xxx) and not to show both right ? I think we need to show both (partition-specific and queue general), the only change is append with (Node-Label=xxx). bq. Its great to hear its working fine, but it worked without any modifications to the patch ? Forgot to mention, I modified patch a little bit, removed some avoid-displaying-checking mentioned by you at https://issues.apache.org/jira/browse/YARN-3362?focusedCommentId=14517364page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14517364. Uploading modified patch as well as CS config for you to test. Add node label usage in RM CapacityScheduler web UI --- Key: YARN-3362 URL: https://issues.apache.org/jira/browse/YARN-3362 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, webapp Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: CSWithLabelsView.png, Screen Shot 2015-04-29 at 11.42.17 AM.png, YARN-3362.20150428-3.patch We don't have node label usage in RM CapacityScheduler web UI now, without this, user will be hard to understand what happened to nodes have labels assign to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520115#comment-14520115 ] Li Lu commented on YARN-3411: - Thanks [~stack] for the quick info! Yes let's go with HBase 1. We can figure out a solution for Phoenix later. On the worst case, we can rely on the snapshot version of Phoenix, which already works with HBase 1. [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.2.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520139#comment-14520139 ] Hadoop QA commented on YARN-2893: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 37s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 9m 6s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 9s | The applied patch generated 1 additional checkstyle issues. | | {color:green}+1{color} | install | 2m 0s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 34s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 51m 53s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 100m 3s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12729253/YARN-2893.004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 3dd6395 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7544/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7544/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7544/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7544/console | This message was automatically generated. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch, YARN-2893.004.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519848#comment-14519848 ] Vinod Kumar Vavilapalli commented on YARN-3561: --- Is this because the keep-containers flag is on? Why was the AM stopped and not the the app killed if that is what they want. Non-AM Containers continue to run even after AM is stopped -- Key: YARN-3561 URL: https://issues.apache.org/jira/browse/YARN-3561 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.6.0 Reporter: Gour Saha Priority: Critical Non-AM containers continue to run even after application is stopped. This occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a Hadoop 2.6 deployment. Following are the NM logs from 2 different nodes: *host-07* - where Slider AM was running *host-03* - where Storm NIMBUS container was running. *Note:* The logs are partial, starting with the time when the relevant Slider AM and NIMBUS containers were allocated, till the time when the Slider AM was stopped. Also, the large number of Memory usage log lines were removed keeping only a few starts and ends of every segment. *NM log from host-07 where Slider AM container was running:* {noformat} 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for container_1428575950531_0020_02_01 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for container_1428575950531_0021_01_01 by user yarn 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new application reference for app application_1428575950531_0021 2015-04-29 00:41:10,323 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from NEW to INITING 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1428575950531_0021 CONTAINERID=container_1428575950531_0021_01_01 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-04-29 00:41:10,351 INFO application.Application (ApplicationImpl.java:transition(304)) - Adding container_1428575950531_0021_01_01 to application application_1428575950531_0021 2015-04-29 00:41:10,352 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from INITING to RUNNING 2015-04-29 00:41:10,356 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING 2015-04-29 00:41:10,357 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId application_1428575950531_0021 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/confdir/log4j-server.properties transitioned from
[jira] [Updated] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3521: -- Attachment: 0001-YARN-3521.patch Attaching an initial version. [~leftnoteasy] pls check the same as I have the changed the method interface of *getClusterNodeLabels* and *addToClusterNodeLabels* to pass argument to *ListNodeLabelInfo*. Support return structured NodeLabel objects in REST API when call getClusterNodeLabels -- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519906#comment-14519906 ] Naganarasimha G R commented on YARN-3044: - Thanks for the review [~djp] [~sjlee0] lee], bq. some of the issues reported by the jenkins build might be related to the patch? Some might be but many(findbugs and testcase) are not related to this jira, hence planning to raise seperate jira to handle the same. And some findbugs (like Unchecked/unconfirmed cast from org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent ) not planning to handle as its same as earlier code if checks doesnt make sense here bq. l.147: how about using a simple for loop? Well AFAIK it only affects readability here and had taken entry set iterator here as its generally preferred in terms of performance and concurrency (not relevance here). If you feel readability is a issue then can modify to simple loop :) bq. l.179: curious, we could add them to the entity as metrics, right? bq. So having them as events means that they should/will not be aggregated (e.g. from app = flow). Is that the intent with these values (CPU and cores)? I'm not exactly clear what these values indicate. Well initially even though i had the same thoughts as that of [~djp], but it might be required to be aggregated (e.g. from app = flow) as its current value also is aggregation of all containers. As mentioned earlier, planning to raise jira for the following : # To enhance TestSystemMetricsPublisherForV2 to ensure that test case verifies the published entity is populated as desired (similar to ATSV1). # To add interface in TimelineClient to push application specific configurations as all are not captured as part of RM Please provide your opinion. Had one query, as earlier suggested by [~djp], where to add the util class(Package and classname) which converts the SystemEntities to Timeline entities and vice versa? Also shall i handle this as part of this patch or TestSystemMetricsPublisherForV2 enhancement patch ? [Event producers] Implement RM writing app lifecycle events to ATS -- Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3044-YARN-2928.004.patch, YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3362) Add node label usage in RM CapacityScheduler web UI
[ https://issues.apache.org/jira/browse/YARN-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519917#comment-14519917 ] Wangda Tan commented on YARN-3362: -- Hi Naga, Thanks for taking initiative for this, just tried to run the patch locally, looks great! Some comments: 1) Show partition=partition-name in every partition, if the partition is the NO_LABEL partition, show it's a YARN.DEFAULT.PARTITION. 2) I think it's better to show labels are not accessible, especially for the non-exclusive node label case, we can optimize this in future patch. To avoid people ask question like where is my label? This includes all existing avoid displaying items in your existing patch. But it's good to keep avoid showing label when there's no label in your cluster. 3) Showing partition of partition-specific queue metrics, they're: - Used Capacity:0.0% - Absolute Used Capacity: 0.0% - Absolute Capacity:50.0% - Absolute Max Capacity:100.0% - Configured Capacity: 50.0% - Configured Max Capacity: 100.0% I suggest to add a (Partition=xxx) at the end of these metrics. I attached queue hierarchy showing in my local cluster: https://issues.apache.org/jira/secure/attachment/12729256/Screen%20Shot%202015-04-29%20at%2011.42.17%20AM.png. It seems multi hierarchy works well in my environment. Add node label usage in RM CapacityScheduler web UI --- Key: YARN-3362 URL: https://issues.apache.org/jira/browse/YARN-3362 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, webapp Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: CSWithLabelsView.png, Screen Shot 2015-04-29 at 11.42.17 AM.png, YARN-3362.20150428-3.patch We don't have node label usage in RM CapacityScheduler web UI now, without this, user will be hard to understand what happened to nodes have labels assign to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2893: Attachment: YARN-2893.004.patch AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch, YARN-2893.004.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2619) NodeManager: Add cgroups support for disk I/O isolation
[ https://issues.apache.org/jira/browse/YARN-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519814#comment-14519814 ] Sidharta Seethana commented on YARN-2619: - [~vvasudev] , thanks for refactoring the test to be cleaner. The corresponding changes seem good to me. NodeManager: Add cgroups support for disk I/O isolation --- Key: YARN-2619 URL: https://issues.apache.org/jira/browse/YARN-2619 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2619-1.patch, YARN-2619.002.patch, YARN-2619.003.patch, YARN-2619.004.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519840#comment-14519840 ] Hadoop QA commented on YARN-3554: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 34s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 30s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 7m 35s | There were no new checkstyle issues. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 46s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 30s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | | | 46m 59s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12729227/YARN-3554-20150429-2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8f82970 | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7543/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7543/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7543/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7543/console | This message was automatically generated. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519858#comment-14519858 ] stack commented on YARN-3411: - bq. So there are some major changes between hbase 0.98 and hbase like the client facing APIs (HTableInterface, etc) have been deprecated and replaced with new interfaces. It would be a pity if you fellas were stuck on the 0.98 APIs. Phoenix is shaping up to do an RC that will work w/ hbase 1.x. [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.2.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3271) FairScheduler: Move tests related to max-runnable-apps from TestFairScheduler to TestAppRunnability
[ https://issues.apache.org/jira/browse/YARN-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519861#comment-14519861 ] Karthik Kambatla commented on YARN-3271: Thanks for working on this, [~nijel]. While at this, can we improve how we initialize the scheduler in {{TestAppRunnability#setUp}} as below? {code} Configuration conf = createConfiguration(); resourceManager = new MockRM(conf); resourceManager.start(); scheduler = (FairScheduler) resourceManager.getResourceScheduler(); {code} FairScheduler: Move tests related to max-runnable-apps from TestFairScheduler to TestAppRunnability --- Key: YARN-3271 URL: https://issues.apache.org/jira/browse/YARN-3271 Project: Hadoop YARN Issue Type: Improvement Reporter: Karthik Kambatla Assignee: nijel Attachments: YARN-3271.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gour Saha updated YARN-3561: Environment: debian 7 Non-AM Containers continue to run even after AM is stopped -- Key: YARN-3561 URL: https://issues.apache.org/jira/browse/YARN-3561 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.6.0 Environment: debian 7 Reporter: Gour Saha Priority: Critical Fix For: 2.6.1 Non-AM containers continue to run even after application is stopped. This occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a Hadoop 2.6 deployment. Following are the NM logs from 2 different nodes: *host-07* - where Slider AM was running *host-03* - where Storm NIMBUS container was running. *Note:* The logs are partial, starting with the time when the relevant Slider AM and NIMBUS containers were allocated, till the time when the Slider AM was stopped. Also, the large number of Memory usage log lines were removed keeping only a few starts and ends of every segment. *NM log from host-07 where Slider AM container was running:* {noformat} 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for container_1428575950531_0020_02_01 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for container_1428575950531_0021_01_01 by user yarn 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new application reference for app application_1428575950531_0021 2015-04-29 00:41:10,323 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from NEW to INITING 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1428575950531_0021 CONTAINERID=container_1428575950531_0021_01_01 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-04-29 00:41:10,351 INFO application.Application (ApplicationImpl.java:transition(304)) - Adding container_1428575950531_0021_01_01 to application application_1428575950531_0021 2015-04-29 00:41:10,352 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from INITING to RUNNING 2015-04-29 00:41:10,356 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING 2015-04-29 00:41:10,357 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId application_1428575950531_0021 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/confdir/log4j-server.properties transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) -
[jira] [Updated] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2893: Attachment: (was: YARN-2893.004.patch) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3362) Add node label usage in RM CapacityScheduler web UI
[ https://issues.apache.org/jira/browse/YARN-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519960#comment-14519960 ] Naganarasimha G R commented on YARN-3362: - Thanks [~wangda], for reviewing and testing the patch. bq. partition=partition-name Well i understand that in the later patches we are targetting it more as partition than labels, but in that case shall i modify the same in other locations of WEB like node labels page, in CS page shall i mark it as Accessible Partitions ? bq. But it's good to keep avoid showing label when there's no label in your cluster. you mean if no node is mapped to cluster node label then not to show that Node Label ? bq. Showing partition of partition-specific queue metrics you mean the existing names of metrics entries needs to be appended with (Partition=xxx) and not to show both right ? bq. It seems multi hierarchy works well in my environment. Its great to hear its working fine, but it worked without any modifications to the patch ? If so can you share offline your cluster setup (topology) with CS configuration, so that i can test it further. Add node label usage in RM CapacityScheduler web UI --- Key: YARN-3362 URL: https://issues.apache.org/jira/browse/YARN-3362 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, webapp Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: CSWithLabelsView.png, Screen Shot 2015-04-29 at 11.42.17 AM.png, YARN-3362.20150428-3.patch We don't have node label usage in RM CapacityScheduler web UI now, without this, user will be hard to understand what happened to nodes have labels assign to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519792#comment-14519792 ] Sangjin Lee commented on YARN-3044: --- Yes that's kind of what I'm wondering about. So having them as events means that they should/will not be aggregated (e.g. from app = flow). Is that the intent with these values (CPU and cores)? I'm not exactly clear what these values indicate. [Event producers] Implement RM writing app lifecycle events to ATS -- Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3044-YARN-2928.004.patch, YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gour Saha updated YARN-3561: Fix Version/s: 2.6.1 Non-AM Containers continue to run even after AM is stopped -- Key: YARN-3561 URL: https://issues.apache.org/jira/browse/YARN-3561 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.6.0 Reporter: Gour Saha Priority: Critical Fix For: 2.6.1 Non-AM containers continue to run even after application is stopped. This occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a Hadoop 2.6 deployment. Following are the NM logs from 2 different nodes: *host-07* - where Slider AM was running *host-03* - where Storm NIMBUS container was running. *Note:* The logs are partial, starting with the time when the relevant Slider AM and NIMBUS containers were allocated, till the time when the Slider AM was stopped. Also, the large number of Memory usage log lines were removed keeping only a few starts and ends of every segment. *NM log from host-07 where Slider AM container was running:* {noformat} 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for container_1428575950531_0020_02_01 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for container_1428575950531_0021_01_01 by user yarn 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new application reference for app application_1428575950531_0021 2015-04-29 00:41:10,323 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from NEW to INITING 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1428575950531_0021 CONTAINERID=container_1428575950531_0021_01_01 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-04-29 00:41:10,351 INFO application.Application (ApplicationImpl.java:transition(304)) - Adding container_1428575950531_0021_01_01 to application application_1428575950531_0021 2015-04-29 00:41:10,352 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from INITING to RUNNING 2015-04-29 00:41:10,356 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING 2015-04-29 00:41:10,357 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId application_1428575950531_0021 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/confdir/log4j-server.properties transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource
[jira] [Updated] (YARN-3362) Add node label usage in RM CapacityScheduler web UI
[ https://issues.apache.org/jira/browse/YARN-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3362: - Attachment: Screen Shot 2015-04-29 at 11.42.17 AM.png Add node label usage in RM CapacityScheduler web UI --- Key: YARN-3362 URL: https://issues.apache.org/jira/browse/YARN-3362 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, webapp Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: CSWithLabelsView.png, Screen Shot 2015-04-29 at 11.42.17 AM.png, YARN-3362.20150428-3.patch We don't have node label usage in RM CapacityScheduler web UI now, without this, user will be hard to understand what happened to nodes have labels assign to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3544) AM logs link missing in the RM UI for a completed app
[ https://issues.apache.org/jira/browse/YARN-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519964#comment-14519964 ] Xuan Gong commented on YARN-3544: - Original, we are calling getContainerReport to AMContainer information (such as container log url, nm address, startTime, etc). It works fine when the Application is running, and the container is running. But when the application is finished, we will not keep finished container info. In that case, we could not get any finished container report from RM. That is why we see the AM logs link in web ui as N/A as well as other related attempt information. In this patch, instead of querying from container Report, we directly get attempt(AM Container) information from AttemptInfo which is from RMAttempt. So, no matter the application is running or is finished, we could get related information and show them in the web ui AM logs link missing in the RM UI for a completed app -- Key: YARN-3544 URL: https://issues.apache.org/jira/browse/YARN-3544 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.7.0 Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Blocker Attachments: Screen Shot 2015-04-27 at 6.24.05 PM.png, YARN-3544.1.patch AM log links should always be present ( for both running and completed apps). Likewise node info is also empty. This is usually quite crucial when trying to debug where an AM was launched and a pointer to which NM's logs to look at if the AM failed to launch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519911#comment-14519911 ] Gour Saha commented on YARN-3561: - Slider stop command was called which initiates the Slider Storm application to stop (and hence the Slider AM to stop). Which property sets the keep-containers flag on? Non-AM Containers continue to run even after AM is stopped -- Key: YARN-3561 URL: https://issues.apache.org/jira/browse/YARN-3561 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.6.0 Environment: debian 7 Reporter: Gour Saha Priority: Critical Fix For: 2.6.1 Non-AM containers continue to run even after application is stopped. This occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a Hadoop 2.6 deployment. Following are the NM logs from 2 different nodes: *host-07* - where Slider AM was running *host-03* - where Storm NIMBUS container was running. *Note:* The logs are partial, starting with the time when the relevant Slider AM and NIMBUS containers were allocated, till the time when the Slider AM was stopped. Also, the large number of Memory usage log lines were removed keeping only a few starts and ends of every segment. *NM log from host-07 where Slider AM container was running:* {noformat} 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for container_1428575950531_0020_02_01 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for container_1428575950531_0021_01_01 by user yarn 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new application reference for app application_1428575950531_0021 2015-04-29 00:41:10,323 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from NEW to INITING 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1428575950531_0021 CONTAINERID=container_1428575950531_0021_01_01 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-04-29 00:41:10,351 INFO application.Application (ApplicationImpl.java:transition(304)) - Adding container_1428575950531_0021_01_01 to application application_1428575950531_0021 2015-04-29 00:41:10,352 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from INITING to RUNNING 2015-04-29 00:41:10,356 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING 2015-04-29 00:41:10,357 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId application_1428575950531_0021 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource
[jira] [Created] (YARN-3562) unit tests fail with the failure to bring up node manager
Sangjin Lee created YARN-3562: - Summary: unit tests fail with the failure to bring up node manager Key: YARN-3562 URL: https://issues.apache.org/jira/browse/YARN-3562 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Minor A bunch of MR unit tests are failing on our branch whenever the mini YARN cluster needs to bring up multiple node managers. For example, see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/ It is because the NMCollectorService is using a fixed port for the RPC (8048). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520226#comment-14520226 ] Wangda Tan commented on YARN-3521: -- Hi Sunil, Thanks for working on this, some comments: NodelabelsInfo: (It should be NodeLabelInfo, right?) - nodeLabelName: don't need call {{new String()}} since it will be always initialized, and I prefer to call it name - nodeLabelExclusivity - exclusivity - Also getter - Setters are not used by anybody, could be removed - I'm not sure if you need add an empty constructure to make {{// JAXB needs this}} like other infos? - Could add a constructor of NodeLabelsInfo receives NodeLabel which will be used by RMWebServices - We may need to add a separated NodeLabelsInfo and it contains ArrayList of NodeLabelInfo NodeToLabelsInfo - NodeToLabelNames addToClusterNodeLabels now receives Set as parameter, I'm not sure if it works, could you add test to verify add/get node labels? Now TestRMWebServicesNodeLabels will fail Support return structured NodeLabel objects in REST API when call getClusterNodeLabels -- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3563) Completed app shows -1 running containers on RM web UI
[ https://issues.apache.org/jira/browse/YARN-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3563: -- Attachment: Screen Shot 2015-04-29 at 2.11.19 PM.png Completed app shows -1 running containers on RM web UI -- Key: YARN-3563 URL: https://issues.apache.org/jira/browse/YARN-3563 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Reporter: Zhijie Shen Attachments: Screen Shot 2015-04-29 at 2.11.19 PM.png See the attached screenshot. I saw this issue with trunk. Not sure if it exists in branch-2.7 too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3563) Completed app shows -1 running containers on RM web UI
[ https://issues.apache.org/jira/browse/YARN-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3563: -- Component/s: webapp resourcemanager Completed app shows -1 running containers on RM web UI -- Key: YARN-3563 URL: https://issues.apache.org/jira/browse/YARN-3563 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Reporter: Zhijie Shen Attachments: Screen Shot 2015-04-29 at 2.11.19 PM.png See the attached screenshot. I saw this issue with trunk. Not sure if it exists in branch-2.7 too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3563) Completed app shows -1 running containers on RM web UI
Zhijie Shen created YARN-3563: - Summary: Completed app shows -1 running containers on RM web UI Key: YARN-3563 URL: https://issues.apache.org/jira/browse/YARN-3563 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen See the attached screenshot. I saw this issue with trunk. Not sure if it exists in branch-2.7 too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3551) Consolidate data model change according to the backend implementation
[ https://issues.apache.org/jira/browse/YARN-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520295#comment-14520295 ] Sangjin Lee commented on YARN-3551: --- I'm fine with going with using GenericOptionMapper for the serialization/deserialization of appropriate types. The generics is a suggestion for strengthening the types on the user side of things for the most part, so it may not be critical. Consolidate data model change according to the backend implementation - Key: YARN-3551 URL: https://issues.apache.org/jira/browse/YARN-3551 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3551.1.patch, YARN-3551.2.patch, YARN-3551.3.patch Based on the comments on [YARN-3134|https://issues.apache.org/jira/browse/YARN-3134?focusedCommentId=14512080page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14512080] and [YARN-3411|https://issues.apache.org/jira/browse/YARN-3411?focusedCommentId=14512098page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14512098], we need to change the data model to restrict the data type of info/config/metric section. 1. Info: the value could be all kinds object that is able to be serialized/deserialized by jackson. 2. Config: the value will always be assumed as String. 3. Metric: single data or time series value have to be number for aggregation. Other than that, info/start time/finish time of metric seem not to be necessary for storage. They should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-1876) Document the REST APIs of timeline and generic history services
[ https://issues.apache.org/jira/browse/YARN-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-1876. --- Resolution: Duplicate Duplicate is the right resolution. Document the REST APIs of timeline and generic history services --- Key: YARN-1876 URL: https://issues.apache.org/jira/browse/YARN-1876 Project: Hadoop YARN Issue Type: Improvement Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: documentaion Attachments: YARN-1876.1.patch, YARN-1876.2.patch, YARN-1876.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3533) Test: Fix launchAM in MockRM to wait for attempt to be scheduled
[ https://issues.apache.org/jira/browse/YARN-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520293#comment-14520293 ] Jian He commented on YARN-3533: --- patch looks good to me, thanks [~adhoot] ! hopefully this can resolve some intermittent failures we've seen recently. Test: Fix launchAM in MockRM to wait for attempt to be scheduled Key: YARN-3533 URL: https://issues.apache.org/jira/browse/YARN-3533 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.6.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3533.001.patch MockRM#launchAM fails in many test runs because it does not wait for the app attempt to be scheduled before NM update is sent as noted in [recent builds|https://issues.apache.org/jira/browse/YARN-3387?focusedCommentId=14507255page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14507255] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-1876) Document the REST APIs of timeline and generic history services
[ https://issues.apache.org/jira/browse/YARN-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reopened YARN-1876: --- Document the REST APIs of timeline and generic history services --- Key: YARN-1876 URL: https://issues.apache.org/jira/browse/YARN-1876 Project: Hadoop YARN Issue Type: Improvement Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: documentaion Attachments: YARN-1876.1.patch, YARN-1876.2.patch, YARN-1876.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520312#comment-14520312 ] Vinod Kumar Vavilapalli commented on YARN-3539: --- bq. So I'm not sure if it's good timeline now, as we foresee in the near future, we're going to be upgraded to ATS v2, which may significantly refurnish the APIs. How about we simply say that people can continue to run the v1 Timeline Service (Single server backed by LevelDB) beyond Timeline Service next-gen? That way, older installations and apps can continue to use the old APIs, and the new APIs do not need to take the unknown burden of making the old APIs work on the newer framework. Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, YARN-3539-003.patch, YARN-3539-004.patch The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3564) TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly
Jian He created YARN-3564: - Summary: TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly Key: YARN-3564 URL: https://issues.apache.org/jira/browse/YARN-3564 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2868) FairScheduler: Metric for latency to allocate first container for an application
[ https://issues.apache.org/jira/browse/YARN-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520196#comment-14520196 ] Vinod Kumar Vavilapalli commented on YARN-2868: --- Going through old tickets. I have two questions # Why was this done in a scheduler specific way? RMAppAttempt clearly knows when it requests and when it gets the allocation. # Seems like the patch only looks at the first AM container. What happens if the we have a 2nd AM container? I accidentally closed this ticket, so doesn't look like I can reopen it. If folks agree, I will open a new ticket. FairScheduler: Metric for latency to allocate first container for an application Key: YARN-2868 URL: https://issues.apache.org/jira/browse/YARN-2868 Project: Hadoop YARN Issue Type: Improvement Reporter: Ray Chiang Assignee: Ray Chiang Labels: metrics, supportability Fix For: 2.8.0 Attachments: YARN-2868-01.patch, YARN-2868.002.patch, YARN-2868.003.patch, YARN-2868.004.patch, YARN-2868.005.patch, YARN-2868.006.patch, YARN-2868.007.patch, YARN-2868.008.patch, YARN-2868.009.patch, YARN-2868.010.patch, YARN-2868.011.patch, YARN-2868.012.patch Add a metric to measure the latency between starting container allocation and first container actually allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1317) Make Queue, QueueACLs and QueueMetrics first class citizens in YARN
[ https://issues.apache.org/jira/browse/YARN-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1317: -- Target Version/s: 2.8.0 (was: ) I'd like to at the least get some of this done in the 2.8 time-frame.. Make Queue, QueueACLs and QueueMetrics first class citizens in YARN --- Key: YARN-1317 URL: https://issues.apache.org/jira/browse/YARN-1317 Project: Hadoop YARN Issue Type: Improvement Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Today, we are duplicating the exact same code in all the schedulers. Queue is a top class concept - clientService, web-services etc already recognize queue as a top level concept. We need to move Queue, QueueMetrics and QueueACLs to be top level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520214#comment-14520214 ] Sangjin Lee commented on YARN-3051: --- {quote} My major concern about this proposal is compatibility. Previously in v1, timeline entity is globally unique, such that when fetching a single entity before, users only need to provide entity type, entity id. app id, entity type, entity id is required to locate one entity, and theoretically null, entity type, entity id will refer to multiple entities. It probably makes difficult to be compatible to existing use cases. {quote} To hash out that point, existing use cases which previously assumed that entity id was globally unique would continue to generate entity id's that are globally unique, right? Since existing use cases (w/o modification) would stick to globally unique entity id's in practice, redefining the uniqueness requirement to be in the scope of application should not impact existing use cases. Entity id's that are generated to be unique globally would trivially be unique within the application scope. The point here is that since this is in the direction of relaxing uniqueness, stricter use cases (existing use cases) should not be impacted. Let me know your thoughts. IMO, stating that the entity id's are unique within the scope of applications is not an invitation for frameworks to generate tons of redundant entity id's. Frameworks (MR, tez, ...) would likely continue to generate entity id's that are practically unique globally anyway. But the part of the timeline service, we don't have to have checks for enforcing global uniqueness. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520319#comment-14520319 ] Vinod Kumar Vavilapalli commented on YARN-3539: --- In a way, I am saying that there will be v1 end-points and v2 end-points. V1 end-points go to the old Timeline Service and V2 end-points go to the next-gen Timeline Service. Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, YARN-3539-003.patch, YARN-3539-004.patch The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3477) TimelineClientImpl swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520328#comment-14520328 ] Vinod Kumar Vavilapalli commented on YARN-3477: --- This looks good to me. [~zjshen], can you look and do the honors? TimelineClientImpl swallows exceptions -- Key: YARN-3477 URL: https://issues.apache.org/jira/browse/YARN-3477 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0, 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-3477-001.patch, YARN-3477-002.patch If timeline client fails more than the retry count, the original exception is not thrown. Instead some runtime exception is raised saying retries run out # the failing exception should be rethrown, ideally via NetUtils.wrapException to include URL of the failing endpoing # Otherwise, the raised RTE should (a) state that URL and (b) set the original fault as the inner cause -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3533) Test: Fix launchAM in MockRM to wait for attempt to be scheduled
[ https://issues.apache.org/jira/browse/YARN-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520339#comment-14520339 ] Jian He commented on YARN-3533: --- bq. getApplicationAttempt seems confusing, I just opened https://issues.apache.org/jira/browse/YARN-3546 to discuss this I replied on the jira. The TestContainerAllocation failure is unrelated to this patch. opening a new jira to fix that. committing this. Test: Fix launchAM in MockRM to wait for attempt to be scheduled Key: YARN-3533 URL: https://issues.apache.org/jira/browse/YARN-3533 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.6.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3533.001.patch MockRM#launchAM fails in many test runs because it does not wait for the app attempt to be scheduled before NM update is sent as noted in [recent builds|https://issues.apache.org/jira/browse/YARN-3387?focusedCommentId=14507255page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14507255] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520252#comment-14520252 ] Thomas Graves commented on YARN-3517: - changes look good, +1. thanks [~vvasudev] RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Thomas Graves Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, YARN-3517.006.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520322#comment-14520322 ] Sangjin Lee commented on YARN-3044: --- {quote} Some might be but many(findbugs and testcase) are not related to this jira, hence planning to raise seperate jira to handle the same. And some findbugs (like Unchecked/unconfirmed cast from org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent ) not planning to handle as its same as earlier code if checks doesnt make sense here {quote} Understood. We should try to resolve the ones that make sense but don't have to be pedantic. By the way, note that I filed a separate JIRA for the unit test issues that already exist on YARN-2928 (YARN-3562). {quote} Well AFAIK it only affects readability here and had taken entry set iterator here as its generally preferred in terms of performance and concurrency (not relevance here). If you feel readability is a issue then can modify to simple loop {quote} That's fine. It was a style nit (if that wasn't clear). [Event producers] Implement RM writing app lifecycle events to ATS -- Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3044-YARN-2928.004.patch, YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reassigned YARN-3517: - Assignee: Varun Vasudev (was: Thomas Graves) Seems like the JIRA assignee got mixed up, fixing.. RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Fix For: 2.8.0 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, YARN-3517.006.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520332#comment-14520332 ] Sangjin Lee commented on YARN-3045: --- Hi [~Naganarasimha], I do have one quick question on the naming. I see a lot of names that include metrics, such as NMMetricsPublisher, NMMetricsEvent, NMMetricsEventType, and so on. And yet, they don't seem to involve metrics in the sense of timeline metrics. This is a source of confusion to me. Do we need metrics in these? They seem to be capturing purely lifecycle events. Could we change them to better names? [Event producers] Implement NM writing container lifecycle events to ATS Key: YARN-3045 URL: https://issues.apache.org/jira/browse/YARN-3045 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3045.20150420-1.patch Per design in YARN-2928, implement NM writing container lifecycle events and container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3406) Display count of running containers in the RM's Web UI
[ https://issues.apache.org/jira/browse/YARN-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520273#comment-14520273 ] Zhijie Shen commented on YARN-3406: --- The web UI seems to have bug: YARN-3563 Display count of running containers in the RM's Web UI -- Key: YARN-3406 URL: https://issues.apache.org/jira/browse/YARN-3406 Project: Hadoop YARN Issue Type: Improvement Reporter: Ryu Kobayashi Assignee: Ryu Kobayashi Priority: Minor Fix For: 2.8.0 Attachments: YARN-3406.1.patch, YARN-3406.2.patch, screenshot.png, screenshot2.png Display the running containers in the all application list. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3546) AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it
[ https://issues.apache.org/jira/browse/YARN-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520292#comment-14520292 ] Jian He commented on YARN-3546: --- [~sandflee], inside the scheduler, every application only has one attempt. so the current attempt is the attempt corresponding to the appAttemptId. So the name 'getAppAttempt(attemptId)' is matching with the internal implementation. If you agree, we can close this jira. AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it - Key: YARN-3546 URL: https://issues.apache.org/jira/browse/YARN-3546 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: sandflee I'm not familiar with scheduler, with first eyes, I thought this func returns the schdulerAppAttempt info corresponding to appAttemptId, but actually it returns the current schdulerAppAttempt. It seems misled others too, such as TestWorkPreservingRMRestart.waitForNumContainersToRecover MockRM.waitForSchedulerAppAttemptAdded should I rename it to T getCurrentSchedulerApplicationAttempt(ApplicationId applicationid) or returns null if current attempt id not equals to the request attempt id ? comment preferred! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3563) Completed app shows -1 running containers on RM web UI
[ https://issues.apache.org/jira/browse/YARN-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520317#comment-14520317 ] Jason Lowe commented on YARN-3563: -- This sounds closely related to, if not a duplicate of, YARN-3552. Completed app shows -1 running containers on RM web UI -- Key: YARN-3563 URL: https://issues.apache.org/jira/browse/YARN-3563 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Reporter: Zhijie Shen Attachments: Screen Shot 2015-04-29 at 2.11.19 PM.png See the attached screenshot. I saw this issue with trunk. Not sure if it exists in branch-2.7 too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3533) Test: Fix launchAM in MockRM to wait for attempt to be scheduled
[ https://issues.apache.org/jira/browse/YARN-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520346#comment-14520346 ] Jian He commented on YARN-3533: --- committed to trunk and branch-2, thanks Anubhav ! Thanks [~sandflee], [~rohithsharma] for the review ! Test: Fix launchAM in MockRM to wait for attempt to be scheduled Key: YARN-3533 URL: https://issues.apache.org/jira/browse/YARN-3533 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.6.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3533.001.patch MockRM#launchAM fails in many test runs because it does not wait for the app attempt to be scheduled before NM update is sent as noted in [recent builds|https://issues.apache.org/jira/browse/YARN-3387?focusedCommentId=14507255page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14507255] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3564) TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly
[ https://issues.apache.org/jira/browse/YARN-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3564: -- Description: the test fails intermittently in jenkins https://builds.apache.org/job/PreCommit-YARN-Build/7467/testReport/ (was: https://builds.apache.org/job/PreCommit-YARN-Build/7467/testReport/) TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly --- Key: YARN-3564 URL: https://issues.apache.org/jira/browse/YARN-3564 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He the test fails intermittently in jenkins https://builds.apache.org/job/PreCommit-YARN-Build/7467/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3564) TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly
[ https://issues.apache.org/jira/browse/YARN-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3564: -- Description: https://builds.apache.org/job/PreCommit-YARN-Build/7467/testReport/ TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly --- Key: YARN-3564 URL: https://issues.apache.org/jira/browse/YARN-3564 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He https://builds.apache.org/job/PreCommit-YARN-Build/7467/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520355#comment-14520355 ] Thomas Graves commented on YARN-3517: - thanks [~vinodkv] I missed that. RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Fix For: 2.8.0 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, YARN-3517.006.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520291#comment-14520291 ] Li Lu commented on YARN-3411: - Hi [~vrushalic] [~zjshen], just a quick thing to confirm that we want to use byte arrays for config and info fields in both of our storage. I'll convert the type for config and info in the Phoenix implementation to VARBINARY to be consistent with this design. [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.2.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520308#comment-14520308 ] Hudson commented on YARN-3517: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7701 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7701/]) YARN-3517. RM web ui for dumping scheduler logs should be for admins only (Varun Vasudev via tgraves) (tgraves: rev 2e215484bd05cd5e3b7a81d3558c6879a05dd2d2) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/security/ApplicationACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/CHANGES.txt RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Thomas Graves Priority: Blocker Labels: security Fix For: 2.8.0 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, YARN-3517.006.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3544) AM logs link missing in the RM UI for a completed app
[ https://issues.apache.org/jira/browse/YARN-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520378#comment-14520378 ] Hitesh Shah commented on YARN-3544: --- Doesnt the NM log link redirect the log server after the logs have been aggregated? AM logs link missing in the RM UI for a completed app -- Key: YARN-3544 URL: https://issues.apache.org/jira/browse/YARN-3544 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.7.0 Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Blocker Attachments: Screen Shot 2015-04-27 at 6.24.05 PM.png, YARN-3544.1.patch AM log links should always be present ( for both running and completed apps). Likewise node info is also empty. This is usually quite crucial when trying to debug where an AM was launched and a pointer to which NM's logs to look at if the AM failed to launch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3564) TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly
[ https://issues.apache.org/jira/browse/YARN-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520506#comment-14520506 ] Hadoop QA commented on YARN-3564: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 5m 9s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 29s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 19s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 5m 26s | There were no new checkstyle issues. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 30s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 13s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 52m 15s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 73m 55s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12729320/YARN-3564.1.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / 4c1af15 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7545/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7545/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7545/console | This message was automatically generated. TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly --- Key: YARN-3564 URL: https://issues.apache.org/jira/browse/YARN-3564 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3564.1.patch the test fails intermittently in jenkins https://builds.apache.org/job/PreCommit-YARN-Build/7467/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3134: Attachment: YARN-3134-YARN-2928.001.patch [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134-YARN-2928.001.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3534) Report node resource utilization
[ https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520574#comment-14520574 ] Karthik Kambatla commented on YARN-3534: I notice the patch tries to provide utilization information in bytes for memory and float for CPU. Since the RM schedules in MB and vcores, seeing the utilization as rounded-up values in a Resource object is probably enough. Report node resource utilization Key: YARN-3534 URL: https://issues.apache.org/jira/browse/YARN-3534 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Assignee: Inigo Goiri Attachments: YARN-3534-1.patch, YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch Original Estimate: 336h Remaining Estimate: 336h YARN should be aware of the resource utilization of the nodes when scheduling containers. For this, this task will implement the NodeResourceMonitor and send this information to the Resource Manager in the heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3544) AM logs link missing in the RM UI for a completed app
[ https://issues.apache.org/jira/browse/YARN-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520365#comment-14520365 ] Zhijie Shen commented on YARN-3544: --- Xuan, thanks for the patch. I've tried your patch locally, and it brought the content back to the web UI. However, I've one concern. It seems that the link to the local log on NM is not useful after the app is finished, because the log is not supposed to be there any longer. So is this jira supposed to fix the regression, or ultimately provide a useful link to AM container log? Those seem to be different goals. /cc [~hitesh] AM logs link missing in the RM UI for a completed app -- Key: YARN-3544 URL: https://issues.apache.org/jira/browse/YARN-3544 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.7.0 Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Blocker Attachments: Screen Shot 2015-04-27 at 6.24.05 PM.png, YARN-3544.1.patch AM log links should always be present ( for both running and completed apps). Likewise node info is also empty. This is usually quite crucial when trying to debug where an AM was launched and a pointer to which NM's logs to look at if the AM failed to launch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String
Wangda Tan created YARN-3565: Summary: NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String Key: YARN-3565 URL: https://issues.apache.org/jira/browse/YARN-3565 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Priority: Blocker Now NM HB/Register uses SetString, it will be hard to add new fields if we want to support specifying NodeLabel type such as exclusivity/constraints, etc. We need to make sure rolling upgrade works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3473) Fix RM Web UI configuration for some properties
[ https://issues.apache.org/jira/browse/YARN-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-3473: - Labels: BB2015-05-TBR (was: ) Fix RM Web UI configuration for some properties --- Key: YARN-3473 URL: https://issues.apache.org/jira/browse/YARN-3473 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: BB2015-05-TBR Attachments: YARN-3473.001.patch Using the RM Web UI, the Tools-Configuration page shows some properties as something like BufferedInputStream instead of the appropriate .xml file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520473#comment-14520473 ] Zhijie Shen commented on YARN-3539: --- bq. That way, older installations and apps can continue to use the old APIs, and the new APIs do not need to take the unknown burden of making the old APIs work on the newer framework. This sounds a more reasonable commitment for ATS v2. Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, YARN-3539-003.patch, YARN-3539-004.patch The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2868) FairScheduler: Metric for latency to allocate first container for an application
[ https://issues.apache.org/jira/browse/YARN-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520480#comment-14520480 ] Ray Chiang commented on YARN-2868: -- I'll answer these in reverse order: 2) The first AM container is the easy one to measure. Subsequent measurements can be tricky since the request time will need to be recorded somewhere until the request is actually fulfilled. Tracking all the requests and corresponding fulfillments would be a lot more work and may want more sophisticated measurements. I haven't filed a JIRA for doing the later containers. 1) Breaking this answer into several parts. I'm not going to remember all the iterations I went through but I'll answer as best as I can. 1A) YARN-3105 covers the enhancements to StateMachine to record state transitions generically for metrics. [~jianhe] made the original suggestion. 1B) There were several factors for this. I think it was a combination of wanting queue-specific metrics, wanting to separate first allocation from later allocations, working with managed and unmanaged AMs, and a desire to get a more exact measurement with less overhead. I've deleted all my earliest attempts at this (i.e. those prior to the first patch on this JIRA), so I can't provide more specific information offhand. Let me know if that satisfactorily answers your questions. FairScheduler: Metric for latency to allocate first container for an application Key: YARN-2868 URL: https://issues.apache.org/jira/browse/YARN-2868 Project: Hadoop YARN Issue Type: Improvement Reporter: Ray Chiang Assignee: Ray Chiang Labels: metrics, supportability Fix For: 2.8.0 Attachments: YARN-2868-01.patch, YARN-2868.002.patch, YARN-2868.003.patch, YARN-2868.004.patch, YARN-2868.005.patch, YARN-2868.006.patch, YARN-2868.007.patch, YARN-2868.008.patch, YARN-2868.009.patch, YARN-2868.010.patch, YARN-2868.011.patch, YARN-2868.012.patch Add a metric to measure the latency between starting container allocation and first container actually allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520497#comment-14520497 ] Ray Chiang commented on YARN-3069: -- Thanks Akira! I've made those changes. I definitely left some empty descriptions in yarn-default.xml where I couldn't figure out what the property was for. I'll wait for more of your review before uploading a new patch. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520520#comment-14520520 ] Hadoop QA commented on YARN-3134: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12729341/YARN-3134-YARN-2928.runJenkins.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 4c1af15 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7546/console | This message was automatically generated. [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134-YARN-2928.runJenkins.001.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3134: Attachment: (was: YARN-3134-YARN-2928.runJenkins.001.patch) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3482) Report NM resource capacity in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3482: --- Summary: Report NM resource capacity in heartbeat (was: Report NM available resources in heartbeat) Report NM resource capacity in heartbeat Key: YARN-3482 URL: https://issues.apache.org/jira/browse/YARN-3482 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Original Estimate: 504h Remaining Estimate: 504h NMs are usually collocated with other processes like HDFS, Impala or HBase. To manage this scenario correctly, YARN should be aware of the actual available resources. The proposal is to have an interface to dynamically change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3564) TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly
[ https://issues.apache.org/jira/browse/YARN-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520544#comment-14520544 ] Wangda Tan commented on YARN-3564: -- +1, will commit later. TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly --- Key: YARN-3564 URL: https://issues.apache.org/jira/browse/YARN-3564 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3564.1.patch the test fails intermittently in jenkins https://builds.apache.org/job/PreCommit-YARN-Build/7467/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3533) Test: Fix launchAM in MockRM to wait for attempt to be scheduled
[ https://issues.apache.org/jira/browse/YARN-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520380#comment-14520380 ] Hudson commented on YARN-3533: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7702 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7702/]) YARN-3533. Test: Fix launchAM in MockRM to wait for attempt to be scheduled. Contributed by Anubhav Dhoot (jianhe: rev 4c1af156aef4f3bb1d9823d5980c59b12007dc77) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java Test: Fix launchAM in MockRM to wait for attempt to be scheduled Key: YARN-3533 URL: https://issues.apache.org/jira/browse/YARN-3533 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.6.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3533.001.patch MockRM#launchAM fails in many test runs because it does not wait for the app attempt to be scheduled before NM update is sent as noted in [recent builds|https://issues.apache.org/jira/browse/YARN-3387?focusedCommentId=14507255page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14507255] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3544) AM logs link missing in the RM UI for a completed app
[ https://issues.apache.org/jira/browse/YARN-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520379#comment-14520379 ] Hitesh Shah commented on YARN-3544: --- I meant redirect to the log server AM logs link missing in the RM UI for a completed app -- Key: YARN-3544 URL: https://issues.apache.org/jira/browse/YARN-3544 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.7.0 Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Blocker Attachments: Screen Shot 2015-04-27 at 6.24.05 PM.png, YARN-3544.1.patch AM log links should always be present ( for both running and completed apps). Likewise node info is also empty. This is usually quite crucial when trying to debug where an AM was launched and a pointer to which NM's logs to look at if the AM failed to launch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3445) Cache runningApps in RMNode for getting running apps on given NodeId
[ https://issues.apache.org/jira/browse/YARN-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520396#comment-14520396 ] Vinod Kumar Vavilapalli commented on YARN-3445: --- There is a too much of duplicate information already in NodeHeartbeatRequest, albeit for slightly different purposes. We need to consolidate the following (without breaking compatibility of previous releases), lest the heartbeat will become heavier and heavier. - logAggregationReportsForApps added, but not released yet -- logAggregationReportsForApps itself is a map of ApplicationID with a nested LogAggregationReport.ApplicationID - duplicate AppID information - runningApplications in this patch - NodeStatus.keepAliveApplications /cc [~jianhe] [~leftnoteasy] Cache runningApps in RMNode for getting running apps on given NodeId Key: YARN-3445 URL: https://issues.apache.org/jira/browse/YARN-3445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3445-v2.patch, YARN-3445.patch Per discussion in YARN-3334, we need filter out unnecessary collectors info from RM in heartbeat response. Our propose is to add cache for runningApps in RMNode, so RM only send collectors for local running apps back. This is also needed in YARN-914 (graceful decommission) that if no running apps in NM which is in decommissioning stage, it will get decommissioned immediately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3134: Attachment: YARN-3134-YARN-2928.runJenkins.001.patch In the latest patch I addressed all previous comments, and changed the storage type for config and info into byte arrays. I've also revised the storage of metrics, which no longer uses startTime and end Time. Right now I'm focusing on storing singleData since we need to discuss more about storing and aggregating time series data. Renaming the patch to the new format so that we can try jenkins on YARN-2928 branch. Disable the Phoenix test for now since it's blocked by YARN-3529. [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134-YARN-2928.runJenkins.001.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3534) Report node resource utilization
[ https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520563#comment-14520563 ] Karthik Kambatla commented on YARN-3534: Skimmed through the latest patch. High-level comments/questions: # Do we need a separate class/proto for ResourceUtilization? Could we just reuse Resource? That should make the patch significantly small. # Would be nice to have NodeResourceMonitor emit metrics for usage. We could do this on a follow-up JIRA. Report node resource utilization Key: YARN-3534 URL: https://issues.apache.org/jira/browse/YARN-3534 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Assignee: Inigo Goiri Attachments: YARN-3534-1.patch, YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch Original Estimate: 336h Remaining Estimate: 336h YARN should be aware of the resource utilization of the nodes when scheduling containers. For this, this task will implement the NodeResourceMonitor and send this information to the Resource Manager in the heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3534) Report node resource utilization
[ https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520584#comment-14520584 ] Inigo Goiri commented on YARN-3534: --- The original reason for having ResourceUtilization was getting better granularity for the CPUs. We had some discussion about it in YARN-3481; take a look there and chim in. My original implementation had Resource but when trying to do scheduling of containers based on this, there were a lot of holes in the scheduling. Given this, I thought this patch was a good place to create this new utilization entity with CPU as a float. Regarding the metrics in the NodeResourceMonitor, I completely agree. I thought about doing it right away but as you mentioned, it seemed a better idea to save it for another JIRA. Let's do that. Report node resource utilization Key: YARN-3534 URL: https://issues.apache.org/jira/browse/YARN-3534 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Assignee: Inigo Goiri Attachments: YARN-3534-1.patch, YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch Original Estimate: 336h Remaining Estimate: 336h YARN should be aware of the resource utilization of the nodes when scheduling containers. For this, this task will implement the NodeResourceMonitor and send this information to the Resource Manager in the heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3544) AM logs link missing in the RM UI for a completed app
[ https://issues.apache.org/jira/browse/YARN-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520586#comment-14520586 ] Zhijie Shen commented on YARN-3544: --- [~xgong], the patch doesn't apply for branch-2.7. It seems to be non-trivial conflict merge. Would you please take a look? AM logs link missing in the RM UI for a completed app -- Key: YARN-3544 URL: https://issues.apache.org/jira/browse/YARN-3544 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.7.0 Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Blocker Attachments: Screen Shot 2015-04-27 at 6.24.05 PM.png, YARN-3544.1.patch AM log links should always be present ( for both running and completed apps). Likewise node info is also empty. This is usually quite crucial when trying to debug where an AM was launched and a pointer to which NM's logs to look at if the AM failed to launch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520599#comment-14520599 ] Hadoop QA commented on YARN-3134: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 47s | Pre-patch YARN-2928 compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:red}-1{color} | javac | 7m 58s | The applied patch generated 8 additional warning messages. | | {color:green}+1{color} | javadoc | 9m 41s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 4m 5s | The applied patch generated 2 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 39s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 0m 41s | The patch appears to introduce 10 new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 40m 18s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-timelineservice | | | Found reliance on default encoding in org.apache.hadoop.yarn.server.timelineservice.storage.FileSystemTimelineWriterImpl.write(String, String, String, String, long, String, TimelineEntity, TimelineWriteResponse):in org.apache.hadoop.yarn.server.timelineservice.storage.FileSystemTimelineWriterImpl.write(String, String, String, String, long, String, TimelineEntity, TimelineWriteResponse): new java.io.FileWriter(String, boolean) At FileSystemTimelineWriterImpl.java:[line 86] | | | org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.tryInitTable() may fail to clean up java.sql.Statement on checked exception Obligation to clean up resource created at PhoenixTimelineWriterImpl.java:up java.sql.Statement on checked exception Obligation to clean up resource created at PhoenixTimelineWriterImpl.java:[line 227] is not discharged | | | org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.executeQuery(String) may fail to close Statement At PhoenixTimelineWriterImpl.java:Statement At PhoenixTimelineWriterImpl.java:[line 492] | | | A prepared statement is generated from a nonconstant String in org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.storeEntityVariableLengthFields(TimelineEntity, TimelineCollectorContext, Connection) At PhoenixTimelineWriterImpl.java:from a nonconstant String in org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.storeEntityVariableLengthFields(TimelineEntity, TimelineCollectorContext, Connection) At PhoenixTimelineWriterImpl.java:[line 389] | | | A prepared statement is generated from a nonconstant String in org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.storeEvents(TimelineEntity, TimelineCollectorContext, Connection) At PhoenixTimelineWriterImpl.java:from a nonconstant String in org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.storeEvents(TimelineEntity, TimelineCollectorContext, Connection) At PhoenixTimelineWriterImpl.java:[line 476] | | | A prepared statement is generated from a nonconstant String in org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.storeMetrics(TimelineEntity, TimelineCollectorContext, Connection) At PhoenixTimelineWriterImpl.java:from a nonconstant String in org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.storeMetrics(TimelineEntity, TimelineCollectorContext, Connection) At PhoenixTimelineWriterImpl.java:[line 433] | | | A prepared statement is generated from a nonconstant String in org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.write(String, String, String, String, long, String, TimelineEntities) At PhoenixTimelineWriterImpl.java:from a nonconstant String in org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.write(String, String, String, String, long, String, TimelineEntities) At PhoenixTimelineWriterImpl.java:[line 167] | | | org.apache.hadoop.yarn.server.timelineservice.storage.PhoenixTimelineWriterImpl.setBytesForColumnFamily(PreparedStatement, Map, int) makes inefficient use of keySet iterator instead of entrySet
[jira] [Updated] (YARN-3564) TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly
[ https://issues.apache.org/jira/browse/YARN-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3564: -- Attachment: YARN-3564.1.patch patch to fix the failure TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable fails randomly --- Key: YARN-3564 URL: https://issues.apache.org/jira/browse/YARN-3564 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3564.1.patch the test fails intermittently in jenkins https://builds.apache.org/job/PreCommit-YARN-Build/7467/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3392) Change NodeManager metrics to not populate resource usage metrics if they are unavailable
[ https://issues.apache.org/jira/browse/YARN-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot resolved YARN-3392. - Resolution: Duplicate Change NodeManager metrics to not populate resource usage metrics if they are unavailable -- Key: YARN-3392 URL: https://issues.apache.org/jira/browse/YARN-3392 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3392.prelim.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3544) AM logs link missing in the RM UI for a completed app
[ https://issues.apache.org/jira/browse/YARN-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520425#comment-14520425 ] Zhijie Shen commented on YARN-3544: --- bq. Doesnt the NM log link redirect the log server after the logs have been aggregated? Thanks, Hitesh! I didn't notice this option before. Tried it locally, and the whole process of the completed log is working fine now. Will commit the patch late today unless there's further comment. AM logs link missing in the RM UI for a completed app -- Key: YARN-3544 URL: https://issues.apache.org/jira/browse/YARN-3544 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.7.0 Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Blocker Attachments: Screen Shot 2015-04-27 at 6.24.05 PM.png, YARN-3544.1.patch AM log links should always be present ( for both running and completed apps). Likewise node info is also empty. This is usually quite crucial when trying to debug where an AM was launched and a pointer to which NM's logs to look at if the AM failed to launch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)