[jira] [Commented] (YARN-3459) TestLog4jWarningErrorMetricsAppender breaks in trunk
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484798#comment-14484798 ] Hadoop QA commented on YARN-3459: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723835/apache-yarn-3459.0.patch against trunk revision ab04ff9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7251//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7251//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7251//console This message is automatically generated. TestLog4jWarningErrorMetricsAppender breaks in trunk Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Li Lu Priority: Blocker Fix For: 2.7.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3391: -- Attachment: YARN-3391.4.patch Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, YARN-3391.4.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484813#comment-14484813 ] Hadoop QA commented on YARN-3457: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723815/YARN-3457.001.patch against trunk revision ab04ff9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7252//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7252//console This message is automatically generated. NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484815#comment-14484815 ] Zhijie Shen commented on YARN-3391: --- I created a new patch: bq. So in general, I think we should use as much javadoc comments instead of inline comments for public APIs. Move the comments into TimelineUtils and make them javadoc. bq. We should add more info to LOG.warn messages, at least to tell user flow run should be numeric. Improve the warn message bq. In addition, do we need to check negative value for flow run here? According to Sangjin's given example, we usually want to identify a flow run by timestamp, which theoretically can be negative to represent sometime before 1970. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, YARN-3391.4.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484845#comment-14484845 ] Tsuyoshi Ozawa commented on YARN-3457: -- +1, committing this shortly. NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
[ https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485071#comment-14485071 ] Devaraj K commented on YARN-3225: - Thanks [~djp] for your review. bq. I think we should support a case that Admin want node get decommissioned whenever all apps on these node get finished. If so, shall we support nigative value (anyone or some special one, like: -1) to specify this case? If the user wants to achieve this, they can give some larger timeout value and wait for all nodes to get decommissioned gracefully(without forceful). Do we really need to provide special handling for this case? bq. For NORMAL, shall we use Decommission nodes in normal (old) way instead or something simpler- Decommission nodes? I feel Decommission nodes in normal way would be ok, no need to mention the 'old' term. What is your opinion on this? bq. IMO, the methods inside a class should't be more public than class itself? If we don't expect other projects to use class, we alwasy don't expect some methods get used. The same problem happen in an old API RefreshNodeRequest.java. I think we may need to fix both? I agree, I will fix both of them. bq. Why do we need this change? recordFactory.newRecordInstance(RefreshNodesRequest.class) will return something with DecommissionType.NORMAL as default. No? It will not give any difference because the NORMAL is the default. I made this change to make it consistent with other decommission types. New parameter or CLI for decommissioning node gracefully in RMAdmin CLI --- Key: YARN-3225 URL: https://issues.apache.org/jira/browse/YARN-3225 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Devaraj K Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, YARN-3225.patch, YARN-914.patch New CLI (or existing CLI with parameters) should put each node on decommission list to decommissioning status and track timeout to terminate the nodes that haven't get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3462) Patches applied for YARN-2424 are inconsistent between trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485151#comment-14485151 ] Naganarasimha G R commented on YARN-3462: - [~qwertymaniac]/[~aw] Can you guys take a look at this patch ? Patches applied for YARN-2424 are inconsistent between trunk and branch-2 - Key: YARN-3462 URL: https://issues.apache.org/jira/browse/YARN-3462 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Sidharta Seethana Assignee: Naganarasimha G R Attachments: YARN-3462.20150508-1.patch It looks like the changes for YARN-2424 are not the same for trunk (commit 7e75226e68715c3eca9d346c8eaf2f265aa70d23) and branch-2 (commit 5d965f2f3cf97a87603720948aacd4f7877d73c4) . Branch-2 has a missing warning and documentation is a bit different as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485118#comment-14485118 ] Hudson commented on YARN-3110: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #148 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/148/]) YARN-3110. Few issues in ApplicationHistory web ui. Contributed by Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java Few issues in ApplicationHistory web ui --- Key: YARN-3110 URL: https://issues.apache.org/jira/browse/YARN-3110 Project: Hadoop YARN Issue Type: Sub-task Components: applications, timelineserver Affects Versions: 2.6.0 Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, YARN-3110.20150406-1.patch Application state and History link wrong when Application is in unassigned state 1.Configure capacity schedular with queue size as 1 also max Absolute Max Capacity: 10.0% (Current application state is Accepted and Unassigned from resource manager side) 2.Submit application to queue and check the state and link in Application history State= null and History link shown as N/A in applicationhistory page Kill the same application . In timeline server logs the below is show when selecting application link. {quote} 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to read the AM container of the application attempt appattempt_1422467063659_0007_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at
[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period
[ https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485122#comment-14485122 ] Hudson commented on YARN-3294: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #148 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/148/]) YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for (xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/CHANGES.txt Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period - Key: YARN-3294 URL: https://issues.apache.org/jira/browse/YARN-3294 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, apache-yarn-3294.3.patch, apache-yarn-3294.4.patch It would be nice to have a button on the web UI that would allow dumping of debug logs for just the capacity scheduler for a fixed period of time(1 min, 5 min or so) in a separate log file. It would be useful when debugging scheduler behavior without affecting the rest of the resourcemanager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485127#comment-14485127 ] Hudson commented on YARN-3429: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #148 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/148/]) YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 5b8a3ae366294aec492f69f1a429aa7fce5d13be) * hadoop-yarn-project/CHANGES.txt TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485120#comment-14485120 ] Hudson commented on YARN-3457: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #148 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/148/]) YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. Contributed by Bibin A Chundatt. (ozawa: rev dd852f5b8c8fe9e52d15987605f36b5b60f02701) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/CHANGES.txt NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485114#comment-14485114 ] Hudson commented on YARN-3429: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2089 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2089/]) YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 5b8a3ae366294aec492f69f1a429aa7fce5d13be) * hadoop-yarn-project/CHANGES.txt TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485045#comment-14485045 ] Hudson commented on YARN-3110: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #157 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/157/]) YARN-3110. Few issues in ApplicationHistory web ui. Contributed by Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java Few issues in ApplicationHistory web ui --- Key: YARN-3110 URL: https://issues.apache.org/jira/browse/YARN-3110 Project: Hadoop YARN Issue Type: Sub-task Components: applications, timelineserver Affects Versions: 2.6.0 Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, YARN-3110.20150406-1.patch Application state and History link wrong when Application is in unassigned state 1.Configure capacity schedular with queue size as 1 also max Absolute Max Capacity: 10.0% (Current application state is Accepted and Unassigned from resource manager side) 2.Submit application to queue and check the state and link in Application history State= null and History link shown as N/A in applicationhistory page Kill the same application . In timeline server logs the below is show when selecting application link. {quote} 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to read the AM container of the application attempt appattempt_1422467063659_0007_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at
[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period
[ https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485049#comment-14485049 ] Hudson commented on YARN-3294: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #157 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/157/]) YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for (xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period - Key: YARN-3294 URL: https://issues.apache.org/jira/browse/YARN-3294 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, apache-yarn-3294.3.patch, apache-yarn-3294.4.patch It would be nice to have a button on the web UI that would allow dumping of debug logs for just the capacity scheduler for a fixed period of time(1 min, 5 min or so) in a separate log file. It would be useful when debugging scheduler behavior without affecting the rest of the resourcemanager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485047#comment-14485047 ] Hudson commented on YARN-3457: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #157 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/157/]) YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. Contributed by Bibin A Chundatt. (ozawa: rev dd852f5b8c8fe9e52d15987605f36b5b60f02701) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485054#comment-14485054 ] Hudson commented on YARN-3429: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #157 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/157/]) YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 5b8a3ae366294aec492f69f1a429aa7fce5d13be) * hadoop-yarn-project/CHANGES.txt TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period
[ https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485142#comment-14485142 ] Hudson commented on YARN-3294: -- FAILURE: Integrated in Hadoop-Yarn-trunk #891 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/891/]) YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for (xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/CHANGES.txt Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period - Key: YARN-3294 URL: https://issues.apache.org/jira/browse/YARN-3294 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, apache-yarn-3294.3.patch, apache-yarn-3294.4.patch It would be nice to have a button on the web UI that would allow dumping of debug logs for just the capacity scheduler for a fixed period of time(1 min, 5 min or so) in a separate log file. It would be useful when debugging scheduler behavior without affecting the rest of the resourcemanager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485138#comment-14485138 ] Hudson commented on YARN-3110: -- FAILURE: Integrated in Hadoop-Yarn-trunk #891 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/891/]) YARN-3110. Few issues in ApplicationHistory web ui. Contributed by Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java Few issues in ApplicationHistory web ui --- Key: YARN-3110 URL: https://issues.apache.org/jira/browse/YARN-3110 Project: Hadoop YARN Issue Type: Sub-task Components: applications, timelineserver Affects Versions: 2.6.0 Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, YARN-3110.20150406-1.patch Application state and History link wrong when Application is in unassigned state 1.Configure capacity schedular with queue size as 1 also max Absolute Max Capacity: 10.0% (Current application state is Accepted and Unassigned from resource manager side) 2.Submit application to queue and check the state and link in Application history State= null and History link shown as N/A in applicationhistory page Kill the same application . In timeline server logs the below is show when selecting application link. {quote} 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to read the AM container of the application attempt appattempt_1422467063659_0007_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485140#comment-14485140 ] Hudson commented on YARN-3457: -- FAILURE: Integrated in Hadoop-Yarn-trunk #891 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/891/]) YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. Contributed by Bibin A Chundatt. (ozawa: rev dd852f5b8c8fe9e52d15987605f36b5b60f02701) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/CHANGES.txt NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485147#comment-14485147 ] Hudson commented on YARN-3429: -- FAILURE: Integrated in Hadoop-Yarn-trunk #891 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/891/]) YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 5b8a3ae366294aec492f69f1a429aa7fce5d13be) * hadoop-yarn-project/CHANGES.txt TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485105#comment-14485105 ] Hudson commented on YARN-3110: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2089 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2089/]) YARN-3110. Few issues in ApplicationHistory web ui. Contributed by Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java Few issues in ApplicationHistory web ui --- Key: YARN-3110 URL: https://issues.apache.org/jira/browse/YARN-3110 Project: Hadoop YARN Issue Type: Sub-task Components: applications, timelineserver Affects Versions: 2.6.0 Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, YARN-3110.20150406-1.patch Application state and History link wrong when Application is in unassigned state 1.Configure capacity schedular with queue size as 1 also max Absolute Max Capacity: 10.0% (Current application state is Accepted and Unassigned from resource manager side) 2.Submit application to queue and check the state and link in Application history State= null and History link shown as N/A in applicationhistory page Kill the same application . In timeline server logs the below is show when selecting application link. {quote} 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to read the AM container of the application attempt appattempt_1422467063659_0007_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485107#comment-14485107 ] Hudson commented on YARN-3457: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2089 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2089/]) YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. Contributed by Bibin A Chundatt. (ozawa: rev dd852f5b8c8fe9e52d15987605f36b5b60f02701) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period
[ https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485109#comment-14485109 ] Hudson commented on YARN-3294: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2089 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2089/]) YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for (xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/CHANGES.txt Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period - Key: YARN-3294 URL: https://issues.apache.org/jira/browse/YARN-3294 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, apache-yarn-3294.3.patch, apache-yarn-3294.4.patch It would be nice to have a button on the web UI that would allow dumping of debug logs for just the capacity scheduler for a fixed period of time(1 min, 5 min or so) in a separate log file. It would be useful when debugging scheduler behavior without affecting the rest of the resourcemanager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3326) ReST support for getLabelsToNodes
[ https://issues.apache.org/jira/browse/YARN-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485159#comment-14485159 ] Naganarasimha G R commented on YARN-3326: - Hi [~ozawa], Test case failure is not related to this issue and seperate jira is already raised for it ( YARN-2871) ReST support for getLabelsToNodes -- Key: YARN-3326 URL: https://issues.apache.org/jira/browse/YARN-3326 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Minor Attachments: YARN-3326.20150310-1.patch, YARN-3326.20150407-1.patch, YARN-3326.20150408-1.patch REST to support to retrieve LabelsToNodes Mapping -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484931#comment-14484931 ] Bibin A Chundatt commented on YARN-3457: Thank you [~ozawa] for checking and committing patch. NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3464: Description: Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. Without ContainerLocalizer, LocalizerRunner#update will never be called. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. was: Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. Without ContainerLocalizer, LocalizerRunner#update will never be called. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3465) use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl
zhihai xu created YARN-3465: --- Summary: use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484858#comment-14484858 ] Hudson commented on YARN-3457: -- FAILURE: Integrated in Hadoop-trunk-Commit #7531 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7531/]) YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. Contributed by Bibin A Chundatt. (ozawa: rev dd852f5b8c8fe9e52d15987605f36b5b60f02701) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
zhihai xu created YARN-3464: --- Summary: Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484917#comment-14484917 ] Ravi Prakash commented on YARN-3429: You may have inadvertantly used the wrong JIRA number in your commit [~rkanter] Ought to be YARN-3429 (instead of YARN-2429) I see comments on YARN-2429. TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484920#comment-14484920 ] zhihai xu commented on YARN-3464: - This issue only happened for PRIVATE/APPLICATION resource Localization We saw this issue happened when the PRIVATE LocalizerResourceRequestEvent interleaved with PUBLIC LocalizerResourceRequestEvent in the following order: PRIVATE1 PRIVATE2 .. PRIVATEm PUBLIC1 PUBLIC2 . PUBLICn PRIVATEm+1 PRIVATEm+2 The last two PRIVATE LocalizerResourceRequestEvent is added after all previous m PRIVATE LocalizerResourceRequestEvent are LOCALIZED due to the delay to process n PUBLIC LocalizerResourceRequestEvent. Then the container will stay at LOCALIZING state until it is killed by AM. Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3293) Track and display capacity scheduler health metrics in web UI
[ https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485466#comment-14485466 ] Craig Welch commented on YARN-3293: --- Overall +1 looks good to me. One additional thing occurred to me when looking it over again - I think that CapacitySchedulerHealthInfo in the web dao is, for the most part, cross-scheduler. Does it make sense to factor most of it up into a generalized SchedulerHealthInfo with all the common pieces and extend it (to CapacitySchedulerHealthInfo) just for the CS specific constructor? Track and display capacity scheduler health metrics in web UI - Key: YARN-3293 URL: https://issues.apache.org/jira/browse/YARN-3293 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch It would be good to display metrics that let users know about the health of the capacity scheduler in the web UI. Today it is hard to get an idea if the capacity scheduler is functioning correctly. Metrics such as the time for the last allocation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485392#comment-14485392 ] Sunil G commented on YARN-2003: --- Findbugs warnings are not related. Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side] -- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3388) Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit
[ https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485413#comment-14485413 ] Nathan Roberts commented on YARN-3388: -- Test failures don't appear related to patch. Ran failing tests locally and they pass. Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit - Key: YARN-3388 URL: https://issues.apache.org/jira/browse/YARN-3388 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-3388-v0.patch, YARN-3388-v1.patch When there are multiple active users in a queue, it should be possible for those users to make use of capacity up-to max_capacity (or close). The resources should be fairly distributed among the active users in the queue. This works pretty well when there is a single resource being scheduled. However, when there are multiple resources the situation gets more complex and the current algorithm tends to get stuck at Capacity. Example illustrated in subsequent comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485451#comment-14485451 ] Jason Lowe commented on YARN-3466: -- This was caused by YARN-2943. A new column was added at the beginning of the table but table indices in the sorting metadata for the javascript were not updated accordingly. RM nodes web page does not sort by node HTTP address or containers -- Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe The ResourceManager does not support sorting by the node HTTP address nor the container count columns on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-3466: - Attachment: YARN-3466.001.patch Patch to bump the column indices to take into account the new node label column. This also restores the formatting of the code where the columns are defined so it's easier to see the column order and count them. [~leftnoteasy] or [~jianhe] please review. It would be nice to get this into 2.7. RM nodes web page does not sort by node HTTP address or containers -- Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address nor the container count columns on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities
[ https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485446#comment-14485446 ] Zhijie Shen commented on YARN-3448: --- bq. In fact, all rolling dbs from now unto ttl may be active. Yeah, actually this is the point I'd like make. For example, if ttl = 10h and rolling period = 1h, we will have 10 active rolling dbs. Though 2 - 10 dbs are not current, but they can't be deleted because they contain the data that is still alive. Only rolling dbs from 11 and son will be deleted. While ttl = 10h, we change rolling period = 10h, and I will only have 1 active 10 rolling db, and its size should be equivalent to prior 10 1h rolling dbs. Therefore, my point is that if rolling period smaller than ttl, we still need to keep all the data alive, it's not necessary to separate them into multiple dbs rather than keeping them together in the current db. One benefit I can think of about multiple-rolling-db approach (as well as different dbs for different data type) is to increase concurrency. However, I didn't see we have multiple threads to write different dbs concurrently. Add Rolling Time To Lives Level DB Plugin Capabilities -- Key: YARN-3448 URL: https://issues.apache.org/jira/browse/YARN-3448 Project: Hadoop YARN Issue Type: Improvement Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-3448.1.patch, YARN-3448.2.patch, YARN-3448.3.patch For large applications, the majority of the time in LeveldbTimelineStore is spent deleting old entities record at a time. An exclusive write lock is held during the entire deletion phase which in practice can be hours. If we are to relax some of the consistency constraints, other performance enhancing techniques can be employed to maximize the throughput and minimize locking time. Split the 5 sections of the leveldb database (domain, owner, start time, entity, index) into 5 separate databases. This allows each database to maximize the read cache effectiveness based on the unique usage patterns of each database. With 5 separate databases each lookup is much faster. This can also help with I/O to have the entity and index databases on separate disks. Rolling DBs for entity and index DBs. 99.9% of the data are in these two sections 4:1 ration (index to entity) at least for tez. We replace DB record removal with file system removal if we create a rolling set of databases that age out and can be efficiently removed. To do this we must place a constraint to always place an entity's events into it's correct rolling db instance based on start time. This allows us to stitching the data back together while reading and artificial paging. Relax the synchronous writes constraints. If we are willing to accept losing some records that we not flushed in the operating system during a crash, we can use async writes that can be much faster. Prefer Sequential writes. sequential writes can be several times faster than random writes. Spend some small effort arranging the writes in such a way that will trend towards sequential write performance over random write performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-3466: - Affects Version/s: (was: 2.6.0) 2.7.0 RM nodes web page does not sort by node HTTP address or containers -- Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address nor the container count columns on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniMRYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485303#comment-14485303 ] Mit Desai commented on YARN-2890: - [~hitesh] any comments on the latest patch? MiniMRYarnCluster should turn on timeline service if configured to do so Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485551#comment-14485551 ] Junping Du commented on YARN-2637: -- Hi [~cwelch] and [~jianhe], I think MAPREDUCE-6189 could be related to this patch. Can you take a look at it? Thanks! maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Fix For: 2.7.0 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, YARN-2637.38.patch, YARN-2637.39.patch, YARN-2637.40.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3293) Track and display capacity scheduler health metrics in web UI
[ https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485557#comment-14485557 ] Craig Welch commented on YARN-3293: --- Your call, I think it's also fine to wait to do this until we do FairScheduler integration when we are clear on exactly what needs to happen (it may be premature to do it now, not entirely sure), but ultimately I think as much as can be shared should be. Track and display capacity scheduler health metrics in web UI - Key: YARN-3293 URL: https://issues.apache.org/jira/browse/YARN-3293 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch It would be good to display metrics that let users know about the health of the capacity scheduler in the web UI. Today it is hard to get an idea if the capacity scheduler is functioning correctly. Metrics such as the time for the last allocation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485566#comment-14485566 ] Karthik Kambatla commented on YARN-3464: I have been investigating a similar issue. Initially I thought of the same race, but not sure if that alone solves the issue. Looking at the code closely, I don't see any resources being removed from pending. So, pending shouldn't be empty after some of the resources have been downloaded. Related: YARN-3024 increases the frequency of this issue. Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. Without ContainerLocalizer, LocalizerRunner#update will never be called. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3361) CapacityScheduler side changes to support non-exclusive node labels
[ https://issues.apache.org/jira/browse/YARN-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3361: - Attachment: YARN-3361.3.patch Thanks for your comments, [~vinodkv]/[~jianhe]: * Main code comments from Vinod: * bq. checkNodeLabelExpression: NPEs on labelExpression can happen? No, I removed checkings bq. FiCaSchedulerNode: exclusive, setters, getters - exclusivePartition They're not used by anybody, removed bq. ExclusiveType renames Done bq. AbstractCSQueue: 1. Change to nodePartitionToLookAt: Done 2. Now all queues checks needResources 3. Renamed to hasPendingResourceRequest as suggested by Jian bq. checkResourceRequestMatchingNodeLabel can be moved into the application? Moved to SchedulerUtils bq. checkResourceRequestMatchingNodeLabel nodeLabelToLookAt arg is not used anywhere else. Done (merged it in SchedulerUtils.checkResourceRequestMatchingNodePartition) bq. addNonExclusiveSchedulingOpportunity Renamed to reset/addMissedNonPartitionedRequestSchedulingOpportunity bq. It seems like we are not putting absolute max-capacities on the individual queues when not-respecting-partitions. Describe why? Similarly, describe as to why user-limit-factor is ignored in the not-respecting-paritions mode. Done * Test code comments from Vinod: * bq. testNonExclusiveNodeLabelsAllocationIgnoreAppSubmitOrder Done bq. testNonExclusiveNodeLabelsAllocationIgnorePriority Rename to testPreferenceOfNeedyPrioritiesUnderSameAppTowardsNodePartitions bq. Actually, now that I rename it that way, this may not be the right behavior. Not respecting priorities within an app can result in scheduling deadlocks: This will not lead deadlock, because we separately count resource usage under each partition, priority=1 goes first on partition=y before priority=0 all satisifed only because priority=1 is the lowest priority asks for partition=y. bq. testLabeledResourceRequestsGetPreferrenceInHierarchyOfQueue Renamed to testPreferenceOfQueuesTowardsNodePartitions bq. testNonLabeledQueueUsesLabeledResource Done bq. Let's move all these node-label related tests into their own test-case. Moved to TestNodeLabelContainerAllocation Add more tests: 1. Added testAMContainerAllocationWillAlwaysBeExclusive to make sure AM will be always excluisve. 2. Added testQueueMaxCapacitiesWillNotBeHonoredWhenNotRespectingExclusivity to make sure max-capacities on individual queues ignored when doing ignore exclusivity allocation * Main code comments from Jian: * bq. Merge queue#needResource and application#needResource Done, now moved common implementation to SchedulerUtils.hasPendingResourceRequest bq. Merge queue#needResource and application#needResource Done bq. Some methods like canAssignToThisQueue where both nodeLabels and exclusiveType are passed, it may be simplified by passing the current partitionToAllocate to simplify the internal if/else check. Actually, it will not simplify logic too much, I checked there're only few places can leverage nodePartitionToLookAt, I perfer to keep semantics of SchedulingMode bq. The following may be incorrect, as the current request may be not the AM container request, though null == rmAppAttempt.getMasterContainer() I understand masterContainer could be async initialized in RMApp, but the interval could be ignored, doing the null check here can make sure AM container isn't get allocated. bq. below if/else can be avoided if passing the nodePartition into queueCapacities.getAbsoluteCapacity(nodePartition), Done bq. the second limit won’t be hit? Yeah, it will not be hit, but set it to be maxUserLimit will enhance readability. bq. nonExclusiveSchedulingOpportunities#setCount - add(Priority) Done Attached new patch (ver.3) CapacityScheduler side changes to support non-exclusive node labels --- Key: YARN-3361 URL: https://issues.apache.org/jira/browse/YARN-3361 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3361.1.patch, YARN-3361.2.patch, YARN-3361.3.patch According to design doc attached in YARN-3214, we need implement following logic in CapacityScheduler: 1) When allocate a resource request with no node-label specified, it should get preferentially allocated to node without labels. 2) When there're some available resource in a node with label, they can be used by applications with following order: - Applications under queues which can access the label and ask for same labeled resource. - Applications under queues which can access the label and ask for non-labeled resource. - Applications under queues cannot access the label and ask for non-labeled resource. 3) Expose necessary information that can be used by preemption
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485607#comment-14485607 ] Zhijie Shen commented on YARN-3051: --- bq. My sense is that it should be fine to use the same time window for all metrics. Makes sense to me too. bq. Or we have to be handle it as part of a single query ? The result will just include the entity identifier of the related entities. And then we issue separate query to pull the detailed info of each related entity. This is also preventing the response from being nested. Otherwise, entity is related to the other, which will consequently related to another. The response will be too big then. And if A is related to B, and B is then related to A, JAX-RS will find the cyclic dependency and throw the exception. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
[ https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485199#comment-14485199 ] Junping Du commented on YARN-3225: -- Thanks [~devaraj.k] for replying. bq. If the user wants to achieve this, they can give some larger timeout value and wait for all nodes to get decommissioned gracefully(without forceful). Do we really need to provide special handling for this case? It would be great if we can support this case because users doesn't have to think out a large number for an important job and doesn't known when to end. Given this is a trivial effort comparing what you already achieved, we'd better do here instead of filing a separated JIRA. What do you think? bq. I feel Decommission nodes in normal way would be ok, no need to mention the 'old' term. What is your opinion on this? Yes. That sounds good. My previous point is not to mention decommissioning for normal/previous decommission process to get rid of any confusing. New parameter or CLI for decommissioning node gracefully in RMAdmin CLI --- Key: YARN-3225 URL: https://issues.apache.org/jira/browse/YARN-3225 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Devaraj K Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, YARN-3225.patch, YARN-914.patch New CLI (or existing CLI with parameters) should put each node on decommission list to decommissioning status and track timeout to terminate the nodes that haven't get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3127) Apphistory url crashes when RM switches with ATS enabled
[ https://issues.apache.org/jira/browse/YARN-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485235#comment-14485235 ] Xuan Gong commented on YARN-3127: - [~Naganarasimha] Thanks for working on this. I will take a look shortly. Apphistory url crashes when RM switches with ATS enabled Key: YARN-3127 URL: https://issues.apache.org/jira/browse/YARN-3127 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Environment: RM HA with ATS Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Attachments: YARN-3127.20150213-1.patch, YARN-3127.20150329-1.patch 1.Start RM with HA and ATS configured and run some yarn applications 2.Once applications are finished sucessfully start timeline server 3.Now failover HA form active to standby 4.Access timeline server URL IP:PORT/applicationhistory Result: Application history URL fails with below info {quote} 2015-02-03 20:28:09,511 ERROR org.apache.hadoop.yarn.webapp.View: Failed to read the applications. java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643) at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:80) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) ... Caused by: org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: The entity for application attempt appattempt_1422972608379_0001_01 doesn't exist in the timeline store at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplicationAttempt(ApplicationHistoryManagerOnTimelineStore.java:151) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.generateApplicationReport(ApplicationHistoryManagerOnTimelineStore.java:499) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAllApplications(ApplicationHistoryManagerOnTimelineStore.java:108) at org.apache.hadoop.yarn.server.webapp.AppsBlock$1.run(AppsBlock.java:84) at org.apache.hadoop.yarn.server.webapp.AppsBlock$1.run(AppsBlock.java:81) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) ... 51 more 2015-02-03 20:28:09,512 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /applicationhistory org.apache.hadoop.yarn.webapp.WebAppException: Error rendering block: nestLevel=6 expected 5 at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) {quote} Behaviour with AHS with file based history store -Apphistory url is working -No attempt entries are shown for each application. Based on inital analysis when RM switches ,application attempts from state store are not replayed but only applications are. So when /applicaitonhistory url is accessed it tries for all attempt id and fails -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485242#comment-14485242 ] Varun Vasudev commented on YARN-3348: - The last line Moved the cache to YarnClientImpl where the hashcode doesn't show up should be Moved the cache to YarnClientImpl where the hashcode issue doesn't show up Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485237#comment-14485237 ] Junping Du commented on YARN-3391: -- Thanks [~zjshen] for updating the patch! bq. According to Sangjin's given example, we usually want to identify a flow run by timestamp, which theoretically can be negative to represent sometime before 1970. Except time travel, I don't believe any flow run running on hadoop and new timeline service should happen before 1970. :) Anyway, we do have some practice to check timestamp 0 (like: MetricsRecordImpl), but more cases sounds like we didn't do this negative check for timestamp. Given this, I am fine with not checking here. v4 patch looks good to me. [~sjlee0], [~vrushalic] and [~jrottinghuis], any additional comments for the patch? Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, YARN-3391.4.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3348: Attachment: apache-yarn-3348.2.patch Thanks for the reviews [~aw] and [~jianhe]. bq. Why are we doing this manipulation here and not in the Java code? I get different values when I run the command in the yarn script vs spawn it via Java. From Java, I get lower values - 80x24, whereas the yarn script gives me 204x44. bq. backticks are antiquated in modern bash. Use $() construction Fixed. bq. What happens if tput gives you zero or an error because you are on a non-addressable terminal? (You can generally simulate this by unset TERM or equivalent env var) Thank you for pointing this out. I hadn't considered it. I've added additional checks in the script. If the values can't be determined either by the script or by the Java code, it sets it to 80x24. bq. “Unable to fetach cluster metrics” - typo Fixed. bq. exceeding 80 Column limit, Fixed. bq. the -rows, -cols options seems not having effect on my screen when I tried it, could you double check ? I found an issue with cols option which I've fixed. Can you please try it again? bq. the ‘yarn top’ output is repeatedly showing up on terminal every $delay seconds. it’ll be better to only show that only once. I didn't understand this - do you mean that it shouldn't auto-refresh? bq. Does the patch only show root queue info ? should we show all queues info ? Queues can be specified as a comma seperated string using the -queues option. By default, it shows information for the root queue. bq. “F + Enter : Select sort field” ; may be use ’S’ for sorting ? Fixed. bq. “Memory seconds(in GBseconds” - missing “)” Fixed {quote} It seems a bit odd to have this method in a public API record. Do you know why hashcode is not correct without this method ? Or we can just type cast it to GetApplicationsRequestPBImpl and use the method from there. // need this otherwise the hashcode doesn't get generated correctly request.initAllFields(); for the caching in ClientRMService. Do you think we can do the cache on client side ? that’ll save RPCs, especially if we have many top commands running on client side. {quote} Fixed. Moved the cache to YarnClientImpl where the hashcode doesn't show up. As to why it wasn't correct - I suspect it might be to do with lazy initialization but I'm not sure. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3293) Track and display capacity scheduler health metrics in web UI
[ https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3293: Attachment: apache-yarn-3293.6.patch Uploaded a new patch with getters so that findbugs doesn't complain. Track and display capacity scheduler health metrics in web UI - Key: YARN-3293 URL: https://issues.apache.org/jira/browse/YARN-3293 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch It would be good to display metrics that let users know about the health of the capacity scheduler in the web UI. Today it is hard to get an idea if the capacity scheduler is functioning correctly. Metrics such as the time for the last allocation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs
[ https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485295#comment-14485295 ] Junping Du commented on YARN-3347: -- Hi [~xgong], thanks for reporting this issue and delivering a patch for fixing this! This looks like a very helpful feature for trouble shooting. I went through the patch quickly and have some comments so far: {code} +Option amOption = new Option(AM_CONTAINER_OPTION, true, + Prints the AM Container logs for this application. + + Specify comma-separated value to get logs for related AM Container. + + To get logs for all AM Containers, use -am ALL. + + To get logs for the latest AM Container, use -am -1. + + By default, it will only print out syslog. Work with -logFiles + + to get other logs); {code} For comma-separated value, do we mean attempt number? If so, may be we should describe more explicitly here? Also, can we use 0 (instead of -1) for AM container of latest attempt. If so, all negative value here is illegal. {code} +if (getConf().getBoolean(YarnConfiguration.APPLICATION_HISTORY_ENABLED, + YarnConfiguration.DEFAULT_APPLICATION_HISTORY_ENABLED)) { + System.out.println(Please enable the application history service. Or ); +} {code} Missing ! before getConf()? In method of printAMContainerLogsForRunningApplication(), {code} +boolean printAll = amContainers.contains(ALL); + +for (int i = 0; i amContainersInfo.length(); i++) { + boolean printThis = amContainers.contains(Integer.toString(i+1)) + || (i == (amContainersInfo.length()-1) + amContainers.contains(Integer.toString(-1))); + if (printAll || printThis) { +String nodeHttpAddress = +amContainersInfo.getJSONObject(i).getString(nodeHttpAddress); +String containerId = +amContainersInfo.getJSONObject(i).getString(containerId); +String nodeId = amContainersInfo.getJSONObject(i).getString(nodeId); +if (nodeHttpAddress != null containerId != null + !nodeHttpAddress.isEmpty() !containerId.isEmpty()) { + printContainerLogsFromRunningApplication(conf, appId, containerId, +nodeHttpAddress, nodeId, logFiles, logCliHelper, appOwner); +} + } +} +return 0; + } {code} Sounds like we are re-order the sequence of user's input which seems unnecessary to me. I would suggest to keep order from user's input or it could confuse people. Also, the logic here sounds not quite straightforward. I would expect something simper, like pseudo code below: {code} if (printAll) { go through amContainersInfo and print } for (amContainer : amContainers) { amContainer == -1? print amContainersInfo(last-one) : print amContainersInfo(amContainer -1); } {code} Also, for method of run(String[] args), it looks very complexity for now. Can we do some refactor work there and put some comments inline? Improve YARN log command to get AMContainer logs as well as running containers logs --- Key: YARN-3347 URL: https://issues.apache.org/jira/browse/YARN-3347 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, YARN-3347.2.patch, YARN-3347.2.rebase.patch Right now, we could specify applicationId, node http address and container ID to get the specify container log. Or we could only specify applicationId to get all the container logs. It is very hard for the users to get logs for AM container since the AMContainer logs have more useful information. Users need to know the AMContainer's container ID and related Node http address. We could improve the YARN Log Command to allow users to get AMContainer logs directly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485717#comment-14485717 ] Wangda Tan commented on YARN-3466: -- Updated title and description, added node-label column to reflect changes in the patch. Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column - Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address, container count and node label column on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3466: - Description: The ResourceManager does not support sorting by the node HTTP address, container count and node label column on the cluster nodes page. (was: The ResourceManager does not support sorting by the node HTTP address nor the container count columns on the cluster nodes page. ) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column - Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address, container count and node label column on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485733#comment-14485733 ] Hudson commented on YARN-3429: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2107 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2107/]) YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 5b8a3ae366294aec492f69f1a429aa7fce5d13be) * hadoop-yarn-project/CHANGES.txt TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications on RM Web UI
Anthony Rojas created YARN-3467: --- Summary: Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications on RM Web UI Key: YARN-3467 URL: https://issues.apache.org/jira/browse/YARN-3467 Project: Hadoop YARN Issue Type: New Feature Components: webapp, yarn Affects Versions: 2.5.0 Reporter: Anthony Rojas Priority: Minor The YARN REST API can report on the following properties: *allocatedMB*: The sum of memory in MB allocated to the application's running containers *allocatedVCores*: The sum of virtual cores allocated to the application's running containers *runningContainers*: The number of containers currently running for the application Currently, the RM Web UI does not report on these items (at least I couldn't find any entries within the Web UI). It would be useful for YARN Application and Resource troubleshooting to have these properties and their corresponding values exposed on the RM WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period
[ https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485728#comment-14485728 ] Hudson commented on YARN-3294: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2107 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2107/]) YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for (xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period - Key: YARN-3294 URL: https://issues.apache.org/jira/browse/YARN-3294 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, apache-yarn-3294.3.patch, apache-yarn-3294.4.patch It would be nice to have a button on the web UI that would allow dumping of debug logs for just the capacity scheduler for a fixed period of time(1 min, 5 min or so) in a separate log file. It would be useful when debugging scheduler behavior without affecting the rest of the resourcemanager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485726#comment-14485726 ] Hudson commented on YARN-3457: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2107 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2107/]) YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. Contributed by Bibin A Chundatt. (ozawa: rev dd852f5b8c8fe9e52d15987605f36b5b60f02701) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/CHANGES.txt NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485724#comment-14485724 ] Hudson commented on YARN-3110: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2107 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2107/]) YARN-3110. Few issues in ApplicationHistory web ui. Contributed by Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/CHANGES.txt Few issues in ApplicationHistory web ui --- Key: YARN-3110 URL: https://issues.apache.org/jira/browse/YARN-3110 Project: Hadoop YARN Issue Type: Sub-task Components: applications, timelineserver Affects Versions: 2.6.0 Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, YARN-3110.20150406-1.patch Application state and History link wrong when Application is in unassigned state 1.Configure capacity schedular with queue size as 1 also max Absolute Max Capacity: 10.0% (Current application state is Accepted and Unassigned from resource manager side) 2.Submit application to queue and check the state and link in Application history State= null and History link shown as N/A in applicationhistory page Kill the same application . In timeline server logs the below is show when selecting application link. {quote} 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to read the AM container of the application attempt appattempt_1422467063659_0007_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485741#comment-14485741 ] Jian He commented on YARN-3136: --- [~sunilg], sorry for the late response. we can suppress the find bug warning, given it's a no issue. I found below synchronization is added in the newest patch, I think it's not necessary ? {code} synchronized (this) { appImpl = this.rmContext.getRMApps().get(appId); amContainerId = rmContext.getRMApps().get(appId) .getCurrentAppAttempt().getMasterContainer().getId(); } {code} getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-3055: -- Attachment: YARN-3055.patch Haven't had a chance to run findbugs. Might grumble about sync dttr.applicationIds. Will check this afternoon. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485751#comment-14485751 ] Robert Kanter commented on YARN-3429: - Ya, sorry about that; I only noticed yesterday, and so I fixed CHANGES.txt to say YARN-3429. Unfortunately, I can't fix the git message or the Hudson comments in YARN-2429. TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485775#comment-14485775 ] zhihai xu commented on YARN-3464: - [~kasha], thanks for the information. I just looked at YARN-3024, Yes, it will make this issue happen more frequently. Before YARN-3024, The localization for private resource is one by one. The next one won't start until the current one finish localization. It will take longer time for private resource localization. With YARN-3024, The localization will be done in parallel, multiple files can be localized at the same time. The chance for ContainerLocalizer being killed when the last two PRIVATE LocalizerResourceRequestEvent are added is bigger. Yes, your suggestion is also what I thought. Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. Without ContainerLocalizer, LocalizerRunner#update will never be called. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485781#comment-14485781 ] Vinod Kumar Vavilapalli commented on YARN-3391: --- A cosmetic suggestion: flow_run - flow_run_name or flow_run_id ? Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, YARN-3391.4.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485791#comment-14485791 ] Vrushali C commented on YARN-3391: -- [~vinodkv] , +1 for flow_run to be called as flow_run_id. It's a number (epoch timestamp). If we call it flow_run_name, that makes it sound like it's a string. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, YARN-3391.4.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485797#comment-14485797 ] Naganarasimha G R commented on YARN-3044: - Thanks for the review comments [~zjshen], bq. Can we use ContainerEntity. The events from RM are RM__EVENT, and those from NM are NM__EVENT. This approach should be fine, will update in the next patch. bq. I think we may overestimate the performance impact of writing NM lifecycles. Perhaps a more reasonable performance metric is {{cost of writing lifecycle events per container / cost of managing lifecycle per container * 100%}}. For example, if it is 2%, I guess it will probably be acceptable. Well true we might be underestimating the RM's ability to handle publishing of Container Entity. But currently anyway have made it configurable to publish Container entities from RM side and while measuring performance we can enable this and check the performance, if fine then we can totally disable this configuration check and make RM publish always, your opinion ? bq. I'm not sure if I understand this part correctly, but I incline that system timeline data (RM/NM) is controlled by cluster config and per cluster, while application data is controlled by framework or even per-application config. It may have some problem if the user is able to change the former config. For example, he can hide its application information from cluster admin. may be i dint get this correctly, Is it that you intend to say that framework/cluster config (which can impact the application execution) should be logged by RM/NM and other application specific config can be logged by the AM ? bq. Do you mean we should keep yarn.resourcemanager.system-metrics-publisher.enabled to control RM SMP, and and create yarn.nodemanager.system-metrics-publisher.enabled to control NM SMP? No i meant this commment of [~djp] {{We can have different entity types, e.g. NM_CONTAINER_EVENT, RM_CONTAINER_EVENT, for containers' event get posted from NM or RM then we can fully understand how the world could be different from NM and RM (i.e. start time, end time, etc.}} {{However, we can disable RM-side posting work in production environment by default.}} [Event producers] Implement RM writing app lifecycle events to ATS -- Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485794#comment-14485794 ] zhihai xu commented on YARN-3464: - I also created another JIRA YARN-3465, which can help this issue and make sure localization is based on the correct order: PUBLIC, PRIVATE and APPLICATION. The issue in my case is also because PRIVATE LocalResourceRequest is reordered to first and APPLICATION LocalResourceRequest is reordered to last. The PUBLIC LocalResourceRequest is in the middle which add delay for APPLICATION LocalResourceRequest. Because the entrySet order based on HashMap will not be fixed. use LinkedHashMap should be used. Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. Without ContainerLocalizer, LocalizerRunner#update will never be called. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485798#comment-14485798 ] Thomas Graves commented on YARN-3434: - [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485798#comment-14485798 ] Thomas Graves edited comment on YARN-3434 at 4/8/15 6:59 PM: - [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again in assignContainer() if it skipped the checks in assignContainers() based on reservations but then is allowed to shouldAllocOrReserveNewContainer. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. was (Author: tgraves): [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Rojas updated YARN-3467: Summary: Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications in RM Web UI (was: Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications on RM Web UI) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications in RM Web UI --- Key: YARN-3467 URL: https://issues.apache.org/jira/browse/YARN-3467 Project: Hadoop YARN Issue Type: New Feature Components: webapp, yarn Affects Versions: 2.5.0 Reporter: Anthony Rojas Priority: Minor The YARN REST API can report on the following properties: *allocatedMB*: The sum of memory in MB allocated to the application's running containers *allocatedVCores*: The sum of virtual cores allocated to the application's running containers *runningContainers*: The number of containers currently running for the application Currently, the RM Web UI does not report on these items (at least I couldn't find any entries within the Web UI). It would be useful for YARN Application and Resource troubleshooting to have these properties and their corresponding values exposed on the RM WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications on RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485792#comment-14485792 ] Rohith commented on YARN-3467: -- I think ApplicationAttempt page would give these information. This page very much help full for debugging the application. Would you have look into this page ? Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications on RM Web UI --- Key: YARN-3467 URL: https://issues.apache.org/jira/browse/YARN-3467 Project: Hadoop YARN Issue Type: New Feature Components: webapp, yarn Affects Versions: 2.5.0 Reporter: Anthony Rojas Priority: Minor The YARN REST API can report on the following properties: *allocatedMB*: The sum of memory in MB allocated to the application's running containers *allocatedVCores*: The sum of virtual cores allocated to the application's running containers *runningContainers*: The number of containers currently running for the application Currently, the RM Web UI does not report on these items (at least I couldn't find any entries within the Web UI). It would be useful for YARN Application and Resource troubleshooting to have these properties and their corresponding values exposed on the RM WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485818#comment-14485818 ] Karthik Kambatla commented on YARN-3464: We can may be discuss this more on YARN-3465, but I don't think having it sorted is necessary. The container can not be started until all the resources are localized; so, the order of their downloads shouldn't matter as long as they all get localized. Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. Without ContainerLocalizer, LocalizerRunner#update will never be called. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485827#comment-14485827 ] Robert Kanter commented on YARN-2423: - Thanks for the comments Steve; those definitely sound like good suggestions. However, I'm not going to spend time updating the patch again if we're not going to actually commit this, and it seems like we're not. If that ever changes, I'll make sure to incorporate them though. TimelineClient should wrap all GET APIs to facilitate Java users Key: YARN-2423 URL: https://issues.apache.org/jira/browse/YARN-2423 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Robert Kanter Attachments: YARN-2423.004.patch, YARN-2423.005.patch, YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, YARN-2423.patch TimelineClient provides the Java method to put timeline entities. It's also good to wrap over all GET APIs (both entity and domain), and deserialize the json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485834#comment-14485834 ] Thomas Graves commented on YARN-3434: - Note I had a reproducible test case for this. Set userlimit% to 100%, user limit factor to 1. 15 nodes, 20GB each. 1 queue configured for capacity 70, the 2nd queue configured capacity 30. In one queue I started a sleep job needing 10 - 12GB containers in the first queue. I then started a second job in the 2nd queue that needed 25, 12GB containers, the second job got containers but then had to reserve others waiting for the first job to release some. Without this change when the first job started releasing containers the second job would grab them and go over the user limit. With this fix it stayed within the user limit. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485396#comment-14485396 ] Sunil G commented on YARN-3136: --- Hi [~jlowe] and [~jianhe] Cud u pls have a look on the comment above. getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers
Jason Lowe created YARN-3466: Summary: RM nodes web page does not sort by node HTTP address or containers Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe The ResourceManager does not support sorting by the node HTTP address nor the container count columns on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485461#comment-14485461 ] Daryn Sharp commented on YARN-3055: --- bq. It does seem odd to get the expiration date by renewing the token The expiration is metadata associated with the token that is only known to the token issuer's secret manager. The correct fix is for the renewer to not reschedule if the next expiration is the same as the last. The bug wasn't a real priority when tokens weren't renewed forever. If we regress to renewing forever, then it does become a problem. bq. I think currently the sub-job won't kill the overall workflow. Correct, I misread in my haste. It's rather the opposite: sub-jobs can override the original job's request to cancel the tokens. bq. I think overall the current patch will work, other than few comments I have. It works but not in a desirable way. Jason posted my patch we use internally on YARN-3439 which is duped to this jira. I'm updating it to handle the proxy refresh cases and will post shortly. The current semantics of the conf setting and the 2.x changes have been nothing but production blockers. Ref counting will solve this once and for all. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485527#comment-14485527 ] Zhijie Shen commented on YARN-3044: --- Before screening the patch details, I have some high level comments: bq. IIUC you meant we will have RMContainerEntity having type as YARN_RM_CONTAINER and NMContainerEntity having type as YARN_NM_CONTAINER right ? Can we use ContainerEntity. The events from RM are RM__EVENT, and those from NM are NM__EVENT. bq. I'm very much concerned about the volume of writes that the RM collector would need to do, bq. I fully understand the concern from Sangjin Lee that RM may not afford tens of thousands containers in large size cluster. I also think publishing all container lifecycle events from NM is likely to be a big cost in total, but I'd like to provide some point from other point of view. Say we have a big cluster that can afford 5,000 concurrent containers. RM have to maintain the lifecycle of these 5K containers, and I don't think a less powerful server can manage it, right? Assume we have such a powerful server to run the RM of a big cluster, will publishing lifecycle events be a big deal to the server? I'm not sure, but I can provide some hints. Now each container will write 2 events per lifecycle, and perhaps in the future we want to record each state transition, and result in ~10 events per lifecycle. Therefore, we have 10 * 5K lifecycle events, and they won't be written at the same moment because containers' lifecycles are usually async. Let's assume each container run for 1h and lifecycle events are uniformly distributed, in each second, there will just be around 14 concurrent writes (for a powerful server). I think we may overestimate the performance impact of writing NM lifecycles. Perhaps a more reasonable performance metric is {{cost of writing lifecycle events per container / cost of managing lifecycle per container * 100%}}. For example, if it is 2%, I guess it will probably be acceptable. bq. all configs will not be set as part of this so was there more planned for this from the framework side or each application needs to take care of this on their own to populate configuration information ? bq. In that sense, how about letting frameworks (namely AMs) write the configuration instead of RM? I'm not sure if I understand this part correctly, but I incline that system timeline data (RM/NM) is controlled by cluster config and per cluster, while application data is controlled by framework or even per-application config. It may have some problem if the user is able to change the former config. For example, he can hide its application information from cluster admin. bq. I have also incorporated the changes to support RMContainer metrics based on configuration (Junping's comments). Do you mean we should keep {{yarn.resourcemanager.system-metrics-publisher.enabled}} to control RM SMP, and and create {{yarn.nodemanager.system-metrics-publisher.enabled}} to control NM SMP? [Event producers] Implement RM writing app lifecycle events to ATS -- Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs
[ https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485530#comment-14485530 ] Xuan Gong commented on YARN-3347: - Thanks for the review. bq. For comma-separated value, do we mean attempt number? If so, may be we should describe more explicitly here? Also, can we use 0 (instead of -1) for AM container of latest attempt. If so, all negative value here is illegal. Added. I prefer to use -1 for the latest AM Container. 0 in the list/array is the first element. bq. Missing ! before getConf()? Fixed bq. Sounds like we are re-order the sequence of user's input which seems unnecessary to me. I would suggest to keep order from user's input or it could confuse people. Fixed bq. Also, for method of run(String[] args), it looks very complexity for now. Can we do some refactor work there and put some comments inline? Yes, it indeed added some logics. Added some comments. Improve YARN log command to get AMContainer logs as well as running containers logs --- Key: YARN-3347 URL: https://issues.apache.org/jira/browse/YARN-3347 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, YARN-3347.2.patch, YARN-3347.2.rebase.patch Right now, we could specify applicationId, node http address and container ID to get the specify container log. Or we could only specify applicationId to get all the container logs. It is very hard for the users to get logs for AM container since the AMContainer logs have more useful information. Users need to know the AMContainer's container ID and related Node http address. We could improve the YARN Log Command to allow users to get AMContainer logs directly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs
[ https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3347: Attachment: YARN-3347.3.patch Improve YARN log command to get AMContainer logs as well as running containers logs --- Key: YARN-3347 URL: https://issues.apache.org/jira/browse/YARN-3347 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, YARN-3347.2.patch, YARN-3347.2.rebase.patch, YARN-3347.3.patch Right now, we could specify applicationId, node http address and container ID to get the specify container log. Or we could only specify applicationId to get all the container logs. It is very hard for the users to get logs for AM container since the AMContainer logs have more useful information. Users need to know the AMContainer's container ID and related Node http address. We could improve the YARN Log Command to allow users to get AMContainer logs directly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3293) Track and display capacity scheduler health metrics in web UI
[ https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485541#comment-14485541 ] Varun Vasudev commented on YARN-3293: - Thanks for the review Craig! I thought about it but I didn't get a chance to look at the FairScheduler page. It should be pretty easy to pull out the block into its own class. Track and display capacity scheduler health metrics in web UI - Key: YARN-3293 URL: https://issues.apache.org/jira/browse/YARN-3293 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch It would be good to display metrics that let users know about the health of the capacity scheduler in the web UI. Today it is hard to get an idea if the capacity scheduler is functioning correctly. Metrics such as the time for the last allocation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3361) CapacityScheduler side changes to support non-exclusive node labels
[ https://issues.apache.org/jira/browse/YARN-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3361: - Attachment: YARN-3361.4.patch Attached patch fixed several naming issues. (ver.4) CapacityScheduler side changes to support non-exclusive node labels --- Key: YARN-3361 URL: https://issues.apache.org/jira/browse/YARN-3361 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3361.1.patch, YARN-3361.2.patch, YARN-3361.3.patch, YARN-3361.4.patch According to design doc attached in YARN-3214, we need implement following logic in CapacityScheduler: 1) When allocate a resource request with no node-label specified, it should get preferentially allocated to node without labels. 2) When there're some available resource in a node with label, they can be used by applications with following order: - Applications under queues which can access the label and ask for same labeled resource. - Applications under queues which can access the label and ask for non-labeled resource. - Applications under queues cannot access the label and ask for non-labeled resource. 3) Expose necessary information that can be used by preemption policy to make preemption decisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485638#comment-14485638 ] Jian He commented on YARN-3055: --- sure, looking forward to your patch. bq. The correct fix is for the renewer to not reschedule if the next expiration is the same as the last. Sorry, didn't get what you mean. mind clarifying more ? The renew call after getting the new token is solely to retrieve the expiration date for the token. I found given that RM renews all tokens at once for each app on app submission, if renew rescheduling becomes a DOS problem, then app submission situation may be much worse. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3459) TestLog4jWarningErrorMetricsAppender breaks in trunk
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3459: - Fix Version/s: (was: 2.7.0) 2.8.0 TestLog4jWarningErrorMetricsAppender breaks in trunk Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Li Lu Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3459: - Assignee: Varun Vasudev (was: Li Lu) Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Varun Vasudev Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts
[ https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485655#comment-14485655 ] Rohith commented on YARN-2684: -- [~kasha] kindly provide your thoughts any more changes to be done as part of this JIRA. FairScheduler should tolerate queue configuration changes across RM restarts Key: YARN-2684 URL: https://issues.apache.org/jira/browse/YARN-2684 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2684.patch YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2901: Assignee: Wangda Tan (was: Varun Vasudev) Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Wangda Tan Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485711#comment-14485711 ] Wangda Tan commented on YARN-3466: -- Tried in a local cluster, HTTP address, #containers and node-label sorting all work. +1. Pending Jenkins. RM nodes web page does not sort by node HTTP address or containers -- Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address nor the container count columns on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3459: - Summary: Fix failiure of TestLog4jWarningErrorMetricsAppender (was: TestLog4jWarningErrorMetricsAppender breaks in trunk) Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Li Lu Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2901: - Assignee: Varun Vasudev (was: Wangda Tan) Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485667#comment-14485667 ] Hudson commented on YARN-3459: -- FAILURE: Integrated in Hadoop-trunk-Commit #7533 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7533/]) YARN-3459. Fix failiure of TestLog4jWarningErrorMetricsAppender. (Varun Vasudev via wangda) (wangda: rev 7af086a515d573dc90ea4deec7f4e3f23622e0e8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java * hadoop-yarn-project/CHANGES.txt Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Varun Vasudev Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485691#comment-14485691 ] Hudson commented on YARN-3110: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #158 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/158/]) YARN-3110. Few issues in ApplicationHistory web ui. Contributed by Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java Few issues in ApplicationHistory web ui --- Key: YARN-3110 URL: https://issues.apache.org/jira/browse/YARN-3110 Project: Hadoop YARN Issue Type: Sub-task Components: applications, timelineserver Affects Versions: 2.6.0 Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Minor Fix For: 2.8.0 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, YARN-3110.20150406-1.patch Application state and History link wrong when Application is in unassigned state 1.Configure capacity schedular with queue size as 1 also max Absolute Max Capacity: 10.0% (Current application state is Accepted and Unassigned from resource manager side) 2.Submit application to queue and check the state and link in Application history State= null and History link shown as N/A in applicationhistory page Kill the same application . In timeline server logs the below is show when selecting application link. {quote} 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to read the AM container of the application attempt appattempt_1422467063659_0007_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at
[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called
[ https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485693#comment-14485693 ] Hudson commented on YARN-3457: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #158 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/158/]) YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. Contributed by Bibin A Chundatt. (ozawa: rev dd852f5b8c8fe9e52d15987605f36b5b60f02701) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java NPE when NodeManager.serviceInit fails and stopRecoveryStore called --- Key: YARN-3457 URL: https://issues.apache.org/jira/browse/YARN-3457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Fix For: 2.8.0 Attachments: YARN-3457.001.patch When NodeManager service init fails during stopRecoveryStore null pointer exception is thrown {code} @Override protected void serviceInit(Configuration conf) throws Exception { .. try { exec.init(); } catch (IOException e) { throw new YarnRuntimeException(Failed to initialize container executor, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore); {code} context is null when service init fails {code} private void stopRecoveryStore() throws IOException { nmStore.stop(); if (context.getDecommissioned() nmStore.canRecover()) { .. } } {code} Null pointer exception thrown {quote} 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485687#comment-14485687 ] Karthik Kambatla commented on YARN-3464: bq. Looking at the code closely, I don't see any resources being removed from pending. So, pending shouldn't be empty after some of the resources have been downloaded. Never mind. findNextResource has a call to iterator.remove(). In any case, I think the right approach is to send an explicit event to the localizer to indicate we are done with localizing all the resources. On receiving this, the localizer tracker sends the DIE action. Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. Without ContainerLocalizer, LocalizerRunner#update will never be called. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485700#comment-14485700 ] Hudson commented on YARN-3429: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #158 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/158/]) YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 5b8a3ae366294aec492f69f1a429aa7fce5d13be) * hadoop-yarn-project/CHANGES.txt TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period
[ https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485695#comment-14485695 ] Hudson commented on YARN-3294: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #158 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/158/]) YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for (xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period - Key: YARN-3294 URL: https://issues.apache.org/jira/browse/YARN-3294 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, apache-yarn-3294.3.patch, apache-yarn-3294.4.patch It would be nice to have a button on the web UI that would allow dumping of debug logs for just the capacity scheduler for a fixed period of time(1 min, 5 min or so) in a separate log file. It would be useful when debugging scheduler behavior without affecting the rest of the resourcemanager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3466: - Summary: Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column (was: RM nodes web page does not sort by node HTTP address or containers) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column - Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address nor the container count columns on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485714#comment-14485714 ] Hudson commented on YARN-2901: -- FAILURE: Integrated in Hadoop-trunk-Commit #7534 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7534/]) YARN-2901 addendum: Fixed findbugs warning caused by previously patch (wangda: rev ba9ee22ca4ed2c5ff447b66b2e2dfe25f6880fe0) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities
[ https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-3448: -- Attachment: YARN-3448.4.patch Couple of more bug fixes in patch 4. Next I try out the index change you suggest above. Add Rolling Time To Lives Level DB Plugin Capabilities -- Key: YARN-3448 URL: https://issues.apache.org/jira/browse/YARN-3448 Project: Hadoop YARN Issue Type: Improvement Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-3448.1.patch, YARN-3448.2.patch, YARN-3448.3.patch, YARN-3448.4.patch For large applications, the majority of the time in LeveldbTimelineStore is spent deleting old entities record at a time. An exclusive write lock is held during the entire deletion phase which in practice can be hours. If we are to relax some of the consistency constraints, other performance enhancing techniques can be employed to maximize the throughput and minimize locking time. Split the 5 sections of the leveldb database (domain, owner, start time, entity, index) into 5 separate databases. This allows each database to maximize the read cache effectiveness based on the unique usage patterns of each database. With 5 separate databases each lookup is much faster. This can also help with I/O to have the entity and index databases on separate disks. Rolling DBs for entity and index DBs. 99.9% of the data are in these two sections 4:1 ration (index to entity) at least for tez. We replace DB record removal with file system removal if we create a rolling set of databases that age out and can be efficiently removed. To do this we must place a constraint to always place an entity's events into it's correct rolling db instance based on start time. This allows us to stitching the data back together while reading and artificial paging. Relax the synchronous writes constraints. If we are willing to accept losing some records that we not flushed in the operating system during a crash, we can use async writes that can be much faster. Prefer Sequential writes. sequential writes can be several times faster than random writes. Spend some small effort arranging the writes in such a way that will trend towards sequential write performance over random write performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3426) Add jdiff support to YARN
[ https://issues.apache.org/jira/browse/YARN-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3426: Attachment: YARN-3426-040815.patch Fixed the problem in hadoop-annotate for our javadoc doclet (missing some @Private tags for some methods). Upload the new patch with new API xmls. Add jdiff support to YARN - Key: YARN-3426 URL: https://issues.apache.org/jira/browse/YARN-3426 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Priority: Blocker Attachments: YARN-3426-040615-1.patch, YARN-3426-040615.patch, YARN-3426-040715.patch, YARN-3426-040815.patch Maybe we'd like to extend our current jdiff tool for hadoop-common and hdfs to YARN as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)