[jira] [Commented] (YARN-2140) Add support for network IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030284#comment-14030284 ] haosdent commented on YARN-2140: Thx [~ywskycn] Looking forward your work. Add support for network IO isolation/scheduling for containers -- Key: YARN-2140 URL: https://issues.apache.org/jira/browse/YARN-2140 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2156) ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration
[ https://issues.apache.org/jira/browse/YARN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030286#comment-14030286 ] Svetozar Ivanov commented on YARN-2156: --- Hm, I would expect a warning message in my logs if is like you said. That's because my configuration settings are completely ignored. ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration --- Key: YARN-2156 URL: https://issues.apache.org/jira/browse/YARN-2156 Project: Hadoop YARN Issue Type: Bug Reporter: Svetozar Ivanov org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService#serviceStart() method has mistakenly hardcoded AuthMethod.TOKEN as Hadoop security authentication. It looks like that: {code} @Override protected void serviceStart() throws Exception { Configuration conf = getConfig(); YarnRPC rpc = YarnRPC.create(conf); InetSocketAddress masterServiceAddress = conf.getSocketAddr( YarnConfiguration.RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT); Configuration serverConf = conf; // If the auth is not-simple, enforce it to be token-based. serverConf = new Configuration(conf); serverConf.set( CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, SaslRpcServer.AuthMethod.TOKEN.toString()); ... } {code} Obviously such code makes sense only if CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION config setting is missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2152) Recover missing container information
[ https://issues.apache.org/jira/browse/YARN-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030292#comment-14030292 ] Hadoop QA commented on YARN-2152: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650228/YARN-2152.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3979//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3979//console This message is automatically generated. Recover missing container information - Key: YARN-2152 URL: https://issues.apache.org/jira/browse/YARN-2152 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2152.1.patch, YARN-2152.1.patch, YARN-2152.2.patch Container information such as container priority and container start time cannot be recovered because NM container today lacks such container information to send across on NM registration when RM recovery happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2144) Add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2144: Assignee: Wangda Tan Add logs when preemption occurs --- Key: YARN-2144 URL: https://issues.apache.org/jira/browse/YARN-2144 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan There should be easy-to-read logs when preemption does occur. 1. For debugging purpose, RM should log this. 2. For administrative purpose, RM webpage should have a page to show recent preemption events. RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030433#comment-14030433 ] Tsuyoshi OZAWA commented on YARN-1919: -- Thanks for the review, Jian. [~kkambatl], could you check it? Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE -- Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2155) FairScheduler: Incorrect threshold check for preemption
[ https://issues.apache.org/jira/browse/YARN-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030524#comment-14030524 ] Hudson commented on YARN-2155: -- FAILURE: Integrated in Hadoop-Yarn-trunk #582 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/582/]) YARN-2155. FairScheduler: Incorrect threshold check for preemption. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1602295) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java FairScheduler: Incorrect threshold check for preemption --- Key: YARN-2155 URL: https://issues.apache.org/jira/browse/YARN-2155 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.5.0 Attachments: YARN-2155.patch {code} private boolean shouldAttemptPreemption() { if (preemptionEnabled) { return (preemptionUtilizationThreshold Math.max( (float) rootMetrics.getAvailableMB() / clusterResource.getMemory(), (float) rootMetrics.getAvailableVirtualCores() / clusterResource.getVirtualCores())); } return false; } {code} preemptionUtilizationThreshould should be compared with allocatedResource instead of availableResource. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1702) Expose kill app functionality as part of RM web services
[ https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030528#comment-14030528 ] Hudson commented on YARN-1702: -- FAILURE: Integrated in Hadoop-Yarn-trunk #582 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/582/]) YARN-1702. Added kill app functionality to RM web services. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1602298) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppState.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm Expose kill app functionality as part of RM web services Key: YARN-1702 URL: https://issues.apache.org/jira/browse/YARN-1702 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.5.0 Attachments: apache-yarn-1702.10.patch, apache-yarn-1702.11.patch, apache-yarn-1702.12.patch, apache-yarn-1702.13.patch, apache-yarn-1702.14.patch, apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, apache-yarn-1702.4.patch, apache-yarn-1702.5.patch, apache-yarn-1702.7.patch, apache-yarn-1702.8.patch, apache-yarn-1702.9.patch Expose functionality to kill an app via the ResourceManager web services API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2156) ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration
[ https://issues.apache.org/jira/browse/YARN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030639#comment-14030639 ] Daryn Sharp commented on YARN-2156: --- A warning doesn't make sense because it implies there is something you should change. There's not. The config setting, whether explicitly set or not, is entirely irrelevant. By design, yarn always uses tokens and these tokens carry essential information that is not otherwise obtainable for non-token authenticated connections. That's why token authentication is explicitly set. ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration --- Key: YARN-2156 URL: https://issues.apache.org/jira/browse/YARN-2156 Project: Hadoop YARN Issue Type: Bug Reporter: Svetozar Ivanov org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService#serviceStart() method has mistakenly hardcoded AuthMethod.TOKEN as Hadoop security authentication. It looks like that: {code} @Override protected void serviceStart() throws Exception { Configuration conf = getConfig(); YarnRPC rpc = YarnRPC.create(conf); InetSocketAddress masterServiceAddress = conf.getSocketAddr( YarnConfiguration.RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT); Configuration serverConf = conf; // If the auth is not-simple, enforce it to be token-based. serverConf = new Configuration(conf); serverConf.set( CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, SaslRpcServer.AuthMethod.TOKEN.toString()); ... } {code} Obviously such code makes sense only if CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION config setting is missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2140) Add support for network IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030643#comment-14030643 ] Robert Joseph Evans commented on YARN-2140: --- We are working on similar things for storm. I am very interested in your design, because for any streaming system to truly have a chance on YARN soft guarantees on network I/O are critical. There are several big problems with network I/O even if the user can effectively estimate what they will need. The first is that the resource is not limited to a single node in the cluster. The network has a topology and a bottlekneck can show up at any point in that topology. So you may think you are fine because each node in a rack is not scheduled to be using the full bandwidth that the network card(s) can support. But you can easily have saturated the top of rack switch without knowing it. To solve this problem you effectively have to know the topology of the application itself. So that you can schedule the node to node network connections within that application. if users don't know how much network they are going to use at a high level, they will never have any idea at a low level. But then you also have the big problem of batch being very bursty in its network usage. The only way to solve this is going to require network hardware support for prioritizing packets. But I'll wait for your design before writing too much more. Add support for network IO isolation/scheduling for containers -- Key: YARN-2140 URL: https://issues.apache.org/jira/browse/YARN-2140 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2155) FairScheduler: Incorrect threshold check for preemption
[ https://issues.apache.org/jira/browse/YARN-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030651#comment-14030651 ] Hudson commented on YARN-2155: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1773 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1773/]) YARN-2155. FairScheduler: Incorrect threshold check for preemption. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1602295) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java FairScheduler: Incorrect threshold check for preemption --- Key: YARN-2155 URL: https://issues.apache.org/jira/browse/YARN-2155 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.5.0 Attachments: YARN-2155.patch {code} private boolean shouldAttemptPreemption() { if (preemptionEnabled) { return (preemptionUtilizationThreshold Math.max( (float) rootMetrics.getAvailableMB() / clusterResource.getMemory(), (float) rootMetrics.getAvailableVirtualCores() / clusterResource.getVirtualCores())); } return false; } {code} preemptionUtilizationThreshould should be compared with allocatedResource instead of availableResource. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1702) Expose kill app functionality as part of RM web services
[ https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030655#comment-14030655 ] Hudson commented on YARN-1702: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1773 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1773/]) YARN-1702. Added kill app functionality to RM web services. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1602298) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppState.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm Expose kill app functionality as part of RM web services Key: YARN-1702 URL: https://issues.apache.org/jira/browse/YARN-1702 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.5.0 Attachments: apache-yarn-1702.10.patch, apache-yarn-1702.11.patch, apache-yarn-1702.12.patch, apache-yarn-1702.13.patch, apache-yarn-1702.14.patch, apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, apache-yarn-1702.4.patch, apache-yarn-1702.5.patch, apache-yarn-1702.7.patch, apache-yarn-1702.8.patch, apache-yarn-1702.9.patch Expose functionality to kill an app via the ResourceManager web services API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
Ted Yu created YARN-2158: Summary: TestRMWebServicesAppsModification sometimes fails in trunk Key: YARN-2158 URL: https://issues.apache.org/jira/browse/YARN-2158 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : {code} Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.297 sec FAILURE! java.lang.AssertionError: app state incorrect at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2155) FairScheduler: Incorrect threshold check for preemption
[ https://issues.apache.org/jira/browse/YARN-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030721#comment-14030721 ] Hudson commented on YARN-2155: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1800 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1800/]) YARN-2155. FairScheduler: Incorrect threshold check for preemption. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1602295) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java FairScheduler: Incorrect threshold check for preemption --- Key: YARN-2155 URL: https://issues.apache.org/jira/browse/YARN-2155 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.5.0 Attachments: YARN-2155.patch {code} private boolean shouldAttemptPreemption() { if (preemptionEnabled) { return (preemptionUtilizationThreshold Math.max( (float) rootMetrics.getAvailableMB() / clusterResource.getMemory(), (float) rootMetrics.getAvailableVirtualCores() / clusterResource.getVirtualCores())); } return false; } {code} preemptionUtilizationThreshould should be compared with allocatedResource instead of availableResource. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1702) Expose kill app functionality as part of RM web services
[ https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030725#comment-14030725 ] Hudson commented on YARN-1702: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1800 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1800/]) YARN-1702. Added kill app functionality to RM web services. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1602298) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppState.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm Expose kill app functionality as part of RM web services Key: YARN-1702 URL: https://issues.apache.org/jira/browse/YARN-1702 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.5.0 Attachments: apache-yarn-1702.10.patch, apache-yarn-1702.11.patch, apache-yarn-1702.12.patch, apache-yarn-1702.13.patch, apache-yarn-1702.14.patch, apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, apache-yarn-1702.4.patch, apache-yarn-1702.5.patch, apache-yarn-1702.7.patch, apache-yarn-1702.8.patch, apache-yarn-1702.9.patch Expose functionality to kill an app via the ResourceManager web services API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2022: -- Attachment: YARN-2022.5.patch Thank you Mayank. I have updated the patch as per the comments. Also I did test on real cluster, and found that AM Container are spared by the proportional policy. Basic scenarios are tested as part of this. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2140) Add support for network IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030750#comment-14030750 ] Wei Yan commented on YARN-2140: --- Thanks for the comments, [~revans2]. Add support for network IO isolation/scheduling for containers -- Key: YARN-2140 URL: https://issues.apache.org/jira/browse/YARN-2140 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1885: -- Summary: RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts (was: RM may not send the finished signal to some nodes where the application ran after RM restarts) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts --- Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030800#comment-14030800 ] Hadoop QA commented on YARN-2022: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650318/YARN-2022.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3980//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3980//console This message is automatically generated. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2146) Yarn logs aggregation error
[ https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030812#comment-14030812 ] Xuan Gong commented on YARN-2146: - [~airbots] Hey, Chen. Have you figured out why this happens ? I am very curious. Yarn logs aggregation error --- Key: YARN-2146 URL: https://issues.apache.org/jira/browse/YARN-2146 Project: Hadoop YARN Issue Type: Bug Reporter: Chen He when I run yarn logs -applicationId application_xxx /tmp/application_xxx. It creates file, also shows part of logs on the terminal screen, and reports following error: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:430) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566) at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev reassigned YARN-2158: --- Assignee: Varun Vasudev TestRMWebServicesAppsModification sometimes fails in trunk -- Key: YARN-2158 URL: https://issues.apache.org/jira/browse/YARN-2158 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Varun Vasudev Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : {code} Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.297 sec FAILURE! java.lang.AssertionError: app state incorrect at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1365: Attachment: YARN-1365.005.patch Addressed Jian's comments. updated finishApplicationMaster to return resync when application is not registered as per the agreement. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030874#comment-14030874 ] Anubhav Dhoot commented on YARN-1365: - Hi [~jianhe] I addressed all your comments except we can print the current state of RMAppAttempt also which will be useful for debugging There is no easy way to get to RMAppAttempt at that point. i dont want to add a dependancy on it just for logging. Let me know if you think there is an easy way to get to it. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2157) Document YARN metrics
[ https://issues.apache.org/jira/browse/YARN-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-2157: Attachment: YARN-2157.patch Attaching a patch. Document YARN metrics - Key: YARN-2157 URL: https://issues.apache.org/jira/browse/YARN-2157 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Akira AJISAKA Assignee: Akira AJISAKA Attachments: YARN-2157.patch YARN-side of HADOOP-6350. Add YARN metrics to Metrics document. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2157) Document YARN metrics
[ https://issues.apache.org/jira/browse/YARN-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030943#comment-14030943 ] Jian He commented on YARN-2157: --- [~ajisakaa], Thanks for working on this ! this will be useful. Document YARN metrics - Key: YARN-2157 URL: https://issues.apache.org/jira/browse/YARN-2157 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Akira AJISAKA Assignee: Akira AJISAKA Attachments: YARN-2157.patch YARN-side of HADOOP-6350. Add YARN metrics to Metrics document. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2052: -- Target Version/s: 2.5.0 ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030955#comment-14030955 ] Hadoop QA commented on YARN-1365: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650337/YARN-1365.005.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3981//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3981//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2157) Document YARN metrics
[ https://issues.apache.org/jira/browse/YARN-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030958#comment-14030958 ] Hadoop QA commented on YARN-2157: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650339/YARN-2157.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3982//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3982//console This message is automatically generated. Document YARN metrics - Key: YARN-2157 URL: https://issues.apache.org/jira/browse/YARN-2157 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Akira AJISAKA Assignee: Akira AJISAKA Attachments: YARN-2157.patch YARN-side of HADOOP-6350. Add YARN metrics to Metrics document. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030966#comment-14030966 ] Vinod Kumar Vavilapalli commented on YARN-2052: --- bq. e.g. container_XXX_1000 after epoch 1. This scheme won't work with a single reserved digit for epochs and a large number of restarts over time. Here's my summary of what I think we should do: The current ContainerID format is {code} ContainerID { applicationAttemptID containerIDInt } {code} Let's just add a new field {code} + rmIdentifier {code} Old code (state-store, history-server etc) will not read it and that's fine. The only problem is users who are interpreting container_ID strings themselves. That is NOT supported. We should modify ConverterUtils to support the new-field, and that should do. Thoughts? ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030971#comment-14030971 ] Vinod Kumar Vavilapalli commented on YARN-2052: --- I forgot to add one more note that I myself ran into in an offline discussion with [~jianhe]. The new field can be RMIdentifier which today is backed by the start-timestamp. But two RMs (active/standby) started at the same time can potentially clash w.r.t time-stamps. We can chose this to be timestamp+host-name etc or simply a UUID.. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030979#comment-14030979 ] Hadoop QA commented on YARN-1365: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650340/YARN-1365.005.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3983//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3983//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031001#comment-14031001 ] Jian He commented on YARN-1885: --- bq. The application list should only be respected when the node is not inactive? For nodes that expired but rejoin with earlier running applications, if the application by this time has completed, I think we should also send the app-finished signal ? RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts --- Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031007#comment-14031007 ] Hadoop QA commented on YARN-2074: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650352/YARN-2074.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3984//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3984//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3984//console This message is automatically generated. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, YARN-2074.4.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2074: -- Attachment: YARN-2074.5.patch Fixed the find bug warnings Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, YARN-2074.4.patch, YARN-2074.5.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2158: Attachment: apache-yarn-2158.0.patch Patch to add debugging information to the test. TestRMWebServicesAppsModification sometimes fails in trunk -- Key: YARN-2158 URL: https://issues.apache.org/jira/browse/YARN-2158 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Varun Vasudev Priority: Minor Attachments: apache-yarn-2158.0.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : {code} Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.297 sec FAILURE! java.lang.AssertionError: app state incorrect at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2000) Fix ordering of starting services inside the RM
[ https://issues.apache.org/jira/browse/YARN-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031043#comment-14031043 ] Jian He commented on YARN-2000: --- Probably we can have state-store stop last so that all the other services are stopped first and won't accept more requests and send events to state-store. Fix ordering of starting services inside the RM --- Key: YARN-2000 URL: https://issues.apache.org/jira/browse/YARN-2000 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He The order of starting services in RM would be: - Recovery of the app/attempts - Start the scheduler and add scheduler app/attempts - Start ResourceTrackerService and re-populate the containers in scheduler based on the containers info from NMs - ApplicationMasterService either don’t start or start but block until all the previous NMs registers. Other than these, there are other services like ClientRMService, Webapps which we need to think about the order too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2000) Fix ordering of starting services inside the RM
[ https://issues.apache.org/jira/browse/YARN-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031080#comment-14031080 ] Tsuyoshi OZAWA commented on YARN-2000: -- It sounds reasonable to me. Fix ordering of starting services inside the RM --- Key: YARN-2000 URL: https://issues.apache.org/jira/browse/YARN-2000 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He The order of starting services in RM would be: - Recovery of the app/attempts - Start the scheduler and add scheduler app/attempts - Start ResourceTrackerService and re-populate the containers in scheduler based on the containers info from NMs - ApplicationMasterService either don’t start or start but block until all the previous NMs registers. Other than these, there are other services like ClientRMService, Webapps which we need to think about the order too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031082#comment-14031082 ] Hadoop QA commented on YARN-2158: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650365/apache-yarn-2158.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3985//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3985//console This message is automatically generated. TestRMWebServicesAppsModification sometimes fails in trunk -- Key: YARN-2158 URL: https://issues.apache.org/jira/browse/YARN-2158 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Varun Vasudev Priority: Minor Attachments: apache-yarn-2158.0.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : {code} Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.297 sec FAILURE! java.lang.AssertionError: app state incorrect at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031083#comment-14031083 ] Hadoop QA commented on YARN-2074: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650366/YARN-2074.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3986//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3986//console This message is automatically generated. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, YARN-2074.4.patch, YARN-2074.5.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2159) allocateContainer() in SchedulerNode needs a clearer LOG.info message
Ray Chiang created YARN-2159: Summary: allocateContainer() in SchedulerNode needs a clearer LOG.info message Key: YARN-2159 URL: https://issues.apache.org/jira/browse/YARN-2159 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor This bit of code: LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); results in a line like: 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Perhaps one of the following phrasings is better? - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2159) allocateContainer() in SchedulerNode needs a clearer LOG.info message
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2159: - Description: This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Perhaps one of the following phrasings is better? - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation was: This bit of code: LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); results in a line like: 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Perhaps one of the following phrasings is better? - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation allocateContainer() in SchedulerNode needs a clearer LOG.info message - Key: YARN-2159 URL: https://issues.apache.org/jira/browse/YARN-2159 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Perhaps one of the following phrasings is better? - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2159) allocateContainer() in SchedulerNode needs a clearer LOG.info message
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2159: - Description: This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Here is one suggested phrasing - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation was: This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Perhaps one of the following phrasings is better? - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation allocateContainer() in SchedulerNode needs a clearer LOG.info message - Key: YARN-2159 URL: https://issues.apache.org/jira/browse/YARN-2159 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Here is one suggested phrasing - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2146) Yarn logs aggregation error
[ https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031120#comment-14031120 ] Chen He commented on YARN-2146: --- I think it is because of mismatching during log parsing. I found this problem when I was running a Pig on Tez job running on Hadoop-2.4. Yarn logs aggregation error --- Key: YARN-2146 URL: https://issues.apache.org/jira/browse/YARN-2146 Project: Hadoop YARN Issue Type: Bug Reporter: Chen He when I run yarn logs -applicationId application_xxx /tmp/application_xxx. It creates file, also shows part of logs on the terminal screen, and reports following error: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:430) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566) at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2146) Yarn logs aggregation error
[ https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031162#comment-14031162 ] Xuan Gong commented on YARN-2146: - bq. because of mismatching during log parsing When we aggregates the logs into HDFS, we write the file_name and size of the files before we write the log contents. When it tries to read back size of the log, but somehow the mismatching happens. That cause the exception. Not sure why this can happen. Yarn logs aggregation error --- Key: YARN-2146 URL: https://issues.apache.org/jira/browse/YARN-2146 Project: Hadoop YARN Issue Type: Bug Reporter: Chen He when I run yarn logs -applicationId application_xxx /tmp/application_xxx. It creates file, also shows part of logs on the terminal screen, and reports following error: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:430) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566) at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2146) Yarn logs aggregation error
[ https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031167#comment-14031167 ] Chen He commented on YARN-2146: --- If you take a look of the log aggregation, you may get some hint there. What if the size of file is not right? Yarn logs aggregation error --- Key: YARN-2146 URL: https://issues.apache.org/jira/browse/YARN-2146 Project: Hadoop YARN Issue Type: Bug Reporter: Chen He when I run yarn logs -applicationId application_xxx /tmp/application_xxx. It creates file, also shows part of logs on the terminal screen, and reports following error: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:430) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566) at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2146) Yarn logs aggregation error
[ https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031172#comment-14031172 ] Chen He commented on YARN-2146: --- It is the same problem in YARN-1670. Yarn logs aggregation error --- Key: YARN-2146 URL: https://issues.apache.org/jira/browse/YARN-2146 Project: Hadoop YARN Issue Type: Bug Reporter: Chen He when I run yarn logs -applicationId application_xxx /tmp/application_xxx. It creates file, also shows part of logs on the terminal screen, and reports following error: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:430) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566) at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (YARN-2146) Yarn logs aggregation error
[ https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He reopened YARN-2146: --- Yarn logs aggregation error --- Key: YARN-2146 URL: https://issues.apache.org/jira/browse/YARN-2146 Project: Hadoop YARN Issue Type: Bug Reporter: Chen He when I run yarn logs -applicationId application_xxx /tmp/application_xxx. It creates file, also shows part of logs on the terminal screen, and reports following error: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:430) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566) at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2146) Yarn logs aggregation error
[ https://issues.apache.org/jira/browse/YARN-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He resolved YARN-2146. --- Resolution: Duplicate Yarn logs aggregation error --- Key: YARN-2146 URL: https://issues.apache.org/jira/browse/YARN-2146 Project: Hadoop YARN Issue Type: Bug Reporter: Chen He when I run yarn logs -applicationId application_xxx /tmp/application_xxx. It creates file, also shows part of logs on the terminal screen, and reports following error: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:430) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:566) at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:139) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1670) aggregated log writer can write more log data then it says is the log length
[ https://issues.apache.org/jira/browse/YARN-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031175#comment-14031175 ] Chen He commented on YARN-1670: --- It reports similar error in YARN-2146. [~mitdesai] are working on it. aggregated log writer can write more log data then it says is the log length Key: YARN-1670 URL: https://issues.apache.org/jira/browse/YARN-1670 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 0.23.10, 2.2.0 Reporter: Thomas Graves Assignee: Mit Desai Priority: Critical Fix For: 3.0.0, 0.23.11, 2.4.0, 2.5.0 Attachments: YARN-1670-b23.patch, YARN-1670-v2-b23.patch, YARN-1670-v2.patch, YARN-1670-v3-b23.patch, YARN-1670-v3.patch, YARN-1670-v4-b23.patch, YARN-1670-v4-b23.patch, YARN-1670-v4.patch, YARN-1670-v4.patch, YARN-1670.patch, YARN-1670.patch We have seen exceptions when using 'yarn logs' to read log files. at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:441) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:518) at org.apache.hadoop.yarn.logaggregation.LogDumper.dumpAContainerLogs(LogDumper.java:178) at org.apache.hadoop.yarn.logaggregation.LogDumper.run(LogDumper.java:130) at org.apache.hadoop.yarn.logaggregation.LogDumper.main(LogDumper.java:246) We traced it down to the reader trying to read the file type of the next file but where it reads is still log data from the previous file. What happened was the Log Length was written as a certain size but the log data was actually longer then that. Inside of the write() routine in LogValue it first writes what the logfile length is, but then when it goes to write the log itself it just goes to the end of the file. There is a race condition here where if someone is still writing to the file when it goes to be aggregated the length written could be to small. We should have the write() routine stop when it writes whatever it said was the length. It would be nice if we could somehow tell the user it might be truncated but I'm not sure of a good way to do this. We also noticed that a bug in readAContainerLogsForALogType where it is using an int for curRead whereas it should be using a long. while (len != -1 curRead fileLength) { This isn't actually a problem right now as it looks like the underlying decoder is doing the right thing and the len condition exits. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031223#comment-14031223 ] Tsuyoshi OZAWA commented on YARN-2052: -- [~jianhe] and [~vinodkv], thank you for the comments and suggestions! {quote} This scheme won't work with a single reserved digit for epochs and a large number of restarts over time. {quote} Yes, this is a case that integer overflow happens. We need to take it into account the case. {quote} Old code (state-store, history-server etc) will not read it and that's fine. The only problem is users who are interpreting container_ID strings themselves. That is NOT supported. We should modify ConverterUtils to support the new-field, and that should do. {quote} Adding RM Id + hostname as epoch sounds reasonable approach to me. If we suffixes the epoch to the container id, following code is also valid with old {{ConverterUtils.toContainerId}}: {code} ContainerId id = TestContainerId.newContainerId(0, 0, 0, 0); String cid = ConverterUtils.toString(id); ContainerId gen = ConverterUtils.toContainerId(cid + _uuid_rm1); assertEquals(gen, id); // valid to parse even with old code {code} Therefore, I think {{container_XXX_000_uuid_rm1}} is better format. I'll create a patch based on the idea. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031248#comment-14031248 ] Jian He commented on YARN-1365: --- Thanks for updating the patch. The debug logging can be wrapped with isDebugEnabled condition {code} LOG.debug(Skipping notifying ATTEMPT_ADDED); {code} The following code is removed, but schedulers#addApplication are not handling the case to not send app_accepted events as we do for addApplicationAttempt. My point was we can do the same for both addApplication and addApplicationAttempt to not send dup events. Given this is not relevant to this patch itself, we can fix this separately if needed. {code} // ACCECPTED state can once again receive APP_ACCEPTED event, because on // recovery the app returns ACCEPTED state and the app once again go // through the scheduler and triggers one more APP_ACCEPTED event at // ACCEPTED state. .addTransition(RMAppState.ACCEPTE {code} This transition can never happen ? given that unregister also has to do resync. {code} .addTransition(RMAppAttemptState.LAUNCHED, EnumSet.of(RMAppAttemptState.FINAL_SAVING, RMAppAttemptState.FINISHED), RMAppAttemptEventType.UNREGISTERED, new AMUnregisteredTransition()) {code} This piece of code is not needed, the previous launchAM internally checks the app state already. We can use MockRM.launchAndRegisterAM alternatively. The test case can be moved to TestWorkPreservingRMRestart {code} nm1.nodeHeartbeat(am0.getApplicationAttemptId(), 1, ContainerState.RUNNING); am0.waitForState(RMAppAttemptState.RUNNING); rm1.waitForState(app0.getApplicationId(), RMAppState.RUNNING); {code} *Just thinking*: Does it make sense to map AMCommand(shutdown, resync) to corresponding exceptions? The benefit is that we don’t need to add extra fields in AMS protocol response and user not using AMRMClient will be forced to handle such condition to work with RM restart. thoughts? ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031258#comment-14031258 ] Tsuyoshi OZAWA commented on YARN-2052: -- {quote} The only problem is users who are interpreting container_ID strings themselves. That is NOT supported. {quote} Yeah, I think it is difficult to avoid the problem. But the interpreting logic itself doesn't change drastically with our approach because we doesn't change the order of attributes. IMHO, it's acceptable approach. BTW, I found that ConverterUtils is marked as {{@Pivate}}. Should we make the class {{@Public}}? {code} @Private public class ConverterUtils { {code} ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031268#comment-14031268 ] Anubhav Dhoot commented on YARN-1365: - Agreed. I was trying to be consistent with allocateresonse, but would prefer exceptions. AM Client will discover it automatically instead of being hidden in a return value. I would prefer if allocateresponse would also use exceptions instead of AM commands. I can open a Jira for it Will address your other comments as well. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2159) allocateContainer() in SchedulerNode needs a clearer LOG.info message
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2159: - Attachment: YARN2159-01.patch Rearrange sentence as per initial suggestion. allocateContainer() in SchedulerNode needs a clearer LOG.info message - Key: YARN-2159 URL: https://issues.apache.org/jira/browse/YARN-2159 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Attachments: YARN2159-01.patch This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Here is one suggested phrasing - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-1885: - Attachment: YARN-1885.patch Thanks [~vinodkv] for your comments. I uploaded a patch addressed all your comments. bq. AddNodeTransition: The application list should only be respected when the node is not inactive? Not sure if that is right or wrong, but that is how running-containers are treated today. Currently, application will be respected no matter node is inactive or not in AddNodeTransition. I think it's not a regression at least, do some extra clean-up is not bad, do you agree? RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts --- Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031369#comment-14031369 ] Wangda Tan commented on YARN-1885: -- [~jianhe], bq. For nodes that expired but rejoin with earlier running applications, if the application by this time has completed, I think we should also send the app-finished signal ? This is behavior in existing patch. RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts --- Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2159) allocateContainer() in SchedulerNode needs a clearer LOG.info message
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031368#comment-14031368 ] Hadoop QA commented on YARN-2159: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650415/YARN2159-01.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3987//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3987//console This message is automatically generated. allocateContainer() in SchedulerNode needs a clearer LOG.info message - Key: YARN-2159 URL: https://issues.apache.org/jira/browse/YARN-2159 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Attachments: YARN2159-01.patch This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Here is one suggested phrasing - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation -- This message was sent by Atlassian JIRA (v6.2#6252)