[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013356#comment-14013356 ] Jian He commented on YARN-1366: --- The bulk of the patch here is MR changes. I think we should have a MR jira to track the MR changes? Both patches are very related and patch size seems reasonable to be consolidated. It's fine to leave as-is, but just easier for reviewer to have more context. AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013398#comment-14013398 ] Sandy Ryza commented on YARN-2010: -- Agree with Vinod that non-secure cluster to secure cluster is not currently supported and bound to have tons of issues. I've come across other bugs that have turned out to stem from this. If this is the only situation where we could conceivably face this issue, I'm somewhat dubious about whether it needs to be fixed. On the other hand, in general, being defensive about allowing a transition to active even when an app recovery fails makes sense to me. RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Rohith Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2022: -- Attachment: YARN-2022-DesignDraft.docx Hi [~curino] I have attached a Design Draft document and I tried to capture the corner cases. This draft also includes the approach to handle the same. Please review the same and share your thoughts. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013509#comment-14013509 ] Hadoop QA commented on YARN-2022: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647576/YARN-2022-DesignDraft.docx against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3865//console This message is automatically generated. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
[ https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013520#comment-14013520 ] Hudson commented on YARN-2112: -- FAILURE: Integrated in Hadoop-Yarn-trunk #568 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/568/]) YARN-2112. Fixed yarn-common's pom.xml to include jackson dependencies so that both Timeline Server and client can access them. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598373) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/pom.xml Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml - Key: YARN-2112 URL: https://issues.apache.org/jira/browse/YARN-2112 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.5.0 Attachments: YARN-2112.1.patch Now YarnClient is using TimelineClient, which has dependency on jackson libs. However, the current dependency configurations make the hadoop-client artifect miss 2 jackson libs, such that the applications which have hadoop-client dependency will see the following exception {code} java.lang.NoClassDefFoundError: org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) at java.lang.ClassLoader.defineClass(ClassLoader.java:621) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.init(TimelineClientImpl.java:92) at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapred.ResourceMgrDelegate.init(ResourceMgrDelegate.java:88) at org.apache.hadoop.mapred.YARNRunner.init(YARNRunner.java:111) at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34) at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013607#comment-14013607 ] Rohith commented on YARN-1366: -- Let this jira keep only for Yarn Client. I created MAPREDUCE-5910 for handling at MR. AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-1366: - Attachment: YARN-1366.4.patch Attached updated patch that address Anubhav's all comments. This patch contains only YarnClient changes. The changes are 1. Added a test that covers the scenario for functionalities. 2. Added core-site.xml for test with disabling ip based check. 3. modified yarn-client pom.xml for getting yarn-common-test in test classpath. I am not changing status for Patch Available since test require patch of Yarn-1365. AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
[ https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013639#comment-14013639 ] Hudson commented on YARN-2112: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1759 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1759/]) YARN-2112. Fixed yarn-common's pom.xml to include jackson dependencies so that both Timeline Server and client can access them. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598373) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/pom.xml Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml - Key: YARN-2112 URL: https://issues.apache.org/jira/browse/YARN-2112 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.5.0 Attachments: YARN-2112.1.patch Now YarnClient is using TimelineClient, which has dependency on jackson libs. However, the current dependency configurations make the hadoop-client artifect miss 2 jackson libs, such that the applications which have hadoop-client dependency will see the following exception {code} java.lang.NoClassDefFoundError: org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) at java.lang.ClassLoader.defineClass(ClassLoader.java:621) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.init(TimelineClientImpl.java:92) at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapred.ResourceMgrDelegate.init(ResourceMgrDelegate.java:88) at org.apache.hadoop.mapred.YARNRunner.init(YARNRunner.java:111) at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34) at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at
[jira] [Commented] (YARN-800) Clicking on an AM link for a running app leads to a HTTP 500
[ https://issues.apache.org/jira/browse/YARN-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013710#comment-14013710 ] Dave Disser commented on YARN-800: -- As a follow-up, I also notice that the proxying works correctly while the tracking URL is UNASSIGNED (the first couple seconds after AM container launch), but then HTTP 500 occurs shortly after. Clicking on an AM link for a running app leads to a HTTP 500 Key: YARN-800 URL: https://issues.apache.org/jira/browse/YARN-800 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Arpit Gupta Priority: Minor Clicking the AM link tries to open up a page with url like http://hostname:8088/proxy/application_1370886527995_0645/ and this leads to an HTTP 500 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013734#comment-14013734 ] Hadoop QA commented on YARN-1338: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647161/YARN-1338v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 16 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3866//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3866//console This message is automatically generated. Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1338.patch, YARN-1338v2.patch, YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch, YARN-1338v6.patch Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects. * For work preserving restart we definitely want them as running containers are using them * For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013757#comment-14013757 ] Hudson commented on YARN-596: - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1786 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1786/]) YARN-596. Use scheduling policies throughout the queue hierarchy to decide which containers to preempt (Wei Yan via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598197) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AppSchedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSParentQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FakeSchedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java Use scheduling policies throughout the queue hierarchy to decide which containers to preempt Key: YARN-596 URL: https://issues.apache.org/jira/browse/YARN-596 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Wei Yan Fix For: 2.5.0 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch In the fair scheduler, containers are chosen for preemption in the following way: All containers for all apps that are in queues that are over their fair share are put in a list. The list is sorted in order of the priority that the container was requested in. This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense. Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2107) Refactor timeline classes into server.timeline package
[ https://issues.apache.org/jira/browse/YARN-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013761#comment-14013761 ] Hudson commented on YARN-2107: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1786 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1786/]) YARN-2107. Refactored timeline classes into o.a.h.y.s.timeline package. Contributed by Vinod Kumar Vavilapalli. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598094) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AHSWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/EntityIdentifier.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/GenericObjectMapper.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/NameValuePair.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineReader.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineWriter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/package-info.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilterInitializer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineClientAuthenticationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineDelegationTokenSecretManagerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp *
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013777#comment-14013777 ] Karthik Kambatla commented on YARN-2010: Let me clarify a couple of things. It is true that the first time we encountered this was during an upgrade from non-secure to secure cluster. However, as I mentioned earlier in the JIRA, it is possible to run into this in other situations. Even in the case of upgrading from non-secure to secure cluster, I totally understand we can't support recovering running/completed applications. However, one shouldn't have to explicitly nuke the ZK store (which by the way is involved due to the ACLs-magic and lacks an rmadmin command) to be able to start the RM. RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Rohith Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at
[jira] [Commented] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
[ https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013766#comment-14013766 ] Hudson commented on YARN-2112: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1786 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1786/]) YARN-2112. Fixed yarn-common's pom.xml to include jackson dependencies so that both Timeline Server and client can access them. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598373) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/pom.xml Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml - Key: YARN-2112 URL: https://issues.apache.org/jira/browse/YARN-2112 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.5.0 Attachments: YARN-2112.1.patch Now YarnClient is using TimelineClient, which has dependency on jackson libs. However, the current dependency configurations make the hadoop-client artifect miss 2 jackson libs, such that the applications which have hadoop-client dependency will see the following exception {code} java.lang.NoClassDefFoundError: org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) at java.lang.ClassLoader.defineClass(ClassLoader.java:621) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.init(TimelineClientImpl.java:92) at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapred.ResourceMgrDelegate.init(ResourceMgrDelegate.java:88) at org.apache.hadoop.mapred.YARNRunner.init(YARNRunner.java:111) at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34) at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at
[jira] [Commented] (YARN-2054) Better defaults for YARN ZK configs for retries and retry-inteval when HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013788#comment-14013788 ] Karthik Kambatla commented on YARN-2054: Saw this late - thanks for the review, [~ozawa] :) Better defaults for YARN ZK configs for retries and retry-inteval when HA is enabled Key: YARN-2054 URL: https://issues.apache.org/jira/browse/YARN-2054 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Fix For: 2.5.0 Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch, yarn-2054-4.patch Currenly, we have the following default values: # yarn.resourcemanager.zk-num-retries - 500 # yarn.resourcemanager.zk-retry-interval-ms - 2000 This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1877) Document yarn.resourcemanager.zk-auth and its scope
[ https://issues.apache.org/jira/browse/YARN-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1877: --- Summary: Document yarn.resourcemanager.zk-auth and its scope (was: ZK store: Add yarn.resourcemanager.zk-state-store.root-node.auth for root node auth) Document yarn.resourcemanager.zk-auth and its scope --- Key: YARN-1877 URL: https://issues.apache.org/jira/browse/YARN-1877 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-1877.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1877) ZK store: Add yarn.resourcemanager.zk-state-store.root-node.auth for root node auth
[ https://issues.apache.org/jira/browse/YARN-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013793#comment-14013793 ] Karthik Kambatla commented on YARN-1877: Thanks for investigating this, Robert. +1 - the description is missing a closing ), I ll add it at commit time. ZK store: Add yarn.resourcemanager.zk-state-store.root-node.auth for root node auth --- Key: YARN-1877 URL: https://issues.apache.org/jira/browse/YARN-1877 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-1877.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1877) Document yarn.resourcemanager.zk-auth and its scope
[ https://issues.apache.org/jira/browse/YARN-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1877: -- Assignee: Robert Kanter (was: Karthik Kambatla) Document yarn.resourcemanager.zk-auth and its scope --- Key: YARN-1877 URL: https://issues.apache.org/jira/browse/YARN-1877 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Priority: Critical Attachments: YARN-1877.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2054) Better defaults for YARN ZK configs for retries and retry-inteval when HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013812#comment-14013812 ] Hudson commented on YARN-2054: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5631 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5631/]) YARN-2054. Better defaults for YARN ZK configs for retries and retry-inteval when HA is enabled. (kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598630) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStoreZKClientConnections.java Better defaults for YARN ZK configs for retries and retry-inteval when HA is enabled Key: YARN-2054 URL: https://issues.apache.org/jira/browse/YARN-2054 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Fix For: 2.5.0 Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch, yarn-2054-4.patch Currenly, we have the following default values: # yarn.resourcemanager.zk-num-retries - 500 # yarn.resourcemanager.zk-retry-interval-ms - 2000 This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013833#comment-14013833 ] Hudson commented on YARN-1338: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5632 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5632/]) YARN-1338. Recover localized resource cache state upon nodemanager restart (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598640) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/Context.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalCacheDirectoryManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/event/ResourceEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/event/ResourceRecoveredEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/DummyContainerManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestEventFlow.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/BaseContainerManagerTest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalCacheDirectoryManager.java *
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013834#comment-14013834 ] Hudson commented on YARN-2010: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5632 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5632/]) YARN-2010. Document yarn.resourcemanager.zk-auth and its scope. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598636) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Rohith Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013876#comment-14013876 ] Tsuyoshi OZAWA commented on YARN-2091: -- Thank you for the suggestion, Sandy. {quote} ContainerExitStatus should stay an int. While ContainerStatus.getExitStatus is technically marked Unstable, I'm sure changing this would break some applications. {quote} I agree with this. In the latest patch, ContainerExitStatus stays an int. Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013880#comment-14013880 ] Wangda Tan commented on YARN-1368: -- [~jianhe], while reading YARN-2022, I don't know if you considered masterContainer recovering in this patch or not. I haven't found it in the patch, I think we need to consider it if it's not here. For preemption policy, it's important to get AM containers before making preemption decisions. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, YARN-1368.4.patch, YARN-1368.5.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013890#comment-14013890 ] Tsuyoshi OZAWA commented on YARN-2091: -- Changelogs in v4: * Added KILL_EXCEEDED_PMEM, KILL_EXCEEDED_VMEM to ContainerExitStatus. * Updated ContainersMonitorImpl for dispatching KILL_EXCEEDED_VMEM/KILL_EXCEEDED_PMEM. * If the exit reason is AM-aware({{ContainerExitStatus#isAMAware()}}), pass it to app masters. Otherwise, the exit reason is converted into ExitCode.TERMINATED.getExitCode() for backward compatibility. AM-Aware events are DISKS_FAILED, KILL_EXCEEDED_PMEM, KILL_EXCEEDED_VMEM currently. Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013926#comment-14013926 ] Tsuyoshi OZAWA commented on YARN-2091: -- Jenkins passed last night. It's ready for review. Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2099) Preemption in fair scheduler should consider app priorities
[ https://issues.apache.org/jira/browse/YARN-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2099: -- Attachment: YARN-2099.patch Upload an initial patch to capture which we discussed. Need to add testcases once YARN-2098 resolved. Preemption in fair scheduler should consider app priorities --- Key: YARN-2099 URL: https://issues.apache.org/jira/browse/YARN-2099 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.5.0 Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2099.patch Fair scheduler should take app priorities into account while preempting containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013998#comment-14013998 ] Jian He commented on YARN-1368: --- Wangda, AM container is just one type of container and should be covered already in the patch. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, YARN-1368.4.patch, YARN-1368.5.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2072) RM/NM UIs and webservices are missing vcore information
[ https://issues.apache.org/jira/browse/YARN-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated YARN-2072: - Attachment: YARN-2072.patch RM/NM UIs and webservices are missing vcore information --- Key: YARN-2072 URL: https://issues.apache.org/jira/browse/YARN-2072 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager, webapp Affects Versions: 3.0.0, 2.4.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-2072.patch Change RM and NM UIs and webservices to include virtual cores. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1868) YARN status web ui does not show correctly in IE 11
[ https://issues.apache.org/jira/browse/YARN-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-1868: Hadoop Flags: Reviewed +1 for the patch. I'll commit this. YARN status web ui does not show correctly in IE 11 --- Key: YARN-1868 URL: https://issues.apache.org/jira/browse/YARN-1868 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 3.0.0 Reporter: Chuan Liu Assignee: Chuan Liu Labels: yxls123123 Attachments: YARN-1868.1.patch, YARN-1868.2.patch, YARN-1868.patch, YARN_status.png The YARN status web ui does not show correctly in IE 11. The drop down menu for app entries are not shown. Also the navigation menu displays incorrectly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1702) Expose kill app functionality as part of RM web services
[ https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014043#comment-14014043 ] Zhijie Shen commented on YARN-1702: --- [~vvasudev], thanks for the big patch! I've looked through it, and bellow are some high level comments: 1. There're lot of formatting changes in TestRMWebServicesApps, which seem no to be necessary, and affect review. 2. getAppState seem not to be necessary, as we have getApp, which returns a full report including the state. 3. I'm not sure it is good idea to have a updateAppState API, but only allow to change the state to KILLED. Why not having killApp directly, and accepting an appId? {code} + @PUT + @Path(/apps/{appid}/state) + @Produces({ MediaType.APPLICATION_JSON, MediaType.APPLICATION_XML }) + @Consumes({ MediaType.APPLICATION_JSON, MediaType.APPLICATION_XML }) + public Response updateAppState(AppState targetState, + @Context HttpServletRequest hsr, @PathParam(appid) String appId) + throws AuthorizationException, YarnException, InterruptedException, + IOException { {code} 4. We should make killApp work in insecure mode as well, as we can do it via PRC. 5. In YarnClientImpl, we have implemented the logic to keep sending the kill request until we get confirmed that the app is killed. IMHO, as the user of REST API should be a thin client, we may want to implement this logic at the server side, blocking the response until we confirm that the app is killed. In RPC we have limited the number of concurrent threads. However, at the web side, we don't have this limitation, right? 6. As to the authentication filter, I think it's not just the problem of killApp, the whole RM web is unprotected, but we can handle this issue separately. Some lessons from implementing security for the timeline server: a) It's better to have separate configs for RM only, and load the authentication filter for RM daemon only instead of all. b) RM may also want Kerberos + DT authentication style. Expose kill app functionality as part of RM web services Key: YARN-1702 URL: https://issues.apache.org/jira/browse/YARN-1702 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-1702.10.patch, apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, apache-yarn-1702.4.patch, apache-yarn-1702.5.patch, apache-yarn-1702.7.patch, apache-yarn-1702.8.patch, apache-yarn-1702.9.patch Expose functionality to kill an app via the ResourceManager web services API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1868) YARN status web ui does not show correctly in IE 11
[ https://issues.apache.org/jira/browse/YARN-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014050#comment-14014050 ] Hudson commented on YARN-1868: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5634 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5634/]) YARN-1868. YARN status web ui does not show correctly in IE 11. Contributed by Chuan Liu. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598686) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/view/HtmlPage.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/TestSubViews.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/view/TestHtmlPage.java YARN status web ui does not show correctly in IE 11 --- Key: YARN-1868 URL: https://issues.apache.org/jira/browse/YARN-1868 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 3.0.0 Reporter: Chuan Liu Assignee: Chuan Liu Labels: yxls123123 Fix For: 3.0.0, 2.5.0 Attachments: YARN-1868.1.patch, YARN-1868.2.patch, YARN-1868.patch, YARN_status.png The YARN status web ui does not show correctly in IE 11. The drop down menu for app entries are not shown. Also the navigation menu displays incorrectly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014079#comment-14014079 ] Bikas Saha commented on YARN-2091: -- Why is isAMAware needed. All values in ContainerExitStatus are public and hence user code should already be aware of them. Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2072) RM/NM UIs and webservices are missing vcore information
[ https://issues.apache.org/jira/browse/YARN-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014100#comment-14014100 ] Hadoop QA commented on YARN-2072: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647655/YARN-2072.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3867//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3867//console This message is automatically generated. RM/NM UIs and webservices are missing vcore information --- Key: YARN-2072 URL: https://issues.apache.org/jira/browse/YARN-2072 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager, webapp Affects Versions: 3.0.0, 2.4.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-2072.patch Change RM and NM UIs and webservices to include virtual cores. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2103) Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder
[ https://issues.apache.org/jira/browse/YARN-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014106#comment-14014106 ] Tsuyoshi OZAWA commented on YARN-2103: -- [~decster], thank you for the update! I think some test cases are missing like calling functions before {{init()}} and calling {{deSerialize()}}. Do you mind adding these tests to your patch? It covers overall functions in SerializedExceptionPBImpl. {code} @Test public void testDeserialize() throws Exception { SerializedExceptionProto defaultProto = SerializedExceptionProto.newBuilder().build(); Exception ex = new Exception(test exception); SerializedExceptionPBImpl pb = new SerializedExceptionPBImpl(); try { pb.deSerialize(); Assert.fail(deSerialze should throw YarnRuntimeException); } catch (YarnRuntimeException e) { Assert.assertEquals(ClassNotFoundException.class, e.getCause().getClass()); } pb.init(ex); Assert.assertEquals(ex.toString(), pb.deSerialize().toString()); } @Test public void testBeforeInit() throws Exception { SerializedExceptionProto defaultProto = SerializedExceptionProto.newBuilder().build(); SerializedExceptionPBImpl pb1 = new SerializedExceptionPBImpl(); Assert.assertNull(pb1.getCause()); SerializedExceptionPBImpl pb2 = new SerializedExceptionPBImpl(); Assert.assertEquals(defaultProto, pb2.getProto()); SerializedExceptionPBImpl pb3 = new SerializedExceptionPBImpl(); Assert.assertEquals(defaultProto.getTrace(), pb3.getRemoteTrace()); } {code} Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder - Key: YARN-2103 URL: https://issues.apache.org/jira/browse/YARN-2103 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: YARN-2103.v1.patch, YARN-2103.v2.patch Bug 1: {code} SerializedExceptionProto proto = SerializedExceptionProto .getDefaultInstance(); SerializedExceptionProto.Builder builder = null; boolean viaProto = false; {code} Since viaProto is false, we should initiate build rather than proto Bug 2: the class does not provide hashcode() and equals() like other PBImpl records, this class is used in other records, it may affect other records' behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1702) Expose kill app functionality as part of RM web services
[ https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014124#comment-14014124 ] Hadoop QA commented on YARN-1702: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644316/apache-yarn-1702.10.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3868//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3868//console This message is automatically generated. Expose kill app functionality as part of RM web services Key: YARN-1702 URL: https://issues.apache.org/jira/browse/YARN-1702 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-1702.10.patch, apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, apache-yarn-1702.4.patch, apache-yarn-1702.5.patch, apache-yarn-1702.7.patch, apache-yarn-1702.8.patch, apache-yarn-1702.9.patch Expose functionality to kill an app via the ResourceManager web services API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2115) Replace RegisterNodeManagerRequest#ContainerStatus with a new ContainerRecoveryReport
[ https://issues.apache.org/jira/browse/YARN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2115: -- Issue Type: Sub-task (was: Improvement) Parent: YARN-556 Replace RegisterNodeManagerRequest#ContainerStatus with a new ContainerRecoveryReport - Key: YARN-2115 URL: https://issues.apache.org/jira/browse/YARN-2115 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2115.1.patch This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new ContainerRecoveryReport to include all the necessary information for container recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014245#comment-14014245 ] Tsuyoshi OZAWA commented on YARN-2091: -- Make sense. I found some test cases that check the exit code as ExitCode.TERMINATION.getCode() and I thought we need to preserve the semantics. These should be fixed, right? Thanks for the clarification. Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2115) Replace RegisterNodeManagerRequest#ContainerStatus with a new ContainerRecoveryReport
[ https://issues.apache.org/jira/browse/YARN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2115: -- Attachment: YARN-2115.2.patch Thanks Vinod for the review, Addressed the comments accordingly. Replace RegisterNodeManagerRequest#ContainerStatus with a new ContainerRecoveryReport - Key: YARN-2115 URL: https://issues.apache.org/jira/browse/YARN-2115 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2115.1.patch, YARN-2115.2.patch This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new ContainerRecoveryReport to include all the necessary information for container recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2115) Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus
[ https://issues.apache.org/jira/browse/YARN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2115: -- Description: This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new NMContainerStatus to include all the necessary information for container recovery. (was: This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new ContainerRecoveryReport to include all the necessary information for container recovery.) Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus --- Key: YARN-2115 URL: https://issues.apache.org/jira/browse/YARN-2115 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2115.1.patch, YARN-2115.2.patch This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new NMContainerStatus to include all the necessary information for container recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2115) Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus
[ https://issues.apache.org/jira/browse/YARN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2115: -- Summary: Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus (was: Replace RegisterNodeManagerRequest#ContainerStatus with a new ContainerRecoveryReport) Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus --- Key: YARN-2115 URL: https://issues.apache.org/jira/browse/YARN-2115 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2115.1.patch, YARN-2115.2.patch This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new ContainerRecoveryReport to include all the necessary information for container recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014284#comment-14014284 ] Bikas Saha commented on YARN-2091: -- We can check all cases of ContainerKillEvent and add new ExitStatus values where it makes sense or use some good default value. If a test needs to change to account for a new value then we should change the test. There may be other cases of exit status being set or tested which are unrelated to container kill event. Those can stay out of the scope of this jira. Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014304#comment-14014304 ] Sandy Ryza commented on YARN-1913: -- Thanks for the updated patch Wei. For queues, maxAMShare should be defined as a fraction of the queue's fair share, not maxShare. The majority of queues are configured with infinite maxResources. We need to be careful with this, as fair shares can change when queues are created dynamically. I think it might make sense to only allow the queue-level maxAMShare on leaf queues for the moment. I can't think of a strong reason somebody would want to set it on a parent queue, and doing this would allow us to avoid the complex logic in MaxRunningAppsEnforcer, and merely enforce the AM max share by checking in AppSchedulable.assignContainer. This is also what the Capacity Scheduler has at the moment. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014308#comment-14014308 ] Wei Yan commented on YARN-1913: --- Thanks, Sandy. Will update a patch. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1550: Attachment: YARN-1550.002.patch Added tests NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.002.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2115) Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus
[ https://issues.apache.org/jira/browse/YARN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014328#comment-14014328 ] Hadoop QA commented on YARN-2115: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647696/YARN-2115.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3869//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3869//console This message is automatically generated. Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus --- Key: YARN-2115 URL: https://issues.apache.org/jira/browse/YARN-2115 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2115.1.patch, YARN-2115.2.patch This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new NMContainerStatus to include all the necessary information for container recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014357#comment-14014357 ] Ashwin Shankar commented on YARN-1913: -- Hey [~sandyr], quick comment bq.I think it might make sense to only allow the queue-level maxAMShare on leaf queues for the moment. I can't think of a strong reason somebody would want to set it on a parent queue For NestedUserQueue rule, user queues would be created dynamically under a parent. For this use case, maxAMShare at the parent would be useful, since leaf user queues are not configured in the alloc xml. I see your point that it would complicate the logic at MaxRunningAppsEnforcer,but just wanted to bring this up in case you didn't consider this use case. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014370#comment-14014370 ] Hadoop QA commented on YARN-1550: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647711/YARN-1550.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3870//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3870//console This message is automatically generated. NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.002.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2116) TestRMAdminCLI#testTransitionToActive and testHelp fail on trunk
Jian He created YARN-2116: - Summary: TestRMAdminCLI#testTransitionToActive and testHelp fail on trunk Key: YARN-2116 URL: https://issues.apache.org/jira/browse/YARN-2116 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Two tests fail as following {code} testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.105 sec ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.091 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) Results : {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2115) Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus
[ https://issues.apache.org/jira/browse/YARN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014407#comment-14014407 ] Vinod Kumar Vavilapalli commented on YARN-2115: --- Looks good, +1. Checking this in. Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus --- Key: YARN-2115 URL: https://issues.apache.org/jira/browse/YARN-2115 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2115.1.patch, YARN-2115.2.patch This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new NMContainerStatus to include all the necessary information for container recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1368: -- Attachment: YARN-1368.7.patch Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, YARN-1368.4.patch, YARN-1368.5.patch, YARN-1368.7.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2117) Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block
Ted Yu created YARN-2117: Summary: Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block Key: YARN-2117 URL: https://issues.apache.org/jira/browse/YARN-2117 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Priority: Minor Here is related code: {code} Reader reader = new FileReader(signatureSecretFile); int c = reader.read(); while (c -1) { secret.append((char) c); c = reader.read(); } reader.close(); {code} If IOException is thrown out of reader.read(), reader would be left unclosed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2115) Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus
[ https://issues.apache.org/jira/browse/YARN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014419#comment-14014419 ] Hudson commented on YARN-2115: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5639 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5639/]) YARN-2115. Replaced RegisterNodeManagerRequest's ContainerStatus with a new NMContainerStatus which has more information that is needed for work-preserving RM-restart. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598790) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NMContainerStatus.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NMContainerStatusPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/Container.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/MockContainer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus --- Key: YARN-2115 URL: https://issues.apache.org/jira/browse/YARN-2115 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2115.1.patch, YARN-2115.2.patch This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new NMContainerStatus to include all the necessary information for container recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1913: -- Attachment: YARN-1913.patch With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2118) Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo()
Ted Yu created YARN-2118: Summary: Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo() Key: YARN-2118 URL: https://issues.apache.org/jira/browse/YARN-2118 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Priority: Minor {code} if (timelineEntity.getPrimaryFilters() != null timelineEntity.getPrimaryFilters().containsKey( TimelineStore.SystemFilter.ENTITY_OWNER)) { throw new YarnException( {code} getPrimaryFilters() returns a Map keyed by String. However, TimelineStore.SystemFilter.ENTITY_OWNER is an enum. Their types don't match. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014422#comment-14014422 ] Wei Yan commented on YARN-1913: --- Update a new patch to fix Sandy's comments. [~ashwinshankar77], if the leaf queue is not configured, the default AM resource limit is (leaf_queue_fair_share * 1.0f), still limited by its fair share. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014426#comment-14014426 ] Jian He commented on YARN-1368: --- bq.Kill container? Same for the following too? good point,fixed. bq. Instead we should use getCurrentAttemptForContainer(ContainerId containerId)? I think the RMContainer should be created with the original attempt Id. The containerId to attemptId routing will happen automatically. bq. ContainerRecoveredTransition: Missing other transitions that a regular container goes through? checked the code, we only need to send event to update the ranNodes. Added here. Eventually, YARN-1885 should fix the ranNodes thing on recovery. bq. Kill the container when the following happens? I added comment saying this condition can never happen. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, YARN-1368.4.patch, YARN-1368.5.patch, YARN-1368.7.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-2080: --- Attachment: YARN-2080.patch Attaching a patch file that wires the reservation APIs into existing YARN APIs. It introduces a new component *ReservationSystem* that essentially manages all the _Plans_ (#YARN-1709) configured in the ResourceSchedulers. The ReservationSystem is bootstrapped by ResourceManager if it is enabled in configuration. The ClientRMService has implementation of the reservation APIs which are additionally exposed via the YarnClient. Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subramaniam Krishnan Assignee: Subramaniam Krishnan Attachments: YARN-2080.patch This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014439#comment-14014439 ] Wangda Tan commented on YARN-1368: -- [~jianhe], I mean after RM restart and recover, the RMAppAttempt.getMasterContainer will return correct master container or not? Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, YARN-1368.4.patch, YARN-1368.5.patch, YARN-1368.7.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014442#comment-14014442 ] Jian He commented on YARN-1368: --- bq. RMAppAttempt.getMasterContainer will return correct master container or not? Yes, RMAppAttemptImpl.recover does that. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, YARN-1368.4.patch, YARN-1368.5.patch, YARN-1368.7.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014445#comment-14014445 ] Hadoop QA commented on YARN-1368: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647729/YARN-1368.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 13 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3871//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3871//console This message is automatically generated. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, YARN-1368.4.patch, YARN-1368.5.patch, YARN-1368.7.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014447#comment-14014447 ] Hadoop QA commented on YARN-1913: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647733/YARN-1913.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3872//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3872//console This message is automatically generated. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2118) Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo()
[ https://issues.apache.org/jira/browse/YARN-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014481#comment-14014481 ] Zhijie Shen commented on YARN-2118: --- Ted, good catch! Do you want to pick this issue? Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo() -- Key: YARN-2118 URL: https://issues.apache.org/jira/browse/YARN-2118 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu {code} if (timelineEntity.getPrimaryFilters() != null timelineEntity.getPrimaryFilters().containsKey( TimelineStore.SystemFilter.ENTITY_OWNER)) { throw new YarnException( {code} getPrimaryFilters() returns a Map keyed by String. However, TimelineStore.SystemFilter.ENTITY_OWNER is an enum. Their types don't match. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2118) Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo()
[ https://issues.apache.org/jira/browse/YARN-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2118: -- Priority: Major (was: Minor) Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo() -- Key: YARN-2118 URL: https://issues.apache.org/jira/browse/YARN-2118 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu {code} if (timelineEntity.getPrimaryFilters() != null timelineEntity.getPrimaryFilters().containsKey( TimelineStore.SystemFilter.ENTITY_OWNER)) { throw new YarnException( {code} getPrimaryFilters() returns a Map keyed by String. However, TimelineStore.SystemFilter.ENTITY_OWNER is an enum. Their types don't match. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2118) Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo()
[ https://issues.apache.org/jira/browse/YARN-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reassigned YARN-2118: Assignee: Ted Yu Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo() -- Key: YARN-2118 URL: https://issues.apache.org/jira/browse/YARN-2118 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Attachments: yarn-2118-v1.txt {code} if (timelineEntity.getPrimaryFilters() != null timelineEntity.getPrimaryFilters().containsKey( TimelineStore.SystemFilter.ENTITY_OWNER)) { throw new YarnException( {code} getPrimaryFilters() returns a Map keyed by String. However, TimelineStore.SystemFilter.ENTITY_OWNER is an enum. Their types don't match. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2118) Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo()
[ https://issues.apache.org/jira/browse/YARN-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-2118: - Attachment: yarn-2118-v1.txt Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo() -- Key: YARN-2118 URL: https://issues.apache.org/jira/browse/YARN-2118 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Attachments: yarn-2118-v1.txt {code} if (timelineEntity.getPrimaryFilters() != null timelineEntity.getPrimaryFilters().containsKey( TimelineStore.SystemFilter.ENTITY_OWNER)) { throw new YarnException( {code} getPrimaryFilters() returns a Map keyed by String. However, TimelineStore.SystemFilter.ENTITY_OWNER is an enum. Their types don't match. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2118) Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo()
[ https://issues.apache.org/jira/browse/YARN-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014494#comment-14014494 ] Hadoop QA commented on YARN-2118: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647748/yarn-2118-v1.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3873//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3873//console This message is automatically generated. Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo() -- Key: YARN-2118 URL: https://issues.apache.org/jira/browse/YARN-2118 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Attachments: yarn-2118-v1.txt {code} if (timelineEntity.getPrimaryFilters() != null timelineEntity.getPrimaryFilters().containsKey( TimelineStore.SystemFilter.ENTITY_OWNER)) { throw new YarnException( {code} getPrimaryFilters() returns a Map keyed by String. However, TimelineStore.SystemFilter.ENTITY_OWNER is an enum. Their types don't match. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014529#comment-14014529 ] Jian He commented on YARN-1367: --- Thanks for working on the patch. The patch needs update, can you update please ? A few initial comments: - Let's leave containerId handled in YARN-2052 separately. - The extra ContainerReport in RegisterNodeManagerRequest is not needed any more. - NM side may not need the config of work-preserving restart enabled. Given RM has this config already, RM should be able to instruct NM to keep_containers_on_resync in the case of work-preserving restart and kill_containers_on_resync in the case of non-work-preserving restart. We also avoid config overhead on each NM if doing this. After restart NM should resync with the RM without killing containers - Key: YARN-1367 URL: https://issues.apache.org/jira/browse/YARN-1367 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1367.prototype.patch After RM restart, the RM sends a resync response to NMs that heartbeat to it. Upon receiving the resync response, the NM kills all containers and re-registers with the RM. The NM should be changed to not kill the container and instead inform the RM about all currently running containers including their allocations etc. After the re-register, the NM should send all pending container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)