[jira] [Updated] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse
[ https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-1897: -- Attachment: YARN-1897-4.patch Updated patch per Vinod's suggestions. 1. Clean up SignalContainerCommand. 2. Support signalContainersRequest. Define SignalContainerRequest and SignalContainerResponse - Key: YARN-1897 URL: https://issues.apache.org/jira/browse/YARN-1897 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1897-2.patch, YARN-1897-3.patch, YARN-1897-4.patch, YARN-1897.1.patch We need to define SignalContainerRequest and SignalContainerResponse first as they are needed by other sub tasks. SignalContainerRequest should use OS-independent commands and provide a way to application to specify reason for diagnosis. SignalContainerResponse might be empty. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1803) Signal container support in nodemanager
[ https://issues.apache.org/jira/browse/YARN-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002847#comment-14002847 ] Ming Ma commented on YARN-1803: --- Vinod, I have updated YARN-1897. Please let me know if you have other suggestions. I can also upload updated version for other subtasks that depend on YARN-1897. Signal container support in nodemanager --- Key: YARN-1803 URL: https://issues.apache.org/jira/browse/YARN-1803 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1803.patch It could include the followings. 1. ContainerManager is able to process a new event type ContainerManagerEventType.SIGNAL_CONTAINERS coming from NodeStatusUpdater and deliver the request to ContainerExecutor. 2. Translate the platform independent signal command to Linux specific signals. Windows support will be tracked by another task. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002858#comment-14002858 ] Rohith commented on YARN-1366: -- bq. If there's no RM restart, a normal app only calling unregister without calling register earlier will be just deemed as FINISHED ? is this acceptable? No. The mutual contract that unregistration should not be called before registering (MR handles this.MAPREDUCE-5769) but still in defensive programming this has to be handled at yarn.What about storing information on zk for registered application.? This can be read during recovery and move application directly to running. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002873#comment-14002873 ] Rohith commented on YARN-1366: -- Adding to above point, enfource AMRMClient to handle unregistration should not be called before registering. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse
[ https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002875#comment-14002875 ] Gera Shegalov commented on YARN-1897: - I am confused, [~mingma]. I thought we agreed to do it as YARN-1515. Define SignalContainerRequest and SignalContainerResponse - Key: YARN-1897 URL: https://issues.apache.org/jira/browse/YARN-1897 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1897-2.patch, YARN-1897-3.patch, YARN-1897-4.patch, YARN-1897.1.patch We need to define SignalContainerRequest and SignalContainerResponse first as they are needed by other sub tasks. SignalContainerRequest should use OS-independent commands and provide a way to application to specify reason for diagnosis. SignalContainerResponse might be empty. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2077) JobImpl#makeUberDecision doesn't log that Uber mode is disabled because of too much CPUs
[ https://issues.apache.org/jira/browse/YARN-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2077: - Affects Version/s: 2.4.0 JobImpl#makeUberDecision doesn't log that Uber mode is disabled because of too much CPUs Key: YARN-2077 URL: https://issues.apache.org/jira/browse/YARN-2077 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2077.1.patch JobImpl#makeUberDecision usually logs why the Job cannot be launched as Uber mode(e.g. too much RAM; or something). About CPUs, it's not logged currently. We should log it when too much CPU. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2078) yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented
Tsuyoshi OZAWA created YARN-2078: Summary: yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented Key: YARN-2078 URL: https://issues.apache.org/jira/browse/YARN-2078 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial We should document the condition when uber mode is enabled. If not, users need to read code. {code} boolean smallMemory = ( (Math.max(conf.getLong(MRJobConfig.MAP_MEMORY_MB, 0), conf.getLong(MRJobConfig.REDUCE_MEMORY_MB, 0)) = sysMemSizeForUberSlot) || (sysMemSizeForUberSlot == JobConf.DISABLED_MEMORY_LIMIT)); boolean smallCpu = Math.max( conf.getInt( MRJobConfig.MAP_CPU_VCORES, MRJobConfig.DEFAULT_MAP_CPU_VCORES), conf.getInt( MRJobConfig.REDUCE_CPU_VCORES, MRJobConfig.DEFAULT_REDUCE_CPU_VCORES)) = sysCPUSizeForUberSlot {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2078) yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented
[ https://issues.apache.org/jira/browse/YARN-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2078: - Attachment: YARN-2078.1.patch yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented -- Key: YARN-2078 URL: https://issues.apache.org/jira/browse/YARN-2078 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2078.1.patch We should document the condition when uber mode is enabled. If not, users need to read code. {code} boolean smallMemory = ( (Math.max(conf.getLong(MRJobConfig.MAP_MEMORY_MB, 0), conf.getLong(MRJobConfig.REDUCE_MEMORY_MB, 0)) = sysMemSizeForUberSlot) || (sysMemSizeForUberSlot == JobConf.DISABLED_MEMORY_LIMIT)); boolean smallCpu = Math.max( conf.getInt( MRJobConfig.MAP_CPU_VCORES, MRJobConfig.DEFAULT_MAP_CPU_VCORES), conf.getInt( MRJobConfig.REDUCE_CPU_VCORES, MRJobConfig.DEFAULT_REDUCE_CPU_VCORES)) = sysCPUSizeForUberSlot {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2078) yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented
[ https://issues.apache.org/jira/browse/YARN-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2078: - Component/s: documentation yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented -- Key: YARN-2078 URL: https://issues.apache.org/jira/browse/YARN-2078 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2078.1.patch We should document the condition when uber mode is enabled. If not, users need to read code. {code} boolean smallMemory = ( (Math.max(conf.getLong(MRJobConfig.MAP_MEMORY_MB, 0), conf.getLong(MRJobConfig.REDUCE_MEMORY_MB, 0)) = sysMemSizeForUberSlot) || (sysMemSizeForUberSlot == JobConf.DISABLED_MEMORY_LIMIT)); boolean smallCpu = Math.max( conf.getInt( MRJobConfig.MAP_CPU_VCORES, MRJobConfig.DEFAULT_MAP_CPU_VCORES), conf.getInt( MRJobConfig.REDUCE_CPU_VCORES, MRJobConfig.DEFAULT_REDUCE_CPU_VCORES)) = sysCPUSizeForUberSlot {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2078) yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented
[ https://issues.apache.org/jira/browse/YARN-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2078: - Affects Version/s: 2.4.0 yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented -- Key: YARN-2078 URL: https://issues.apache.org/jira/browse/YARN-2078 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2078.1.patch We should document the condition when uber mode is enabled. If not, users need to read code. {code} boolean smallMemory = ( (Math.max(conf.getLong(MRJobConfig.MAP_MEMORY_MB, 0), conf.getLong(MRJobConfig.REDUCE_MEMORY_MB, 0)) = sysMemSizeForUberSlot) || (sysMemSizeForUberSlot == JobConf.DISABLED_MEMORY_LIMIT)); boolean smallCpu = Math.max( conf.getInt( MRJobConfig.MAP_CPU_VCORES, MRJobConfig.DEFAULT_MAP_CPU_VCORES), conf.getInt( MRJobConfig.REDUCE_CPU_VCORES, MRJobConfig.DEFAULT_REDUCE_CPU_VCORES)) = sysCPUSizeForUberSlot {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2077) JobImpl#makeUberDecision doesn't log that Uber mode is disabled because of too much CPUs
[ https://issues.apache.org/jira/browse/YARN-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2077: - Component/s: client JobImpl#makeUberDecision doesn't log that Uber mode is disabled because of too much CPUs Key: YARN-2077 URL: https://issues.apache.org/jira/browse/YARN-2077 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.4.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2077.1.patch JobImpl#makeUberDecision usually logs why the Job cannot be launched as Uber mode(e.g. too much RAM; or something). About CPUs, it's not logged currently. We should log it when too much CPU. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang updated YARN-2030: Attachment: YARN-2030.v1.patch Attach patch. This is a code refactor, TestFSRMStateStore and TestZKRMStateStore already cover the code here, so no additional test is added. Use StateMachine to simplify handleStoreEvent() in RMStateStore --- Key: YARN-2030 URL: https://issues.apache.org/jira/browse/YARN-2030 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Assignee: Binglin Chang Attachments: YARN-2030.v1.patch Now the logic to handle different store events in handleStoreEvent() is as following: {code} if (event.getType().equals(RMStateStoreEventType.STORE_APP) || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } ... try { if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT) || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) { ... } else { ... } } {code} This is not only confuse people but also led to mistake easily. We may leverage state machine to simply this even no state transitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2051) Add more unit tests for PBImpl that didn't get covered
[ https://issues.apache.org/jira/browse/YARN-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang reassigned YARN-2051: --- Assignee: Binglin Chang Add more unit tests for PBImpl that didn't get covered -- Key: YARN-2051 URL: https://issues.apache.org/jira/browse/YARN-2051 Project: Hadoop YARN Issue Type: Test Reporter: Junping Du Assignee: Binglin Chang Priority: Critical From YARN-2016, we can see some bug could exist in PB implementation of protocol. The bad news is most of these PBImpl don't have any unit test to verify the info is not lost or changed after serialization/deserialization. We should add more tests for it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002937#comment-14002937 ] Hudson commented on YARN-2053: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5606 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5606/]) YARN-2053. Fixed a bug in AMS to not add null NMToken into NMTokens list from previous attempts for work-preserving AM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1595116) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Fix For: 2.4.1 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()
[ https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002941#comment-14002941 ] Hudson commented on YARN-2066: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5606 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5606/]) YARN-2066. Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder (Contributed by Hong Zhiguo) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1595413) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder() --- Key: YARN-2066 URL: https://issues.apache.org/jira/browse/YARN-2066 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Hong Zhiguo Priority: Minor Fix For: 2.4.1 Attachments: YARN-2066.patch {code} if (this.finish != null) { builder.setFinishBegin(start.getMinimumLong()); builder.setFinishEnd(start.getMaximumLong()); } {code} this.finish should be referenced in the if block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2078) yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented
[ https://issues.apache.org/jira/browse/YARN-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002981#comment-14002981 ] Hadoop QA commented on YARN-2078: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645748/YARN-2078.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3768//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3768//console This message is automatically generated. yarn.app.am.resource.mb/cpu-vcores affects uber mode but is not documented -- Key: YARN-2078 URL: https://issues.apache.org/jira/browse/YARN-2078 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2078.1.patch We should document the condition when uber mode is enabled. If not, users need to read code. {code} boolean smallMemory = ( (Math.max(conf.getLong(MRJobConfig.MAP_MEMORY_MB, 0), conf.getLong(MRJobConfig.REDUCE_MEMORY_MB, 0)) = sysMemSizeForUberSlot) || (sysMemSizeForUberSlot == JobConf.DISABLED_MEMORY_LIMIT)); boolean smallCpu = Math.max( conf.getInt( MRJobConfig.MAP_CPU_VCORES, MRJobConfig.DEFAULT_MAP_CPU_VCORES), conf.getInt( MRJobConfig.REDUCE_CPU_VCORES, MRJobConfig.DEFAULT_REDUCE_CPU_VCORES)) = sysCPUSizeForUberSlot {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2077) JobImpl#makeUberDecision doesn't log that Uber mode is disabled because of too much CPUs
[ https://issues.apache.org/jira/browse/YARN-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002983#comment-14002983 ] Hadoop QA commented on YARN-2077: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645746/YARN-2077.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3767//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3767//console This message is automatically generated. JobImpl#makeUberDecision doesn't log that Uber mode is disabled because of too much CPUs Key: YARN-2077 URL: https://issues.apache.org/jira/browse/YARN-2077 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.4.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2077.1.patch JobImpl#makeUberDecision usually logs why the Job cannot be launched as Uber mode(e.g. too much RAM; or something). About CPUs, it's not logged currently. We should log it when too much CPU. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse
[ https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003009#comment-14003009 ] Hadoop QA commented on YARN-1897: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645735/YARN-1897-4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3771//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3771//console This message is automatically generated. Define SignalContainerRequest and SignalContainerResponse - Key: YARN-1897 URL: https://issues.apache.org/jira/browse/YARN-1897 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1897-2.patch, YARN-1897-3.patch, YARN-1897-4.patch, YARN-1897.1.patch We need to define SignalContainerRequest and SignalContainerResponse first as they are needed by other sub tasks. SignalContainerRequest should use OS-independent commands and provide a way to application to specify reason for diagnosis. SignalContainerResponse might be empty. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2075) TestRMAdminCLI consistently fail on trunk
[ https://issues.apache.org/jira/browse/YARN-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003027#comment-14003027 ] Hadoop QA commented on YARN-2075: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645730/YARN-2075.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3769//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3769//console This message is automatically generated. TestRMAdminCLI consistently fail on trunk - Key: YARN-2075 URL: https://issues.apache.org/jira/browse/YARN-2075 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Attachments: YARN-2075.patch {code} Running org.apache.hadoop.yarn.client.TestRMAdminCLI Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.082 sec ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.088 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003035#comment-14003035 ] Hadoop QA commented on YARN-941: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645713/YARN-941.preview.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestRMAdminCLI {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3770//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3770//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3770//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3770//console This message is automatically generated. RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, YARN-941.preview.patch When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003208#comment-14003208 ] Hadoop QA commented on YARN-2030: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645754/YARN-2030.v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3772//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3772//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3772//console This message is automatically generated. Use StateMachine to simplify handleStoreEvent() in RMStateStore --- Key: YARN-2030 URL: https://issues.apache.org/jira/browse/YARN-2030 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Assignee: Binglin Chang Attachments: YARN-2030.v1.patch Now the logic to handle different store events in handleStoreEvent() is as following: {code} if (event.getType().equals(RMStateStoreEventType.STORE_APP) || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } ... try { if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT) || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) { ... } else { ... } } {code} This is not only confuse people but also led to mistake easily. We may leverage state machine to simply this even no state transitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()
[ https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003248#comment-14003248 ] Hudson commented on YARN-2066: -- FAILURE: Integrated in Hadoop-Yarn-trunk #562 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/562/]) YARN-2066. Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder (Contributed by Hong Zhiguo) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1595413) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder() --- Key: YARN-2066 URL: https://issues.apache.org/jira/browse/YARN-2066 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Hong Zhiguo Priority: Minor Fix For: 2.4.1 Attachments: YARN-2066.patch {code} if (this.finish != null) { builder.setFinishBegin(start.getMinimumLong()); builder.setFinishEnd(start.getMaximumLong()); } {code} this.finish should be referenced in the if block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003257#comment-14003257 ] Hudson commented on YARN-2053: -- FAILURE: Integrated in Hadoop-Yarn-trunk #562 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/562/]) YARN-2053. Fixed a bug in AMS to not add null NMToken into NMTokens list from previous attempts for work-preserving AM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1595116) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Fix For: 2.4.1 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003292#comment-14003292 ] Hudson commented on YARN-2053: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1754 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1754/]) YARN-2053. Fixed a bug in AMS to not add null NMToken into NMTokens list from previous attempts for work-preserving AM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1595116) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Fix For: 2.4.1 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()
[ https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003283#comment-14003283 ] Hudson commented on YARN-2066: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1754 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1754/]) YARN-2066. Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder (Contributed by Hong Zhiguo) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1595413) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder() --- Key: YARN-2066 URL: https://issues.apache.org/jira/browse/YARN-2066 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Hong Zhiguo Priority: Minor Fix For: 2.4.1 Attachments: YARN-2066.patch {code} if (this.finish != null) { builder.setFinishBegin(start.getMinimumLong()); builder.setFinishEnd(start.getMaximumLong()); } {code} this.finish should be referenced in the if block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2079) Recover NonAggregatingLogHandler state upon nodemanager restart
Jason Lowe created YARN-2079: Summary: Recover NonAggregatingLogHandler state upon nodemanager restart Key: YARN-2079 URL: https://issues.apache.org/jira/browse/YARN-2079 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Jason Lowe The state of NonAggregatingLogHandler needs to be persisted so logs are properly deleted across a nodemanager restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()
[ https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003460#comment-14003460 ] Hudson commented on YARN-2066: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1780 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1780/]) YARN-2066. Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder (Contributed by Hong Zhiguo) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1595413) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder() --- Key: YARN-2066 URL: https://issues.apache.org/jira/browse/YARN-2066 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Hong Zhiguo Priority: Minor Fix For: 2.4.1 Attachments: YARN-2066.patch {code} if (this.finish != null) { builder.setFinishBegin(start.getMinimumLong()); builder.setFinishEnd(start.getMaximumLong()); } {code} this.finish should be referenced in the if block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003469#comment-14003469 ] Hudson commented on YARN-2053: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1780 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1780/]) YARN-2053. Fixed a bug in AMS to not add null NMToken into NMTokens list from previous attempts for work-preserving AM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1595116) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Fix For: 2.4.1 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003440#comment-14003440 ] bc Wong commented on YARN-941: -- Hi [~xgong], thanks for the patch! I'm interested in talking through the changes and their security implications, for everybody who's following along. I think the following are worth highlighting: # The token update mechanism is via the AM heartbeat. So if the previous AMRM token has been compromised, the attacker can get the new token. ** I don't think it's a big problem as the RM will only hand out the new token in _exactly_ one AllocateResponse (except for the case of RM restart). So if the attacker has the new token, the real AM won't, and it'll die and the token will get revoked. # How frequently a running AM gets an updated token is at the mercy of the configuration (the roll interval and activation delay). In addition, whenever the RM restarts, all AMs will get a new token on the next heartbeat. ** Should the RM check that the roll interval and activation delay are both shorter than the token expiration interval? # The client app is not responsible for renewing the token. The RM will renew it proactively and update the apps. ** The loss of control may be inconvenient to the app. The AM must also heartbeat frequently enough to catch the update in time. In practice, it's not an issue. But it still makes me slightly uncomfortable, since the client is the usually one renewing its credentials, of all other security protocols I know of. Here, the RM doesn't have any explicit logic to update an AMRM token before it expires. The math just generally works out if the admin sets the token expiry, roll interval and activation delay to the right values.\\ \\ Again, I think this is better than making it the AM's responsibility to get a new token, which is more burden on the AM. I just want to bring this up for discussion. RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, YARN-941.preview.patch When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003489#comment-14003489 ] Junping Du commented on YARN-1338: -- [~jlowe], thanks again for your patch here! A few comments so far: One question in general: beside null store and a leveled store, I saw a memory store implemented there but no usage so far. Does it helps in some scenario or only for test purpose? In NodeManager#serviceInit() {code} if (recoveryEnabled) { ... + nmStore = new NMLeveldbStateStoreService(); +} else { + nmStore = new NMNullStateStoreService(); } +nmStore.init(conf); +nmStore.start(); {code} Can we abstract code since if block into a method, something like: initializeNMStore(conf)? which can make NodeManager#serviceInit() simpler. In yarn_server_nodemanager_recovery.proto, {code} +message LocalizedResourceProto { + optional LocalResourceProto resource = 1; + optional string localPath = 2; + optional int64 size = 3; +} {code} Does size here represent for size of local resource? If so, may be duplicated with the size within LocalResourceProto? In ResourceLocalizationService.java {code} + //Recover localized resources after an NM restart + public void recoverLocalizedResources(RecoveredLocalizationState state) + throws URISyntaxException { + ... + for (Map.EntryApplicationId, LocalResourceTrackerState appEntry : + userResources.getAppTrackerStates().entrySet()) { +ApplicationId appId = appEntry.getKey(); +... +recoverTrackerResources(tracker, appEntry.getValue()); + } +} + } {code} May be we should check appResourceState(appEntry.getValue)’s localizedResources and inProgressResources is not empty before recover it as we check for userResourceState? In NMMemoryStateStoreService#loadLocalizationState() {code} ... +if (tk.appId == null) { + rur.privateTrackerState = loadTrackerState(ts); +} else { + rur.appTrackerStates.put(tk.appId, loadTrackerState(ts)); +} ... {code} May be even in case tk.appId !=null, we should load private resource state as well? Given the patch is big enough, I haven’t finished my review although walk though it a few times. More comments may come later. Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1338.patch, YARN-1338v2.patch, YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects. * For work preserving restart we definitely want them as running containers are using them * For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext
[ https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003548#comment-14003548 ] Jason Lowe commented on YARN-2050: -- +1 lgtm. Committing this. Fix LogCLIHelpers to create the correct FileContext --- Key: YARN-2050 URL: https://issues.apache.org/jira/browse/YARN-2050 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-2050-2.patch, YARN-2050.patch LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus the FileContext created isn't necessarily the FileContext for remote log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2076) Minor error in TestLeafQueue files
[ https://issues.apache.org/jira/browse/YARN-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2076: -- Attachment: YARN-2076.patch Minor error in TestLeafQueue files -- Key: YARN-2076 URL: https://issues.apache.org/jira/browse/YARN-2076 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Chen He Assignee: Chen He Priority: Minor Labels: test Attachments: YARN-2076.patch numNodes should be 2 instead of 3 in testReservationExchange() since only two nodes are defined. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003675#comment-14003675 ] Hadoop QA commented on YARN-1680: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645816/YARN-1680.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3773//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3773//console This message is automatically generated. availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Chen He Attachments: YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003678#comment-14003678 ] Bikas Saha commented on YARN-1366: -- bq.If there's no RM restart, a normal app only calling unregister without calling register earlier will be just deemed as FINISHED ? is this acceptable? bq.What about storing information on zk for registered application. Catching incorrect unregistration before registration should have always been there. Is this a regression in the patch or an existing bug. Should we consider the possibility of allowing unregister without register? What are the downsides? As long as we can make sure that unregister is coming from the latest version of the app. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1365: Attachment: YARN-1365.002.patch Added ApplicationMasterService changes to send SHUTDOWN for attempt thats not known and RESYNC for allocate if the AM has not registered after restart. Added more Unit tests that verify these Pending how to handle unregister after restart for an unregistered AM. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2076) Minor error in TestLeafQueue files
[ https://issues.apache.org/jira/browse/YARN-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003703#comment-14003703 ] Hadoop QA commented on YARN-2076: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645818/YARN-2076.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3774//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3774//console This message is automatically generated. Minor error in TestLeafQueue files -- Key: YARN-2076 URL: https://issues.apache.org/jira/browse/YARN-2076 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Chen He Assignee: Chen He Priority: Minor Labels: test Attachments: YARN-2076.patch numNodes should be 2 instead of 3 in testReservationExchange() since only two nodes are defined. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003782#comment-14003782 ] Hadoop QA commented on YARN-1365: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645826/YARN-1365.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3775//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3775//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003796#comment-14003796 ] Marcelo Vanzin commented on YARN-941: - Apologies for jumping in the middle of the conversation. I don't have a lot of background into the Yarn code here, but from this bug and some internal discussions I have a question for people who are more familiar with this code: What is the purpose of this renewal mechanism? So far it seems to me that it's an attack mitigation feature. An attacker who is able to get the token would only be able to use it while the original application (i) is running and (ii) keeps renewing the token. if that's true, it sounds to me like the problem is actually that it's possible to sniff the token in the first place. Wouldn't it be better, at that point, to have a protocol that doesn't allow that? Either using full-blown encryption for the RPC channels, or if that's deemed too expensive, some mechanism where tokens are negotiated instead of sent in plain text over the wire. RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, YARN-941.preview.patch When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1935) Security for timeline server
[ https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1935: -- Attachment: Timeline_Kerberos_DT_ACLs.2.patch Timeline Security Diagram.pdf Hi folks, I've just attached a diagram Timeline Security Diagram.pdf to demonstrate the rough workflow of the the timeline security. In general, it consists of two parts: authentication and the authorization. *1. Authentication* a) When the authentication is enabled, a customized authentication filter will be loaded into the webapp of the timeline server, which prevents unauthorized users to access any timeline web resources. The filter allow users to: * negotiate the authentication via HTTP SPNEGO, and login with Kerberos principal and keytab; and * request a delegation token after Kerberos login and use it for follow-up secured communication. b) TimelineClient is adapted to pass the authentication before putting the timeline data. It can choose append the Kerberos token or delegation token into the HTTP request. The rationale behind supporting delegation token is to allow AM and other containers to use TimelineClient to put the timeline data in a secured manner, where the Kerberos stuff is not available. c) TimelineClient also has the API to get the delegation token from the timeline sever (actually from the customized authentication filter). When security is enabled and the timeline service is enabled, and YarnClient is used to submit an application, YarnClient will automatically call TimeClient to get a delegation token and put into the application submission context, such that the AM can used the passed-in delegation token to communicate with the timeline server securely. d) Any tool which support SPNEGO/Kerberos, such as Firefox, curl and etc., can access the three GET APIs of the timeline server to inquiry the timeline data. *2. Authorization* Once the request from an authenticated user passes the customized authentication filter, it will be processed by the timeline web services. Here we use the ACLs manager to determine whether the user of the request has the access to the requested data. The basic rules are as follows: * The access control granularity is entity, which means a user can access all the information of any entity and its events, or he/she can access nothing of it. * Currently we only allow the owner of the entity to access it. In the future, we can simply extend the rule to allow Admin and users/groups on the access control list. *Configuration* After all, to enable the timeline security, we need to setup Kerberos. In addition, there're a bunch of configurations to do: * Make use of the filter initializer to setup the customized authentication filter, and the configuration is much like hadoop-auth style; and * ACLs is controlled by YARN ACLs configuration like other YARN daemons. I also uploaded my newest uber patch Timeline_Kerberos_DT_ACLs.2.patch to demonstrate how the design is implemented Security for timeline server Key: YARN-1935 URL: https://issues.apache.org/jira/browse/YARN-1935 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy Assignee: Zhijie Shen Attachments: Timeline Security Diagram.pdf, Timeline_Kerberos_DT_ACLs.2.patch, Timeline_Kerberos_DT_ACLs.patch Jira to track work to secure the ATS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1709) Admission Control: Reservation subsystem
[ https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1709: --- Description: This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The Reservation subsystem is conceptually a plan of how the scheduler will allocate resources over-time. (was: This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The inventory subsystem is conceptually a plan of how the capacity scheduler will be configured over-time.) Summary: Admission Control: Reservation subsystem (was: Admission Control: inventory subsystem) Admission Control: Reservation subsystem Key: YARN-1709 URL: https://issues.apache.org/jira/browse/YARN-1709 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Subramaniam Krishnan This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The Reservation subsystem is conceptually a plan of how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
Subramaniam Krishnan created YARN-2080: -- Summary: Admission Control: Integrate Reservation subsystem with ResourceManager Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subramaniam Krishnan Assignee: Subramaniam Krishnan This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The Reservation subsystem is conceptually a plan of how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-2080: --- Description: This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. (was: This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The Reservation subsystem is conceptually a plan of how the scheduler will allocate resources over-time.) Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subramaniam Krishnan Assignee: Subramaniam Krishnan This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-2074: - Assignee: Jian He (was: Vinod Kumar Vavilapalli) Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003892#comment-14003892 ] Jian He commented on YARN-2074: --- I'd like to work on this. Taking this over.. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003960#comment-14003960 ] Vinod Kumar Vavilapalli commented on YARN-2074: --- [~sunilg], Agree that as much as possible we should avoid killing the AM during preemption and so we should look at YARN-2022 orthogonally. This one focuses only on the point that in the case that this cannot be avoided, it shouldn't be accounted towards AM failures. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting
[ https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1569: Attachment: yarn-1569.patch For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting - Key: YARN-1569 URL: https://issues.apache.org/jira/browse/YARN-1569 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Junping Du Assignee: zhihai xu Priority: Minor Labels: newbie Attachments: yarn-1569.patch As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should always check appropriate type before casting. handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so far (no bug there now) but should be improved as FairScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1938) Kerberos authentication for the timeline server
[ https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003994#comment-14003994 ] Vinod Kumar Vavilapalli commented on YARN-1938: --- Looks good to me too. Can you add the new configs into yarn-default.xml? Kerberos authentication for the timeline server --- Key: YARN-1938 URL: https://issues.apache.org/jira/browse/YARN-1938 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1938.1.patch, YARN-1938.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1938) Kerberos authentication for the timeline server
[ https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1938: -- Target Version/s: 2.5.0 Kerberos authentication for the timeline server --- Key: YARN-1938 URL: https://issues.apache.org/jira/browse/YARN-1938 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1938.1.patch, YARN-1938.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1338: - Attachment: YARN-1338v5.patch Thanks for the review, Junping! Attaching a patch to address your comments with specific responses below. bq. beside null store and a leveled store, I saw a memory store implemented there but no usage so far. Does it helps in some scenario or only for test purpose? It's only for use in unit tests which is why it's located under src/test/. It stores state in the memory of the JVM itself, so it's not very useful for real-world recovery scenarios. The state is lost when the NM crashes/exits. bq. Can we abstract code since if block into a method, something like: initializeNMStore(conf)? which can make NodeManager#serviceInit() simpler. Done. bq. Does size here represent for size of local resource? If so, may be duplicated with the size within LocalResourceProto? As I understand it they are slightly different. The size in the LocalResourceProto is the size of the resource that will be downloaded, while the size in LocalizedResource (and also persisted in LocalizedResourceProto) is the size of the resource on the local disk. These can be different if the resource is uncompressed/unarchived after downloading (e.g.: a .tar.gz resource). bq. May be we should check appResourceState(appEntry.getValue)’s localizedResources and inProgressResources is not empty before recover it as we check for userResourceState? Done. I also added a LocalResourceTrackerState#isEmpty method to make the code a bit cleaner. bq. May be even in case tk.appId !=null, we should load private resource state as well? No, if tk.appId is not null then this is state for an app-specific resource tracker and not for a private resource tracker. See the javadoc for NMStateStoreService#startResourceLocalization or NMStateStoreService#finishResourceLocalziation for some hints, and I also added some comments to the NMMemoryStateStoreService to clarify how the user and appId are used to discern public vs. private vs. app-specific trackers. Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1338.patch, YARN-1338v2.patch, YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects. * For work preserving restart we definitely want them as running containers are using them * For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting
[ https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004063#comment-14004063 ] Hadoop QA commented on YARN-1569: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645863/yarn-1569.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3777//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3777//console This message is automatically generated. For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting - Key: YARN-1569 URL: https://issues.apache.org/jira/browse/YARN-1569 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Junping Du Assignee: zhihai xu Priority: Minor Labels: newbie Attachments: yarn-1569.patch As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should always check appropriate type before casting. handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so far (no bug there now) but should be improved as FairScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2049: -- Attachment: YARN-2049.3.patch I created a new patch, which will no longer rely on HADOOP-10596, given it is still arguable how we should fix initSpnego of HttpServer2. In this patch, I walked around to use the filter initializer approach introduce by hadoop-auth to load TimelineAuthenticationFilter, though it is not consistent with the existing YARN-style SPNEGO configuration. Hopefully folks are fine with the the walk around to make the timeline security available ASAP. Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2074: -- Attachment: YARN-2074.2.patch Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1938) Kerberos authentication for the timeline server
[ https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1938: -- Attachment: YARN-1938.3.patch Thanks for review, Vinod and Varun. I added the configs into yarn-default.xml as well in the newest patch. Kerberos authentication for the timeline server --- Key: YARN-1938 URL: https://issues.apache.org/jira/browse/YARN-1938 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1938.1.patch, YARN-1938.2.patch, YARN-1938.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1938) Kerberos authentication for the timeline server
[ https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004138#comment-14004138 ] Hadoop QA commented on YARN-1938: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645907/YARN-1938.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3780//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3780//console This message is automatically generated. Kerberos authentication for the timeline server --- Key: YARN-1938 URL: https://issues.apache.org/jira/browse/YARN-1938 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1938.1.patch, YARN-1938.2.patch, YARN-1938.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004146#comment-14004146 ] Hadoop QA commented on YARN-2074: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645906/YARN-2074.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3781//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3781//console This message is automatically generated. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2073: --- Attachment: yarn-2073-1.patch Added a unit test - the test fails without the fix. Also, moved a bunch of helper code from TestFairScheduler to FairSchedulerTestBase. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse
[ https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004166#comment-14004166 ] Ming Ma commented on YARN-1897: --- Chatted with Gera offline. The definition of SignalContainer* APIs is needed for other subtasks including YARN-1515. So we will resolve SignalContainer* APIs issues in this jira. After it is done, other subtasks can continue. Here are couple open issues. 1. Support for a list of containers. The latest patch in this jira just supports a flat list of signalContainerRequest, regardless if they are from the same containers or not. Gera's patch in YARN-1515 groups all commands under the same container together via signalContainerRequest.getSignals(). Either approach works. I don't have strong preference either way given the most common use case is for single container; although signalContainers is more consistent with startContainers. 2. Support for SIGTERM + delay + SIGKILL used in stopContainers. Latest YARN-1515 introduces Pause method so that containers can pause in between signals. We need something like that to support YARN-1515 scenario. Or we can provide some new SignalContainerCommand like sleep. Really appreciate any comments on this. Define SignalContainerRequest and SignalContainerResponse - Key: YARN-1897 URL: https://issues.apache.org/jira/browse/YARN-1897 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1897-2.patch, YARN-1897-3.patch, YARN-1897-4.patch, YARN-1897.1.patch We need to define SignalContainerRequest and SignalContainerResponse first as they are needed by other sub tasks. SignalContainerRequest should use OS-independent commands and provide a way to application to specify reason for diagnosis. SignalContainerResponse might be empty. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004175#comment-14004175 ] Wei Yan commented on YARN-2073: --- [~kasha], if we move preemption-related test code to a separate .java file, we may also need to move the previous preemption-related test functions (testChoiceOfPreemptedContainers and testPreemptionDecision) to the new file. And so next step we'll divide the TestFairScheduler into several test files according to different scheduler operations? FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004181#comment-14004181 ] Karthik Kambatla commented on YARN-2073: bq. we may also need to move the previous preemption-related test functions (testChoiceOfPreemptedContainers and testPreemptionDecision) to the new file Moving them might require slightly more work, and I was planning on doing that in a separate JIRA along with splitting the tests into multiple files. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext
[ https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004198#comment-14004198 ] Hudson commented on YARN-2050: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5607 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5607/]) YARN-2050. Fix LogCLIHelpers to create the correct FileContext. Contributed by Ming Ma (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596310) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java Fix LogCLIHelpers to create the correct FileContext --- Key: YARN-2050 URL: https://issues.apache.org/jira/browse/YARN-2050 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Fix For: 3.0.0, 2.5.0 Attachments: YARN-2050-2.patch, YARN-2050.patch LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus the FileContext created isn't necessarily the FileContext for remote log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2073: --- Attachment: yarn-2073-2.patch Thanks Wei. Updated patch to address the nits. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004241#comment-14004241 ] Hadoop QA commented on YARN-2073: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645920/yarn-2073-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3783//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3783//console This message is automatically generated. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004279#comment-14004279 ] Rohith commented on YARN-1366: -- bq. Catching incorrect unregistration before registration should have always been there. Is this a regression in the patch or an existing bug. This is not bug in existing code. Unregister in ApplicationMasterService check whether app is registered.Otherwise throw InvalidApplicationMasterRequestException bq. Should we consider the possibility of allowing unregister without register? Yes, becaue for differentiating last heatbeat sent by AM to RM,RM restarted, and unregistering application VS application master sending unregister without registering ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004290#comment-14004290 ] Sandy Ryza commented on YARN-2073: -- There are some situations where preemption with free resources on the cluster is the right thing to do. For example, if I'm requesting 2 GB containers, I have no resources, and 100 nodes on the cluster each have 1GB remaining, containers should get preempted on my behalf. There are also cases arising from requests with strict locality - the cluster might have resources available because I'm waiting on a subset of nodes. (In this case, we'd probably want to make sure preemption only happens on the nodes being waited for; otherwise we'd kill containers needlessly). If the goal is to make sure that we aren't preempting on behalf of an application that's actually receiving resources, it might also be worth considering time-based approaches. E.g. only preempt on behalf of an application that hasn't received resources in some amount of time. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang updated YARN-2030: Attachment: YARN-2030.v2.patch attach v2 patch to fix findbug warnings Use StateMachine to simplify handleStoreEvent() in RMStateStore --- Key: YARN-2030 URL: https://issues.apache.org/jira/browse/YARN-2030 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Assignee: Binglin Chang Attachments: YARN-2030.v1.patch, YARN-2030.v2.patch Now the logic to handle different store events in handleStoreEvent() is as following: {code} if (event.getType().equals(RMStateStoreEventType.STORE_APP) || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } ... try { if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT) || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) { ... } else { ... } } {code} This is not only confuse people but also led to mistake easily. We may leverage state machine to simply this even no state transitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004307#comment-14004307 ] Sandy Ryza commented on YARN-2012: -- {code} + defaultQueueName = root. + defaultQueueName; {code} This should go inside the initializeFromXml method. {code} +if (configuredQueues.get(FSQueueType.LEAF).contains(defaultQueueName) +|| configuredQueues.get(FSQueueType.PARENT).contains( +defaultQueueName)) { + return defaultQueueName; +} + } return root. + YarnConfiguration.DEFAULT_QUEUE_NAME; {code} I think it's a little confusing for the rule to fall back to default. Can we let this part be handled by the create logic in assignAppToQueue? Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute - Key: YARN-2012 URL: https://issues.apache.org/jira/browse/YARN-2012 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue . This queue should be an existing queue,if not we fall back to root.default queue hence keeping this rule as terminal. This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004315#comment-14004315 ] Karthik Kambatla commented on YARN-2073: Sandy - you make very good points. In other words, we want to have an absoluteMinSharePreemptionTimeout. Now, the question becomes whether we should express this as a separate timeout config or a scaling factor which determines this absolute timeout for both min-share and fair-share? Also, we can make it a per-queue config or a single factor for the cluster. Eventually, we need a better story for preemption. Currently, it is like a spray gun, we preempt some resources and hope that helps this application. Instead, we should preempt resources that match the application's ask. In that case, this new config will be moot. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004335#comment-14004335 ] Junping Du commented on YARN-2030: -- Hi [~decster], thanks for taking on this effort. I will review your patch. Use StateMachine to simplify handleStoreEvent() in RMStateStore --- Key: YARN-2030 URL: https://issues.apache.org/jira/browse/YARN-2030 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Assignee: Binglin Chang Attachments: YARN-2030.v1.patch, YARN-2030.v2.patch Now the logic to handle different store events in handleStoreEvent() is as following: {code} if (event.getType().equals(RMStateStoreEventType.STORE_APP) || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } ... try { if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT) || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) { ... } else { ... } } {code} This is not only confuse people but also led to mistake easily. We may leverage state machine to simply this even no state transitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2081) TestDistributedShell fails after YARN-1962
Hong Zhiguo created YARN-2081: - Summary: TestDistributedShell fails after YARN-1962 Key: YARN-2081 URL: https://issues.apache.org/jira/browse/YARN-2081 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:198) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2081) TestDistributedShell fails after YARN-1962
[ https://issues.apache.org/jira/browse/YARN-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-2081: -- Attachment: YARN-2081.patch TestDistributedShell fails after YARN-1962 -- Key: YARN-2081 URL: https://issues.apache.org/jira/browse/YARN-2081 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2081.patch java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:198) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2051) Add more unit tests for PBImpl that didn't get covered
[ https://issues.apache.org/jira/browse/YARN-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004371#comment-14004371 ] Binglin Chang commented on YARN-2051: - I thought about this, most of the pb serde validation involves the following procedure: 1. set property to record using value(v0) 2. get proto obj from record 3. create new record from proto obj 4. get property from new record value(v1), validate v0 == v1 This can be automated for all set/get pairs, we just need to use reflection to find all get/set pairs of the record class, and test each pair. By doing this, we save lots of testing code. In the future when we add new properties to a record, no need to add/change the testing code:) Note: those record looks like java beans but many of those does not follow strict java bean laws, I try to leverage commons-beanutil but it seems it is not flexible enough, we make a patch soon. Add more unit tests for PBImpl that didn't get covered -- Key: YARN-2051 URL: https://issues.apache.org/jira/browse/YARN-2051 Project: Hadoop YARN Issue Type: Test Reporter: Junping Du Assignee: Binglin Chang Priority: Critical From YARN-2016, we can see some bug could exist in PB implementation of protocol. The bad news is most of these PBImpl don't have any unit test to verify the info is not lost or changed after serialization/deserialization. We should add more tests for it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1872) TestDistributedShell occasionally fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004373#comment-14004373 ] Hong Zhiguo commented on YARN-1872: --- Binglin, I got same failure. The phenomenon and reason of your failure is different with this one reported by Ted Yu. I fixed it by YARN-2081. TestDistributedShell occasionally fails in trunk Key: YARN-1872 URL: https://issues.apache.org/jira/browse/YARN-1872 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Hong Zhiguo Attachments: TestDistributedShell.out, YARN-1872.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console : TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and TestDistributedShell#testDSShell timed out. -- This message was sent by Atlassian JIRA (v6.2#6252)