[jira] [Created] (YARN-1192) Update HAServiceState to STOPPING on RM#stop()
Karthik Kambatla created YARN-1192: -- Summary: Update HAServiceState to STOPPING on RM#stop() Key: YARN-1192 URL: https://issues.apache.org/jira/browse/YARN-1192 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla Post HADOOP-9945, we should update HAServiceState in RMHAProtocolService to STOPPING on stop(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1185: - Summary: FileSystemRMStateStore can leave partial files that prevent subsequent recovery (was: FileSystemRMStateStore doesn't use temporary files when writing data) bq. The RM will not start if there is anything wrong with the stored state. So it some write is partial/empty is will not start. The concern I have about that approach is it requires manual intervention from ops when there is a problem, and the current scheme can lead to that situation occurring because the RM can crash at arbitrary points. I think the RM should try to prevent that situation from occurring and/or have the ability to automatically recover from that situation if it does occur. The RM could skip the corrupted info and continue if the info is deemed not critical to the overall recovery process. Then we're only involving ops if the corruption is very serious. {quote} So we could do the following. Storing app data may continue to be optimistic and since thats the main workload we continue to do what we do today. Storing global data (mainly the security stuff) can change to be more atomic. {quote} That sounds reasonable, especially if the RM is more robust during recovery. I understand it's a tradeoff between reliability and performance, especially with the RPC overhead when talking to HDFS and the potentially high rate of state churn. Thanks for the informative discussion, [~bikassaha]! Updating the summary to better reflect the problem and not a particular solution. FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766124#comment-13766124 ] Eli Collins commented on YARN-1024: --- bq. keeping virtual cores to express parallelism sounds good as it is clear it is not a real core. Hm, I read this the other way. If a framework asks for three vcores on a host it intends to run some code on three real physical cores at the same time. If a long lived framework wants to reserve 2 cores per host it would ask for 2 cores (and 100% YCU per core). Sandy's proposal, switching to cores and YCU instead of just vcores, is equivalent to the proposal above of getting rid of vcores and supporting fractional cores. A vcore becomes a core and YCU is just a way to express that you want a fraction of a core. Sounds good to me. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy Attachments: CPUasaYARNresource.pdf We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1193) ResourceManger.clusterTimeStamp should be reset when RM transitions to active
Karthik Kambatla created YARN-1193: -- Summary: ResourceManger.clusterTimeStamp should be reset when RM transitions to active Key: YARN-1193 URL: https://issues.apache.org/jira/browse/YARN-1193 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla ResourceManager.clusterTimeStamp is used to generate application-ids. Currently, when the RM transitions to active-standby-active back and forth, the clusterTimeStamp stays the same leading to apps getting the same ids as jobs from before. This leads to other races in staging directory etc. To avoid this, it is better to set it on every transition to Active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766141#comment-13766141 ] Junping Du commented on YARN-311: - Thanks Luke for review and comments! Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1186) Add support for simulating several important behaviors in the MRAM to yarn scheduler simulator
Wei Yan created YARN-1186: - Summary: Add support for simulating several important behaviors in the MRAM to yarn scheduler simulator Key: YARN-1186 URL: https://issues.apache.org/jira/browse/YARN-1186 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Add support for simulating some important behaviors in the MRAM (such as slowstart, headroom, etc) to the Yarn scheduler load simulator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1027) Implement RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1027: --- Attachment: yarn-1027-7.patch Implement RMHAProtocolService - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: test-yarn-1027.patch, yarn-1027-1.patch, yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-4.patch, yarn-1027-5.patch, yarn-1027-6.patch, yarn-1027-7.patch, yarn-1027-including-yarn-1098-3.patch, yarn-1027-in-rm-poc.patch Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1027) Implement RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766162#comment-13766162 ] Karthik Kambatla commented on YARN-1027: bq. Should createAndInit/Start/Stop methods in RM be synchronized? Can they race with other activity in the RM happening on the dispatcher thread? stop() is equivalent to stopping the RM previously. createAndInit/start also don't change the behavior in any way. Their callers themselves are synchronized, so I don't see the need to synchronize these as well. bq. Was getClusterTimeStamp() addition necessary? Its good to keep refactorings separate. Filed another subtask under YARN-149 to address this. Fixed all other comments. Running a pseudo-cluster with YARN-1068 applied - will update here on the scenarios shortly. Implement RMHAProtocolService - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: test-yarn-1027.patch, yarn-1027-1.patch, yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-4.patch, yarn-1027-5.patch, yarn-1027-6.patch, yarn-1027-7.patch, yarn-1027-including-yarn-1098-3.patch, yarn-1027-in-rm-poc.patch Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator
Wei Yan created YARN-1187: - Summary: Add discrete event-based simulation to yarn scheduler simulator Key: YARN-1187 URL: https://issues.apache.org/jira/browse/YARN-1187 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Follow the discussion in YARN-1021. Discrete event simulation decouples the running from any real-world clock. This allows users to step through the execution, set debug points, and definitely get a deterministic rexec. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1027) Implement RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766174#comment-13766174 ] Hadoop QA commented on YARN-1027: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602941/yarn-1027-7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1908//console This message is automatically generated. Implement RMHAProtocolService - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: test-yarn-1027.patch, yarn-1027-1.patch, yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-4.patch, yarn-1027-5.patch, yarn-1027-6.patch, yarn-1027-7.patch, yarn-1027-including-yarn-1098-3.patch, yarn-1027-in-rm-poc.patch Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1194) TestContainerLogsPage test fails on trunk
Roman Shaposhnik created YARN-1194: -- Summary: TestContainerLogsPage test fails on trunk Key: YARN-1194 URL: https://issues.apache.org/jira/browse/YARN-1194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Roman Shaposhnik Assignee: Roman Shaposhnik Priority: Minor Running TestContainerLogsPage on trunk while Native IO is enabled makes it fail -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1027) Implement RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1027: --- Attachment: yarn-1027-7.patch Implement RMHAProtocolService - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: test-yarn-1027.patch, yarn-1027-1.patch, yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-4.patch, yarn-1027-5.patch, yarn-1027-6.patch, yarn-1027-7.patch, yarn-1027-7.patch, yarn-1027-including-yarn-1098-3.patch, yarn-1027-in-rm-poc.patch Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-905) Add state filters to nodes CLI
[ https://issues.apache.org/jira/browse/YARN-905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765778#comment-13765778 ] Wei Yan commented on YARN-905: -- [~sandyr], [~vinodkv] Confused last time. I'll close YARN-1126 and update a patch supporting new features here. Add state filters to nodes CLI -- Key: YARN-905 URL: https://issues.apache.org/jira/browse/YARN-905 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Wei Yan Attachments: Yarn-905.patch, YARN-905.patch, YARN-905.patch It would be helpful for the nodes CLI to have a node-states option that allows it to return nodes that are not just in the RUNNING state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1193) ResourceManger.clusterTimeStamp should be reset when RM transitions to active
[ https://issues.apache.org/jira/browse/YARN-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1193: --- Attachment: yarn-1193-1.patch Straight-forward patch. ResourceManger.clusterTimeStamp should be reset when RM transitions to active - Key: YARN-1193 URL: https://issues.apache.org/jira/browse/YARN-1193 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-1193-1.patch ResourceManager.clusterTimeStamp is used to generate application-ids. Currently, when the RM transitions to active-standby-active back and forth, the clusterTimeStamp stays the same leading to apps getting the same ids as jobs from before. This leads to other races in staging directory etc. To avoid this, it is better to set it on every transition to Active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1130) Improve the log flushing for tasks when mapred.userlog.limit.kb is set
[ https://issues.apache.org/jira/browse/YARN-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Han updated YARN-1130: --- Attachment: YARN-1130.patch Improve the log flushing for tasks when mapred.userlog.limit.kb is set -- Key: YARN-1130 URL: https://issues.apache.org/jira/browse/YARN-1130 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Paul Han Attachments: YARN-1130.patch When userlog limit is set with something like this: {code} property namemapred.userlog.limit.kb/name value2048/value descriptionThe maximum size of user-logs of each task in KB. 0 disables the cap. /description /property {code} the log entry will be truncated randomly for the jobs. The log size is left between 1.2MB to 1.6MB. Since the log is already limited, avoid the log truncation is crucial for user. The other issue with the current impl(org.apache.hadoop.yarn.ContainerLogAppender) is that log entries will not flush to file until the container shutdown and logmanager close all appenders. If user likes to see the log during task execution, it doesn't support it. Will propose a patch to add a flush mechanism and also flush the log when task is done. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1027) Implement RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766203#comment-13766203 ] Hadoop QA commented on YARN-1027: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602948/yarn-1027-7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1909//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1909//console This message is automatically generated. Implement RMHAProtocolService - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: test-yarn-1027.patch, yarn-1027-1.patch, yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-4.patch, yarn-1027-5.patch, yarn-1027-6.patch, yarn-1027-7.patch, yarn-1027-7.patch, yarn-1027-including-yarn-1098-3.patch, yarn-1027-in-rm-poc.patch Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766208#comment-13766208 ] Omkar Vinit Joshi commented on YARN-1189: - Yes this is clearly a leak... I had this locally when I was working on it..somehow missed the change.. It should be called when application completely finishes..Attaching a quick patch... NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi reassigned YARN-1189: --- Assignee: Omkar Vinit Joshi NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1189: Attachment: YARN-1189-20130912.1.patch NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1189-20130912.1.patch The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-867) Isolation of failures in aux services
[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765878#comment-13765878 ] Zhijie Shen commented on YARN-867: -- Think about the problem again. Essentially, problem is the implementation of AuxiliaryService may throw RuntimeException (or other Throwable), and fail the thread of NM dispatcher. Wrapping the calling statements with try/catch can basically prevent NM failure. The next task is to handle the throwable from AuxiliaryService. In previous thread, what we plan to do is to fail the container directly, and let the AM know that the container is failed due to AUXSERVICE_FAILED. For MR, it may be okay, because without ShuffleHandler, MR jobs cannot run properly. However, should NM always make the decision to fail the container? I'm concerned that: 1. NM doesn't know what the AuxiliaryService serves the application and how important it is. 2. NM doesn't know how critical the exception is, or whether it is transit or reproducible. Therefore, if the application can toleran Isolation of failures in aux services -- Key: YARN-867 URL: https://issues.apache.org/jira/browse/YARN-867 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Critical Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch, YARN-867.sampleCode.2.patch Today, a malicious application can bring down the NM by sending bad data to a service. For example, sending data to the ShuffleService such that it results any non-IOException will cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-1183: -- Attachment: YARN-1183--n2.patch Attaching an updated patch. Updated the name of the wait method. Changed the way it gets notifications when app masters get registered/unregistered so now ApplicationAttemptId is used as the key. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766221#comment-13766221 ] Hadoop QA commented on YARN-1189: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602954/YARN-1189-20130912.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.application.TestApplication org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestLocalResourcesTrackerImpl {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1910//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1910//console This message is automatically generated. NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1189-20130912.1.patch The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1068) Add admin support for HA operations
[ https://issues.apache.org/jira/browse/YARN-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1068: --- Attachment: yarn-1068-1.patch Updated patch to capture the updates to YARN-1027. Add admin support for HA operations --- Key: YARN-1068 URL: https://issues.apache.org/jira/browse/YARN-1068 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-1068-1.patch, yarn-1068-prelim.patch To transitionTo{Active,Standby} etc. we should support admin operations the same way DFS does. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-1184) ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA
[ https://issues.apache.org/jira/browse/YARN-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K reassigned YARN-1184: --- Assignee: Devaraj K ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA --- Key: YARN-1184 URL: https://issues.apache.org/jira/browse/YARN-1184 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: J.Andreina Assignee: Devaraj K preemption is enabled. Queue = a,b a capacity = 30% b capacity = 70% Step 1: Assign a big job to queue a ( so that job_a will utilize some resources from queue b) Step 2: Assigne a big job to queue b. Following exception is thrown at Resource Manager {noformat} 2013-09-12 10:42:32,535 ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw an Exception. java.lang.ClassCastException: java.util.Collections$UnmodifiableSet cannot be cast to java.util.NavigableSet at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getContainersToPreempt(ProportionalCapacityPreemptionPolicy.java:403) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:202) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:173) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-867) Isolation of failures in aux services
[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765884#comment-13765884 ] Zhijie Shen commented on YARN-867: -- Sorry to post the broken comment before. Think about the problem again. Essentially, problem is the implementation of AuxiliaryService may throw RuntimeException (or other Throwable), and fail the thread of NM dispatcher. Wrapping the calling statements with try/catch can basically prevent NM failure. The next task is to handle the throwable from AuxiliaryService. In previous thread, what we plan to do is to fail the container directly, and let the AM know that the container is failed due to AUXSERVICE_FAILED. For MR, it may be okay, because without ShuffleHandler, MR jobs cannot run properly. However, should NM always make the decision to fail the container? I'm concerned that: 1. NM doesn't know what the AuxiliaryService serves the application and how important it is. 2. NM doesn't know how critical the exception is, or whether it is transit or reproducible. Therefore, if the application can tolerant the AuxiliaryService failure? For example, if the AuxiliaryService just does some node-local monitoring work, the application can complete with the AuxiliaryService not working. Therefore, I'm wondering whether we should leave the decision to the AM. The application knows how to handle the exception best. NM just need to exposure the failure of the AuxiliaryService to the application in some method. Thoughts? Isolation of failures in aux services -- Key: YARN-867 URL: https://issues.apache.org/jira/browse/YARN-867 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Critical Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch, YARN-867.sampleCode.2.patch Today, a malicious application can bring down the NM by sending bad data to a service. For example, sending data to the ShuffleService such that it results any non-IOException will cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1027) Implement RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766227#comment-13766227 ] Karthik Kambatla commented on YARN-1027: For manual testing, I applied the patches yarn-1027-7.patch, yarn-1193-1.patch (update clusterTimeStamp on transitionToActive), and yarn-1068-1.patch (support for admin commands). Testing steps: # *Start the RM*. Verified it was in Standby mode - Log, webui check, netstat for ports check. jmap histo showed 168465 objects with 19476960 bytes. # *Transition to Active*. Verified it was in Active mode - Log, webui, NM connected, netstat for ports check. jmap histo showed 253430 objects with 31171544 bytes. # *Run MR pi job*. Job finished successfully. WebUI worked as expected. jmap histo showed 288406 objects with 35726096 bytes. # *Transition to Standby*. Verified it was in Standby mode. jmap histo showed 282392 objects with 33975600 bytes. # Repeated steps 2, 3, 4 once more. # *Stop the RM*. RM stopped as expected and the logs didn't show any untoward exceptions etc. Implement RMHAProtocolService - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: test-yarn-1027.patch, yarn-1027-1.patch, yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-4.patch, yarn-1027-5.patch, yarn-1027-6.patch, yarn-1027-7.patch, yarn-1027-7.patch, yarn-1027-including-yarn-1098-3.patch, yarn-1027-in-rm-poc.patch Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766237#comment-13766237 ] Andrey Klochkov commented on YARN-1183: --- bq. MiniYARNCluster is used by several tests. This might bite us if and when we run tests parallely. Concurrency level won't make any difference even with that. BTW I'm actually running MR tests in parallel now. That's when this issue with cluster shutdown working incorrectly becomes more evident. Thanks for catching the thing with synchronized block, fixing it. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1195) RM may relaunch already KILLED / FAILED jobs after RM restarts
Jian He created YARN-1195: - Summary: RM may relaunch already KILLED / FAILED jobs after RM restarts Key: YARN-1195 URL: https://issues.apache.org/jira/browse/YARN-1195 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Just like YARN-540, RM restarts after job killed/failed , but before App state info is cleaned from store. the next time RM comes back, it will relaunch the job again. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1194) TestContainerLogsPage test fails on trunk
[ https://issues.apache.org/jira/browse/YARN-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Shaposhnik updated YARN-1194: --- Attachment: YARN-1194.patch.txt TestContainerLogsPage test fails on trunk - Key: YARN-1194 URL: https://issues.apache.org/jira/browse/YARN-1194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Roman Shaposhnik Assignee: Roman Shaposhnik Priority: Minor Attachments: YARN-1194.patch.txt Running TestContainerLogsPage on trunk while Native IO is enabled makes it fail -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765904#comment-13765904 ] Karthik Kambatla commented on YARN-1183: bq. It may be a non necessary optimization in the testing code MiniYARNCluster is used by several tests. This might bite us if and when we run tests parallely. bq. Can you advice on how to get ApplicationID from RegisterApplicationMasterRequest/RegisterApplicationMasterResponse? Here, using host:port should be good - only a single application runs on the host:port at any point. Also, in the following code, even the while() should also be in the synchronized block. Otherwise, it is possible to loose notifications and wait longer than needed. {code} while (!appMasters.isEmpty() System.currentTimeMillis() - started timeoutMillis) { synchronized (appMasters) { appMasters.wait(1000); } } {code} MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1194) TestContainerLogsPage test fails on trunk
[ https://issues.apache.org/jira/browse/YARN-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766245#comment-13766245 ] Hadoop QA commented on YARN-1194: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602959/YARN-1194.patch.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1911//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1911//console This message is automatically generated. TestContainerLogsPage test fails on trunk - Key: YARN-1194 URL: https://issues.apache.org/jira/browse/YARN-1194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Roman Shaposhnik Assignee: Roman Shaposhnik Priority: Minor Fix For: 2.1.1-beta Attachments: YARN-1194.patch.txt Running TestContainerLogsPage on trunk while Native IO is enabled makes it fail -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-540: - Attachment: YARN-540.7.patch Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766274#comment-13766274 ] Hadoop QA commented on YARN-540: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602971/YARN-540.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1912//console This message is automatically generated. Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1191) [YARN-321] Update artifact versions for application history service
[ https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-1191: Summary: [YARN-321] Update artifact versions for application history service (was: [YARN-321] Compilation is failing for YARN-321 branch) [YARN-321] Update artifact versions for application history service --- Key: YARN-1191 URL: https://issues.apache.org/jira/browse/YARN-1191 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1191-1.patch Compilation is failing for YARN-321 branch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1191) [YARN-321] Update artifact versions for application history service
[ https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766276#comment-13766276 ] Devaraj K commented on YARN-1191: - +1, Patch looks good to me. [YARN-321] Update artifact versions for application history service --- Key: YARN-1191 URL: https://issues.apache.org/jira/browse/YARN-1191 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1191-1.patch Compilation is failing for YARN-321 branch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1196) LocalDirsHandlerService never change failedDirs back to normal even when these disks turn good
Nemon Lou created YARN-1196: --- Summary: LocalDirsHandlerService never change failedDirs back to normal even when these disks turn good Key: YARN-1196 URL: https://issues.apache.org/jira/browse/YARN-1196 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.1-beta Reporter: Nemon Lou A simple way to reproduce it: 1,change access mode of one node manager's local-dirs to 000 After a few seconds,this node manager will become unhealthy. 2,change access mode of one node manager's local-dirs back to normal. The node manager is still unhealthy with all local-dirs in bad state even after a long time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1196) LocalDirsHandlerService never change failedDirs back to normal even when these disks turn good
[ https://issues.apache.org/jira/browse/YARN-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nemon Lou updated YARN-1196: Description: A simple way to reproduce it: 1,change access mode of one node manager's local-dirs to 000 After a few seconds,this node manager will become unhealthy. 2,change access mode of the node manager's local-dirs back to normal. The node manager is still unhealthy with all local-dirs in bad state even after a long time. was: A simple way to reproduce it: 1,change access mode of one node manager's local-dirs to 000 After a few seconds,this node manager will become unhealthy. 2,change access mode of one node manager's local-dirs back to normal. The node manager is still unhealthy with all local-dirs in bad state even after a long time. LocalDirsHandlerService never change failedDirs back to normal even when these disks turn good -- Key: YARN-1196 URL: https://issues.apache.org/jira/browse/YARN-1196 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.1-beta Reporter: Nemon Lou A simple way to reproduce it: 1,change access mode of one node manager's local-dirs to 000 After a few seconds,this node manager will become unhealthy. 2,change access mode of the node manager's local-dirs back to normal. The node manager is still unhealthy with all local-dirs in bad state even after a long time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1197) Add container merge support in YARN
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766343#comment-13766343 ] Wangda Tan commented on YARN-1197: -- I don't know is it possible to add this in RM or NM side. And I think it should be easier to move some existing applications (OpenMPI, PBS, etc.) to YARN platform, because such application should have their own daemons in old implementation, and container merge can be helpful to leverage their original logic with less modifications to be a resident of YARN :) Welcome your suggestions and comments! -- Thanks, Wangda Add container merge support in YARN --- Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is, In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues. A very common resource request in MPI world is, give me 100G memory in the cluster, I will launch 100 processes in this resource. In current YARN, we have following two choices to make this happen, 1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above. 2) Send a larger resource request, like 10G. But we may encounter following problems, 2.1 Such a large resource request is hard to get at one time. 2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node). 2.3 Hard to decide how much resource to ask. So my proposal is, 1) We can incrementally send resource request with small resources like before, until we get enough resources in total 2) Merge resource in the same node, make only one big container in each node 3) Launch daemons in each node, and the daemon will spawn its local processes and manage them. For example, We need to run 10 processes, 1G for each, finally we got container 1, 2, 3, 4, 5 in node1. container 6, 7, 8 in node2. container 9, 10 in node3. Then we will, merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1197) Add container merge support in YARN
Wangda Tan created YARN-1197: Summary: Add container merge support in YARN Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is, In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues. A very common resource request in MPI world is, give me 100G memory in the cluster, I will launch 100 processes in this resource. In current YARN, we have following two choices to make this happen, 1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above. 2) Send a larger resource request, like 10G. But we may encounter following problems, 2.1 Such a large resource request is hard to get at one time. 2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node). 2.3 Hard to decide how much resource to ask. So my proposal is, 1) We can incrementally send resource request with small resources like before, until we get enough resources in total 2) Merge resource in the same node, make only one big container in each node 3) Launch daemons in each node, and the daemon will spawn its local processes and manage them. For example, We need to run 10 processes, 1G for each, finally we got container 1, 2, 3, 4, 5 in node1. container 6, 7, 8 in node2. container 9, 10 in node3. Then we will, merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766290#comment-13766290 ] Bikas Saha commented on YARN-540: - bq. Delete throws exception in case of not-existing If that is the case, then why didnt this code in the previous patch cause an exception to be thrown for a normal job? This is removing the app that should already have been removed after unregister. {code} + // application completely done and remove from state store. + // App state may be already removed during RMAppFinishingOrRemovingTransition. + RMStateStore store = app.rmContext.getStateStore(); + store.removeApplication(app) {code} bq. it should not be possible to generate RMAppEventType.ATTEMPT_FAILED event at that state Can the app crash while its waiting to be unregistered. Will that generate an ATTEMPT_FAILED? Can the node crash and cause an ATTEMPT_FAILED. If yes, then these would be apply to the FINISHING state also. bq. In case of REMOVING, return YARNApplicationState as RUNNING, makes sense? In general an app can be removed while its in ACCEPTED state also (kill app after submission) These should also go through the REMOVING state. So its not necessary that the app state will always be RUNNING. We probably need to save the previous state and return that while the app is in REMOVING state. Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766597#comment-13766597 ] Hadoop QA commented on YARN-1183: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603037/YARN-1183--n3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1916//console This message is automatically generated. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766572#comment-13766572 ] Xuan Gong commented on YARN-978: bq. I'm fine with remove it, but trackingUrl is on web UI according to the latest patches in YARN-954 and YARN-1023. If it is to be removed, we should leave note there. Yes, We might still need trackingUrl. Originally, I though the trackingUrl will be set as null, so we might not need it. But I checked code again. Actually, this is from the ApplicationMaster {code} resourceManager.unregisterApplicationMaster(appStatus, appMessage, null); {code}. I am thinking since this applicationMaster can be re-wrote or provided by the client, this will be changed, too. (At least from MRApplicationMaster, the trackUrl is set as non-null). So, we can add trackUrl to the applicationAttemptReport. And I agree that the logUrl should go to the containerReport. [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation -- Key: YARN-978 URL: https://issues.apache.org/jira/browse/YARN-978 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Xuan Gong Fix For: YARN-321 Attachments: YARN-978-1.patch, YARN-978.2.patch, YARN-978.3.patch, YARN-978.4.patch, YARN-978.5.patch, YARN-978.6.patch, YARN-978.7.patch We dont have ApplicationAttemptReport and Protobuf implementation. Adding that. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1157) ResourceManager UI has invalid tracking URL link for distributed shell application
[ https://issues.apache.org/jira/browse/YARN-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766565#comment-13766565 ] Hadoop QA commented on YARN-1157: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603030/YARN-1157.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1915//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1915//console This message is automatically generated. ResourceManager UI has invalid tracking URL link for distributed shell application -- Key: YARN-1157 URL: https://issues.apache.org/jira/browse/YARN-1157 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.1.1-beta Attachments: YARN-1157.1.patch Submit YARN distributed shell application. Goto ResourceManager Web UI. The application definitely appears. In Tracking UI column, there will be history link. Click on that link. Instead of showing application master web UI, HTTP error 500 would appear. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1194) TestContainerLogsPage fails with native builds
[ https://issues.apache.org/jira/browse/YARN-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766547#comment-13766547 ] Hudson commented on YARN-1194: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4408 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4408/]) YARN-1194. TestContainerLogsPage fails with native builds. Contributed by Roman Shaposhnik (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522968) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java TestContainerLogsPage fails with native builds -- Key: YARN-1194 URL: https://issues.apache.org/jira/browse/YARN-1194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Roman Shaposhnik Assignee: Roman Shaposhnik Priority: Minor Attachments: YARN-1194.patch.txt Running TestContainerLogsPage on trunk while Native IO is enabled makes it fail -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-905) Add state filters to nodes CLI
[ https://issues.apache.org/jira/browse/YARN-905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766550#comment-13766550 ] Hadoop QA commented on YARN-905: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603027/YARN-905-addendum.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1914//console This message is automatically generated. Add state filters to nodes CLI -- Key: YARN-905 URL: https://issues.apache.org/jira/browse/YARN-905 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Wei Yan Attachments: YARN-905-addendum.patch, Yarn-905.patch, YARN-905.patch, YARN-905.patch It would be helpful for the nodes CLI to have a node-states option that allows it to return nodes that are not just in the RUNNING state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1194) TestContainerLogsPage test fails on trunk
[ https://issues.apache.org/jira/browse/YARN-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766516#comment-13766516 ] Jason Lowe commented on YARN-1194: -- +1, lgtm. TestContainerLogsPage test fails on trunk - Key: YARN-1194 URL: https://issues.apache.org/jira/browse/YARN-1194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Roman Shaposhnik Assignee: Roman Shaposhnik Priority: Minor Fix For: 2.1.1-beta Attachments: YARN-1194.patch.txt Running TestContainerLogsPage on trunk while Native IO is enabled makes it fail -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1194) TestContainerLogsPage fails with native builds
[ https://issues.apache.org/jira/browse/YARN-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1194: - Fix Version/s: (was: 2.1.1-beta) Summary: TestContainerLogsPage fails with native builds (was: TestContainerLogsPage test fails on trunk) Adjusting summary in preparation for commit since this affects more than trunk. Also the Fix Version normally should not be set until the patch has been committed. TestContainerLogsPage fails with native builds -- Key: YARN-1194 URL: https://issues.apache.org/jira/browse/YARN-1194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Roman Shaposhnik Assignee: Roman Shaposhnik Priority: Minor Attachments: YARN-1194.patch.txt Running TestContainerLogsPage on trunk while Native IO is enabled makes it fail -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1078) TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows
[ https://issues.apache.org/jira/browse/YARN-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766509#comment-13766509 ] Hudson commented on YARN-1078: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1547 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1547/]) YARN-1078. TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows. Contributed by Chuan Liu. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522644) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows - Key: YARN-1078 URL: https://issues.apache.org/jira/browse/YARN-1078 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.1-beta Reporter: Chuan Liu Assignee: Chuan Liu Priority: Minor Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-1078.2.patch, YARN-1078.3.patch, YARN-1078.branch-2.patch, YARN-1078.patch The three unit tests fail on Windows due to host name resolution differences on Windows, i.e. 127.0.0.1 does not resolve to host name localhost. {noformat} org.apache.hadoop.security.token.SecretManager$InvalidToken: Given Container container_0__01_00 identifier is not valid for current Node manager. Expected : 127.0.0.1:12345 Found : localhost:12345 {noformat} {noformat} testNMConnectionToRM(org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater) Time elapsed: 8343 sec FAILURE! org.junit.ComparisonFailure: expected:[localhost]:12345 but was:[127.0.0.1]:12345 at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyResourceTracker6.registerNodeManager(TestNodeStatusUpdater.java:712) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at $Proxy26.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:212) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:149) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyNodeStatusUpdater4.serviceStart(TestNodeStatusUpdater.java:369) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:101) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:213) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNMConnectionToRM(TestNodeStatusUpdater.java:985) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-1183: -- Attachment: YARN-1183--n4.patch MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183--n4.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1157) ResourceManager UI has invalid tracking URL link for distributed shell application
[ https://issues.apache.org/jira/browse/YARN-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766556#comment-13766556 ] Xuan Gong commented on YARN-1157: - The reason is: At RMAppAttemptImpl::generateProxyUriWithoutScheme(String) {code} return result.toASCIIString().substring(HttpConfig.getSchemePrefix().length()); {code} can return an empty String, But at WebAppProxyServlet, it only check whether the urlString is null or not, we should also check the empty string. ResourceManager UI has invalid tracking URL link for distributed shell application -- Key: YARN-1157 URL: https://issues.apache.org/jira/browse/YARN-1157 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.1.1-beta Attachments: YARN-1157.1.patch Submit YARN distributed shell application. Goto ResourceManager Web UI. The application definitely appears. In Tracking UI column, there will be history link. Click on that link. Instead of showing application master web UI, HTTP error 500 would appear. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-1183: -- Attachment: YARN-1183--n3.patch Attaching an updated patch MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766634#comment-13766634 ] Hadoop QA commented on YARN-1183: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603041/YARN-1183--n4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1917//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1917//console This message is automatically generated. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183--n4.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1193) ResourceManger.clusterTimeStamp should be reset when RM transitions to active
[ https://issues.apache.org/jira/browse/YARN-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766645#comment-13766645 ] Bikas Saha commented on YARN-1193: -- isnt this already fixed in YARN-1027. Did you mean to open this jira to create a getClusterTimestamp() method? ResourceManger.clusterTimeStamp should be reset when RM transitions to active - Key: YARN-1193 URL: https://issues.apache.org/jira/browse/YARN-1193 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-1193-1.patch ResourceManager.clusterTimeStamp is used to generate application-ids. Currently, when the RM transitions to active-standby-active back and forth, the clusterTimeStamp stays the same leading to apps getting the same ids as jobs from before. This leads to other races in staging directory etc. To avoid this, it is better to set it on every transition to Active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766690#comment-13766690 ] Karthik Kambatla commented on YARN-1183: Looks good to me. +1. Thanks Andrey. Observation: Not that we should change anything. We are storing the timestamp of when the appMaster registered, but not using it anywhere yet. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183--n4.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1197) Add container merge support in YARN
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-1197: - Description: Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. was: Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is, In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues. A very common resource request in MPI world is, give me 100G memory in the cluster, I will launch 100 processes in this resource. In current YARN, we have following two choices to make this happen, 1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above. 2) Send a larger resource request, like 10G. But we may encounter following problems, 2.1 Such a large resource request is hard to get at one time. 2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node). 2.3 Hard to decide how much resource to ask. So my proposal is, 1) We can incrementally send resource request with small resources like before, until we get enough resources in total 2) Merge resource in the same node, make only one big container in each node 3) Launch daemons in each node, and the daemon will spawn its local processes and manage them. For example, We need to run 10 processes, 1G for each, finally we got container 1, 2, 3, 4, 5 in node1. container 6, 7, 8 in node2. container 9, 10 in node3. Then we will, merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes Add container merge support in YARN --- Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-905) Add state filters to nodes CLI
[ https://issues.apache.org/jira/browse/YARN-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-905: - Attachment: YARN-905-addendum.patch Add state filters to nodes CLI -- Key: YARN-905 URL: https://issues.apache.org/jira/browse/YARN-905 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Wei Yan Attachments: YARN-905-addendum.patch, YARN-905-addendum.patch, Yarn-905.patch, YARN-905.patch, YARN-905.patch It would be helpful for the nodes CLI to have a node-states option that allows it to return nodes that are not just in the RUNNING state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1193) ResourceManger.clusterTimeStamp should be reset when RM transitions to active
[ https://issues.apache.org/jira/browse/YARN-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766684#comment-13766684 ] Karthik Kambatla commented on YARN-1193: [~bikassaha], looks like I misunderstood your comments on YARN-1027. The latest patch there (yarn-1027-7.patch) doesn't have this. clusterTimeStamp is a public static final variable in ResourceManager. To set it when transitioning to Active, we need to make it non-final, which would expose a public static variable. This change, I thought, should go with making it private and adding a public get method. Do you suggest I merge this back into YARN-1027? Or, is it okay to handle it separately. I think it might be cleaner to handle separately so it is easier for people to understand why a particular change has been made. ResourceManger.clusterTimeStamp should be reset when RM transitions to active - Key: YARN-1193 URL: https://issues.apache.org/jira/browse/YARN-1193 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-1193-1.patch ResourceManager.clusterTimeStamp is used to generate application-ids. Currently, when the RM transitions to active-standby-active back and forth, the clusterTimeStamp stays the same leading to apps getting the same ids as jobs from before. This leads to other races in staging directory etc. To avoid this, it is better to set it on every transition to Active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1197) Add container merge support in YARN
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766650#comment-13766650 ] Bikas Saha commented on YARN-1197: -- bq. 1) We can incrementally send resource request with small resources like before, until we get enough resources in total bq. 2) Merge resource in the same node, make only one big container in each node When the RM is asked for a container then this is what the RM does. It incrementally adds reserved space on a node until it can allocate the full resources desired by the container. Then it assigns the container to the app. So its not clear how making small allocations and then merging them in the app is going to help. By asking the RM directly for 10G resources we can ensure that the RM will eventually give us that. If we ask for 10 1G resources then we are not guaranteed that the RM will give them to us on the same node and thus the overall request may be unsatisfiable. Add container merge support in YARN --- Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1193) ResourceManger.clusterTimeStamp should be reset when RM transitions to active
[ https://issues.apache.org/jira/browse/YARN-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766701#comment-13766701 ] Bikas Saha commented on YARN-1193: -- The patch in YARN-1027 is incorrect without the clustertimestamp modification. However, we dont need to add a getClusterTimeStamp method in that patch itself. That can be done in a separate jira. So I meant that we can do the method refactoring separately. Sorry for not being clear. However, in interest of getting YARN-1027 done I will take back those comments. So lets keep those changes in YARN-1027. I am closing this jira. ResourceManger.clusterTimeStamp should be reset when RM transitions to active - Key: YARN-1193 URL: https://issues.apache.org/jira/browse/YARN-1193 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-1193-1.patch ResourceManager.clusterTimeStamp is used to generate application-ids. Currently, when the RM transitions to active-standby-active back and forth, the clusterTimeStamp stays the same leading to apps getting the same ids as jobs from before. This leads to other races in staging directory etc. To avoid this, it is better to set it on every transition to Active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-1193) ResourceManger.clusterTimeStamp should be reset when RM transitions to active
[ https://issues.apache.org/jira/browse/YARN-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha resolved YARN-1193. -- Resolution: Duplicate Assignee: (was: Karthik Kambatla) YARN-1027 fixes this. ResourceManger.clusterTimeStamp should be reset when RM transitions to active - Key: YARN-1193 URL: https://issues.apache.org/jira/browse/YARN-1193 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Attachments: yarn-1193-1.patch ResourceManager.clusterTimeStamp is used to generate application-ids. Currently, when the RM transitions to active-standby-active back and forth, the clusterTimeStamp stays the same leading to apps getting the same ids as jobs from before. This leads to other races in staging directory etc. To avoid this, it is better to set it on every transition to Active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1078) TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows
[ https://issues.apache.org/jira/browse/YARN-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766401#comment-13766401 ] Hudson commented on YARN-1078: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #331 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/331/]) YARN-1078. TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows. Contributed by Chuan Liu. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522644) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows - Key: YARN-1078 URL: https://issues.apache.org/jira/browse/YARN-1078 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.1-beta Reporter: Chuan Liu Assignee: Chuan Liu Priority: Minor Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-1078.2.patch, YARN-1078.3.patch, YARN-1078.branch-2.patch, YARN-1078.patch The three unit tests fail on Windows due to host name resolution differences on Windows, i.e. 127.0.0.1 does not resolve to host name localhost. {noformat} org.apache.hadoop.security.token.SecretManager$InvalidToken: Given Container container_0__01_00 identifier is not valid for current Node manager. Expected : 127.0.0.1:12345 Found : localhost:12345 {noformat} {noformat} testNMConnectionToRM(org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater) Time elapsed: 8343 sec FAILURE! org.junit.ComparisonFailure: expected:[localhost]:12345 but was:[127.0.0.1]:12345 at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyResourceTracker6.registerNodeManager(TestNodeStatusUpdater.java:712) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at $Proxy26.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:212) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:149) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyNodeStatusUpdater4.serviceStart(TestNodeStatusUpdater.java:369) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:101) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:213) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNMConnectionToRM(TestNodeStatusUpdater.java:985) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1197) Add container merge support in YARN
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766646#comment-13766646 ] Bikas Saha commented on YARN-1197: -- (Copying description into comments to reduce email size.) In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues. A very common resource request in MPI world is, give me 100G memory in the cluster, I will launch 100 processes in this resource. In current YARN, we have following two choices to make this happen, 1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above. 2) Send a larger resource request, like 10G. But we may encounter following problems, 2.1 Such a large resource request is hard to get at one time. 2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node). 2.3 Hard to decide how much resource to ask. So my proposal is, 1) We can incrementally send resource request with small resources like before, until we get enough resources in total 2) Merge resource in the same node, make only one big container in each node 3) Launch daemons in each node, and the daemon will spawn its local processes and manage them. For example, We need to run 10 processes, 1G for each, finally we got container 1, 2, 3, 4, 5 in node1. container 6, 7, 8 in node2. container 9, 10 in node3. Then we will, merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes Add container merge support in YARN --- Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1027) Implement RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1027: --- Attachment: yarn-1027-8.patch Including the patch from YARN-1193. Implement RMHAProtocolService - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: test-yarn-1027.patch, yarn-1027-1.patch, yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-4.patch, yarn-1027-5.patch, yarn-1027-6.patch, yarn-1027-7.patch, yarn-1027-7.patch, yarn-1027-8.patch, yarn-1027-including-yarn-1098-3.patch, yarn-1027-in-rm-poc.patch Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1189: Priority: Blocker (was: Major) NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Blocker Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766713#comment-13766713 ] Omkar Vinit Joshi commented on YARN-1189: - looks good to me.. thanks [~jlowe] NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta, 2.1.1-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Blocker Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-975) Adding HDFS implementation for grouped reading and writing interfaces of history storage
[ https://issues.apache.org/jira/browse/YARN-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-975: --- Attachment: YARN-975.5.patch Attaching rebased patch. Thanks, Mayank Adding HDFS implementation for grouped reading and writing interfaces of history storage Key: YARN-975 URL: https://issues.apache.org/jira/browse/YARN-975 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-975.1.patch, YARN-975.2.patch, YARN-975.3.patch, YARN-975.4.patch, YARN-975.5.patch HDFS implementation should be a standard persistence strategy of history storage -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766738#comment-13766738 ] Daryn Sharp commented on YARN-1189: --- Oops, I thought the .1 patch was the latest so I didn't see the test. NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta, 2.1.1-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Blocker Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-975) Adding HDFS implementation for grouped reading and writing interfaces of history storage
[ https://issues.apache.org/jira/browse/YARN-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766774#comment-13766774 ] Mayank Bansal commented on YARN-975: Patch needs rebasing Adding HDFS implementation for grouped reading and writing interfaces of history storage Key: YARN-975 URL: https://issues.apache.org/jira/browse/YARN-975 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-975.1.patch, YARN-975.2.patch, YARN-975.3.patch, YARN-975.4.patch, YARN-975.5.patch HDFS implementation should be a standard persistence strategy of history storage -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1027) Implement RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766750#comment-13766750 ] Hadoop QA commented on YARN-1027: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603056/yarn-1027-8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1919//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1919//console This message is automatically generated. Implement RMHAProtocolService - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: test-yarn-1027.patch, yarn-1027-1.patch, yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-4.patch, yarn-1027-5.patch, yarn-1027-6.patch, yarn-1027-7.patch, yarn-1027-7.patch, yarn-1027-8.patch, yarn-1027-including-yarn-1098-3.patch, yarn-1027-in-rm-poc.patch Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1194) TestContainerLogsPage fails with native builds
[ https://issues.apache.org/jira/browse/YARN-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766638#comment-13766638 ] Roman Shaposhnik commented on YARN-1194: [~jlowe] thanks a lot for a quick review/commit! TestContainerLogsPage fails with native builds -- Key: YARN-1194 URL: https://issues.apache.org/jira/browse/YARN-1194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Roman Shaposhnik Assignee: Roman Shaposhnik Priority: Minor Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-1194.patch.txt Running TestContainerLogsPage on trunk while Native IO is enabled makes it fail -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
Omkar Vinit Joshi created YARN-1198: --- Summary: Capacity Scheduler headroom calculation does not work as expected Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766787#comment-13766787 ] Arun C Murthy commented on YARN-311: Can I get a few more days to review this? Thanks. Also, let's put this in 2.3.0 (not 2.1.1). Thanks. Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-311: --- Target Version/s: 2.3.0 (was: 2.1.1-beta) Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-451) Add more metrics to RM page
[ https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766759#comment-13766759 ] Sangjin Lee commented on YARN-451: -- I am pretty close to getting a patch ready for review on this. A quick question before that however: the proposed change contains changes in YARN (changes in message definition to carry this extra info, and subsequent UI changes) and mapreduce (mapreduce application providing this information). Should I create two sub-tasks (one for YARN and one for MAPREDUCE) and provide separate patches for them? Add more metrics to RM page --- Key: YARN-451 URL: https://issues.apache.org/jira/browse/YARN-451 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Priority: Minor ResourceManager webUI shows list of RUNNING applications, but it does not tell which applications are requesting more resource compared to others. With cluster running hundreds of applications at once it would be useful to have some kind of metric to show high-resource usage applications vs low-resource usage ones. At the minimum showing number of containers is good option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-540: - Attachment: YARN-540.8.patch bq. why didnt this code in the previous patch cause an exception to be thrown for a normal job? Because I added a check in RMAppRemovingTransition instead of FinalTransition bq. Can the app crash while its waiting to be unregistered. Will that generate an ATTEMPT_FAILED? Can the node crash and cause an ATTEMPT_FAILED. Since AppAttempt is already in FINISHING state if App is in REMOVING state. if app crashed, attempt will receive CONTAINER_FINISHED event and then attempt goes to FINISHED state. If the node crash, attempt should receive EXPIRE event and attempt should go to FINISHED state as well. bq. We probably need to save the previous state and return that while the app is in REMOVING state. Yes, added a function to return the previous state when App is in REMOVING state Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.8.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766728#comment-13766728 ] Daryn Sharp commented on YARN-1189: --- +1 but a test, even a mock that spies appFinished would be great to avoid a regression NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta, 2.1.1-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Blocker Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-953) [YARN-321] Change ResourceManager to use HistoryStorage to log history data
[ https://issues.apache.org/jira/browse/YARN-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766872#comment-13766872 ] Hadoop QA commented on YARN-953: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603086/YARN-953.4.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1922//console This message is automatically generated. [YARN-321] Change ResourceManager to use HistoryStorage to log history data --- Key: YARN-953 URL: https://issues.apache.org/jira/browse/YARN-953 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: YARN-953.1.patch, YARN-953.2.patch, YARN-953.3.patch, YARN-953.4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1189: - Attachment: YARN-1189-20130913.txt Thanks, Omkar. Patch looks good to me. Here's the patch with a unit test to make sure we're calling the token secret manager when the application is finished. NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-953) [YARN-321] Change ResourceManager to use HistoryStorage to log history data
[ https://issues.apache.org/jira/browse/YARN-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-953: - Attachment: YARN-953.4.patch Rebase the patch [YARN-321] Change ResourceManager to use HistoryStorage to log history data --- Key: YARN-953 URL: https://issues.apache.org/jira/browse/YARN-953 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: YARN-953.1.patch, YARN-953.2.patch, YARN-953.3.patch, YARN-953.4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1001) YARN should provide per application-type and state statistics
[ https://issues.apache.org/jira/browse/YARN-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766878#comment-13766878 ] Srimanth Gunturi commented on YARN-1001: What we need is a call like {{/appStates?type=mapreduce}}, and that would give per-state counts of the various MR apps. It should also include a total count of MR apps. Something like {noformat} { total: 10, submitted: 10, running: 3, pending: 4, completed: 2, killed: 1, failed: 1 } {noformat} YARN should provide per application-type and state statistics - Key: YARN-1001 URL: https://issues.apache.org/jira/browse/YARN-1001 Project: Hadoop YARN Issue Type: Task Components: api Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-1001.1.patch, YARN-1001.2.patch, YARN-1001.3.patch In Ambari we plan to show for MR2 the number of applications finished, running, waiting, etc. It would be efficient if YARN could provide per application-type and state aggregated counts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-978: --- Attachment: YARN-978.8.patch New patch adds TrackingUrl back, and remove the logUrl(This will be exposed by containerReport) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation -- Key: YARN-978 URL: https://issues.apache.org/jira/browse/YARN-978 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Xuan Gong Fix For: YARN-321 Attachments: YARN-978-1.patch, YARN-978.2.patch, YARN-978.3.patch, YARN-978.4.patch, YARN-978.5.patch, YARN-978.6.patch, YARN-978.7.patch, YARN-978.8.patch We dont have ApplicationAttemptReport and Protobuf implementation. Adding that. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766916#comment-13766916 ] Hadoop QA commented on YARN-978: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603089/YARN-978.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1923//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1923//console This message is automatically generated. [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation -- Key: YARN-978 URL: https://issues.apache.org/jira/browse/YARN-978 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Xuan Gong Fix For: YARN-321 Attachments: YARN-978-1.patch, YARN-978.2.patch, YARN-978.3.patch, YARN-978.4.patch, YARN-978.5.patch, YARN-978.6.patch, YARN-978.7.patch, YARN-978.8.patch We dont have ApplicationAttemptReport and Protobuf implementation. Adding that. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766926#comment-13766926 ] Bikas Saha commented on YARN-540: - bq. Because I added a check in RMAppRemovingTransition instead of FinalTransition The check in RMAppRemovingTransition will pass in the normal case because the app has unregistered and this is the first call to remove app. Then in the end when the app container exits then FinalTransition is called and there is no check at that time. so removeapp will be called a second time and the delete will throw an exception. Is that not the flow? Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.8.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-1178) TestContainerLogsPage#testContainerLogPageAccess is failing
[ https://issues.apache.org/jira/browse/YARN-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-1178. -- Resolution: Duplicate Marking this as a duplicate of YARN-1194 since that already has a patch posted. TestContainerLogsPage#testContainerLogPageAccess is failing --- Key: YARN-1178 URL: https://issues.apache.org/jira/browse/YARN-1178 Project: Hadoop YARN Issue Type: Bug Reporter: Jonathan Eagles Test is failing after YARN-649. This test is only run in native mode mvn clean test -Pnative -Dtest=TestContainerLogsPage -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766499#comment-13766499 ] Hadoop QA commented on YARN-1189: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603017/YARN-1189-20130913.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1913//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1913//console This message is automatically generated. NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-905) Add state filters to nodes CLI
[ https://issues.apache.org/jira/browse/YARN-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-905: - Attachment: YARN-905-addendum.patch Update an addendum patch that fixs case-ignored all and invalid input exception. Add state filters to nodes CLI -- Key: YARN-905 URL: https://issues.apache.org/jira/browse/YARN-905 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Wei Yan Attachments: YARN-905-addendum.patch, Yarn-905.patch, YARN-905.patch, YARN-905.patch It would be helpful for the nodes CLI to have a node-states option that allows it to return nodes that are not just in the RUNNING state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766935#comment-13766935 ] Mayank Bansal commented on YARN-978: Looks good +1 Thanks, Mayank [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation -- Key: YARN-978 URL: https://issues.apache.org/jira/browse/YARN-978 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Xuan Gong Fix For: YARN-321 Attachments: YARN-978-1.patch, YARN-978.2.patch, YARN-978.3.patch, YARN-978.4.patch, YARN-978.5.patch, YARN-978.6.patch, YARN-978.7.patch, YARN-978.8.patch We dont have ApplicationAttemptReport and Protobuf implementation. Adding that. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1157) ResourceManager UI has invalid tracking URL link for distributed shell application
[ https://issues.apache.org/jira/browse/YARN-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1157: Attachment: YARN-1157.1.patch Trivial patch, no test cases added ResourceManager UI has invalid tracking URL link for distributed shell application -- Key: YARN-1157 URL: https://issues.apache.org/jira/browse/YARN-1157 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.1.1-beta Attachments: YARN-1157.1.patch Submit YARN distributed shell application. Goto ResourceManager Web UI. The application definitely appears. In Tracking UI column, there will be history link. Click on that link. Instead of showing application master web UI, HTTP error 500 would appear. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766943#comment-13766943 ] Jian He commented on YARN-540: -- bq. Is that not the flow? Yeah, I think I missed that in the previous patch. That previous patch should throw exception for a normal job.. Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.8.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1001) YARN should provide per application-type and state statistics
[ https://issues.apache.org/jira/browse/YARN-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766950#comment-13766950 ] Zhijie Shen commented on YARN-1001: --- Having talked to [~srimanth.gunturi] offline. Here's more specifications for the API: 1. The API takes exact one applicationType now. If we check there's no or more than one applicationType, we throw exception. We may work only multiple applicationTypes in the future. 2. The API takes zero to many states. If no state is specified, we enumerate all states of RMApp, and return the count of each state. If the states are specified, we just return the counts of these states. 3. We output the results as following: {code} appStatInfo statItem statesubmitted/state typemapreduce/type count10/count /statItem statItem staterunning/state typemapreduce/type count3/count /statItem /appStatInfo {code} We don't particularly list the total count, as it can be concluded by summing up all the counts, and it is not fit for the schema. YARN should provide per application-type and state statistics - Key: YARN-1001 URL: https://issues.apache.org/jira/browse/YARN-1001 Project: Hadoop YARN Issue Type: Task Components: api Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-1001.1.patch, YARN-1001.2.patch, YARN-1001.3.patch In Ambari we plan to show for MR2 the number of applications finished, running, waiting, etc. It would be efficient if YARN could provide per application-type and state aggregated counts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1184) ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA
[ https://issues.apache.org/jira/browse/YARN-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767039#comment-13767039 ] Hadoop QA commented on YARN-1184: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603114/Y1184-0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1924//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1924//console This message is automatically generated. ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA --- Key: YARN-1184 URL: https://issues.apache.org/jira/browse/YARN-1184 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.1.0-beta Reporter: J.Andreina Assignee: Devaraj K Fix For: 2.1.1-beta Attachments: Y1184-0.patch preemption is enabled. Queue = a,b a capacity = 30% b capacity = 70% Step 1: Assign a big job to queue a ( so that job_a will utilize some resources from queue b) Step 2: Assigne a big job to queue b. Following exception is thrown at Resource Manager {noformat} 2013-09-12 10:42:32,535 ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw an Exception. java.lang.ClassCastException: java.util.Collections$UnmodifiableSet cannot be cast to java.util.NavigableSet at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getContainersToPreempt(ProportionalCapacityPreemptionPolicy.java:403) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:202) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:173) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767095#comment-13767095 ] Junping Du commented on YARN-311: - Sure. Thanks for review. Arun! Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1170) yarn proto definitions should specify package as 'hadoop.yarn'
[ https://issues.apache.org/jira/browse/YARN-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1170: Assignee: Binglin Chang yarn proto definitions should specify package as 'hadoop.yarn' -- Key: YARN-1170 URL: https://issues.apache.org/jira/browse/YARN-1170 Project: Hadoop YARN Issue Type: Bug Reporter: Arun C Murthy Assignee: Binglin Chang Priority: Blocker Attachments: YARN-1170.v1.patch yarn proto definitions should specify package as 'hadoop.yarn' similar to protos with 'hadoop.common' 'hadoop.hdfs' in Common HDFS respectively. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1170) yarn proto definitions should specify package as 'hadoop.yarn'
[ https://issues.apache.org/jira/browse/YARN-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1170: Attachment: YARN-1170.v1.patch yarn proto definitions should specify package as 'hadoop.yarn' -- Key: YARN-1170 URL: https://issues.apache.org/jira/browse/YARN-1170 Project: Hadoop YARN Issue Type: Bug Reporter: Arun C Murthy Assignee: Binglin Chang Priority: Blocker Attachments: YARN-1170.v1.patch, YARN-1170.v1.patch yarn proto definitions should specify package as 'hadoop.yarn' similar to protos with 'hadoop.common' 'hadoop.hdfs' in Common HDFS respectively. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767124#comment-13767124 ] Hadoop QA commented on YARN-540: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603143/YARN-540.9.patch against trunk revision . {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1927//console This message is automatically generated. Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.8.patch, YARN-540.9.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1170) yarn proto definitions should specify package as 'hadoop.yarn'
[ https://issues.apache.org/jira/browse/YARN-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767123#comment-13767123 ] Arun C Murthy commented on YARN-1170: - +1, I'll commit after jenkins is back. Thanks [~decster]! yarn proto definitions should specify package as 'hadoop.yarn' -- Key: YARN-1170 URL: https://issues.apache.org/jira/browse/YARN-1170 Project: Hadoop YARN Issue Type: Bug Reporter: Arun C Murthy Assignee: Binglin Chang Priority: Blocker Attachments: YARN-1170.v1.patch, YARN-1170.v1.patch yarn proto definitions should specify package as 'hadoop.yarn' similar to protos with 'hadoop.common' 'hadoop.hdfs' in Common HDFS respectively. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-540: - Attachment: YARN-540.9.patch Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.8.patch, YARN-540.9.patch, YARN-540.9.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1170) yarn proto definitions should specify package as 'hadoop.yarn'
[ https://issues.apache.org/jira/browse/YARN-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767160#comment-13767160 ] Hadoop QA commented on YARN-1170: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603144/YARN-1170.v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1926//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1926//console This message is automatically generated. yarn proto definitions should specify package as 'hadoop.yarn' -- Key: YARN-1170 URL: https://issues.apache.org/jira/browse/YARN-1170 Project: Hadoop YARN Issue Type: Bug Reporter: Arun C Murthy Assignee: Binglin Chang Priority: Blocker Attachments: YARN-1170.v1.patch, YARN-1170.v1.patch yarn proto definitions should specify package as 'hadoop.yarn' similar to protos with 'hadoop.common' 'hadoop.hdfs' in Common HDFS respectively. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1001) YARN should provide per application-type and state statistics
[ https://issues.apache.org/jira/browse/YARN-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767159#comment-13767159 ] Hadoop QA commented on YARN-1001: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603145/YARN-1001.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1925//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1925//console This message is automatically generated. YARN should provide per application-type and state statistics - Key: YARN-1001 URL: https://issues.apache.org/jira/browse/YARN-1001 Project: Hadoop YARN Issue Type: Task Components: api Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-1001.1.patch, YARN-1001.2.patch, YARN-1001.3.patch, YARN-1001.4.patch In Ambari we plan to show for MR2 the number of applications finished, running, waiting, etc. It would be efficient if YARN could provide per application-type and state aggregated counts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1001) YARN should provide per application-type and state statistics
[ https://issues.apache.org/jira/browse/YARN-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767162#comment-13767162 ] Xuan Gong commented on YARN-1001: - One quick comment: Do we suppose to use YarnApplicationState instead of exposing the real RMApp state ? It is true that most of them are one-to-one matching. Also ApplicationCLI uses YarnApplicationState, do they need to be consistent ? YARN should provide per application-type and state statistics - Key: YARN-1001 URL: https://issues.apache.org/jira/browse/YARN-1001 Project: Hadoop YARN Issue Type: Task Components: api Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-1001.1.patch, YARN-1001.2.patch, YARN-1001.3.patch, YARN-1001.4.patch In Ambari we plan to show for MR2 the number of applications finished, running, waiting, etc. It would be efficient if YARN could provide per application-type and state aggregated counts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira